Elasticsearch之IKAnalyzer

2022-08-12 20:14:58 浏览数 (3)

中文分词器——IKAnalyzer

参考文章

代码语言:javascript复制
我们之前Elasticsearch(一)~Elasticsearch(六)都在学习知识点,所有的字段的值都是用英文的, 
但是实际工作中,使用的基本上中文,但是Elasticsearch自带的分词器效果都不好。 
elasticsearch提供了几个内置的分词器:standard analyzer(标准分词器)、 
simple analyzer(简单分词器)、whitespace analyzer(空格分词器)、 
language analyzer(语言分词器),而如果我们不指定分词器类型的话, 
elasticsearch默认是使用标准分词器的。 
POST _analyze 
{ 
"analyzer": "分词器类型", 
"text": "测试中文分词器" 
} 

通过测试结果我们可以发现,使用标准分词器的分词结果,是去掉标点符号,然后一个一个字符来分词,这就是我们上一章提到的中文搜索的问题,这显然不是我们想要的分词效果,接下来我们来看中文分词器。

安装

代码语言:javascript复制
直接下载 https://github.com/medcl/elasticsearch-analysis-ik/releases 
https://github.com/348786639/elasticsearch-analysis-ik 
(1)下载安装,要和es版本对应,我们这里使用的是5.2.0 
git clone -b v5.2.0 git@github.com:medcl/elasticsearch-analysis-ik.git 
(2)mvn package 
(3)cd /usr/local/elasticsearch-5.2.0/plugins/ 
(4)mkdir ik 
(5)copy and unzip target/releases/elasticsearch-analysis-ik-{version}.zip to your-es-root/plugins/ik 
(6)restart elasticsearch 
创建一个index,并指定mapping 
PUT /ik_index 
{ 
"mappings": { 
"ik_type":{ 
"_all": { 
"analyzer": "ik_max_word", 
"search_analyzer": "ik_max_word", 
"term_vector": "no", 
"store": "false" 
}, 
"properties": { 
"content": { 
"type": "text", 
"analyzer": "ik_max_word", 
"search_analyzer": "ik_max_word", 
"include_in_all": "true", 
"boost": 8 
} 
} 
} 
} 
} 
插入数据 
PUT /ik_index/ik_type/1 
{ 
"content":"测试中文分词器" 
} 
查看分词效果,见下图 
查询效果----不可以查到 
GET /ik_index/ik_type/_search 
{ 
"query": { 
"match": { 
"content": "测" 
} 
} 
} 
查询效果----可以查到 
GET /ik_index/ik_type/_search 
{ 
"query": { 
"match": { 
"content": "测试" 
} 
} 
} 

动态加载热词

代码语言:javascript复制
每年网上都会出现一些新的流行词,电影等 比如,攀登者,网红,蓝瘦香菇然而我们都不想让他被分词,这时候我们就需要自定义我们的词库了.我们有两种方案来实现 
1.通过配置文件,需要重新启动才生效. 
2.通过配置远程文件词典,来实现热更新(不用重启es) 
3.修改ik的源码将热词配置在mysql中(不用重启es) 
ik 配置文件地址es/plugins/ik/config 
IKAnalyzer.cfg.xml:用来配置自定义词库 
main.dic:ik原生内置的中文词库,总共有27万多条,只要是这些单词,都会被分在一起 
quantifier.dic:放了一些单位相关的词 
suffix.dic:放了一些后缀 
surname.dic:中国的姓氏 
stopword.dic:英文停用词 
ik原生最重要的两个配置文件 
main.dic:包含了原生的中文词语,会按照这个里面的词语去分词 
stopword.dic:包含了英文的停用词 
停用词 a the and at but 
我们通过自定义文件来实现自定义热词 
<properties> 
<comment>IK Analyzer 扩展配置</comment> 
<!--用户可以在这里配置自己的扩展字典 --> 
<entry key="ext_dict">custom/mydict.dic</entry> 
<!--用户可以在这里配置自己的扩展停止词字典--> 
<entry key="ext_stopwords"></entry> 
<!--用户可以在这里配置远程扩展字典 --> 
<!-- <entry key="remote_ext_dict">words_location</entry> --> 
<!--用户可以在这里配置远程扩展停止词字典--> 
<!-- <entry key="remote_ext_stopwords">words_location</entry> --> 
</properties> 
如果有多个可以配置成 
<entry key="ext_dict">custom/mydict.dic;xxxx.dic</entry> 
这里配置了 
/usr/local/elasticsearch-6.3.2/plugins/ik/config/custom 
[es@iZwz9278r1bks3b80puk6fZ custom]$ cat mydict.dic 
许金锭 
网红 
醉品 
攀登者 
蓝瘦香菇 
[es@iZwz9278r1bks3b80puk6fZ custom]$ ^C 
post 54288.top:9200/_analyze 
{ 
"analyzer":"ik_max_word", 
"text":"许金锭喜欢看攀登者蓝瘦香菇" 
} 
下面两次查询对比 许金锭和蓝瘦香菇 
{ 
"tokens": [ 
{ 
"token": "许", 
"start_offset": 0, 
"end_offset": 1, 
"type": "CN_CHAR", 
"position": 0 
}, 
{ 
"token": "金锭", 
"start_offset": 1, 
"end_offset": 3, 
"type": "CN_WORD", 
"position": 1 
}, 
{ 
"token": "喜欢", 
"start_offset": 3, 
"end_offset": 5, 
"type": "CN_WORD", 
"position": 2 
}, 
{ 
"token": "看", 
"start_offset": 5, 
"end_offset": 6, 
"type": "CN_CHAR", 
"position": 3 
}, 
{ 
"token": "攀登者", 
"start_offset": 6, 
"end_offset": 9, 
"type": "CN_WORD", 
"position": 4 
}, 
{ 
"token": "攀登", 
"start_offset": 6, 
"end_offset": 8, 
"type": "CN_WORD", 
"position": 5 
}, 
{ 
"token": "者", 
"start_offset": 8, 
"end_offset": 9, 
"type": "CN_CHAR", 
"position": 6 
}, 
{ 
"token": "蓝", 
"start_offset": 9, 
"end_offset": 10, 
"type": "CN_CHAR", 
"position": 7 
}, 
{ 
"token": "瘦", 
"start_offset": 10, 
"end_offset": 11, 
"type": "CN_CHAR", 
"position": 8 
}, 
{ 
"token": "香菇", 
"start_offset": 11, 
"end_offset": 13, 
"type": "CN_WORD", 
"position": 9 
} 
] 
} 
post 54288.top:9200/_analyze 
{ 
"tokens": [ 
{ 
"token": "许金锭", 
"start_offset": 0, 
"end_offset": 3, 
"type": "CN_WORD", 
"position": 0 
}, 
{ 
"token": "金锭", 
"start_offset": 1, 
"end_offset": 3, 
"type": "CN_WORD", 
"position": 1 
}, 
{ 
"token": "喜欢", 
"start_offset": 3, 
"end_offset": 5, 
"type": "CN_WORD", 
"position": 2 
}, 
{ 
"token": "看", 
"start_offset": 5, 
"end_offset": 6, 
"type": "CN_CHAR", 
"position": 3 
}, 
{ 
"token": "攀登者", 
"start_offset": 6, 
"end_offset": 9, 
"type": "CN_WORD", 
"position": 4 
}, 
{ 
"token": "攀登", 
"start_offset": 6, 
"end_offset": 8, 
"type": "CN_WORD", 
"position": 5 
}, 
{ 
"token": "者", 
"start_offset": 8, 
"end_offset": 9, 
"type": "CN_CHAR", 
"position": 6 
}, 
{ 
"token": "蓝瘦香菇", 
"start_offset": 9, 
"end_offset": 13, 
"type": "CN_WORD", 
"position": 7 
}, 
{ 
"token": "香菇", 
"start_offset": 11, 
"end_offset": 13, 
"type": "CN_WORD", 
"position": 8 
} 
] 
} 
使用远程文件热更新 
POST 54288.top:9200/_analyze 
{ 
"analyzer":"ik_max_word", 
"text":"黄逸飞和谭林超一起去看中国机长" 
} 
效果 
{ 
"tokens": [ 
{ 
"token": "黄", 
"start_offset": 0, 
"end_offset": 1, 
"type": "CN_CHAR", 
"position": 0 
}, 
{ 
"token": "逸", 
"start_offset": 1, 
"end_offset": 2, 
"type": "CN_CHAR", 
"position": 1 
}, 
{ 
"token": "飞", 
"start_offset": 2, 
"end_offset": 3, 
"type": "CN_CHAR", 
"position": 2 
}, 
{ 
"token": "和", 
"start_offset": 3, 
"end_offset": 4, 
"type": "CN_CHAR", 
"position": 3 
}, 
{ 
"token": "谭", 
"start_offset": 4, 
"end_offset": 5, 
"type": "CN_CHAR", 
"position": 4 
}, 
{ 
"token": "林", 
"start_offset": 5, 
"end_offset": 6, 
"type": "CN_CHAR", 
"position": 5 
}, 
{ 
"token": "超一", 
"start_offset": 6, 
"end_offset": 8, 
"type": "CN_WORD", 
"position": 6 
}, 
{ 
"token": "一起", 
"start_offset": 7, 
"end_offset": 9, 
"type": "CN_WORD", 
"position": 7 
}, 
{ 
"token": "一", 
"start_offset": 7, 
"end_offset": 8, 
"type": "TYPE_CNUM", 
"position": 8 
}, 
{ 
"token": "起", 
"start_offset": 8, 
"end_offset": 9, 
"type": "COUNT", 
"position": 9 
}, 
{ 
"token": "去看", 
"start_offset": 9, 
"end_offset": 11, 
"type": "CN_WORD", 
"position": 10 
}, 
{ 
"token": "看中", 
"start_offset": 10, 
"end_offset": 12, 
"type": "CN_WORD", 
"position": 11 
}, 
{ 
"token": "中国", 
"start_offset": 11, 
"end_offset": 13, 
"type": "CN_WORD", 
"position": 12 
}, 
{ 
"token": "机长", 
"start_offset": 13, 
"end_offset": 15, 
"type": "CN_WORD", 
"position": 13 
} 
] 
} 
1.新建词典文件 my_dict.txt(UTF8 编码),放在服务器根目录下,比较好访问 
[root@iZwz9278r1bks3b80puk6fZ blog]# cat my_dict.txt 
黄逸飞 
谭林超 
中国机长 
[root@iZwz9278r1bks3b80puk6fZ blog]# 
2.修改配置文件,remote_ext_dict配置的是一个url, 
vim IKAnalyzer.cfg.xml 
<?xml version="1.0" encoding="UTF-8"?> 
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> 
<properties> 
<comment>IK Analyzer 扩展配置</comment> 
<!--用户可以在这里配置自己的扩展字典 --> 
<entry key="ext_dict">custom/mydict.dic</entry> 
<!--用户可以在这里配置自己的扩展停止词字典--> 
<entry key="ext_stopwords"></entry> 
<!--用户可以在这里配置远程扩展字典 --> 
<entry key="remote_ext_dict">http://www.54288.top/my_dict.txt</entry> 
<!--用户可以在这里配置远程扩展停止词字典--> 
<!-- <entry key="remote_ext_stopwords">words_location</entry> --> 
</properties> 
{ 
"tokens": [ 
{ 
"token": "黄逸飞", 
"start_offset": 0, 
"end_offset": 3, 
"type": "CN_WORD", 
"position": 0 
}, 
{ 
"token": "和", 
"start_offset": 3, 
"end_offset": 4, 
"type": "CN_CHAR", 
"position": 1 
}, 
{ 
"token": "谭林超", 
"start_offset": 4, 
"end_offset": 7, 
"type": "CN_WORD", 
"position": 2 
}, 
{ 
"token": "超一", 
"start_offset": 6, 
"end_offset": 8, 
"type": "CN_WORD", 
"position": 3 
}, 
{ 
"token": "一起", 
"start_offset": 7, 
"end_offset": 9, 
"type": "CN_WORD", 
"position": 4 
}, 
{ 
"token": "一", 
"start_offset": 7, 
"end_offset": 8, 
"type": "TYPE_CNUM", 
"position": 5 
}, 
{ 
"token": "起", 
"start_offset": 8, 
"end_offset": 9, 
"type": "COUNT", 
"position": 6 
}, 
{ 
"token": "去看", 
"start_offset": 9, 
"end_offset": 11, 
"type": "CN_WORD", 
"position": 7 
}, 
{ 
"token": "看中", 
"start_offset": 10, 
"end_offset": 12, 
"type": "CN_WORD", 
"position": 8 
}, 
{ 
"token": "中国机长", 
"start_offset": 11, 
"end_offset": 15, 
"type": "CN_WORD", 
"position": 9 
}, 
{ 
"token": "中国", 
"start_offset": 11, 
"end_offset": 13, 
"type": "CN_WORD", 
"position": 10 
}, 
{ 
"token": "机长", 
"start_offset": 13, 
"end_offset": 15, 
"type": "CN_WORD", 
"position": 11 
} 
] 
} 
修改文件看看能不能热更新 新增杨力生这个单词,前后对比 (不用重启es) 
POST 54288.top:9200/_analyze 
{ 
"analyzer":"ik_max_word", 
"text":"黄逸飞和谭林超还有杨力生一起去看中国机长" 
} 
{ 
"tokens": [ 
{ 
"token": "黄逸飞", 
"start_offset": 0, 
"end_offset": 3, 
"type": "CN_WORD", 
"position": 0 
}, 
{ 
"token": "和", 
"start_offset": 3, 
"end_offset": 4, 
"type": "CN_CHAR", 
"position": 1 
}, 
{ 
"token": "谭林超", 
"start_offset": 4, 
"end_offset": 7, 
"type": "CN_WORD", 
"position": 2 
}, 
{ 
"token": "还有", 
"start_offset": 7, 
"end_offset": 9, 
"type": "CN_WORD", 
"position": 3 
}, 
{ 
"token": "杨", 
"start_offset": 9, 
"end_offset": 10, 
"type": "CN_CHAR", 
"position": 4 
}, 
{ 
"token": "力", 
"start_offset": 10, 
"end_offset": 11, 
"type": "CN_CHAR", 
"position": 5 
}, 
{ 
"token": "生", 
"start_offset": 11, 
"end_offset": 12, 
"type": "CN_CHAR", 
"position": 6 
}, 
{ 
"token": "一起", 
"start_offset": 12, 
"end_offset": 14, 
"type": "CN_WORD", 
"position": 7 
}, 
{ 
"token": "一", 
"start_offset": 12, 
"end_offset": 13, 
"type": "TYPE_CNUM", 
"position": 8 
}, 
{ 
"token": "起", 
"start_offset": 13, 
"end_offset": 14, 
"type": "COUNT", 
"position": 9 
}, 
{ 
"token": "去看", 
"start_offset": 14, 
"end_offset": 16, 
"type": "CN_WORD", 
"position": 10 
}, 
{ 
"token": "看中", 
"start_offset": 15, 
"end_offset": 17, 
"type": "CN_WORD", 
"position": 11 
}, 
{ 
"token": "中国机长", 
"start_offset": 16, 
"end_offset": 20, 
"type": "CN_WORD", 
"position": 12 
}, 
{ 
"token": "中国", 
"start_offset": 16, 
"end_offset": 18, 
"type": "CN_WORD", 
"position": 13 
}, 
{ 
"token": "机长", 
"start_offset": 18, 
"end_offset": 20, 
"type": "CN_WORD", 
"position": 14 
} 
] 
} 
查看如下的日志图片,可以看到杨力生被加载进来了 
修改 
黄逸飞 
谭林超 
中国机长 
杨力生 
{ 
"tokens": [ 
{ 
"token": "黄逸飞", 
"start_offset": 0, 
"end_offset": 3, 
"type": "CN_WORD", 
"position": 0 
}, 
{ 
"token": "和", 
"start_offset": 3, 
"end_offset": 4, 
"type": "CN_CHAR", 
"position": 1 
}, 
{ 
"token": "谭林超", 
"start_offset": 4, 
"end_offset": 7, 
"type": "CN_WORD", 
"position": 2 
}, 
{ 
"token": "还有", 
"start_offset": 7, 
"end_offset": 9, 
"type": "CN_WORD", 
"position": 3 
}, 
{ 
"token": "杨力生", 
"start_offset": 9, 
"end_offset": 12, 
"type": "CN_WORD", 
"position": 4 
}, 
{ 
"token": "一起", 
"start_offset": 12, 
"end_offset": 14, 
"type": "CN_WORD", 
"position": 5 
}, 
{ 
"token": "一", 
"start_offset": 12, 
"end_offset": 13, 
"type": "TYPE_CNUM", 
"position": 6 
}, 
{ 
"token": "起", 
"start_offset": 13, 
"end_offset": 14, 
"type": "COUNT", 
"position": 7 
}, 
{ 
"token": "去看", 
"start_offset": 14, 
"end_offset": 16, 
"type": "CN_WORD", 
"position": 8 
}, 
{ 
"token": "看中", 
"start_offset": 15, 
"end_offset": 17, 
"type": "CN_WORD", 
"position": 9 
}, 
{ 
"token": "中国机长", 
"start_offset": 16, 
"end_offset": 20, 
"type": "CN_WORD", 
"position": 10 
}, 
{ 
"token": "中国", 
"start_offset": 16, 
"end_offset": 18, 
"type": "CN_WORD", 
"position": 11 
}, 
{ 
"token": "机长", 
"start_offset": 18, 
"end_offset": 20, 
"type": "CN_WORD", 
"position": 12 
} 
] 
} 

1 人点赞