简介:分词器是什么,内置的分词器有哪些
什么是分词器
- 将⽤户输⼊的⼀段⽂本,按照⼀定逻辑,分析成多个词语的⼀种⼯具
- example: The best 3-points shooter is Curry!
常用的内置分词器
- standard analyzer
- simple analyzer
- whitespace analyzer
- stop analyzer
- language analyzer
- pattern analyzer
standard analyzer
- 标准分析器是默认分词器,如果未指定,则使⽤该分词器。
- POST localhost:9200/_analyze
{
"analyzer": "standard",
"text": "The best 3-points shooter is Curry!"
}
simple analyzer
- simple 分析器当它遇到只要不是字⺟的字符,就将⽂本解析成term,⽽且所有的term都是⼩写的。
- POST localhost:9200/_analyze
{
"analyzer": "simple",
"text": "The best 3-points shooter is Curry!"
}
whitespace analyzer
- whitespace 分析器,当它遇到空⽩字符时,就将⽂本解析成terms
- POST localhost:9200/_analyze
{
"analyzer": "whitespace",
"text": "The best 3-points shooter is Curry!"
}
stop analyzer
- stop 分析器 和 simple 分析器很像,唯⼀不同的是,stop 分析器增加了对删除停⽌词的⽀持,默认使⽤了english停⽌词
- stop words 预定义的停⽌词列表,⽐如 (the,a,an,this,of,at)等等
- POST localhost:9200/_analyze
{
"analyzer": "whitespace",
"text": "The best 3-points shooter is Curry!"
}
language analyzer
- (特定的语⾔的分词器,⽐如说,English[英语分词器]),内置语⾔:arabic, armenian,basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish,french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian,lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish,swedish, turkish, thai
- POST localhost:9200/_analyze
{
"analyzer": "english",
"text": "The best 3-points shooter is Curry!"
}
pattern analyzer
- ⽤正则表达式来将⽂本分割成terms,默认的正则表达式是W (⾮单词字符)
- POST localhost:9200/_analyze
{
"analyzer": "pattern",
"text": "The best 3-points shooter is Curry!"
}
选择分词器
- PUT localhost:9200/my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "whitespace"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text"
},
"team_name": {
"type": "text"
},
"position": {
"type": "text"
},
"play_year": {
"type": "long"
},
"jerse_no": {
"type": "keyword"
},
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
- PUT localhost:9200/my_index/_doc/1
{
"name": "库⾥",
"team_name": "勇⼠",
"position": "控球后卫",
"play_year": 10,
"jerse_no": "30",
"title": "The best 3-points shooter is Curry!"
}
- POST localhost:9200/my_index/_search
{
"query": {
"match": {
"title": "Curry!"
}
}
}