ElasticSearch(7.2.2)-分词器的介绍和使⽤

2019-11-04 16:23:32 浏览数 (1)

简介:分词器是什么,内置的分词器有哪些

什么是分词器
  • 将⽤户输⼊的⼀段⽂本,按照⼀定逻辑,分析成多个词语的⼀种⼯具
  • example: The best 3-points shooter is Curry!
常用的内置分词器
  • standard analyzer
  • simple analyzer
  • whitespace analyzer
  • stop analyzer
  • language analyzer
  • pattern analyzer
standard analyzer
  • 标准分析器是默认分词器,如果未指定,则使⽤该分词器。
  • POST localhost:9200/_analyze
代码语言:javascript复制
{
	 "analyzer": "standard",
	 "text": "The best 3-points shooter is Curry!"
}
simple analyzer
  • simple 分析器当它遇到只要不是字⺟的字符,就将⽂本解析成term,⽽且所有的term都是⼩写的。
  • POST localhost:9200/_analyze
代码语言:javascript复制
{
	 "analyzer": "simple",
	 "text": "The best 3-points shooter is Curry!"
}
whitespace analyzer
  • whitespace 分析器,当它遇到空⽩字符时,就将⽂本解析成terms
  • POST localhost:9200/_analyze
代码语言:javascript复制
{
	 "analyzer": "whitespace",
	 "text": "The best 3-points shooter is Curry!"
}
stop analyzer
  • stop 分析器 和 simple 分析器很像,唯⼀不同的是,stop 分析器增加了对删除停⽌词的⽀持,默认使⽤了english停⽌词
  • stop words 预定义的停⽌词列表,⽐如 (the,a,an,this,of,at)等等
  • POST localhost:9200/_analyze
代码语言:javascript复制
{
	 "analyzer": "whitespace",
	 "text": "The best 3-points shooter is Curry!"
}
language analyzer
  • (特定的语⾔的分词器,⽐如说,English[英语分词器]),内置语⾔:arabic, armenian,basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish,french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian,lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish,swedish, turkish, thai
  • POST localhost:9200/_analyze
代码语言:javascript复制
{
	 "analyzer": "english",
	 "text": "The best 3-points shooter is Curry!"
}
pattern analyzer
  • ⽤正则表达式来将⽂本分割成terms,默认的正则表达式是W (⾮单词字符)
  • POST localhost:9200/_analyze
代码语言:javascript复制
{
	 "analyzer": "pattern",
	 "text": "The best 3-points shooter is Curry!"
}
选择分词器
  • PUT localhost:9200/my_index
代码语言:javascript复制
{
	"settings": {
		"analysis": {
			"analyzer": {
				"my_analyzer": {
					"type": "whitespace"
				}
			}
		}
	},
	"mappings": {
		"properties": {
			"name": {
				"type": "text"
			},
			"team_name": {
				"type": "text"
			},
			"position": {
				"type": "text"
			},
			"play_year": {
				"type": "long"
			},
			"jerse_no": {
				"type": "keyword"
			},
			"title": {
				"type": "text",
				"analyzer": "my_analyzer"
			}
		}
	}
}
  • PUT localhost:9200/my_index/_doc/1
代码语言:javascript复制
{
	 "name": "库⾥",
	 "team_name": "勇⼠",
	 "position": "控球后卫",
	 "play_year": 10,
	 "jerse_no": "30",
	 "title": "The best 3-points shooter is Curry!"
 }
  • POST localhost:9200/my_index/_search
代码语言:javascript复制
{
	"query": {
		"match": {
			"title": "Curry!"
		}
	}
}

0 人点赞