Elasticsearch Analyzer
Elasticsearch全文检索的核心是Text Analysis
,而Text Analysis
由Analyzer
实现。
1 Analyzer的类型
1.1 Built-in Analyzer
Elasticsearch内置了若干开箱即用的Analyzer
,其中Standard Analyzer
是默认的,一般可以满足大多数场景。
Standard Analyzer
,根据词边界将文本拆分成若干term
,其中词边界由Unicode文本分段算法决策;标准分析器会删除大多数的标点符号,同时将大写的term
转化为小写样式。Simple Analyzer
,根据非字母将文本拆分成若干term
,简单分析器会将大写的term
转化为小写样式。Whitespace Analyzer
,根据空白符将文本拆分成若干term
,空白分析器不会将大写的term
转化为小写样式。Stop Analyzer
,与简单分析器类似,但其可以删除停止词。Keyword Analyzer
,关键字分析器是一个空的分析器,并不会对文本进行拆分,而是将整个文本看作一个term
。Pattern Analyzer
,根据正则表达式拆分文本。Language Analyzer
,语言分析器,比如:English和French等。Fingerprint Analyzer
,主要用于重复检测场景。
1.2 Custom Analyzer
如果Elasticsearch内置的分析器无法满足你的需求,那么你可以创建一个custom
类型的分析器:
- 零个或多个
character filter
- 一个
tokenizer
- 零个或多个
token filter
无论是
character filter
还是tokenizer
亦或是token filter
都可以有两种:built-in和custom。
2 Elasticsearch Analyzer的结构
一般地,Elasticsearch Analyzer
由character filter
、tokenizer
和token filter
级联而成,其中tokenizer
有且只能有一个。
2.1 Character filter
Character filter主要针对字符进行预处理操作。
HTML Strip Character Filter
,将HTML标签编码,比如:<b>
转化为&
。Mapping Character Filter
,类比Java中的map<Function<T>>
。Pattern Replace Character Filter
,基于正则表达式替换字符。
2.2 Tokenizer
Tokenizer主要负责分词操作,同时会记录每个分词type、position和该分词首尾字符的offset。Elasticsearch内置了10 种分词器,主要分为三类:Word Oriented Tokenizer、Partial Word Tokenizer和Structured Text Tokenizer。
2.2.1 Word Oriented Tokenizer
Word Oriented Tokenizer以individual word
为维度进行分词。下面是比较常用的Word Oriented Tokenizer分词器:
Standard Tokenizer
,根据词边界将文本拆分成若干term
,其中词边界由Unicode文本分段算法决策;标准分词器会删除大多数的标点符号。Letter Tokenizer
,根据非字母将文本拆分成若干term
。Lowercase Tokenizer
,与Letter Tokenizer类似,同时会将各个分词转化为小写态。Whitespace Tokenizer
,根据空白符将文本拆分成若干term
。
2.2.2 Partial Word Tokenizer
Partial Word Tokenizer以partial word
为维度进行分词。
- N-Gram Tokenizer,quick → [qu, ui, ic, ck]。
- Edge N-Gram Tokenizer,quick → [q, qu, qui, quic, quick]。
2.2.3 Structured Text Tokenizer
Structured Text Tokenizer主要针对结构化文本进行分词,比如:ID、邮箱地址和路径等。
- Keyword Tokenizer,不分词,而是将整个文本看作一个
term
。 - Pattern Tokenizer,根据正则表达式拆分文本。
- Path Tokenizer,/foo/bar/baz → [/foo, /foo/bar, /foo/bar/baz]。
2.3 Token filter
Token filter主要针对分词进行后处理操作。Elasticsearch内置了40 种分词过滤器,这里不再一一赘述。
3 Specify the analyzer for a text field
mapping analyzer
参数可以为特定字段设定分析器。一旦设定完毕,那么在index
或search
阶段将会使用该分析器进行文本分析。
4 Analyze API
我们可以通过Analyze API来进行Text Analysis。
4.1 Request
Request Method | URL |
---|---|
POST | /{index}/_analyze |
4.2 Path parameters
Parameter | Required | Description |
---|---|---|
index | false | 使用该索引中特定field的分析器进行文本分析 |
4.3 Query parameters
Parameter | Required | Description |
---|---|---|
analyzer | false | 由character filter、tokenizer和token filter级联而成的分析器 |
char_filter | false | 字符过滤器 |
tokenizer | false | 分词器 |
filter | false | 分词过滤器 |
text | true | 要分析的文本内容 |
field | false | 使用该参数时,那么必须提供index path parameter;该参数声明了从哪一个field获取分析器 |
normalizer | false | 归一化器用于将文本转化为单个term |
Normalizer Normalizer是简化版的Analyzer,它没有Tokenizer分词器模块;换句话说,Normalizer只能生成一个分词。
4.4 体验
代码语言:javascript复制GET /_analyze
{
"tokenizer": "standard",
"text": "sline-admin-webapp"
}
其响应结构如下:
代码语言:javascript复制{
"tokens": [
{
"token": "sline",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "admin",
"start_offset": 6,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "webapp",
"start_offset": 12,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 2
}
]
}
5 自定义分析器
5.1 需求
基于Filebeat
、Logstash
和Elasticsearch
实现了微服务日志的采集与存储,需要对moduleName
这一field进行模糊搜索,moduleName
也就是微服务的实例名称,其名称中字符只有英文字母和-
分隔符。
5.2 实现
首先,Elasticsearch内置tokenizer
不支持字符级的分词。于是,我们使用character filter
进行处理,将moduleName
拆分成-single character-
形式,比如sline-webapp
经转化就变为-s-l-i-n-e---w-e-b-a-p-p-
;然后借助standard tokenizer
进行分词处理。接下来,更新index template
,指定index
阶段和search
阶段均使用该自定义分析器对moduleName
field进行处理。最后,模糊匹配使用match_phrase
进行查询即可。
5.2.1 更新index template
代码语言:javascript复制PUT /_index_template/sline-system-log-template
{
"index_patterns": [
"elk-*"
],
"template": {
"settings": {
"lifecycle": {
"name": "sline-system-log-ilm-policy",
"rollover_alias": "sline-system-log-ilm-policy-alias"
},
"number_of_shards": "1",
"max_result_window": "1000000",
"number_of_replicas": "1",
"analysis": {
"analyzer": {
"sline_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"sline_char_filter"
]
}
},
"char_filter": {
"sline_char_filter": {
"type": "mapping",
"mappings": [
"a => -a-",
"b => -b-",
"c => -c-",
"d => -d-",
"e => -e-",
"f => -f-",
"g => -g-",
"h => -h-",
"i => -i-",
"j => -j-",
"k => -k-",
"l => -l-",
"m => -m-",
"n => -n-",
"o => -o-",
"p => -p-",
"q => -q-",
"r => -r-",
"s => -s-",
"t => -t-",
"u => -u-",
"v => -v-",
"w => -w-",
"x => -x-",
"y => -y-",
"z => -z-"
]
}
}
}
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"@version": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"className": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"lineNum": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"logLevel": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"methodName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "sline_analyzer"
},
"moduleName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "sline_analyzer"
},
"systemName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "sline_analyzer"
},
"threadName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"timestamp": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}'
5.2.2 检查index template是否生效
检查的思路就是查看新生成的日志索引详情。
代码语言:javascript复制GET /elk-2021.01.31/_mappings
{
"elk-2021.01.31": {
"aliases": {},
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"@version": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"className": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"lineNum": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"logLevel": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"methodName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "sline_analyzer"
},
"moduleName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "sline_analyzer"
},
"systemName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "sline_analyzer"
},
"threadName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"timestamp": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"settings": {
"index": {
"lifecycle": {
"name": "sline-system-log-ilm-policy",
"rollover_alias": "sline-system-log-ilm-policy-alias"
},
"number_of_shards": "1",
"provided_name": "elk-2021.01.31",
"max_result_window": "1000000",
"creation_date": "1612051512356",
"analysis": {
"analyzer": {
"sline_analyzer": {
"type": "custom",
"char_filter": [
"sline_char_filter"
],
"tokenizer": "standard"
}
},
"char_filter": {
"sline_char_filter": {
"type": "mapping",
"mappings": [
"a => -a-",
"b => -b-",
"c => -c-",
"d => -d-",
"e => -e-",
"f => -f-",
"g => -g-",
"h => -h-",
"i => -i-",
"j => -j-",
"k => -k-",
"l => -l-",
"m => -m-",
"n => -n-",
"o => -o-",
"p => -p-",
"q => -q-",
"r => -r-",
"s => -s-",
"t => -t-",
"u => -u-",
"v => -v-",
"w => -w-",
"x => -x-",
"y => -y-",
"z => -z-"
]
}
}
},
"number_of_replicas": "1",
"uuid": "C7rmtPmVTAeaN_6dW0vwfA",
"version": {
"created": "7090199"
}
}
}
}
}
5.2.3 模糊查询
代码语言:javascript复制GET /elk-2021.01.31/_search
{
"from": 0,
"size": 10,
"timeout": "10s",
"_source": {
"exclude": [
"@version",
"@timestamp"
]
},
"track_total_hits": true,
"query": {
"bool": {
"must": [
{
"match_phrase": {
"moduleName": "web"
}
}
]
}
},
"sort": [
{
"timestamp.keyword": {
"order": "desc"
}
}
]
}'
其搜索结果如下:
代码语言:javascript复制{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "elk-2021.01.31",
"_type": "_doc",
"_id": "JB9aYHcBbZJ5iJayD4Mj",
"_score": null,
"_source": {
"systemName": "ccn",
"logLevel": "INFO",
"moduleName": "ccn-webapp",
"lineNum": "1093",
"methodName": "getAndStoreFullRegistry",
"className": "com.netflix.discovery.DiscoveryClient",
"message": "Getting all instance registry info from the eureka server",
"threadName": "main",
"timestamp": "2021-01-31 09:27:29"
},
"sort": [
"2021-01-31 09:27:29"
]
},
{
"_index": "elk-2021.01.31",
"_type": "_doc",
"_id": "KB9aYHcBbZJ5iJayD4M2",
"_score": null,
"_source": {
"systemName": "ccn",
"logLevel": "INFO",
"moduleName": "ccn-webapp",
"lineNum": "60",
"methodName": "<init>",
"className": "com.netflix.discovery.InstanceInfoReplicator",
"message": "InstanceInfoReplicator onDemand update allowed rate per min is 4",
"threadName": "main",
"timestamp": "2021-01-31 09:27:29"
},
"sort": [
"2021-01-31 09:27:29"
]
}
]
}
}
6 参考文档
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-overview.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-charfilters.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html