Elasticsearch Analyzer

2022-12-01 21:32:22 浏览数 (1)

Elasticsearch Analyzer

Elasticsearch全文检索的核心是Text Analysis,而Text AnalysisAnalyzer实现。

1 Analyzer的类型

1.1 Built-in Analyzer

Elasticsearch内置了若干开箱即用的Analyzer,其中Standard Analyzer是默认的,一般可以满足大多数场景。

  • Standard Analyzer,根据词边界将文本拆分成若干term,其中词边界Unicode文本分段算法决策;标准分析器会删除大多数的标点符号,同时将大写的term转化为小写样式。
  • Simple Analyzer,根据非字母将文本拆分成若干term,简单分析器会将大写的term转化为小写样式。
  • Whitespace Analyzer,根据空白符将文本拆分成若干term,空白分析器不会将大写的term转化为小写样式。
  • Stop Analyzer,与简单分析器类似,但其可以删除停止词。
  • Keyword Analyzer,关键字分析器是一个空的分析器,并不会对文本进行拆分,而是将整个文本看作一个term
  • Pattern Analyzer,根据正则表达式拆分文本。
  • Language Analyzer,语言分析器,比如:English和French等。
  • Fingerprint Analyzer,主要用于重复检测场景。

1.2 Custom Analyzer

如果Elasticsearch内置的分析器无法满足你的需求,那么你可以创建一个custom类型的分析器:

  • 零个或多个character filter
  • 一个tokenizer
  • 零个或多个token filter

无论是character filter还是tokenizer亦或是token filter都可以有两种:built-in和custom。

2 Elasticsearch Analyzer的结构

一般地,Elasticsearch Analyzercharacter filtertokenizertoken filter级联而成,其中tokenizer有且只能有一个。

2.1 Character filter

Character filter主要针对字符进行预处理操作。

  • HTML Strip Character Filter,将HTML标签编码,比如:<b>转化为&amp;
  • Mapping Character Filter,类比Java中的map<Function<T>>
  • Pattern Replace Character Filter,基于正则表达式替换字符。

2.2 Tokenizer

Tokenizer主要负责分词操作,同时会记录每个分词type、position和该分词首尾字符的offset。Elasticsearch内置了10 种分词器,主要分为三类:Word Oriented Tokenizer、Partial Word Tokenizer和Structured Text Tokenizer。

2.2.1 Word Oriented Tokenizer

Word Oriented Tokenizer以individual word为维度进行分词。下面是比较常用的Word Oriented Tokenizer分词器:

  • Standard Tokenizer,根据词边界将文本拆分成若干term,其中词边界Unicode文本分段算法决策;标准分词器会删除大多数的标点符号。
  • Letter Tokenizer,根据非字母将文本拆分成若干term
  • Lowercase Tokenizer,与Letter Tokenizer类似,同时会将各个分词转化为小写态。
  • Whitespace Tokenizer,根据空白符将文本拆分成若干term
2.2.2 Partial Word Tokenizer

Partial Word Tokenizer以partial word为维度进行分词。

  • N-Gram Tokenizer,quick → [qu, ui, ic, ck]。
  • Edge N-Gram Tokenizer,quick → [q, qu, qui, quic, quick]。
2.2.3 Structured Text Tokenizer

Structured Text Tokenizer主要针对结构化文本进行分词,比如:ID、邮箱地址和路径等。

  • Keyword Tokenizer,不分词,而是将整个文本看作一个term
  • Pattern Tokenizer,根据正则表达式拆分文本。
  • Path Tokenizer,/foo/bar/baz → [/foo, /foo/bar, /foo/bar/baz]。

2.3 Token filter

Token filter主要针对分词进行后处理操作。Elasticsearch内置了40 种分词过滤器,这里不再一一赘述。

3 Specify the analyzer for a text field

mapping analyzer参数可以为特定字段设定分析器。一旦设定完毕,那么在indexsearch阶段将会使用该分析器进行文本分析。

4 Analyze API

我们可以通过Analyze API来进行Text Analysis。

4.1 Request

Request Method

URL

POST

/{index}/_analyze

4.2 Path parameters

Parameter

Required

Description

index

false

使用该索引中特定field的分析器进行文本分析

4.3 Query parameters

Parameter

Required

Description

analyzer

false

由character filter、tokenizer和token filter级联而成的分析器

char_filter

false

字符过滤器

tokenizer

false

分词器

filter

false

分词过滤器

text

true

要分析的文本内容

field

false

使用该参数时,那么必须提供index path parameter;该参数声明了从哪一个field获取分析器

normalizer

false

归一化器用于将文本转化为单个term

Normalizer Normalizer是简化版的Analyzer,它没有Tokenizer分词器模块;换句话说,Normalizer只能生成一个分词。

4.4 体验

代码语言:javascript复制
GET /_analyze
{
    "tokenizer": "standard",
    "text": "sline-admin-webapp"
}

其响应结构如下:

代码语言:javascript复制
{
    "tokens": [
        {
            "token": "sline",
            "start_offset": 0,
            "end_offset": 5,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "admin",
            "start_offset": 6,
            "end_offset": 11,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "webapp",
            "start_offset": 12,
            "end_offset": 18,
            "type": "<ALPHANUM>",
            "position": 2
        }
    ]
}

5 自定义分析器

5.1 需求

基于FilebeatLogstashElasticsearch实现了微服务日志的采集与存储,需要对moduleName这一field进行模糊搜索,moduleName也就是微服务的实例名称,其名称中字符只有英文字母和-分隔符。

5.2 实现

首先,Elasticsearch内置tokenizer不支持字符级的分词。于是,我们使用character filter进行处理,将moduleName拆分成-single character-形式,比如sline-webapp经转化就变为-s-l-i-n-e---w-e-b-a-p-p-;然后借助standard tokenizer进行分词处理。接下来,更新index template,指定index阶段和search阶段均使用该自定义分析器对moduleName field进行处理。最后,模糊匹配使用match_phrase进行查询即可。

5.2.1 更新index template
代码语言:javascript复制
PUT /_index_template/sline-system-log-template
{
    "index_patterns": [
        "elk-*"
    ],
    "template": {
        "settings": {
            "lifecycle": {
                "name": "sline-system-log-ilm-policy",
                "rollover_alias": "sline-system-log-ilm-policy-alias"
            },
            "number_of_shards": "1",
            "max_result_window": "1000000",
            "number_of_replicas": "1",
            "analysis": {
                "analyzer": {
                    "sline_analyzer": {
                        "type": "custom",
                        "tokenizer": "standard",
                        "char_filter": [
                            "sline_char_filter"
                        ]
                    }
                },
                "char_filter": {
                    "sline_char_filter": {
                        "type": "mapping",
                        "mappings": [
                            "a => -a-",
                            "b => -b-",
                            "c => -c-",
                            "d => -d-",
                            "e => -e-",
                            "f => -f-",
                            "g => -g-",
                            "h => -h-",
                            "i => -i-",
                            "j => -j-",
                            "k => -k-",
                            "l => -l-",
                            "m => -m-",
                            "n => -n-",
                            "o => -o-",
                            "p => -p-",
                            "q => -q-",
                            "r => -r-",
                            "s => -s-",
                            "t => -t-",
                            "u => -u-",
                            "v => -v-",
                            "w => -w-",
                            "x => -x-",
                            "y => -y-",
                            "z => -z-"
                        ]
                    }
                }
            }
        },
        "mappings": {
            "properties": {
                "@timestamp": {
                    "type": "date"
                },
                "@version": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "className": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "lineNum": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "logLevel": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "message": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "methodName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    },
                    "analyzer": "sline_analyzer"
                },
                "moduleName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    },
                    "analyzer": "sline_analyzer"
                },
                "systemName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    },
                    "analyzer": "sline_analyzer"
                },
                "threadName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "timestamp": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                }
            }
        }
    }
}'
5.2.2 检查index template是否生效

检查的思路就是查看新生成的日志索引详情。

代码语言:javascript复制
GET /elk-2021.01.31/_mappings
{
    "elk-2021.01.31": {
        "aliases": {},
        "mappings": {
            "properties": {
                "@timestamp": {
                    "type": "date"
                },
                "@version": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "className": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "lineNum": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "logLevel": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "message": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "methodName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    },
                    "analyzer": "sline_analyzer"
                },
                "moduleName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    },
                    "analyzer": "sline_analyzer"
                },
                "systemName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    },
                    "analyzer": "sline_analyzer"
                },
                "threadName": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "timestamp": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                }
            }
        },
        "settings": {
            "index": {
                "lifecycle": {
                    "name": "sline-system-log-ilm-policy",
                    "rollover_alias": "sline-system-log-ilm-policy-alias"
                },
                "number_of_shards": "1",
                "provided_name": "elk-2021.01.31",
                "max_result_window": "1000000",
                "creation_date": "1612051512356",
                "analysis": {
                    "analyzer": {
                        "sline_analyzer": {
                            "type": "custom",
                            "char_filter": [
                                "sline_char_filter"
                            ],
                            "tokenizer": "standard"
                        }
                    },
                    "char_filter": {
                        "sline_char_filter": {
                            "type": "mapping",
                            "mappings": [
                                "a => -a-",
                                "b => -b-",
                                "c => -c-",
                                "d => -d-",
                                "e => -e-",
                                "f => -f-",
                                "g => -g-",
                                "h => -h-",
                                "i => -i-",
                                "j => -j-",
                                "k => -k-",
                                "l => -l-",
                                "m => -m-",
                                "n => -n-",
                                "o => -o-",
                                "p => -p-",
                                "q => -q-",
                                "r => -r-",
                                "s => -s-",
                                "t => -t-",
                                "u => -u-",
                                "v => -v-",
                                "w => -w-",
                                "x => -x-",
                                "y => -y-",
                                "z => -z-"
                            ]
                        }
                    }
                },
                "number_of_replicas": "1",
                "uuid": "C7rmtPmVTAeaN_6dW0vwfA",
                "version": {
                    "created": "7090199"
                }
            }
        }
    }
}
5.2.3 模糊查询
代码语言:javascript复制
GET /elk-2021.01.31/_search
{
    "from": 0,
    "size": 10,
    "timeout": "10s",
    "_source": {
        "exclude": [
            "@version",
            "@timestamp"
        ]
    },
    "track_total_hits": true,
    "query": {
        "bool": {
            "must": [
                {
                    "match_phrase": {
                        "moduleName": "web"
                    }
                }
            ]
        }
    },
    "sort": [
        {
            "timestamp.keyword": {
                "order": "desc"
            }
        }
    ]
}'

其搜索结果如下:

代码语言:javascript复制
{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": null,
        "hits": [
            {
                "_index": "elk-2021.01.31",
                "_type": "_doc",
                "_id": "JB9aYHcBbZJ5iJayD4Mj",
                "_score": null,
                "_source": {
                    "systemName": "ccn",
                    "logLevel": "INFO",
                    "moduleName": "ccn-webapp",
                    "lineNum": "1093",
                    "methodName": "getAndStoreFullRegistry",
                    "className": "com.netflix.discovery.DiscoveryClient",
                    "message": "Getting all instance registry info from the eureka server",
                    "threadName": "main",
                    "timestamp": "2021-01-31 09:27:29"
                },
                "sort": [
                    "2021-01-31 09:27:29"
                ]
            },
            {
                "_index": "elk-2021.01.31",
                "_type": "_doc",
                "_id": "KB9aYHcBbZJ5iJayD4M2",
                "_score": null,
                "_source": {
                    "systemName": "ccn",
                    "logLevel": "INFO",
                    "moduleName": "ccn-webapp",
                    "lineNum": "60",
                    "methodName": "<init>",
                    "className": "com.netflix.discovery.InstanceInfoReplicator",
                    "message": "InstanceInfoReplicator onDemand update allowed rate per min is 4",
                    "threadName": "main",
                    "timestamp": "2021-01-31 09:27:29"
                },
                "sort": [
                    "2021-01-31 09:27:29"
                ]
            }
        ]
    }
}

6 参考文档

  1. https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-overview.html
  2. https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
  3. https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-charfilters.html
  4. https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html
  5. https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html
  6. https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html

0 人点赞