1、字符过滤器 官方文档
其作用主要是在调用分词器进行分词之前,进行一些无用字符的过滤,字符过滤器主要分为以下三种
(1)、Html strip 官方文档
过滤html标签,主要参数escaped_tags保留哪些html标签,示例代码如下:
代码语言:javascript复制PUT test_index
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer":{
//指定分词器
"tokenizer":"keyword",
//指定分析器的字符串过滤器
"char_filter":"custom_char_filter"
}
},
//字符过滤器
"char_filter": {
"custom_char_filter":{
//字符过滤器的类型
"type":"html_strip",
//跳过过滤的html标签
"escaped_tags": [
"a"
]
}
}
}
}
}
测试过滤器代码:
代码语言:javascript复制GET test_index/_analyze
{
"analyzer": "custom_analyzer",
"text": ["this is address of baidu<a>baidu</a><p>baidu content</p>"]
}
执行结果如下:
代码语言:javascript复制{
"tokens" : [
{
"token" : """this is address of baidu<a>baidu</a>
baidu content
""",
"start_offset" : 0,
"end_offset" : 56,
"type" : "word",
"position" : 0
}
]
}
从结果中可以看出过滤了除a标签之外的所有html标签.
(2)、Mapping 官方文档
常用于敏感词过滤,示例代码如下:
代码语言:javascript复制PUT test_index
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer":{
"tokenizer":"keyword",
"char_filter":["custom_char_filter","custom_mapping_filter"]
}
},
"char_filter": {
"custom_char_filter":{
"type":"html_strip",
"escaped_tags": [
"a"
]
},
"custom_mapping_filter":{
"type": "mapping",
//当内容出现baidu或者is 全都用**替换
"mappings": [
"baidu=>**",
"is=>**"
]
}
}
}
}
}
执行搜索代码如下:
代码语言:javascript复制GET test_index/_analyze
{
"analyzer": "custom_analyzer",
"text": ["this is address of baidu<a>baidu</a><p>baidu content</p>"]
}
执行结果如下:
代码语言:javascript复制{
"tokens" : [
{
"token" : """th** ** address of **<a>**</a>
** content
""",
"start_offset" : 0,
"end_offset" : 56,
"type" : "word",
"position" : 0
}
]
}
在html_strip的基础上,通过mapping完成了baidu和is的敏感词过滤.
(3)、Pattern Replace 官方文档
主要用于一些结构化的内容(可以用正则表达式检索到的)的替换,示例代码如下:
代码语言:javascript复制PUT test_index
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer":{
"tokenizer":"keyword",
"char_filter":["custom_char_filter","custom_mapping_filter","custom_pattern_replace_filter"]
}
},
"char_filter": {
"custom_char_filter":{
"type":"html_strip",
"escaped_tags": [
"a"
]
},
"custom_mapping_filter":{
"type": "mapping",
"mappings": [
"baidu=>**",
"is=>**"
]
},
"custom_pattern_replace_filter":{
"type":"pattern_replace",
"pattern": "(\d{3})\d{4}(\d{4})",
"replacement": "$1****$2"
}
}
}
}
}
在(1)、(2)的基础上增加了custom_pattern_replace_filter用于正则替换内容,主要作用是手机号脱敏
检索代码如下:
代码语言:javascript复制GET test_index/_analyze
{
"analyzer": "custom_analyzer",
"text": ["this is address of baidu<a>baidu</a><p>baidu content</p>telphone:13311112222"]
}
执行结果如下:
代码语言:javascript复制{
"tokens" : [
{
"token" : """th** ** address of **<a>**</a>
** content
telphone:133****2222""",
"start_offset" : 0,
"end_offset" : 76,
"type" : "word",
"position" : 0
}
]
}
手机号13311112222被替换成了133****2222
2、令牌过滤器 官方文档
令牌过滤器包含的内容过多,参考官方文档,这里分析几种常用的令牌过滤器
(1)、同义词过滤器 synonym
第一步向运行目录的config文件夹下添加analysis文件夹,再到此文件夹下添加synonym.txt文件,集群下所有节点重复此操作,内容如下:
第二步编写相关的设置指向同义词文件
代码语言:javascript复制PUT test_index
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer":{
"tokenizer":"ik_max_word",
"filter":["synonym"]
}
},
"filter": {
"synonym":{
"type": "synonym",
"synonyms_path": "analysis/synonym.txt"
}
}
}
}
}
这里用了ik分词器,不明白参考ES 中文分词器ik
新增索引后,执行搜索代码
代码语言:javascript复制GET test_index/_analyze
{
"analyzer": "custom_analyzer",
"text":"啦啦啦,呵呵呵,啧啧啧"
}
结果如下:
代码语言:javascript复制{
"tokens" : [
{
"token" : "嘻嘻",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "嘻嘻",
"start_offset" : 4,
"end_offset" : 6,
"type" : "SYNONYM",
"position" : 1
},
{
"token" : "嘎嘎",
"start_offset" : 4,
"end_offset" : 6,
"type" : "SYNONYM",
"position" : 1
},
{
"token" : "嘎嘎",
"start_offset" : 5,
"end_offset" : 7,
"type" : "SYNONYM",
"position" : 2
},
{
"token" : "么",
"start_offset" : 8,
"end_offset" : 10,
"type" : "SYNONYM",
"position" : 3
},
{
"token" : "么",
"start_offset" : 9,
"end_offset" : 11,
"type" : "SYNONYM",
"position" : 4
},
{
"token" : "么",
"start_offset" : 9,
"end_offset" : 11,
"type" : "SYNONYM",
"position" : 5
}
]
}
结果中所有的相关词汇都被执行的词汇替换
(2)、停用词stop 官方文档
在设置中指定的停用词,将不会创建倒排索引.
代码语言:javascript复制PUT test_index
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer":{
"tokenizer":"ik_max_word",
"filter":["custom_stop_filter"]
}
},
"filter": {
"custom_stop_filter":{
"type": "stop",
"ignore_case": true,
"stopwords": [ "and", "is","friend" ]
}
}
}
}
}
执行以上代码,并执行以下搜索语句
代码语言:javascript复制GET test_index/_analyze
{
"analyzer": "custom_analyzer",
"text":"You and me IS friend"
}
执行结果如下:
代码语言:javascript复制{
"tokens" : [
{
"token" : "you",
"start_offset" : 0,
"end_offset" : 3,
"type" : "ENGLISH",
"position" : 0
},
{
"token" : "me",
"start_offset" : 8,
"end_offset" : 10,
"type" : "ENGLISH",
"position" : 1
}
]
}
注:也可以指定停用词文件路劲,和ik分词器类似.具体参考官方文档.