ES 字符过滤器&令牌过滤器

2022-09-21 08:33:44 浏览数 (1)

1、字符过滤器 官方文档

其作用主要是在调用分词器进行分词之前,进行一些无用字符的过滤,字符过滤器主要分为以下三种

(1)、Html strip 官方文档

过滤html标签,主要参数escaped_tags保留哪些html标签,示例代码如下:

代码语言:javascript复制
PUT test_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer":{
          //指定分词器
          "tokenizer":"keyword",
          //指定分析器的字符串过滤器
          "char_filter":"custom_char_filter"
        }
      },
      //字符过滤器
      "char_filter": {
        "custom_char_filter":{
          //字符过滤器的类型
          "type":"html_strip",
          //跳过过滤的html标签
          "escaped_tags": [
            "a"
          ]
        }
      }
    }
  }
}

测试过滤器代码:

代码语言:javascript复制
GET test_index/_analyze
{
  "analyzer": "custom_analyzer",
  "text": ["this is address of baidu<a>baidu</a><p>baidu content</p>"]
}

执行结果如下:

代码语言:javascript复制
{
  "tokens" : [
    {
      "token" : """this is address of baidu<a>baidu</a>
baidu content
""",
      "start_offset" : 0,
      "end_offset" : 56,
      "type" : "word",
      "position" : 0
    }
  ]
}

从结果中可以看出过滤了除a标签之外的所有html标签.

(2)、Mapping 官方文档

常用于敏感词过滤,示例代码如下:

代码语言:javascript复制
PUT test_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer":{
          "tokenizer":"keyword",
          "char_filter":["custom_char_filter","custom_mapping_filter"]
        }
      },
      "char_filter": {
        "custom_char_filter":{
          "type":"html_strip",
          "escaped_tags": [
            "a"
          ]
        },
        "custom_mapping_filter":{
          "type": "mapping",
          //当内容出现baidu或者is 全都用**替换
          "mappings": [
            "baidu=>**",
            "is=>**"
          ]
        }
      }
    }
  }
}

执行搜索代码如下:

代码语言:javascript复制
GET test_index/_analyze
{
  "analyzer": "custom_analyzer",
  "text": ["this is address of baidu<a>baidu</a><p>baidu content</p>"]
}

执行结果如下:

代码语言:javascript复制
{
  "tokens" : [
    {
      "token" : """th** ** address of **<a>**</a>
** content
""",
      "start_offset" : 0,
      "end_offset" : 56,
      "type" : "word",
      "position" : 0
    }
  ]
}

在html_strip的基础上,通过mapping完成了baidu和is的敏感词过滤.

(3)、Pattern Replace 官方文档

主要用于一些结构化的内容(可以用正则表达式检索到的)的替换,示例代码如下:

代码语言:javascript复制
PUT test_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer":{
          "tokenizer":"keyword",
          "char_filter":["custom_char_filter","custom_mapping_filter","custom_pattern_replace_filter"]
        }
      },
      "char_filter": {
        "custom_char_filter":{
          "type":"html_strip",
          "escaped_tags": [
            "a"
          ]
        },
        "custom_mapping_filter":{
          "type": "mapping",
          "mappings": [
            "baidu=>**",
            "is=>**"
          ]
        },
        "custom_pattern_replace_filter":{
          "type":"pattern_replace",
          "pattern": "(\d{3})\d{4}(\d{4})",
          "replacement": "$1****$2"
        }
      }
    }
  }
}

在(1)、(2)的基础上增加了custom_pattern_replace_filter用于正则替换内容,主要作用是手机号脱敏

检索代码如下:

代码语言:javascript复制
GET test_index/_analyze
{
  "analyzer": "custom_analyzer",
  "text": ["this is address of baidu<a>baidu</a><p>baidu content</p>telphone:13311112222"]
}

执行结果如下:

代码语言:javascript复制
{
  "tokens" : [
    {
      "token" : """th** ** address of **<a>**</a>
** content
telphone:133****2222""",
      "start_offset" : 0,
      "end_offset" : 76,
      "type" : "word",
      "position" : 0
    }
  ]
}

手机号13311112222被替换成了133****2222

2、令牌过滤器 官方文档

令牌过滤器包含的内容过多,参考官方文档,这里分析几种常用的令牌过滤器

(1)、同义词过滤器 synonym

第一步向运行目录的config文件夹下添加analysis文件夹,再到此文件夹下添加synonym.txt文件,集群下所有节点重复此操作,内容如下:

 第二步编写相关的设置指向同义词文件

代码语言:javascript复制
PUT test_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer":{
          "tokenizer":"ik_max_word",
          "filter":["synonym"]
        }
      },
      "filter": {
        "synonym":{
          "type": "synonym",
          "synonyms_path": "analysis/synonym.txt"
        }
      }
    }
  }
}

这里用了ik分词器,不明白参考ES 中文分词器ik

新增索引后,执行搜索代码

代码语言:javascript复制
GET test_index/_analyze
{
  "analyzer": "custom_analyzer",
  "text":"啦啦啦,呵呵呵,啧啧啧"
}

结果如下:

代码语言:javascript复制
{
  "tokens" : [
    {
      "token" : "嘻嘻",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "嘻嘻",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "SYNONYM",
      "position" : 1
    },
    {
      "token" : "嘎嘎",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "SYNONYM",
      "position" : 1
    },
    {
      "token" : "嘎嘎",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "SYNONYM",
      "position" : 2
    },
    {
      "token" : "么",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "SYNONYM",
      "position" : 3
    },
    {
      "token" : "么",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "SYNONYM",
      "position" : 4
    },
    {
      "token" : "么",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "SYNONYM",
      "position" : 5
    }
  ]
}

结果中所有的相关词汇都被执行的词汇替换

(2)、停用词stop 官方文档

在设置中指定的停用词,将不会创建倒排索引.

代码语言:javascript复制
PUT test_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer":{
          "tokenizer":"ik_max_word",
          "filter":["custom_stop_filter"]
        }
      },
      "filter": {
        "custom_stop_filter":{
          "type": "stop",
          "ignore_case": true,
          "stopwords": [ "and", "is","friend" ]
        }
      }
    }
  }
}

执行以上代码,并执行以下搜索语句

代码语言:javascript复制
GET test_index/_analyze
{
  "analyzer": "custom_analyzer",
  "text":"You and me IS friend"
}

执行结果如下:

代码语言:javascript复制
{
  "tokens" : [
    {
      "token" : "you",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "ENGLISH",
      "position" : 0
    },
    {
      "token" : "me",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "ENGLISH",
      "position" : 1
    }
  ]
}

注:也可以指定停用词文件路劲,和ik分词器类似.具体参考官方文档.

0 人点赞