分词器ngram,edge-ngram,shingle分析

2024-09-23 18:03:21 浏览数 (3)

Ngram,edge-ngram,shingle多元分词器的几个注意点:

1、多元分词器Ngram,edge-ngram为单词字符级分词器,通常在索引时间指定,在搜索时间不指定。

2、1-grams,bigrams,trigrams分别指代1元,2元,3元分词器。

3、min_gram,max_gram指定字符的的最小最大分隔范围,output_unigrams指定不输出1元。

4、shingle指字多元分词以单词级分词器。

代码语言:txt复制
DELETE myind_ngram
PUT myind_ngram
{
  "settings": {
    "analysis": {
      "analyzer": {
        "myngram":{
         "tokenizer":"mytokenizer"
        }
      },
      "tokenizer": {
        "mytokenizer":{
          "type":"edge_ngram",
          "min_gram":1,
          "max_gram":2
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content":{
        "type":"text",
        "analyzer": "myngram"
      }
    }
  }
}


POST myind_ngram/_analyze
{
  "text":"Quick Foxes.",
  "analyzer":"myngram"
}

POST _analyze
{
  "text":"hello world",
  "tokenizer":{"type":"ngram","min_gram":1,"max_gram":2}
}

POST _analyze
{
  "text":"Quick Foxes.",
  "tokenizer":{"type":"edge_ngram","min_gram":1,"max_gram":10}
}

POST _analyze
{
  "text":"Quick Foxes Are You Ok",
  "tokenizer":"standard",
  "filter":{"type":"shingle","output_unigrams":false}
}

0 人点赞