Elasticsearch探索:Suggester API(一)

2021-01-26 10:28:30 浏览数 (1)

简介

现代的搜索引擎,一般都会提供 Suggest as you type 的功能,帮助用户在输入搜索的过程中,进行自动补全或者纠错。通过协助用户输入更加精准的关键词,提高后续搜索阶段文档匹配的程度。在 google 上搜索,一开始会自动补全。当输入到一定长度,如因为单词拼写错误无法补全,就会开始提示相似的词或者句子。

官网6.8版本地址:https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-suggesters.html

搜索引擎中类似的功能,在 ES 中通过 Sugester API 实现的

  • 原理:将输入的文档分解为 Token,然后在索引的字段里查找相似的 Term 并返回
  • 根据不同的使用场景,ES 设计了 4 种类别的 Suggesters
    • Term Suggester:纠错补全,输入错误的情况下补全正确的单词
    • Phrase Suggester:自动纠错补全短语,输入一个单词纠错补全整个短语
    • Complete Suggester:完成补全单词,输出如前半部分,补全整个单词
    • Context Suggester:上下文补全

Term 推荐词

Suggester 就是一种特殊类型的搜索。“text” 里是调用时候提供的文本,通常来自用户界面上用户输入的内容。用户输入的 “lucen” 是一个错误的拼写会到 指定的字段 “body” 上搜索,当无法搜索到结果时(missing),返回建议的词。

  • text:suggest 文本,suggest 文本是必须选项
  • field:从中获取候选 suggestions 的字段(field), 这是必需的选项
  • analyzer:对field进行分词,默认是与field设置的分词器一致
  • size:每个 suggest 文本标记(token)返回的最大更正值
  • sort:定义每个 suggest 文本术语中 suggestions 该如何排序。 两个可能的值:
    • score:先按照分数排序,然后按文档频率排序,然后是术语本身
    • frequency:按文档频率排序,然后依次选择相似性分数和术语本身
  • suggest_mode:
    • missing: Only provide suggestions for suggest text terms that are not in the index. This is the default。仅在搜索的词项在索引中不存在时才提供建议词,默认值
    • popular: Only suggest suggestions that occur in more docs than the original suggest text term。仅建议文档频率比搜索词项高的词
    • always: Suggest any matching suggestions based on terms in the suggest text。总是提供匹配的建议词
  • max_edits:suggestions 的最大编辑距离。只能是介于1和2之间的值,任何其他值都会导致抛出错误的请求错误。 默认为2
  • prefix_length:为了成为候选 suggestions 所必须匹配的最小前缀字符的数量。 默认值为1。增加此数字可提高拼写检查性能。 通常拼写错误不会出现在术语的开头。(Old name "prefix_len" is deprecated)
  • min_word_length:suggest 查询文本必须包含的最小长度。 默认值为4(Old name "min_word_len" is deprecated)
  • shard_size:设置要从每个单独的分片中检索的建议的最大数量。 在缩减阶段,仅基于size选项返回前N个suggestion。 默认为size选项。 将此值设置为大于size的值可能很有用,以便以性能为代价获得更准确的文档频率以进行拼写更正。 由于术语在分片之间进行划分,因此分片级别文档的拼写更正频率可能不准确。 增大此频率将使这些文档频率更加精确
  • max_inspections:一个因子,用于与shards_size相乘,以便在shard级别上检查更多的候选拼写更正。 可以以性能为代价提高准确性。 默认为5
  • min_doc_freq:suggestion 应该出现的文档数量的最小阈值。这可以指定为绝对数字或文档数量的相对百分比。 这可以通过仅 suggesting 高频项来提高质量。 默认值为 0f ,未启用。 如果指定的值大于1,则该数字不能为小数。 分片级文档频率用于此选项
  • max_term_freq:The maximum threshold in number of documents in which a suggest text token can exist in order to be included. Can be a relative percentage number (e.g., 0.4) or an absolute number to represent document frequencies. If a value higher than 1 is specified, then fractional can not be specified. Defaults to 0.01f. This can be used to exclude high frequency terms — which are usually spelled correctly — from being spellchecked. This also improves the spellcheck performance. The shard level document frequencies are used for this option.
  • string_distance:Which string distance implementation to use for comparing how similar suggested terms are. Five possible values can be specified:
    • internal: The default based on damerau_levenshtein but highly optimized for comparing string distance for terms inside the index.
    • damerau_levenshtein: String distance algorithm based on Damerau-Levenshtein algorithm.
    • levenshtein: String distance algorithm based on Levenshtein edit distance algorithm.
    • jaro_winkler: String distance algorithm based on Jaro-Winkler algorithm.
    • ngram: String distance algorithm based on character n-grams.
代码语言:javascript复制
PUT /suggest_article/
{
  "mappings": {
    "_doc": {
      "properties": {
        "body": {
          "type": "text"
        }
      }
    }
  }
}

PUT suggest_article/_doc/1
{
  "body":"lucene is very cool"
}

"body":"Elasticsearch builds on top of lucene"
"body":"Elasticsearch rocks"
"body":"elastic is the company behind ELK stack"
"body":"Elk stack rocks"
"body":"elasticsearch is rock solid"

Search API

代码语言:javascript复制
POST suggest_article/_search
{
  "from": 0, 
  "size": 10,
  "query": {
    "match": {
      "body": "lucen rock"
    }
  },
  "suggest": {
    "term-suggestion": {
      "text": "lucen rock",
      "term": {
        "suggest_mode": "missing",  // popular  always
        "field": "body"
      }
    }
  }
}

备注:中文查询时,查询分词使用简单分词器 "analyzer": "simple",不会因为查询分词而把搜索词进行分词

结果:{
  "took" : 38,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.1149852,
    "hits" : [
      {
        "_index" : "suggest_article",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 1.1149852,
        "_source" : {
          "body" : "elasticsearch is rock solid"
        }
      }
    ]
  },
  "suggest" : {
    "term-suggestion" : [
      {
        "text" : "lucen",
        "offset" : 0,
        "length" : 5,
        "options" : [
          {
            "text" : "lucene",
            "score" : 0.8,
            "freq" : 2
          }
        ]
      },
      {
        "text" : "rock",
        "offset" : 6,
        "length" : 4,
        "options" : [
          {
            "text" : "rocks",
            "score" : 0.75,
            "freq" : 2
          }
        ]
      }
    ]
  }
}

Java API:

代码语言:javascript复制
推荐词请求体:keyword为搜索框输入内容

SuggestionBuilder termSuggestionBuilder = SuggestBuilders.termSuggestion("body").text(keyword);
SuggestBuilder suggestBuilder = new SuggestBuilder();
suggestBuilder.addSuggestion("term-suggestion", termSuggestionBuilder);
builder.suggest(suggestBuilder);


推荐词响应结构:

Suggest suggest = searchResponse.getSuggest();
TermSuggestion termSuggestion = suggest.getSuggestion("trem-suggestion");
for (TermSuggestion.Entry entry : termSuggestion.getEntries()) {
   for (TermSuggestion.Entry.Option option : entry) {
         String suggestText = option.getText().string();//建议内容
         float score = option.getScore();
   }
}

备注:支持多个字段分别给出提示

小结:term suggester首先将输入文本经过分析器(所以,分析结果由于采用的分析器不同而有所不同)分析,处理为单个词条,然后根据单个词条去提供建议,并不会考虑多个词条之间的关系。然后将每个词条的建议结果(有或没有)封装到options列表中。最后由推荐器统一返回。term suggester定位的是term,而不是doc,主要是纠错。


Phrase 推荐词

Phrase suggester在 Term suggester 的基础上添加额外的逻辑以选择整个经校正的短语,而不是基于 ngram-language 模型加权的单个 token。会考量多个term之间的关系,比如是否同时出现在索引的原文里,相邻程度,以及词频等等。在实践中,这个 suggester 将能够基于同现和频率来做出关于选择哪些 token 的更好的决定。

phrase 短语建议,在term的基础上,会考量多个term之间的关系,⽐如是否同时出现在索引的原⽂⾥,相邻程度,以及词频等。

  • field:字段的名称
  • gram_size:Sets max size of the n-grams (shingles) in thefield. If the field doesn’t contain n-grams (shingles), this should be omitted or set to1. Note that Elasticsearch tries to detect the gram size based on the specifiedfield. If the field uses ashinglefilter, thegram_sizeis set to themax_shingle_sizeif not explicitly set.设置在field中连词的最大数值,如果这个字段不包含连词应该可以被忽略或者直接设置为1,注意ES会尝试基于特定的field字段检测连词的长度,这个字段用了shingle过滤器,如果没有显式指定那它的gram_size将会被设置为max_shingle_size
  • real_word_error_likelihood:The likelihood of a term being a misspelled even if the term exists in the dictionary. The default is0.95, meaning 5% of the real words are misspelled.即使该term存在于字典中,该term也会被拼错。默认值为0.95,表示5%的真实单词拼写错误。
  • confidence:The confidence level defines a factor applied to the input phrases score which is used as a threshold for other suggest candidates. Only candidates that score higher than the threshold will be included in the result. For instance a confidence level of1.0will only return suggestions that score higher than the input phrase. If set to0.0the top N candidates are returned. The default is1.0.置信水平定义了应用于输入短语分数的因子,该因子用作 suggest 候选者的阈值。返回的result中仅包含得分高于阈值的候选人。例如,置信度为1.0只会返回得分高于输入短语的 suggest 。如果设置为0.0,则返回前N个候选者。默认值为1.0。
  • max_errors:The maximum percentage of the terms considered to be misspellings in order to form a correction. This method accepts a float value in the range[0..1)as a fraction of the actual query terms or a number>=1as an absolute number of query terms. The default is set to1.0, meaning only corrections with at most one misspelled term are returned. Note that setting this too high can negatively impact performance. Low values like1or2are recommended; otherwise the time spend in suggest calls might exceed the time spend in query execution。术语(为了形成修正大多数认为拼写错误)的最大百分比,这个参数可以接受[0,1)范围内的小数作为实际查询项的一部分,也可以是大于等于1的绝对数。默认值为1.0,与最多1对应,只有修正拼写错误返回,注意这个参数设置太高将会影响ES性能,推荐使用像1或2这样较小的数值,否则时间花在建议调用可能超过花在查询执行的时间。
  • separator:用于分隔双字组字段中的term的分隔符。如果未设置,则将空格字符用作分隔符。
  • size:为每个单独的查询词生成的候选数。 较低的数字(例如3或5)通常会产生良好的效果。 提出此要求可以调出具有更高编辑距离的术语。 默认值为5。
  • analyzer:Sets the analyzer to analyze to suggest text with. Defaults to the search analyzer of the suggest field passed viafield.
  • shard_size:设置要从每个单独的分片检索的 suggestions 字词的最大数量。 在减少阶段期间,基于size选项只返回前N个 suggestions 。 默认为5
  • text:查询文本
  • highlight:高亮来向用户展示哪些原有的词条被纠正了
  • collate:Checks each suggestion against the specifiedqueryto prune suggestions for which no matching docs exist in the index. The collate query for a suggestion is run only on the local shard from which the suggestion has been generated from. Thequerymust be specified and it can be templated, seesearch templatesfor more information. The current suggestion is automatically made available as the{{suggestion}}variable, which should be used in your query. You can still specify your own templateparams — thesuggestionvalue will be added to the variables you specify. Additionally, you can specify apruneto control if all phrase suggestions will be returned; when set totruethe suggestions will have an additional optioncollate_match, which will betrueif matching documents for the phrase was found,falseotherwise. The default value forpruneisfalse.

Search API:

代码语言:javascript复制
POST suggest_article/_search
{
  "suggest": {
    "phrase-suggestion": {
      "text": "lucne and elasticsear rock",
      "phrase": {
        "field": "body",
        "max_errors":2, # 最多可以拼错的terms
        "confidence":0, 
        "direct_generator":[{
          "field":"body",
          "suggest_mode":"always"
        }],
        "highlight": {
          "pre_tag": "<em>",
          "post_tag": "</em>"
        }
      }
    }
  }
}

{
  "took" : 99,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "suggest" : {
    "phrase-suggestion" : [
      {
        "text" : "lucne and elasticsear rock",
        "offset" : 0,
        "length" : 26,
        "options" : [
          {
            "text" : "lucne and elasticsearch rocks",
            "highlighted" : "lucne and <em>elasticsearch rocks</em>",
            "score" : 0.12709484
          },
          {
            "text" : "lucne and elasticsearch rock",
            "highlighted" : "lucne and <em>elasticsearch</em> rock",
            "score" : 0.10422645
          },
          {
            "text" : "lucne and elasticsear rocks",
            "highlighted" : "lucne and elasticsear <em>rocks</em>",
            "score" : 0.10036137
          },
          {
            "text" : "lucne and elasticsear rock",
            "highlighted" : "lucne and elasticsear rock",
            "score" : 0.082303174
          },
          {
            "text" : "lucene and elasticsear rock",
            "highlighted" : "<em>lucene</em> and elasticsear rock",
            "score" : 0.030959692
          }
        ]
      }
    ]
  }
}

自定义高亮:"pre_tag":"<b id='d1' class='t1' style='color:red;font-size:18px;'>", "post_tag":"</b>"

注意:推荐器结果的高亮显示和查询结果高亮显示有些许区别
比如说,这里的自定义标签是pre_tag和post_tag而不是之前的pre_tags和post_tags

Java API:

代码语言:java复制
推荐词请求体:keyword为搜索框输入内容

SearchSourceBuilder builder = new SearchSourceBuilder();
PhraseSuggestionBuilder phraseSuggestBuilder = SuggestBuilders.phraseSuggestion("body")
                    .text(keyword)
                    .highlight("<em>", "</em>")
                    .maxErrors(2) // 最多可以拼错的 Terms 数
                    .analyzer("simple")
                    .confidence(0)
                    .size(10); 

SuggestBuilder suggestBuilder = new SuggestBuilder();
suggestBuilder.addSuggestion("phrase-suggestion", phraseSuggestBuilder);
builder.suggest(suggestBuilder);

备注:支持多个字段进行推荐,只需new多个PhraseSuggestionBuilder即可

推荐词响应结构:public static Map<String, List> suggestPhraseResponse(SearchResponse result) {

    Map<String, List> resultResponse = Maps.newHashMap();
    List suggestions = Lists.newArrayList();
    List<Map<String,Object>> hits = Lists.newArrayList();
    if (null != result) {

        Iterator<SearchHit> iterator = result.getHits().iterator();
        while (iterator.hasNext()) {
            Map<String, Object> hit = new HashMap<>();
            SearchHit searchHit = iterator.next();

            hit.put("matches", searchHit.getSourceAsMap());
            hit.put("score", searchHit.getScore());

            hits.add(hit);
        }

        Suggest suggest = result.getSuggest();
        PhraseSuggestion phraseSuggestion =suggest.getSuggestion("suggestion");
        for (PhraseSuggestion.Entry entry : phraseSuggestion){
            for (PhraseSuggestion.Entry.Option option : entry){
                Map<String, Object> optionMap = Maps.newHashMap();
                String text = option.getText().string();
                float score = option.getScore();
                String highlighted =  option.getHighlighted().string();

                optionMap.put("text", text);
                optionMap.put("score", score);
                optionMap.put("highlighted", highlighted);

                suggestions.add(optionMap);
            }
        }
    }
    resultResponse.put("hits", hits);
    resultResponse.put("suggestions", suggestions);

    return resultResponse;
}

Smoothing Models

词组 suggester 支持多种平滑模型,以在不常见的gram和频繁的gram(索引中至少出现一次)之间权衡权重。可以通过将平滑参数设置为以下选项之一来选择平滑模型。每个平滑模型都支持可以配置的特定属性。

Thephrasesuggester supports multiple smoothing models to balance weight between infrequent grams (grams (shingles) are not existing in the index) and frequent grams (appear at least once in the index). The smoothing model can be selected by setting thesmoothingparameter to one of the following options. Each smoothing model supports specific properties that can be configured.

  • stupid_backoff:一个简单的退避模型,如果高阶计数为0,则退回到低阶n-gram模型,并将低阶n-gram模型以恒定因子折现。默认折扣为0.4。傻瓜式的退避是默认模型。
  • laplace:使用加法平滑的平滑模型,其中将常数(通常为1.0或更小)添加到所有计数以平衡权重。默认Alpha为0.5。
  • linear_interpolation:一个平滑模型,该模型根据用户提供的权重(lambda)取得unigram,bigrams和trigram的加权平均值。线性插值没有任何默认值。必须提供所有参数(trigram_lambda,bigram_lambda,unigram_lambda)。
代码语言:javascript复制
POST _search
{
  "suggest": {
    "text" : "obel prize",
    "simple_phrase" : {
      "phrase" : {
        "field" : "title.trigram",
        "size" : 1,
        "smoothing" : {
          "laplace" : {
            "alpha" : 0.7
          }
        }
      }
    }
  }
}

Candidate Generators

phrase suggester 使用 generator 来生成给定text中每个term的可能提示term列表。单个generator就好像为文本中的每个term调用的term suggester。随后,多个generator 对这个term的打分进行组合评分。

当前仅支持一种类型的generator:direct_generator。phrase suggest API接受关键字direct_generator下的generator列表;列表中的每个generator在原始文本中均按term被调用。

The phrase suggester uses candidate generators to produce a list of possible terms per term in the given text. A single candidate generator is similar to a term suggester called for each individual term in the text. The output of the generators is subsequently scored in combination with the candidates from the other terms for suggestion candidates.

Currently only one type of candidate generator is supported, the direct_generator. The Phrase suggest API accepts a list of generators under the key direct_generator; each of the generators in the list is called per term in the original text.

Direct Generators

  • field:
  • size:每个suggest text token 将返回的最大更正数。
  • suggest_mode:
  • max_edits:候选suggest可以具有最大编辑距离。只能是1到2之间的值。任何其他值都将导致引发错误的请求错误。默认为2。
  • prefix_length:必须匹配的最小前缀字符数才能成为suggest的候选者。默认值为1。增加此数字可提高拼写检查性能。通常用在拼写错误不会出现在前面几个字符的情况,比如英文单词。 (旧名称“ prefix_len”已弃用)
  • min_word_length:suggest text term必须包含的最小长度。默认值为4。(旧名称“ min_word_len”已弃用)
  • max_inspections:一个因子,用于与shards_size相乘,以便在shard级别上检查更多的候选拼写更正。可以以性能为代价提高准确性。默认为5。
  • min_doc_freq:suggest应出现的最小文档数阈值。可以将其指定为绝对数量或相对数量的文档数。通过仅suggest高频项可以提高质量。默认为0f且未启用。如果指定的值大于1,则数字不能为小数。分片级别文档频率用于此选项。
  • max_term_freq:可以包含suggest text令牌的文档数量的最大阈值。可以是相对百分比数字(例如0.4)或代表文档频率的绝对数字。如果指定的值大于1,则不能指定小数。默认为0.01f。这可以用来排除高频term-通常被正确拼写-的拼写检查。这也提高了拼写检查性能。分片级别文档频率用于此选项。
  • pre_filter:一个过滤器(分析器),应用于传递给此候选generator的每个token。在生成候选对象之前,此过滤器将应用于原始token。
  • post_filter:在将每个生成的token传递到实际短语计分器之前将其应用于过滤器(分析器)。

下面的示例显示了具有两个generator的词组 suggest 调用:第一个generator使用包含普通索引项的字段,第二个generator使用包含使用反向过滤器索引的项的字段(token按相反顺序索引)。这用于克服直接generator的局限性,即它要求常量前缀以提供高性能 suggest 。 pre_filter和post_filter选项接受普通的分析器名称。

The following example shows a phrase suggest call with two generators: the first one is using a field containing ordinary indexed terms, and the second one uses a field that uses terms indexed with a reverse filter (tokens are index in reverse order). This is used to overcome the limitation of the direct generators to require a constant prefix to provide high-performance suggestions. The pre_filter and post_filter options accept ordinary analyzer names.

代码语言:javascript复制
POST _search
{
  "suggest": {
    "text" : "obel prize",
    "simple_phrase" : {
      "phrase" : {
        "field" : "title.trigram",
        "size" : 1,
        "direct_generator" : [ {
          "field" : "title.trigram",
          "suggest_mode" : "always"
        }, {
          "field" : "title.reverse",
          "suggest_mode" : "always",
          "pre_filter" : "reverse",
          "post_filter" : "reverse"
        } ]
      }
    }
  }
}

总结:phrase suggester对中文的支持不太友好,中文查询时,查询分词使用简单分词器 "analyzer":"simple",不会因为查询分词而把搜索词进行分词。


Completion 推荐词

完全(completion)suggester提供自动完成/按需搜索功能。 这是一种导航功能,可在用户输入时引导用户查看相关结果,从而提高搜索精度。 它不是用于拼写校正或平均值功能,如术语或短语suggesters 。

理想地,自动完成功能应当与用户键入的速度一样快,以提供与用户已经键入的内容相关的即时反馈。因此,完成 suggester 针对速度进行优化。 suggester 使用允许快速查找的数据结构,但是构建成本高并且存储在内存中。

主要针对的应用场景就是"Auto Completion"。 此场景下用户每输入一个字符的时候,就需要即时发送一次查询请求到后端查找匹配项,在用户输入速度较高的情况下对后端响应速度要求比较苛刻。因此实现上它和前面两个Suggester采用了不同的数据结构,索引并非通过倒排来完成,而是将analyze过的数据编码成FST和索引一起存放。对于一个open状态的索引,FST会被ES整个装载到内存里的,进行前缀查找速度极快。但是FST只能用于前缀查找,这也是Completion Suggester的局限所在。

  • analyzer:使用索引分析器,默认为简单
  • search_analyzer:使用的搜索分析器,默认为分析器的值
  • preserve_separators:保留分隔符,默认为true。 如果禁用,你可以找到一个以Foo Fighters开头的字段,如果你推荐foof。
  • preserve_position_increments:启用位置增量,默认为true
  • max_input_length:限制单个输入的长度,默认为50个UTF-16代码点。 此限制仅在索引时使用,以减少每个输入字符串的字符总数,以防止大量输入膨胀底层数据结构。 大多数用例不会受默认值的影响,因为前缀完成很少超过前缀长度超过少数几个字符。

Indexing

You index suggestions like any other field. A suggestion is made of aninputand an optionalweightattribute. Aninputis the expected text to be matched by a suggestion query and theweightdetermines how the suggestions will be scored. Indexing a suggestion is as follows:

代码语言:javascript复制
PUT completion_article/_doc/1?refresh
{
    "suggest" : {
        "input": [ "Nevermind", "Nirvana" ],
        "weight" : 34
    }
}
  • input:
  • weight:正整数或包含正整数的字符串,用于定义权重并允许对suggestions进行排名。 此字段是可选的。

您可以按如下所示为文档编制多个 suggestions:

代码语言:javascript复制
PUT completion_article/_doc/1?refresh
{
    "suggest" : [
        {
            "input": "Nevermind",
            "weight" : 10
        },
        {
            "input": "Nirvana",
            "weight" : 3
        }
    ]
}

您可以使用以下速记形式。 请注意,您不能使用suggestion指定权重。

代码语言:javascript复制
PUT completion_article/_doc/1?refresh
{
  "suggest" : [ "Nevermind", "Nirvana" ]
}

Queries

Suggesting works as usual, except that you have to specify the suggest type as completion. Suggestions are near real-time, which means new suggestions can be made visible by refresh and documents once deleted are never shown. This request:

代码语言:javascript复制
POST music/_suggest?pretty
{
    "song-suggest" : {
        "prefix" : "nir",
        "completion" : {
            "field" : "suggest"
        }
    }
}

响应结果:
{
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "song-suggest" : [ {
    "text" : "nir",
    "offset" : 0,
    "length" : 3,
    "options" : [ {
      "text" : "Nirvana",
      "_index": "music",
      "_type": "song",
      "_id": "1",
      "_score": 1.0,
      "_source": {
        "suggest": ["Nevermind", "Nirvana"]
      }
    } ]
  } ]
}

_source元字段必须启用,这是默认行为,以启用返回_source与suggestions

The configured weight for a suggestion is returned as_score. Thetextfield uses theinputof your indexed suggestion. Suggestions return the full document_sourceby default. The size of the_sourcecan impact performance due to disk fetch and network transport overhead. To save some network overhead, filter out unnecessary fields from the_sourceusingsource filteringto minimize_sourcesize. Note that the _suggest endpoint doesn’t support source filtering but using suggest on the_searchendpoint does:

代码语言:javascript复制
POST music/_search?size=0
{
    "_source": "suggest",
    "suggest": {
        "song-suggest" : {
            "prefix" : "nir",
            "completion" : {
                "field" : "suggest"
            }
        }
    }
}

代码语言:javascript复制
PUT /completion_article/
{
  "mappings": {
    "_doc": {
      "properties": {
        "body": {
          "type": "completion"
        }
      }
    }
  }
}

备注:要使用此功能,请为此字段指定一个特殊映射,为快速完成的字段值编制索引
1.body字段可以设置索引分词,这些会影响FST编码结果,也会影响查找匹配的效果
2.设置查询分词需要在mapping中添加才会生效
"type": "completion",
"analyzer": "trigram_analyzer",
"search_analyzer": "simple",
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 50


PUT completion_article/_doc/1
{
  "body":"lucene is very cool"
}

"body":"Elasticsearch builds on top of lucene"
"body":"Elasticsearch rocks"
"body":"elastic is the company behind ELK stack"
"body":"Elk stack rocks"
"body":"elasticsearch is rock solid"

Search API:

代码语言:javascript复制
POST completion_article/_search
{ "size": 0,
  "_source": {
    "includes": [
      "body"
    ],
    "excludes": []
  },
  "suggest": {
    "completion-suggest": {
      "prefix": "elastic i",
      "completion": {
        "field": "body",
        "skip_duplicates": true // 开启去重推荐词
      }
    }
  }
}

返回结果:{
  "took" : 42,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "suggest" : {
    "completion-suggest" : [
      {
        "text" : "elastic i",
        "offset" : 0,
        "length" : 9,
        "options" : [
          {
            "text" : "elastic is the company behind ELK stack",
            "_index" : "completion_article",
            "_type" : "_doc",
            "_id" : "4",
            "_score" : 1.0,
            "_source" : {
              "body" : "elastic is the company behind ELK stack"
            }
          }
        ]
      }
    ]
  }
}

Java API:

代码语言:javascript复制
推荐词请求结构:
SuggestionBuilder termSuggestionBuilder = SuggestBuilders.completionSuggestion("body")
        .prefix(keyword).skipDuplicates(true) //开启去重推荐词 
        .size(10);
String[] source = {"body"};
SuggestBuilder suggestBuilder = new SuggestBuilder();
suggestBuilder.addSuggestion("completion-suggest", termSuggestionBuilder);
builder.suggest(suggestBuilder).fetchSource(source, null);

推荐词响应结构:

if(RestStatus.OK.equals(searchResponse.status())) {
   // 获取建议结果
   Suggest suggest = searchResponse.getSuggest();
   CompletionSuggestion termSuggestion = suggest.getSuggestion("song-suggest");
      for (CompletionSuggestion.Entry entry : termSuggestion.getEntries()) {
          for (CompletionSuggestion.Entry.Option option : entry) {
              String suggestText = option.getText().string();
          }
      }
}

备注:如果要去重推荐词.skipDuplicates(true)

When set to true, this option can slow down search because more suggestions need to be visited to find the top N.

值得注意的一点是Completion Suggester在索引原始数据的时候也要经过analyze阶段,选用的analyzer不同,某些词可能会被转换或者某些词可能被去除,这些会影响FST编码结果,也会影响查找匹配的效果。

比如我们重新索引,设置索引的mapping,将analyzer更改为"english"

代码语言:javascript复制
PUT /completion_article_analyzer/
{
  "mappings": {
    "_doc": {
      "properties": {
        "body": {
          "type": "completion",
          "analyzer": "english"
        }
      }
    }
  }
}

PUT completion_article_analyzer/_doc/6
{
  "body":"elasticsearch is rock solid"
}

Search API:

代码语言:javascript复制
POST completion_article_analyzer/_search
{ "size": 0,
  "suggest": {
    "completion_article": {
      "prefix": "elastic i",
      "completion": {
        "field": "body"
      }
    }
  }
}

结果:
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "suggest" : {
    "completion_article" : [
      {
        "text" : "elastic i",
        "offset" : 0,
        "length" : 9,
        "options" : [ ]
      }
    ]
  }
}

结果为null:因为我们选择的分词器为english analyzer会剥离掉stop word,而is就是其中一个,被剥离掉了,导致匹配i的时候没有匹配到
分析过程:
POST _analyze
{
  "analyzer":"english",
  "text": "elasticsearch is rock solid"
}

{
  "tokens" : [
    {
      "token" : "elasticsearch",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "rock",
      "start_offset" : 17,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "solid",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}
FST只编码了这3个token,并且默认的还会记录他们在文档中的位置和分隔符。 用户输入"elastic i"进行查找的时候,输入被分解成"elastic"和"i",FST没有编码这个“i” , 匹配失败。

搜索"elastic is",会发现又有结果, 因为这次输入的text经过english analyzer的时候,在查询分词中is也被剥离了,只需在FST里查询"elastic"这个前缀,自然就可以匹配到了。

代码语言:javascript复制
POST completion_article_analyzer/_search
{ "size": 0,
  "suggest": {
    "completion_article": {
      "prefix": "elastic is",
      "completion": {
        "field": "body"
      }
    }
  }
}
结果:
{
  "took" : 17,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "suggest" : {
    "completion_article" : [
      {
        "text" : "elastic is",
        "offset" : 0,
        "length" : 10,
        "options" : [
          {
            "text" : "elastic is the company behind ELK stack",
            "_index" : "completion_article_analyzer",
            "_type" : "_doc",
            "_id" : "4",
            "_score" : 1.0,
            "_source" : {
              "body" : "elastic is the company behind ELK stack"
            }
          }
        ]
      }
    ]
  }
}

Fuzzy queries

Completion Suggester 还支持模糊查询-这意味着您可以在搜索中输入错误,并且仍然可以得到结果。

代码语言:javascript复制
POST music/_suggest?pretty
{
    "song-suggest" : {
        "prefix" : "nor",
        "completion" : {
            "field" : "suggest",
            "fuzzy" : {
                "fuzziness" : 2
            }
        }
    }
}

模糊查询可以采用特定的模糊参数。 支持以下参数:

  • fuzziness:模糊系数,默认为AUTO。 有关允许的设置,请参阅“Fuzziness”一节。
  • transpositions:如果设置为true,则换位计数为一个更改而不是两个,默认为true
  • min_length:返回模糊suggestions前的输入的最小长度,默认值3
  • prefix_length:输入的最小长度(未针对模糊替代项进行检查)默认为1
  • unicode_aware:如果为true,则所有度量(如模糊编辑距离,置换和长度)都以Unicode代码点而不是字节为单位。 这比原始字节稍慢,因此默认情况下设置为false。

Regex queries

Completion Suggester 还支持正则表达式查询,这意味着您可以将前缀表示为正则表达式。

代码语言:javascript复制
POST music/_suggest?pretty
{
    "song-suggest" : {
        "regex" : "n[ever|i]r",
        "completion" : {
            "field" : "suggest"
        }
    }
}

正则表达式查询可以使用特定的正则表达式参数。 支持以下参数:

  • flags:可能的标志是ALL(默认),ANYSTRING,COMPLEMENT,EMPTY,INTERSECTION,INTERVAL或NONE。 有关它们的含义,请参见regexp-syntax.
  • max_determinized_states:正则表达式是危险的,因为很容易意外地创建一个无害的,需要指数数量的内部确定的自动机状态(以及相应的RAM和CPU)执行Lucene。 Lucene使用max_determinized_states设置(默认为10000)阻止这些操作。 您可以提高此限制以允许执行更复杂的正则表达式。

总结:completion suggestion主要是以自动补全为目标,不会进行term纠错。


Context 推荐词

Completion Suggester 的扩展

类别上下文

我们可以在doc上加上分类信息,帮助精准推荐。

例如,输入 “维生素”

  • 药品相关:drug
  • 保健品相关:supplement
代码语言:javascript复制
{
    "indexName":"drug",
    "indexSource":{
        "settings":{
            "number_of_shards":1,
            "number_of_replicas":2,
            "index":{
                "analysis":{
                    "filter":{
                        "bigram_filter":{
                            "max_shingle_size":"2",
                            "min_shingle_size":"2",
                            "output_unigrams":"false",
                            "type":"shingle"
                        },
                        "trigram_filter":{
                            "max_shingle_size":"3",
                            "min_shingle_size":"2",
                            "type":"shingle"
                        },
                        "my_synonym":{
                            "type":"synonym",
                            "synonyms_path":"analysis/synonym.txt"
                        }
                    },
                    "analyzer":{
                        "trigram_analyzer":{
                            "filter":[
                                "lowercase",
                                "trigram_filter"
                            ],
                            "type":"custom",
                            "tokenizer":"standard"
                        },
                        "index_ansj_analyzer":{
                            "filter":[
                                "my_synonym",
                                "asciifolding"
                            ],
                            "type":"custom",
                            "tokenizer":"index_ansj"
                        },
                        "comma":{
                            "pattern":",",
                            "type":"pattern"
                        },
                        "lowercase_ngram_1_2":{
                            "filter":"lowercase",
                            "tokenizer":"ngram_1_2_tokenizer"
                        },
                        "bigram_analyzer":{
                            "filter":[
                                "lowercase",
                                "bigram_filter"
                            ],
                            "type":"custom",
                            "tokenizer":"standard"
                        },
                        "pinyin_analyzer":{
                            "tokenizer":"my_pinyin"
                        }
                    },
                    "tokenizer":{
                        "my_pinyin":{
                            "lowercase":"true",
                            "keep_original":"false",
                            "keep_first_letter":"true",
                            "keep_separate_first_letter":"true",
                            "type":"pinyin",
                            "limit_first_letter_length":"16",
                            "keep_full_pinyin":"true",
                            "keep_none_chinese_in_joined_full_pinyin":"true",
                            "keep_joined_full_pinyin":"true"
                        },
                        "ngram_1_2_tokenizer":{
                            "token_chars":[
                                "letter",
                                "digit"
                            ],
                            "min_gram":"1",
                            "type":"nGram",
                            "max_gram":"2"
                        }
                    }
                }
            }
        },
        "mappings":{
            "properties":{
                "categoryfirst":{
                    "type":"keyword"
                },
                "categorysecond":{
                    "type":"keyword"
                },
                "commonname":{
                    "type":"completion",
                    "analyzer":"trigram_analyzer",
                    "preserve_separators":true,
                    "preserve_position_increments":true,
                    "max_input_length":50,
                    "contexts":[
                        {
                            "type":"category",
                            "name":"spu_category"
                        }
                    ],
                    "fields":{
                        "ansj":{
                            "type":"text",
                            "analyzer":"index_ansj_analyzer"
                        },
                        "text":{
                            "type":"text"
                        },
                        "pinyincompletion":{
                            "type":"completion",
                            "analyzer":"pinyin_analyzer",
                            "preserve_separators":true,
                            "preserve_position_increments":true,
                            "search_analyzer":"simple",
                            "max_input_length":50
                        },
                        "keyword":{
                            "type":"keyword"
                        },
                        "pinyin":{
                            "type":"text",
                            "boost":10,
                            "term_vector":"with_offsets",
                            "analyzer":"pinyin_analyzer"
                        },
                        "shingle":{
                            "type":"text",
                            "analyzer":"trigram_analyzer"
                        }
                    }
                },
                "ctime":{
                    "type":"date",
                    "format":"yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
                },
                "doctorteamhotid":{
                    "type":"keyword"
                },
                "drugid":{
                    "type":"keyword"
                },
                "drugname":{
                    "type":"completion",
                    "analyzer":"trigram_analyzer",
                    "preserve_separators":true,
                    "preserve_position_increments":true,
                    "max_input_length":50,
                    "contexts":[
                        {
                            "type":"category",
                            "name":"spu_category"
                        }
                    ],
                    "fields":{
                        "ansj":{
                            "type":"text",
                            "analyzer":"index_ansj_analyzer"
                        },
                        "text":{
                            "type":"text"
                        },
                        "pinyincompletion":{
                            "type":"completion",
                            "analyzer":"pinyin_analyzer",
                            "search_analyzer":"simple",
                            "preserve_separators":true,
                            "preserve_position_increments":true,
                            "max_input_length":50
                        },
                        "keyword":{
                            "type":"keyword"
                        },
                        "pinyin":{
                            "type":"text",
                            "boost":10,
                            "term_vector":"with_offsets",
                            "analyzer":"pinyin_analyzer"
                        },
                        "shingle":{
                            "type":"text",
                            "analyzer":"trigram_analyzer"
                        }
                    }
                },
                "drugtype":{
                    "type":"keyword"
                },
                "factoryname":{
                    "type":"text",
                    "fields":{
                        "ansj":{
                            "type":"text",
                            "analyzer":"index_ansj_analyzer"
                        },
                        "keyword":{
                            "type":"keyword"
                        },
                        "pinyin":{
                            "type":"text",
                            "boost":10,
                            "term_vector":"with_offsets",
                            "analyzer":"pinyin_analyzer"
                        },
                        "shingle":{
                            "type":"text",
                            "analyzer":"trigram_analyzer"
                        }
                    },
                    "copy_to":[
                        "text_all"
                    ]
                },
                "id":{
                    "type":"keyword"
                },
                "included":{
                    "type":"keyword"
                },
                "indextype":{
                    "type":"keyword"
                },
                "iscfda":{
                    "type":"keyword"
                },
                "medicineaccuratenum":{
                    "type":"keyword",
                    "copy_to":[
                        "text_all"
                    ]
                },
                "prescription":{
                    "type":"keyword"
                },
                "relation":{
                    "type":"join",
                    "eager_global_ordinals":true,
                    "relations":{
                        "drug-spu":[
                            "drug-doctorteamhot",
                            "drug-sku"
                        ]
                    }
                },
                "text_all":{
                    "type":"text"
                },
                "utime":{
                    "type":"date",
                    "format":"yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
                }
            }
        }
    }
}

PUT drug/_doc/310346585771
{
  "indextype": "drug-spu",
  "categoryfirst": "drug",
  "iscfda": "1",
  "utime": "2020-12-29 09:37:04",
  "drugname": {
    "input": "", // 单个的话使用字符串
    "contexts": {
      "spu_category": "drug"
    }
  },
  "relation": "drug-spu",
  "medicineaccuratenum": "国药准字Z20150067",
  "commonname": {
    "input": [
      "维生素E乳膏"  // 多个的话使用数组方式
    ],
    "contexts": {
      "spu_category": "drug"
    }
  },
  "prescription": "0",
  "ctime": "2020-12-23 22:00:18",
  "id": "310346585771",
  "categorysecond": "4171957ed83b25fa5727b1dd034eed50",
  "included": "1",
  "drugtype": "3",
  "factoryname": "中国医学科学院皮肤病医院"
}

{
  "indextype": "drug-spu",
  "categoryfirst": "drug",
  "iscfda": "1",
  "utime": "2020-12-29 09:37:04",
  "drugname": {
    "input": [
      ""
    ],
    "contexts": {
      "spu_category": "supplement"
    }
  },
  "relation": "drug-spu",
  "medicineaccuratenum": "国药准字Z20150067",
  "commonname": {
    "input": [
      "维生素D滴剂"
    ],
    "contexts": {
      "spu_category": "supplement"
    }
  },
  "prescription": "0",
  "ctime": "2020-12-23 21:45:39",
  "id": "310346508974",
  "categorysecond": "a1332770e38c9146e4376fff033fe715",
  "included": "0",
  "drugtype": "0",
  "factoryname": ""
}

Search API

代码语言:javascript复制
POST drug/_search
{
  "_source": {
    "includes": [
      "commonname",
      "drugname"
    ],
    "excludes": []
  },
  "suggest": {
    "commonname-completionsuggest": {
      "prefix": "STC踝控",
      "completion": {
        "field": "commonname",
        "size": 10,
        "skip_duplicates": true,
        "contexts": {
          "spu_category": [
            {
              "context": "others",
              "boost": 1,
              "prefix": false
            }
          ]
        }
      }
    },
    "drugname-completionsuggest": {
      "prefix": "STC踝控",
      "completion": {
        "field": "drugname",
        "size": 10,
        "skip_duplicates": true,
        "contexts": {
          "spu_category": [
            {
              "context": "others",
              "boost": 1,
              "prefix": false
            }
          ]
        }
      }
    }
  }
}

结果:
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "suggest" : {
    "MY_SUGGESTION" : [
      {
        "text" : "维生素",
        "offset" : 0,
        "length" : 3,
        "options" : [
          {
            "text" : "维生素E乳膏",
            "_index" : "drug-20.12.30-103610",
            "_type" : "_doc",
            "_id" : "310346585771",
            "_score" : 1.0,
            "_ignored" : [
              "drugname.pinyincompletion",
              "drugname"
            ],
            "_source" : {
              "indextype" : "drug-spu",
              "categoryfirst" : "drug",
              "iscfda" : "1",
              "utime" : "2020-12-29 09:37:04",
              "drugname" : {
                "input" : [
                  ""
                ],
                "contexts" : {
                  "spu_category" : "drug"
                }
              },
              "relation" : "drug-spu",
              "medicineaccuratenum" : "国药准字Z20150067",
              "commonname" : {
                "input" : [
                  "维生素E乳膏"
                ],
                "contexts" : {
                  "spu_category" : "drug"
                }
              },
              "prescription" : "0",
              "ctime" : "2020-12-23 22:00:18",
              "id" : "310346585771",
              "categorysecond" : "4171957ed83b25fa5727b1dd034eed50",
              "included" : "1",
              "drugtype" : "3",
              "factoryname" : "中国医学科学院皮肤病医院"
            },
            "contexts" : {
              "spu_category" : [
                "drug"
              ]
            }
          }
        ]
      }
    ]
  }
}

Java API

代码语言:javascript复制
String keyword = searchRequest.getKeyword();
String category = searchRequest.getCategory();

SearchSourceBuilder builder = new SearchSourceBuilder();
if (StringUtils.isNotBlank(keyword)) {

    CompletionSuggestionBuilder commonnameBuilder = SuggestBuilders.completionSuggestion("commonname")
            .prefix(keyword).skipDuplicates(true)
            .size(10);

    CompletionSuggestionBuilder drugnameBuilder = SuggestBuilders.completionSuggestion("drugname")
            .prefix(keyword).skipDuplicates(true)
            .size(10);
            
    SuggestionBuilder pinyinDrugnameBuilder = SuggestBuilders.completionSuggestion("drugname.pinyincompletion")
            .prefix(keyword).skipDuplicates(true)
            .size(10);
    SuggestionBuilder pinyinCommonnameBuilder = SuggestBuilders.completionSuggestion("commonname.pinyincompletion")
             .prefix(keyword).skipDuplicates(true)
             .size(10);

    if (StringUtils.isNotBlank(category)) {

        CategoryQueryContext context = CategoryQueryContext.builder()
                .setBoost(1)
                .setCategory(category)
                .setPrefix(false).build();
        Map categoryMap = Maps.newHashMap();
        List categoryList = Lists.newArrayList();
        categoryList.add(context);
        categoryMap.put("spu_category", categoryList);

        commonnameBuilder.contexts(categoryMap);
        drugnameBuilder.contexts(categoryMap);
    }

    String[] source = {"commonname", "drugname"};
    SuggestBuilder suggestBuilder = new SuggestBuilder();
    suggestBuilder.addSuggestion("commonname-completionsuggest", commonnameBuilder)
            .addSuggestion("drugname-completionsuggest", drugnameBuilder)
            .addSuggestion("pinyincommonvame-completionsuggest", pinyinCommonnameBuilder)
            .addSuggestion("pinyindrugname-completionsuggest", pinyinDrugnameBuilder);
    builder.suggest(suggestBuilder).fetchSource(source, null);
}
  • context,要过滤/提升的类别的值,这是强制性的。
  • boost,应该提高建议分数的因素,通过将boost乘以建议权重来计算分数,默认为1。
  • prefix,是否应该将类别实为前缀,例如,如果设置为true,则可以通过指定类型的类别前缀来过滤type1,type2等类别,默认为false。

地理位置上下文

一个geo上下文允许我们将一个或多个地理位置或geohash与在索引时间的建议关联,在查询时,如果建议位于地理位置特定的距离内,则可以过滤和提升建议。

在内部,地位置被编码为具有指定精度的地理位置。

Sugester 总结

精准程度上(Precision)看: Completion >  Phrase > term, 而召回率上(Recall)则反之。从性能上看,Completion Suggester是最快的,如果能满足业务需求,只用Completion Suggester做前缀匹配是最理想的。 Phrase和Term由于是做倒排索引的搜索,相比较而言性能应该要低不少,应尽量控制suggester用到的索引的数据量,最理想的状况是经过一定时间预热后,索引可以全量map到内存。

0 人点赞