Elasticsearch使用：Rare Terms Aggregation（7.3版新功能）

简介

官方文档：https://www.elastic.co/guide/en/elasticsearch/reference/7.3/search-aggregations-bucket-rare-terms-aggregation.html

在许多的情况下，我们做 terms聚合搜索的时候，我们想得到的是每个桶里满足条件的文档最多的搜索结果。但是有些情况，我们想寻找稀有的术语数量。尽管我们可以把我们的搜索结果按照升序来排序，但是对于很大数据的这种聚合操作很容易造成 unbunded error。在 Elasticsearch 了提供了一种叫做 Rare Terms Aggregation 的方法。

它使用了可预测结果的资源高效算法。它是一种聚合，用于识别长系列关键词的尾部的数据，例如文档数较少的字词。从技术角度来看，稀有术语汇总通过维护术语映射以及与每个值关联的计数器来进行。每次识别该术语时，计数器都会增加。如果计数器超过预定义的阈值，则将该术语从map中删除并插入到 cuckoo filter。如果在 cuckoo filter 中找到了该术语，则假定该术语先前已从map中删除，并且是“常见的”。此聚合设计为比替代方案（将terms aggreation的size设置为：MAX_LONG）或通过计数递增排序项聚合（可能会导致 unbounded error）的内存效率更高。

Rare terms aggregation 有多种用例；例如，SIEM 用户经常对罕见事件感兴趣，这些罕见事件有时被怀疑是安全事件的体现。Rare terms aggregation 是Elastic在7.3版本中引入的新功能。

聚合搜索

准备数据

我们首先来下载我们的测试数据：

best_games_json_data.zip

然后我们通过Kibana把这个数据来导入到我们的Elasticsearch中：

在导入的过程中，我们选择Time field为year，并且指定相应的日期格式：

我们指定我们的索引名字为best_games：

terms aggregation

为了说明问题，我们先来采用 terms aggregation 的方法，并使用升序的方式来进行查询：

代码语言：javascript复制

GET best_games/_search
{
  "size": 0,
  "aggs": {
    "normal_genre": {
      "terms": {
        "field": "genre",
        "order": {
          "_count": "asc"
        }
      }
    }
  }
}

我们可以看到这个结果：

代码语言：javascript复制

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 500,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "normal_genre" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 208,
      "buckets" : [
        {
          "key" : "Strategy",
          "doc_count" : 2
        },
        {
          "key" : "Adventure",
          "doc_count" : 7
        },
        {
          "key" : "Puzzle",
          "doc_count" : 8
        },
        {
          "key" : "Simulation",
          "doc_count" : 18
        },
        {
          "key" : "Fighting",
          "doc_count" : 19
        },
        {
          "key" : "Platform",
          "doc_count" : 34
        },
        {
          "key" : "Racing",
          "doc_count" : 35
        },
        {
          "key" : "Misc",
          "doc_count" : 44
        },
        {
          "key" : "Role-Playing",
          "doc_count" : 49
        },
        {
          "key" : "Sports",
          "doc_count" : 76
        }
      ]
    }
  }
}

我们可以看到在 key 为 Strategy 里的文档有两个，而且文档的数值是按照升序的方法来进行排列的。

我们也许觉得这样的方法没有什么问题。它完全满足我们的需求。细心的开发者可以参考Elastic的 Terms aggeration 官方文档，可以看到这样的一段文字：

它的意思是使用升序来进行排序是不建议的一种方法。它会随着文档数量的增加而可能出现错误，特别是多 shard 进行搜索。为了克服这个问题，我们需要使用 Rare terms aggregation。

Rare terms aggregation

我们首先使用如下的命令来查询：

代码语言：javascript复制

GET best_games/_search
{
  "size": 0,
  "aggs": {
    "rare_genre": {
      "rare_terms": {
        "field": "genre",
        "max_doc_count": 1
      }
    }
  }
}

在这里，我们定义了 max_doc_count 为1。max_doc_count 参数用于控制术语可以具有的文档计数的上限。对于 rare terms aggregation 而言，它没有像 terms aggregation 那样有一个 size 的参数来控制返回数值的大小。这意味着将返回符合max_doc_count 条件的字词。 Rare terms aggregation 以这种方式起作用，以避免困扰术语聚合的升序问题。

但是，这的确意味着如果选择不正确，可以返回大量结果。为了限制此设置的危险，最大 max_doc_count 为100。

针对我们上面的用例，它返回的结果是：

代码语言：javascript复制

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 500,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "rare_genre" : {
      "buckets" : [ ]
    }
  }
}

也就是说所有的文档的数量都是大于1的，没有一个是少于1的。那么如果我们修改这个查询条件为：

代码语言：javascript复制

GET best_games/_search
{
  "size": 0,
  "aggs": {
    "rare_genre": {
      "rare_terms": {
        "field": "genre",
        "max_doc_count": 10
      }
    }
  }
}

那么返回的结果是

代码语言：javascript复制

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 500,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "rare_genre" : {
      "buckets" : [
        {
          "key" : "Strategy",
          "doc_count" : 2
        },
        {
          "key" : "Adventure",
          "doc_count" : 7
        },
        {
          "key" : "Puzzle",
          "doc_count" : 8
        }
      ]
    }
  }
}

ElasticsearchService 全文检索

0 人点赞