Elasticsearch使用：top_hits aggregation

简介

官方文档：https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html

top_hits 指标聚合器跟踪要聚合的最相关文档。该聚合器旨在用作子聚合器，以便可以按存储分区汇总最匹配的文档。

top_hits 聚合器可以有效地用于通过存储桶聚合器按某些字段对结果集进行分组。一个或多个存储桶聚合器确定将结果集切成哪些属性。

选项：

from -要获取的第一个结果的偏移量。
size -每个存储桶要返回的最匹配匹配项的最大数目。默认情况下，返回前三个匹配项。
排序 - 匹配的热门匹配的排序方式。默认情况下，命中按主要查询的分数排序。

Top_hits

准备数据

选用 Kibana 里带的官方的 Sample web logs 来作为我们的索引：

Top hits aggregation

首先，我们先做一个简单的基于 hosts 的 aggregation:

代码语言：javascript复制

GET kibana_sample_data_logs/_search
{
  "size": 0,
  "aggs": {
    "hosts": {
      "terms": {
        "field": "host.keyword",
        "size": 2
      }
    }
  }
}

上面的搜索的结果是我们想得到2个桶的数据（这里为了说明问题的方便，设定为2）。而这两个桶是基于 hosts 的值。搜索的结果是：

代码语言：javascript复制

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "hosts" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 2807,
      "buckets" : [
        {
          "key" : "artifacts.elastic.co",
          "doc_count" : 6488
        },
        {
          "key" : "www.elastic.co",
          "doc_count" : 4779
        }
      ]
    }
  }
}

现在的要求是：我们想针对这里的每个桶得到按照我们需要排序的前面的几个结果，比如下面的搜索：

代码语言：javascript复制

GET kibana_sample_data_logs/_search
{
  "size": 0,
  "aggs": {
    "hosts": {
      "terms": {
        "field": "host.keyword",
        "size": 2
      },
      "aggs": {
        "most_bytes": {
          "top_hits": {
            "sort": [
              {
                "bytes": {
                  "order": "desc"
                }
              }
            ],
            "_source": {
              "includes": [
                "bytes",
                "hosts",
                "ip",
                "clientip"
              ]
            },
            "size": 2
          }
        }
      }
    }
  }
}

上面实际上是一个 pipeline 的聚合。它在针对上面的桶来做了一个 top_hits 的聚合。针对每个桶，我们需要按照 bytes 的大小，降序排列，并且每个桶只需要两个数据：

代码语言：javascript复制

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "hosts" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 2807,
      "buckets" : [
        {
          "key" : "artifacts.elastic.co",
          "doc_count" : 6488,
          "most_bytes" : {
            "hits" : {
              "total" : {
                "value" : 6488,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "kibana_sample_data_logs",
                  "_type" : "_doc",
                  "_id" : "KacW8XYBL1uEtTd-YKP_",
                  "_score" : null,
                  "_source" : {
                    "bytes" : 19929,
                    "ip" : "127.155.255.9",
                    "clientip" : "127.155.255.9"
                  },
                  "sort" : [
                    19929
                  ]
                },
                {
                  "_index" : "kibana_sample_data_logs",
                  "_type" : "_doc",
                  "_id" : "7KcW8XYBL1uEtTd-ZKn1",
                  "_score" : null,
                  "_source" : {
                    "bytes" : 19904,
                    "ip" : "100.177.58.231",
                    "clientip" : "100.177.58.231"
                  },
                  "sort" : [
                    19904
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : "www.elastic.co",
          "doc_count" : 4779,
          "most_bytes" : {
            "hits" : {
              "total" : {
                "value" : 4779,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "kibana_sample_data_logs",
                  "_type" : "_doc",
                  "_id" : "lacW8XYBL1uEtTd-abM-",
                  "_score" : null,
                  "_source" : {
                    "bytes" : 19986,
                    "ip" : "233.204.30.48",
                    "clientip" : "233.204.30.48"
                  },
                  "sort" : [
                    19986
                  ]
                },
                {
                  "_index" : "kibana_sample_data_logs",
                  "_type" : "_doc",
                  "_id" : "dacW8XYBL1uEtTd-WpMO",
                  "_score" : null,
                  "_source" : {
                    "bytes" : 19956,
                    "ip" : "129.237.102.30",
                    "clientip" : "129.237.102.30"
                  },
                  "sort" : [
                    19956
                  ]
                }
              ]
            }
          }
        }
      ]
    }
  }
}

从上面的返回结果可以看出来两个 hosts artifacts.elastic.co 及 www.elastic.co 各返回两个结果，并且它们是按照 bytes 的大小进行降序排列的。

细心的读者可能会发现这个和我之前介绍的 field collapsing 有些类似。只是 field collapsing 里针对每个桶有一个结果，并且是按照我们的要求进行排序的最高结果的那个。当然我们也可以含有多几个返回结果在 inner_hits 之中。

ElasticsearchService 全文检索

0 人点赞