Elasticsearch使用:Bucket aggregation

2021-01-29 10:59:13 浏览数 (1)

简介

Elasticsearch 除了在搜索方面非常之快,对数据分析也是非常重要的一面。正确理解 Bucket aggregation 对我们使用 Kibana 非常重要。Elasticsearch 提供了非常多的 aggregation  [ˌæɡrɪˈɡeɪʃn] 可以供我们使用。其中 Bucket aggregation 对于初学者来说也是比较不容易理解的一个。在今天的这篇文章中,我来重点讲述这个。

官方文档:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket.html

简单地说:一个桶代表一个具有共同标准的文档集合。存储桶(bucket)是聚合的关键要素。比如,我们想分析每个月的log流量:

存储桶聚合(bucket aggregation)不像指标聚合(Metric aggregation)那样计算字段的指标,而是创建文档存储桶。 每个存储桶都与一个标准(取决于聚合类型)相关联,该标准确定当前上下文中的文档是否“落入”其中。 换句话说,存储桶有效地定义了文档集。 除了存储桶本身之外,存储桶聚合还计算并返回落入每个存储桶的文档数量。

与指标聚合相反,存储桶聚合可以保存子聚合。 这些子聚合将针对其“父”存储桶聚合创建的存储桶进行聚合。

有不同的存储桶聚合器,每个聚合器都有不同的“存储桶”策略。 一些定义单个存储桶,一些定义固定数量的多个存储桶,另一些定义在聚合过程中动态创建存储桶。

尽管存储桶聚合不计算指标,但它们可以包含可以为存储桶聚合生成的每个存储桶计算指标的指标子聚合。 这使存储桶聚合对于粒度表示和分析Elasticsearch索引非常有用。 在本文中,我们将重点介绍直方图(histogram),范围(range),过滤器(filter)和术语(terms)等存储桶聚合。

id

user

age

country

category

1

Bill

30

FR

A

2

Marie

32

US

A

3

Clarie

32

US

A

4

Tom

44

DE

B

5

John

40

US

B

6

Emma

26

US

B

在上面的表格中,我们可以看到一个很规整的关于用户的名单。每天用户具有 id,user,age,country 及 category。当这些数据被存于到 Elasticsearch 中后,会变成一个一个的文档:

为了把这些数据导入到 Elasticsearch 之中,我们以通过如下的方法来进行。我们先创建这个 users 索引。

代码语言:javascript复制
PUT users
{
  "mappings": {
    "properties": {
      "age": {
        "type": "long"
      },
      "category": {
        "type": "keyword"
      },
      "country": {
        "type": "keyword"
      },
      "user": {
        "type": "keyword"
      }
    }
  }
}

上面创建了一个 users 的索引,同时也定义了它的 mapping。我们可以通过如下bulk API把数据导入:

代码语言:javascript复制
POST _bulk
{ "index" : { "_index" : "users", "_id": 1} }
{"user":"bill", "age": 30, "country": "FR", "category": "A"}
{ "index" : { "_index" : "users", "_id": 2} }
{"user":"Marie", "age": 32, "country": "US", "category": "A"}
{ "index" : { "_index" : "users", "_id": 3} }
{"user":"Clarie", "age": 32, "country": "US", "category": "A"}
{ "index" : { "_index" : "users", "_id": 4} }
{"user":"Tom", "age": 44, "country": "DE", "category": "B"}
{ "index" : { "_index" : "users", "_id": 5} }
{"user":"John", "age": 40, "country": "US", "category": "B"}
{ "index" : { "_index" : "users", "_id": 6} }
{"user":"Emma", "age": 26, "country": "US", "category": "B"}

假如我们想得到这些个用户的平均年龄,我们很容易通过 Elasticsearch 的 Avg aggregation 来得到。

代码语言:javascript复制
GET users/_search
{
  "size": 0,
  "aggs": {
    "avg_of_age": {
      "avg": {
        "field": "age"
      }
    }
  }
}

结果:
{
  "took" : 677,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "avg_of_age" : {
      "value" : 34.0
    }
  }
}

那么他们的平均年龄是34岁。

接下来我们开始谈我们的重点了:Bucket aggregation。

简单地说:Bucket aggregation 是一种把具有相同标准的数据分组数据的方法。创建存储桶:

  • 收集具有共同标准的文件
  • 可以具有一个或多个与其关联的指标
  • bucket 每个存储桶的文档数(文档数)是默认指标

首先,我们可以按照 cetegory 进行分类:我们从上面的表格可以看出来 category A 有3个文档,而 category B 有3个文档。

上面的结果可以通过如下来查询:

代码语言:javascript复制
GET users/_search
{
  "size": 0,
  "aggs": {
    "categoeries": {
      "terms": {
        "field": "category"
      }
    }
  }
}

结果:
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "categoeries" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "A",
          "doc_count" : 3
        },
        {
          "key" : "B",
          "doc_count" : 3
        }
      ]
    }
  }
}

上面我们得到了每个 category 的数量是我们想要的,但是在很多的情况下,我们更想得到在这每天 category 下的一些指标,比如每个 category 的平均年龄是多少?这样我实际上是在以 category 为 key 的存储桶里来求平均值。

我们可以通过如下的方法来得到这个:我们可以在每个category的桶里来做平均值。我们可以添加metrics来实际。

代码语言:javascript复制
GET users/_search
{
  "size": 0,
  "aggs": {
    "categoeries": {
      "terms": {
        "field": "category"
      },
      "aggs": {
        "avg_age_per_category": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}

结果:
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "categoeries" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "A",
          "doc_count" : 3,
          "avg_age_per_category" : {
            "value" : 31.333333333333332
          }
        },
        {
          "key" : "B",
          "doc_count" : 3,
          "avg_age_per_category" : {
            "value" : 36.666666666666664
          }
        }
      ]
    }
  }
}

这样,我们就得到了在每个 category 下的平均年龄。我们可以再进一步想得到在每个 category 下的每个国家的平均年龄。显然这个时候,我们需要使用到 country 这个桶,这桶是在 category 桶下面的另外一个桶。

添加子桶聚合:

代码语言:javascript复制
GET users/_search
{
  "size": 0,
  "aggs": {
    "categories": {
      "terms": {
        "field": "category"
      },
      "aggs": {
        "countries": {
          "terms": {
            "field": "country"
          },
          "aggs": {
            "average_age": {
              "avg": {
                "field": "age"
              }
            }
          }
        }
      }
    }
  }
}

结果:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "categories" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "A",
          "doc_count" : 3,
          "countries" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "US",
                "doc_count" : 2,
                "average_age" : {
                  "value" : 32.0
                }
              },
              {
                "key" : "FR",
                "doc_count" : 1,
                "average_age" : {
                  "value" : 30.0
                }
              }
            ]
          }
        },
        {
          "key" : "B",
          "doc_count" : 3,
          "countries" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "US",
                "doc_count" : 2,
                "average_age" : {
                  "value" : 33.0
                }
              },
              {
                "key" : "DE",
                "doc_count" : 1,
                "average_age" : {
                  "value" : 44.0
                }
              }
            ]
          }
        }
      ]
    }
  }
}

在这一节里,我们主要通过 terms aggregation 来展示存储桶的概念。我在下面用一个具体的例子来详细描述更多桶的操作。

桶聚合

为了说明介绍中提到的各种存储桶聚合,我们首先创建一个新的 “sports” 索引,该索引存储 “althlete” 文档的集合。 索引映射将包含诸如运动员的位置,姓名,等级,运动,年龄,进球数和场位置(例如防守者)之类的字段。 让我们创建映射:

代码语言:javascript复制
PUT sports
{
  "mappings": {
    "properties": {
      "birthdate": {
        "type": "date",
        "format": "dateOptionalTime"
      },
      "location": {
        "type": "geo_point"
      },
      "name": {
        "type": "keyword"
      },
      "rating": {
        "type": "integer"
      },
      "sport": {
        "type": "keyword"
      },
      "age": {
        "type": "integer"
      },
      "goals": {
        "type": "integer"
      },
      "role": {
        "type": "keyword"
      },
      "score_weight": {
        "type": "float"
      }
    }
  }
}

一旦 mapping 创建成功,我们就可以使用 Elasticsearch 所提供的 Bulk API 来把我们数据导入到我们的索引中。我们可以通过一个REST 调用就把所有的数据导入到 Elasticsearch 中。

代码语言:javascript复制
POST sports/_bulk
{"index":{"_index":"sports"}}
{"name":"Michael","birthdate":"1989-10-1","sport":"Football","rating":["5","4"],"location":"46.22,-68.45","age":"23","goals":"43","score_weight":"3","role":"midfielder"}
{"index":{"_index":"sports"}}
{"name":"Bob","birthdate":"1989-11-2","sport":"Football","rating":["3","4"],"location":"45.21,-68.35","age":"33","goals":"54","score_weight":"2","role":"forward"}
{"index":{"_index":"sports"}}
{"name":"Jim","birthdate":"1988-10-3","sport":"Football","rating":["3","2"],"location":"45.16,-63.58","age":"28","goals":"73","score_weight":"2","role":"forward"}
{"index":{"_index":"sports"}}
{"name":"Joe","birthdate":"1992-5-20","sport":"Basketball","rating":["4","3"],"location":"45.22,-68.53","age":"18","goals":"848","score_weight":"3","role":"midfielder"}
{"index":{"_index":"sports"}}
{"name":"Tim","birthdate":"1992-2-28","sport":"Basketball","rating":["3","3"],"location":"46.22,-68.85","age":"28","goals":"942","score_weight":"2","role":"forward"}
{"index":{"_index":"sports"}}
{"name":"Alfred","birthdate":"1990-9-9","sport":"Football","rating":["2","2"],"location":"45.12,-68.35","age":"25","goals":"53","score_weight":"4","role":"defender"}
{"index":{"_index":"sports"}}
{"name":"Jeff","birthdate":"1990-4-1","sport":"Hockey","rating":["2","3"],"location":"46.12,-68.55","age":"26","goals":"93","score_weight":"3","role":"midfielder"}
{"index":{"_index":"sports"}}
{"name":"Will","birthdate":"1988-3-1","sport":"Hockey","rating":["4","4"],"location":"46.25,-84.25","age":"27","goals":"124","score_weight":"2","role":"forward"}
{"index":{"_index":"sports"}}
{"name":"Mick","birthdate":"1989-10-1","sport":"Football","rating":["3","4"],"location":"46.22,-68.45","age":"35","goals":"56","score_weight":"3","role":"midfielder"}
{"index":{"_index":"sports"}}
{"name":"Pong","birthdate":"1989-11-2","sport":"Basketball","rating":["1","3"],"location":"45.21,-68.35","age":"34","goals":"1483","score_weight":"2","role":"forward"}
{"index":{"_index":"sports"}}
{"name":"Ray","birthdate":"1988-10-3","sport":"Football","rating":["2","2"],"location":"45.16,-63.58","age":"31","goals":"84","score_weight":"3","role":"midfielder"}
{"index":{"_index":"sports"}}
{"name":"Ping","birthdate":"1992-5-20","sport":"Basketball","rating":["4","3"],"location":"45.22,-68.53","age":"27","goals":"1328","score_weight":"2","role":"forward"}
{"index":{"_index":"sports"}}
{"name":"Duke","birthdate":"1992-2-28","sport":"Hockey","rating":["5","2"],"location":"46.22,-68.85","age":"41","goals":"218","score_weight":"2","role":"forward"}
{"index":{"_index":"sports"}}
{"name":"Hal","birthdate":"1990-9-9","sport":"Hockey","rating":["4","2"],"location":"45.12,-68.35","age":"18","goals":"148","score_weight":"3","role":"midfielder"}
{"index":{"_index":"sports"}}
{"name":"Charge","birthdate":"1990-4-1","sport":"Football","rating":["3","2"],"location":"44.19,-82.55","age":"19","goals":"34","score_weight":"4","role":"defender"}
{"index":{"_index":"sports"}}
{"name":"Barry","birthdate":"1988-3-1","sport":"Football","rating":["5","2"],"location":"36.45,-79.15","age":"20","goals":"48","score_weight":"4","role":"defender"}
{"index":{"_index":"sports"}}
{"name":"Bank","birthdate":"1988-3-1","sport":"Handball","rating":["6","4"],"location":"46.25,-54.53","age":"25","goals":"150","score_weight":"4","role":"defender"}
{"index":{"_index":"sports"}}
{"name":"Bingo","birthdate":"1988-3-1","sport":"Handball","rating":["10","7"],"location":"46.25,-68.55","age":"29","goals":"143","score_weight":"3","role":"midfielder"}
{"index":{"_index":"sports"}}
{"name":"James","birthdate":"1988-3-1","sport":"Basketball","rating":["10","8"],"location":"41.25,-69.55","age":"36","goals":"1284","score_weight":"2","role":"forward"}
{"index":{"_index":"sports"}}
{"name":"Wayne","birthdate":"1988-3-1","sport":"Hockey","rating":["10","10"],"location":"46.21,-68.55","age":"25","goals":"113","score_weight":"3","role":"midfielder"}
{"index":{"_index":"sports"}}
{"name":"Brady","birthdate":"1988-3-1","sport":"Handball","rating":["10","10"],"location":"63.24,-84.55","age":"29","goals":"443","score_weight":"2","role":"forward"}
{"index":{"_index":"sports"}}
{"name":"Lewis","birthdate":"1988-3-1","sport":"Football","rating":["10","10"],"location":"56.25,-74.55","age":"24","goals":"49","score_weight":"3","role":"midfielder"}

这样我们的数据就录入到 Elasticsearch 中了。在下面,我们就用不同的存储桶来对我们的数据进行统计。

Filter(s) Aggregations

桶聚合支持单过滤器聚合和多过滤器聚合。

单个过滤器聚合根据与过滤器定义中指定的查询或字段值匹配的所有文档构造单个存储桶。 当您要标识一组符合特定条件的文档时,单过滤器聚合很有用。

例如,我们可以使用单过滤器聚合来查找所有具有 “defender” 角色的运动员,并计算每个过滤桶的平均目标。 过滤器配置如下所示:

代码语言:javascript复制
POST sports/_search
{
  "size": 0,
  "aggs": {
    "defender_filter": {
      "filter": {
        "term": {
          "role": "defender"
        }
      },
      "aggs": {
        "avg_goals": {
          "avg": {
            "field": "goals"
          }
        }
      }
    }
  }
}

结果:
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 22,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "defender_filter" : {
      "doc_count" : 4,
      "avg_goals" : {
        "value" : 71.25
      }
    }
  }
}

如你所见,filter 聚合包含一个 “term” 字段,该字段指定文档中的字段以搜索特定值(在本例中为 “defender”)。 Elasticsearch 将遍历所有文档,并检查 “role” 字段中是否包含 “defender”。 然后将与该值匹配的文档添加到聚合生成的单个存储桶中。此输出表明我们集合中所有后卫的平均进球数为71.25。

这是单过滤器聚合的示例。 但是,在 Elasticsearch 中,你可以选择使用 filter 聚合指定多个过滤器。 这是一个多值聚合,其中每个存储桶都对应一个特定的过滤器。 我们可以修改上面的示例以过滤 defender 和 foward:

代码语言:javascript复制
GET sports/_search
{
  "size": 0,
  "aggs": {
    "athletes": {
      "filters": {
        "filters": {
          "defenders": {
            "term": {
              "role": "defender"
            }
          },
          "forwards": {
            "term": {
              "role": "forward"
            }
          }
        }
      },
      "aggs": {
        "avg_goals": {
          "avg": {
            "field": "goals"
          }
        }
      }
    }
  }
}

结果:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 22,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "athletes" : {
      "buckets" : {
        "defenders" : {
          "doc_count" : 4,
          "avg_goals" : {
            "value" : 71.25
          }
        },
        "forwards" : {
          "doc_count" : 9,
          "avg_goals" : {
            "value" : 661.0
          }
        }
      }
    }
  }
}

我们可以看出来我们同时有两个桶,分别对应我们的两个 filter。每一个 filter 都检查 role 的值为 defender 或者 forward。

Terms Aggregation

也许我们对 Terms aggregation 并不陌生。我们刚才在一开始已经使用了 terms aggregation。

术语聚合会在文档的指定字段中搜索唯一值,并为找到的每个唯一值构建存储桶。 与过滤器聚合不同,术语聚合的任务不是将结果限制为特定值,而是查找文档中给定字段的所有唯一值。

看一下下面的示例,我们试图为 “sport” 字段中找到的每个唯一值创建一个存储桶。 这项操作的结果是,我们将为索引中的每种运动提供四个独特的存储桶:Football,Handball,Hockey 和 Basketbalk。 然后,我们将使用平均子聚合计算每种运动的平均目标:

代码语言:javascript复制
GET sports/_search
{
  "size": 0,
  "aggs": {
    "sports": {
      "terms": {
        "field": "sport"
      },
      "aggs": {
        "avg_scoring": {
          "avg": {
            "field": "goals"
          }
        }
      }
    }
  }
}

结果:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 22,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "sports" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Football",
          "doc_count" : 9,
          "avg_scoring" : {
            "value" : 54.888888888888886
          }
        },
        {
          "key" : "Basketball",
          "doc_count" : 5,
          "avg_scoring" : {
            "value" : 1177.0
          }
        },
        {
          "key" : "Hockey",
          "doc_count" : 5,
          "avg_scoring" : {
            "value" : 139.2
          }
        },
        {
          "key" : "Handball",
          "doc_count" : 3,
          "avg_scoring" : {
            "value" : 245.33333333333334
          }
        }
      ]
    }
  }
}

如你所见,术语聚合为索引中的每种运动类型构造了四个存储桶。 每个存储桶包含 doc_count(属于存储桶的文档数)和每个运动的平均子聚合。

Histogram Aggregation

直方图聚合使我们可以根据指定的时间间隔构造存储桶。 属于每个间隔的值将形成一个间隔存储桶。 例如,假设我们要使用5年间隔将直方图聚合应用于 “age” 字段。 在这种情况下,直方图聚合将在我们的文档集中找到最小和最大年龄,并将每个文档与指定的时间间隔相关联。 每个文档的 “age” 字段将向下舍入到最接近的时间间隔存储桶。 例如,假设我们的时间间隔值为5,存储分区大小为6,则年龄32会四舍五入为30。

直方图聚合的公式如下所示:

代码语言:javascript复制
bucket_key = Math.floor((value - offset) / interval) * interval   offset

请注意,时间间隔必须为正十进制数,而偏移量必须为 [0,offset] 范围内的十进制。

让我们使用直方图聚合来生成篮球中目标间隔为200的存储桶。

代码语言:javascript复制
POST sports/_search
{
  "size": 0,
  "aggs": {
    "baskketball_filter": {
      "filter": {
        "term": {
          "sport": "Basketball"
        }
      },
      "aggs": {
        "goals_histogram": {
          "histogram": {
            "field": "goals",
            "interval": 200
          }
        }
      }
    }
  }
}

结果:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 22,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "baskketball_filter" : {
      "doc_count" : 5,
      "goals_histogram" : {
        "buckets" : [
          {
            "key" : 800.0,
            "doc_count" : 2
          },
          {
            "key" : 1000.0,
            "doc_count" : 0
          },
          {
            "key" : 1200.0,
            "doc_count" : 2
          },
          {
            "key" : 1400.0,
            "doc_count" : 1
          }
        ]
      }
    }
  }
}

上面的回答表明,没有目标落在0-200、200-400、400-600和600-800区间内。 因此,第一个存储区从800-1000间隔开始。 因此,值最小的文档将确定最小存储桶(最小key的存储桶)。 相应地,具有最高值的文档将确定最大存储桶(具有最高key的存储桶)。

此外,该响应还显示有零个文档落在[1000,1200)范围内。 这意味着没有运动员得分在1000到1200个目标之间。 默认情况下,Elasticsearch用空存储桶填充此类空白。 您可以使用min_doc_count设置通过请求最小计数不为零的存储桶来更改此行为。 例如,如果我们将 min_doc_count 的值设置为1,则直方图将仅针对其中包含不少于1个文档的间隔构造存储桶。 让我们修改查询,将min_doc_count设置为1。

代码语言:javascript复制
POST sports/_search
{
  "size": 0,
  "aggs": {
    "baskketball_filter": {
      "filter": {
        "term": {
          "sport": "Basketball"
        }
      },
      "aggs": {
        "goals_histogram": {
          "histogram": {
            "field": "goals",
            "interval": 200,
            "min_doc_count": 1
          }
        }
      }
    }
  }
}

结果:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 22,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "baskketball_filter" : {
      "doc_count" : 5,
      "goals_histogram" : {
        "buckets" : [
          {
            "key" : 800.0,
            "doc_count" : 2
          },
          {
            "key" : 1200.0,
            "doc_count" : 2
          },
          {
            "key" : 1400.0,
            "doc_count" : 1
          }
        ]
      }
    }
  }
}

我们还可以使用 extended_bounds 设置来强制直方图聚合,以根据特定的最小值开始构建其存储桶,并继续构建存储桶直至达到最大值(即使不再有文档)。 仅当将 min_doc_count 设置为0时,才使用 extended_bounds。(如果min_doc_count大于0,则不会返回空存储桶)。 看一下这个查询:

代码语言:javascript复制
POST sports/_search
{
  "size": 0,
  "aggs": {
    "baskketball_filter": {
      "filter": {
        "term": {
          "sport": "Basketball"
        }
      },
      "aggs": {
        "goals_histogram": {
          "histogram": {
            "field": "goals",
            "interval": 200,
            "min_doc_count": 0,
            "extended_bounds": {
              "min": 0,
              "max": 1600
            }
          }
        }
      }
    }
  }
}

结果:在这里,我们为存储桶指定了0(最小值)和1600(最大值)。 因此,响应应如下所示:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 22,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "baskketball_filter" : {
      "doc_count" : 5,
      "goals_histogram" : {
        "buckets" : [
          {
            "key" : 0.0,
            "doc_count" : 0
          },
          {
            "key" : 200.0,
            "doc_count" : 0
          },
          {
            "key" : 400.0,
            "doc_count" : 0
          },
          {
            "key" : 600.0,
            "doc_count" : 0
          },
          {
            "key" : 800.0,
            "doc_count" : 2
          },
          {
            "key" : 1000.0,
            "doc_count" : 0
          },
          {
            "key" : 1200.0,
            "doc_count" : 2
          },
          {
            "key" : 1400.0,
            "doc_count" : 1
          },
          {
            "key" : 1600.0,
            "doc_count" : 0
          }
        ]
      }
    }
  }
}

如您所见,即使第一个存储桶和最后一个存储桶根本没有任何值,也会生成从0到结束1600的所有存储桶。

Date histogram aggregation

这个聚合类似于正常的直方图,但只能与日期或日期范围值一起使用。 由于日期在 Elasticsearch 中内部以长值表示,因此也可以但不准确地对日期使用正常的直方图。 这两个 AP I的主要区别在于,可以使用日期/时间表达式指定间隔。 基于时间的数据需要特殊的支持,因为基于时间的间隔并不总是固定的长度。

在我们的数据中有一个叫做 birthdate 的字段。在建立我们的 index pattern 时,我们可以选择这个字段作为 time series 的timest amp。下面是我们的数据的时序图。

我们可以使用 date histogram aggregation 来统计我们的运动员出生年月分别情况:

代码语言:javascript复制
GET sports/_search
{
  "size": 0,
  "aggs": {
    "birthdays": {
      "date_histogram": {
        "field": "birthdate",
        "interval": "year"
      }
    }
  }
}

结果:
{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 22,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "birthdays" : {
      "buckets" : [
        {
          "key_as_string" : "1988-01-01T00:00:00.000Z",
          "key" : 567993600000,
          "doc_count" : 10
        },
        {
          "key_as_string" : "1989-01-01T00:00:00.000Z",
          "key" : 599616000000,
          "doc_count" : 4
        },
        {
          "key_as_string" : "1990-01-01T00:00:00.000Z",
          "key" : 631152000000,
          "doc_count" : 4
        },
        {
          "key_as_string" : "1991-01-01T00:00:00.000Z",
          "key" : 662688000000,
          "doc_count" : 0
        },
        {
          "key_as_string" : "1992-01-01T00:00:00.000Z",
          "key" : 694224000000,
          "doc_count" : 4
        }
      ]
    }
  }
}
上面的结果显示在1988年到1998年之间的运动员有10个,在1989和1990年之间的有4位。

当然我们也可以针对每个年龄段的 goals 的平均值:

代码语言:javascript复制
GET sports/_search
{
  "size": 0,
  "aggs": {
    "birthdays": {
      "date_histogram": {
        "field": "birthdate",
        "interval": "year"
      },
      "aggs": {
        "average_goals": {
          "avg": {
            "field": "goals"
          }
        }
      }
    }
  }
}

结果:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 22,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "birthdays" : {
      "buckets" : [
        {
          "key_as_string" : "1988-01-01T00:00:00.000Z",
          "key" : 567993600000,
          "doc_count" : 10,
          "average_goals" : {
            "value" : 251.1
          }
        },
        {
          "key_as_string" : "1989-01-01T00:00:00.000Z",
          "key" : 599616000000,
          "doc_count" : 4,
          "average_goals" : {
            "value" : 409.0
          }
        },
        {
          "key_as_string" : "1990-01-01T00:00:00.000Z",
          "key" : 631152000000,
          "doc_count" : 4,
          "average_goals" : {
            "value" : 82.0
          }
        },
        {
          "key_as_string" : "1991-01-01T00:00:00.000Z",
          "key" : 662688000000,
          "doc_count" : 0,
          "average_goals" : {
            "value" : null
          }
        },
        {
          "key_as_string" : "1992-01-01T00:00:00.000Z",
          "key" : 694224000000,
          "doc_count" : 4,
          "average_goals" : {
            "value" : 834.0
          }
        }
      ]
    }
  }
}
我们可以看到在1992年的这个年龄段的 average_goals 是最高的,达到834。

Range Aggregation

通过此存储桶聚合,可以轻松根据用户定义的范围构建存储桶。 Elasticsearch 将检查从你指定的数字字段中提取的每个值,并将其与范围进行比较,然后将该值放入相应的范围。 请注意,此聚合包括起始值,但不包括每个范围的起始值。

让我们为运动索引中的 “age” 字段创建一个范围汇总:

代码语言:javascript复制
GET sports/_search
{
  "size": 0,
  "aggs": {
    "goal_ranges": {
      "range": {
        "field": "age",
        "ranges": [
          {
            "to": 20
          },
          {
            "from": 20,
            "to": 30
          },
          {
            "from": 30
          }
        ]
      }
    }
  }
}

结果:如您所见,我们为查询指定了三个范围。 这意味着Elasticsearch将创建与每个范围相对应的三个存储桶。 上面的查询应产生以下输出:
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 22,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "goal_ranges" : {
      "buckets" : [
        {
          "key" : "*-20.0",
          "to" : 20.0,
          "doc_count" : 3
        },
        {
          "key" : "20.0-30.0",
          "from" : 20.0,
          "to" : 30.0,
          "doc_count" : 13
        },
        {
          "key" : "30.0-*",
          "from" : 30.0,
          "doc_count" : 6
        }
      ]
    }
  }
}

如输出所示,我们指数中最多的运动员年龄在20至30岁之间。

为了使范围更易理解,我们可以为每个范围自定义键名,如下所示:

代码语言:javascript复制
GET sports/_search
{
  "size": 0,
  "aggs": {
    "goal_ranges": {
      "range": {
        "field": "age",
        "ranges": [
          {
            "key": "start-of-career",
            "to": 20
          },
          {
            "key": "mid-of-career",
            "from": 20,
            "to": 30
          },
          {
            "key": "end-of-cereer",
            "from": 30
          }
        ]
      }
    }
  }
}

结果:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 22,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "goal_ranges" : {
      "buckets" : [
        {
          "key" : "start-of-career",
          "to" : 20.0,
          "doc_count" : 3
        },
        {
          "key" : "mid-of-career",
          "from" : 20.0,
          "to" : 30.0,
          "doc_count" : 13
        },
        {
          "key" : "end-of-cereer",
          "from" : 30.0,
          "doc_count" : 6
        }
      ]
    }
  }
}

我们可以使用统计子聚合将更多信息添加到范围。 此汇总将为每个范围提供最小值,最大值,平均值和总和。 让我们来看看:

代码语言:javascript复制
GET sports/_search
{
  "size": 0,
  "aggs": {
    "goal_ranges": {
      "range": {
        "field": "age",
        "ranges": [
          {
            "key": "start-of-career",
            "to": 20
          },
          {
            "key": "mid-of-career",
            "from": 20,
            "to": 30
          },
          {
            "key": "end-of-cereer",
            "from": 30
          }
        ]
      },
      "aggs": {
        "age_stats": {
          "stats": {
            "field": "age"
          }
        }
      }
    }
  }
}

结果:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 22,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "goal_ranges" : {
      "buckets" : [
        {
          "key" : "start-of-career",
          "to" : 20.0,
          "doc_count" : 3,
          "age_stats" : {
            "count" : 3,
            "min" : 18.0,
            "max" : 19.0,
            "avg" : 18.333333333333332,
            "sum" : 55.0
          }
        },
        {
          "key" : "mid-of-career",
          "from" : 20.0,
          "to" : 30.0,
          "doc_count" : 13,
          "age_stats" : {
            "count" : 13,
            "min" : 20.0,
            "max" : 29.0,
            "avg" : 25.846153846153847,
            "sum" : 336.0
          }
        },
        {
          "key" : "end-of-cereer",
          "from" : 30.0,
          "doc_count" : 6,
          "age_stats" : {
            "count" : 6,
            "min" : 31.0,
            "max" : 41.0,
            "avg" : 35.0,
            "sum" : 210.0
          }
        }
      ]
    }
  }
}

Geo Distance Aggregation

使用地理距离聚合,您可以定义一个原点和到该点的一组距离范围。然后,聚合将评估每个geo_point值到原点的距离,并确定文档属于哪个范围。如果文档的geo_point值与原点之间的距离落入该存储桶的距离范围内,则该文档被视为属于该存储桶。

在下面的示例中,我们的原点的纬度值为46.22,经度值为-68.85。我们使用原点46.22,-68.85的字符串格式,其中第一个值定义纬度,第二个值定义经度。或者,您可以使用对象格式-{“ lat”:46.22,“ lon”:-68.85}或数组格式:[-68.85,46.22]基于GeoJson标准,其中第一个数字是lon和第二个是拉特

另外,我们以km值创建三个范围。默认距离单位是m(米),因此我们需要在“ unit”(单位)字段中明确设置km。其他受支持的距离单位是mi(英里),in(英寸),yd(码),cm(厘米)和mm(毫米)。

代码语言:javascript复制
POST sports/_search
{
  "size": 0,
  "aggs": {
    "althlete_location": {
      "geo_distance": {
        "field": "location",
        "origin": {
          "lat": 46.22,
          "lon": -68.85
        },
        "ranges": [
          {
            "to": 200
          },
          {
            "from": 200,
            "to": 400
          },
          {
            "from": 400
          }
        ]
      }
    }
  }
}

结果:
{
  "took" : 27,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 22,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "althlete_location" : {
      "buckets" : [
        {
          "key" : "*-200.0",
          "from" : 0.0,
          "to" : 200.0,
          "doc_count" : 2
        },
        {
          "key" : "200.0-400.0",
          "from" : 200.0,
          "to" : 400.0,
          "doc_count" : 0
        },
        {
          "key" : "400.0-*",
          "from" : 400.0,
          "doc_count" : 20
        }
      ]
    }
  }
}

结果表明,居住在距起点不超过200 km的运动员有2名,居住在距起点超过400 km的运动员有20名。

IP Range Aggregation

Elasticsearch还具有对IP范围的内置支持。 IP聚合的工作方式与其他范围聚合类似。 让我们为IP地址创建索引映射,以说明此聚合的工作方式:

代码语言:javascript复制
PUT ips
{
  "mappings": {
    "properties": {
      "ip_addr": {
        "type": "ip"
      }
    }
  }
}

POST ips/_bulk
{"index":{"_index":"ips"}}
{"ip_addr":"172.16.0.0"}
{"index":{"_index":"ips"}}
{"ip_addr":"172.16.0.1"}
{"index":{"_index":"ips"}}
{"ip_addr":"172.16.0.2"}
{"index":{"_index":"ips"}}
{"ip_addr":"172.16.0.3"}
{"index":{"_index":"ips"}}
{"ip_addr":"172.16.0.4"}
{"index":{"_index":"ips"}}
{"ip_addr":"172.16.0.5"}
{"index":{"_index":"ips"}}
{"ip_addr":"172.16.0.6"}
{"index":{"_index":"ips"}}
{"ip_addr":"172.16.0.7"}
{"index":{"_index":"ips"}}
{"ip_addr":"172.16.0.8"}
{"index":{"_index":"ips"}}
{"ip_addr":"172.16.0.9"}

现在,当索引中包含一些数据时,让我们创建一个IP范围聚合:

代码语言:javascript复制
GET ips/_search
{
  "size": 0,
  "aggs": {
    "ip_ranges": {
      "ip_range": {
        "field": "ip_addr",
        "ranges": [
          {
            "to": "172.16.0.4"
          },
          {
            "from": "172.16.0.4"
          }
        ]
      }
    }
  }
}

结果:
{
  "took" : 557,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "ip_ranges" : {
      "buckets" : [
        {
          "key" : "*-172.16.0.4",
          "to" : "172.16.0.4",
          "doc_count" : 4
        },
        {
          "key" : "172.16.0.4-*",
          "from" : "172.16.0.4",
          "doc_count" : 6
        }
      ]
    }
  }
}

0 人点赞