简介
聚合框架有助于基于搜索查询提供聚合数据。它基于称为聚合的简单构建块,可以组合以构建复杂的数据摘要。
Aggregation
代码语言:javascript复制DELETE twitte
PUT twitte
{
"mappings": {
"properties": {
"DOB": {
"type": "date"
},
"address": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"age": {
"type": "long"
},
"city": {
"type": "keyword"
},
"country": {
"type": "keyword"
},
"location": {
"type": "geo_point"
},
"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"province": {
"type": "keyword"
},
"uid": {
"type": "long"
},
"user": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
使用 bulk API 来把我们的数据导入到 Elasticsearch 中:
代码语言:javascript复制POST _bulk
{"index":{"_index":"twitter","_id":1}}
{"user":"张三","message":"今儿天气不错啊,出去转转去","uid":2,"age":20,"city":"北京","province":"北京","country":"中国","address":"中国北京市海淀区","location":{"lat":"39.970718","lon":"116.325747"}, "DOB": "1999-04-01"}
{"index":{"_index":"twitter","_id":2}}
{"user":"老刘","message":"出发,下一站云南!","uid":3,"age":22,"city":"北京","province":"北京","country":"中国","address":"中国北京市东城区台基厂三条3号","location":{"lat":"39.904313","lon":"116.412754"}, "DOB": "1997-04-01"}
{"index":{"_index":"twitter","_id":3}}
{"user":"李四","message":"happy birthday!","uid":4,"age":25,"city":"北京","province":"北京","country":"中国","address":"中国北京市东城区","location":{"lat":"39.893801","lon":"116.408986"}, "DOB": "1994-04-01"}
{"index":{"_index":"twitter","_id":4}}
{"user":"老贾","message":"123,gogogo","uid":5,"age":30,"city":"北京","province":"北京","country":"中国","address":"中国北京市朝阳区建国门","location":{"lat":"39.718256","lon":"116.367910"}, "DOB": "1989-04-01"}
{"index":{"_index":"twitter","_id":5}}
{"user":"老王","message":"Happy BirthDay My Friend!","uid":6,"age":26,"city":"北京","province":"北京","country":"中国","address":"中国北京市朝阳区国贸","location":{"lat":"39.918256","lon":"116.467910"}, "DOB": "1993-04-01"}
{"index":{"_index":"twitter","_id":6}}
{"user":"老吴","message":"好友来了都今天我生日,好友来了,什么 birthday happy 就成!","uid":7,"age":28,"city":"上海","province":"上海","country":"中国","address":"中国上海市闵行区","location":{"lat":"31.175927","lon":"121.383328"}, "DOB": "1991-04-01"}
简单地说,聚合的语法是这样的:
代码语言:javascript复制"aggregations" : {
"<aggregation_name>" : {
"<aggregation_type>" : {
<aggregation_body>
}
[,"meta" : { [<meta_data_body>] } ]?
[,"aggregations" : { [<sub_aggregation>] } ]?
}
[,"<aggregation_name_2>" : { ... } ]*
}
通常,我们也可以使用 aggs 来代替上面的 “aggregations” 。
range聚合
我们可以把用户进行年龄分段,查出来在不同的年龄段的用户:
代码语言:javascript复制GET twitter/_search
{
"size": 0,
"aggs": {
"age": {
"range": {
"field": "age",
"ranges": [
{
"from": 20,
"to": 30
},
{
"from": 30,
"to": 40
},
{
"from": 40,
"to": 50
}
]
}
}
}
}
结果:
{
"took" : 681,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"age" : {
"buckets" : [
{
"key" : "20.0-30.0",
"from" : 20.0,
"to" : 30.0,
"doc_count" : 5
},
{
"key" : "30.0-40.0",
"from" : 30.0,
"to" : 40.0,
"doc_count" : 1
},
{
"key" : "40.0-50.0",
"from" : 40.0,
"to" : 50.0,
"doc_count" : 0
}
]
}
}
}
在上面,我们也注意到,我们把 size 设置为 0。这是因为针对聚合,我们并不关心返回的结果。
date_range聚合
我们可以使用 date_range 来统计在某个时间段里的文档数:
代码语言:javascript复制POST twitter/_search
{
"size": 0,
"aggs": {
"birth_range": {
"date_range": {
"field": "DOB",
"format": "yyyy-MM-dd",
"ranges": [
{
"from": "1989-01-01",
"to": "1990-01-01"
},
{
"from": "1991-01-01",
"to": "1992-01-01"
}
]
}
}
}
}
结果:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"birth_range" : {
"buckets" : [
{
"key" : "1989-01-01-1990-01-01",
"from" : 5.99616E11,
"from_as_string" : "1989-01-01",
"to" : 6.31152E11,
"to_as_string" : "1990-01-01",
"doc_count" : 1
},
{
"key" : "1991-01-01-1992-01-01",
"from" : 6.62688E11,
"from_as_string" : "1991-01-01",
"to" : 6.94224E11,
"to_as_string" : "1992-01-01",
"doc_count" : 1
}
]
}
}
}
terms聚合
我们也可以通过 term 聚合来查询某一个关键字出现的频率。在如下的 term 聚合中,我们想寻找在所有的文档出现 ”Happy birthday” 里按照城市进行分类的一个聚合。
代码语言:javascript复制GET twitter/_search
{
"query": {
"match": {
"message": "happy birthday"
}
},
"size": 0,
"aggs": {
"city": {
"terms": {
"field": "city",
"size": 10
}
}
}
}
结果:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"city" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "北京",
"doc_count" : 2
},
{
"key" : "上海",
"doc_count" : 1
}
]
}
}
}
在上面,我们可以看出来,在所有的含有 "Happy birthday" 的文档中,有两个是来自北京的,有一个是来自上海。
我们也可以使用 script 来生成一个在索引里没有的术语来进行统计。比如,我们可以通过如下的 script 来生成一个对文档人出生年份的统计:
代码语言:javascript复制POST twitter/_search
{
"size": 0,
"aggs": {
"birth_year": {
"terms": {
"script": {
"source": "2019 - doc['age'].value"
},
"size": 10
}
}
}
}
结果:
{
"took" : 14,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"birth_year" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1989",
"doc_count" : 1
},
{
"key" : "1991",
"doc_count" : 1
},
{
"key" : "1993",
"doc_count" : 1
},
{
"key" : "1994",
"doc_count" : 1
},
{
"key" : "1997",
"doc_count" : 1
},
{
"key" : "1999",
"doc_count" : 1
}
]
}
}
}
Histogram Aggregation
基于多桶值源的汇总,可以应用于从文档中提取的数值或数值范围值。 它根据值动态构建固定大小(也称为间隔)的存储桶。
代码语言:javascript复制GET twitter/_search
{
"size": 0,
"aggs": {
"age_distribution": {
"histogram": {
"field": "age",
"interval": 2
}
}
}
}
结果:
{
"took" : 54,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"age_distribution" : {
"buckets" : [
{
"key" : 20.0,
"doc_count" : 1
},
{
"key" : 22.0,
"doc_count" : 1
},
{
"key" : 24.0,
"doc_count" : 1
},
{
"key" : 26.0,
"doc_count" : 1
},
{
"key" : 28.0,
"doc_count" : 1
},
{
"key" : 30.0,
"doc_count" : 1
}
]
}
}
}
date_histogram
这种聚合类似于正常的直方图,但只能与日期或日期范围值一起使用。 由于日期在 Elasticsearch 中内部以长值表示,因此也可以但不准确地对日期使用正常的直方图。
代码语言:javascript复制GET twitter/_search
{
"size": 0,
"aggs": {
"age_distribution": {
"date_histogram": {
"field": "DOB",
"interval": "year"
}
}
}
}
结果:
{
"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"age_distribution" : {
"buckets" : [
{
"key_as_string" : "1989-01-01T00:00:00.000Z",
"key" : 599616000000,
"doc_count" : 1
},
{
"key_as_string" : "1990-01-01T00:00:00.000Z",
"key" : 631152000000,
"doc_count" : 0
},
{
"key_as_string" : "1991-01-01T00:00:00.000Z",
"key" : 662688000000,
"doc_count" : 1
},
{
"key_as_string" : "1992-01-01T00:00:00.000Z",
"key" : 694224000000,
"doc_count" : 0
},
{
"key_as_string" : "1993-01-01T00:00:00.000Z",
"key" : 725846400000,
"doc_count" : 1
},
{
"key_as_string" : "1994-01-01T00:00:00.000Z",
"key" : 757382400000,
"doc_count" : 1
},
{
"key_as_string" : "1995-01-01T00:00:00.000Z",
"key" : 788918400000,
"doc_count" : 0
},
{
"key_as_string" : "1996-01-01T00:00:00.000Z",
"key" : 820454400000,
"doc_count" : 0
},
{
"key_as_string" : "1997-01-01T00:00:00.000Z",
"key" : 852076800000,
"doc_count" : 1
},
{
"key_as_string" : "1998-01-01T00:00:00.000Z",
"key" : 883612800000,
"doc_count" : 0
},
{
"key_as_string" : "1999-01-01T00:00:00.000Z",
"key" : 915148800000,
"doc_count" : 1
}
]
}
}
}
cardinality聚合
我们也可以使用 cardinality 聚合来统计到底有多少个城市:
代码语言:javascript复制GET twitter/_search
{
"size": 0,
"aggs": {
"number_of_cities": {
"cardinality": {
"field": "city.keyword"
}
}
}
}
结果:
{
"took" : 9,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"number_of_cities" : {
"value" : 2
}
}
}
Metric 聚合
我们可以使用 Metrics 来统计我们的数值数据,比如我们想知道所有用户的平均年龄是多少?我们可以用下面的聚合:
代码语言:javascript复制GET twitter/_search
{
"size": 0,
"aggs": {
"average_age": {
"avg": {
"field": "age"
}
}
}
}
结果:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"average_age" : {
"value" : 25.166666666666668
}
}
}
我们也可以对只在北京的用户文档进行统计:
代码语言:javascript复制POST twitter/_search
{
"size": 0,
"query": {
"match": {
"city": "北京"
}
},
"aggs": {
"average_age_beijing": {
"avg": {
"field": "age"
}
}
}
}
结果:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"average_age_beijing" : {
"value" : 24.6
}
}
}
上面我们先查询到所有在北京的用户,然后再对这些文档进行求年龄的平均值。返回的结果:
聚合通常在查询搜索结果上执行。 Elasticsearch 提供了一个特殊的 global 聚合,该全局对所有文档执行,而不受查询的影响。
代码语言:javascript复制POST twitter/_search
{
"size": 0,
"query": {
"match": {
"city": "北京"
}
},
"aggs": {
"average_age_beijing": {
"avg": {
"field": "age"
}
},
"average_age_all": {
"global": {},
"aggs": {
"age_global_avg": {
"avg": {
"field": "age"
}
}
}
}
}
}
在上面我们在 average_age_all 里添加了一个 gobal 的聚合,这个平均值将会使用所有的 6 个文档而不是限于在这个查询的 5 个北京的文档。返回的结果是:
代码语言:javascript复制{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"average_age_beijing" : {
"value" : 24.6
},
"average_age_all" : {
"doc_count" : 6,
"age_global_avg" : {
"value" : 25.166666666666668
}
}
}
}
我们也可以对整个年龄进行一个统计,比如:
代码语言:javascript复制GET twitter/_search
{
"size": 0,
"aggs": {
"age_stats": {
"stats": {
"field": "age"
}
}
}
}
结果:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"age_stats" : {
"count" : 6,
"min" : 20.0,
"max" : 30.0,
"avg" : 25.166666666666668,
"sum" : 151.0
}
}
}
在这里,我们可以看到到底有多少条数据,并且最大,最小的,平均值及加起来的合都在这里一起显示。
我们也可以只得到这个年龄的最大值:
代码语言:javascript复制GET twitter/_search
{
"size": 0,
"aggs": {
"age_max": {
"max": {
"field": "age"
}
}
}
}
结果:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"age_max" : {
"value" : 30.0
}
}
}
聚合通常适用于从聚合文档集中提取的值。 可以使用聚合体内的字段键从特定字段提取这些值,也可以使用脚本提取这些值。我们可以通过 script 的方法来对我们的 aggregtion 结果进行重新计算:
代码语言:javascript复制GET twitter/_search
{
"size": 0,
"aggs": {
"average_age_1.5": {
"avg": {
"field": "age",
"script": {
"source": "_value * params.correction",
"params": {
"correction": 1.5
}
}
}
}
}
}
结果:
{
"took" : 24,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"average_age_1.5" : {
"value" : 37.75
}
}
}
上面的这个聚合可以帮我们计算平均值再乘以 1.5 倍的结果。运行一下的结果如下:
我们也可以直接使用 script 的方法来进行聚合。在这种情况下,我们可以不指定特定的 field 。我们可能把很多项进行综合处理,并把这个结果来进行聚合:
代码语言:javascript复制GET twitter/_search
{
"size": 0,
"aggs": {
"average_2_times_age": {
"avg": {
"script": {
"source": "doc['age'].value * params.times",
"params": {
"times": 2.0
}
}
}
}
}
}
结果:
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"average_2_times_age" : {
"value" : 50.333333333333336
}
}
}
Percentile aggregation
百分位数(percentile)表示观察值出现一定百分比的点。 例如,第 95 个百分位数是大于观察值的 95% 的值。该聚合针对从聚合文档中提取的数值计算一个或多个百分位数。 这些值可以从文档中的特定数字字段中提取,也可以由提供的脚本生成。
百分位通常用于查找离群值。 在正态分布中,第 0.13 和第 99.87 个百分位数代表与平均值的三个标准差。 任何超出三个标准偏差的数据通常被视为异常。这在统计的角度是非常有用的。
我们现在来通过一个简单的例子来展示 Percentile aggregation 的用法:
代码语言:javascript复制GET twitter/_search
{
"size": 0,
"aggs": {
"age_quartiles": {
"percentiles": {
"field": "age",
"percents": [
25,
50,
75,
100
]
}
}
}
}
结果:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"age_quartiles" : {
"values" : {
"25.0" : 22.0,
"50.0" : 25.5,
"75.0" : 28.0,
"100.0" : 30.0
}
}
}
}
在上面,我们使用了以叫做 age 的字段。它是一个数值的字段。我们通过 percentile aggregation 可以得到 25%,50% 及 75% 的人在什么范围。
我们可以看到 25% 的人平均年龄是低于 22.0 岁,而 50% 的人的年龄是低于 25.5 岁,而所有的人的年龄都是低于 30 岁的。这里的 50% 的年龄和我们之前计算的平均年龄是不一样的。
代码语言:javascript复制GET twitter/_search
{
"size": 0,
"aggs": {
"avarage_age": {
"avg": {
"field": "age"
}
}
}
}
结果:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"avarage_age" : {
"value" : 25.166666666666668
}
}
}
更为复杂的聚合
我们可以结合上面的 bucket 聚合及 metric 聚合形成更为复杂的搜索:
代码语言:javascript复制GET twitter/_search
{
"size": 0,
"aggs": {
"cities": {
"terms": {
"field": "city",
"order": {
"average_age": "desc"
},
"size": 5
},
"aggs": {
"average_age": {
"avg": {
"field": "age"
}
}
}
}
}
}
在上面,我们首先通过 terms 来生成每个城市的桶聚合,然后在每个桶里计算所有文档的平均年龄。在正常的情况下,这个排序是按照每个城市里文档的多少由多到少来排序的。在我们上面的搜索中,我们特意添加 average_age 来进行降序排序。这样返回的结果如下:
代码语言:javascript复制"aggregations" : {
"cities" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "上海",
"doc_count" : 1,
"average_age" : {
"value" : 28.0
}
},
{
"key" : "北京",
"doc_count" : 5,
"average_age" : {
"value" : 24.6
}
}
]
}
上面显示,有两个城市:上海及北京。在上海城市中有 1 个文档,而在北京城市里有 5 个文档。同时,我们也计算出来每个城市的平均年龄。由于我们使用了 average_age 来进行降排序,在我们的结果中,我们可以看到“上海”城市排在前面,这是因为上海城市的平均年龄比北京的平均年龄高。