学习管道聚合,是为了完成更复杂的聚合分析,通过本文,你将对管道聚合的各种类型的功用和使用场景有一个全面的掌握。当遇到聚合需求时,可以快速反应,选用合适的聚合类型。
ps:本文基于ES 7.7.1【文末附《管道聚合详解》xmind 获取方式】
管道聚合详解
前两天,我们已经学习ES的桶聚合和指标聚合,这是学习 Pipeline Agg 的基础,如果对这两个聚合还没有整体概念的伙伴,可点击:ES系列14:你知道25种(桶聚合)Bucket Aggs 类型各自的使用场景么、ES系列15:ES的指标聚合有哪些呢?在这里,我都给你总结好了
在掌握了Bucket Agg 和 Metric Agg 后,对于很多聚合分析场景,就已经可以得心应手了,如果有更为复杂的,比如说二次聚合,可能就需要用到管道聚合 Pipeline Agg 了。
什么叫 Pipeline Agg 呢?就是管道聚合:对其他聚合结果进行二次聚合。注意,管道聚合不能具有子聚合,但是根据其类型,它可以引用buckets_path 允许管道聚合链接的另一个管道。例如,您可以将两个导数链接在一起以计算第二个导数(即导数的导数)。
在系统学习管道聚合之前,我们需要先掌握管道聚合的必填参数 buckets_path 的语法。
01 buckets_path 的语法
1.1 与操作的聚合对象同级,就是 Agg_Name
代码语言:javascript复制POST /_search
{
"aggs": {
"my_date_histo":{
"date_histogram":{
"field":"timestamp",
"calendar_interval":"day"
},
"aggs":{
"the_sum":{
"sum":{ "field": "lemmings" }
},
"the_movavg":{
"moving_avg":{ "buckets_path": "the_sum" }
}
}
}
}
}
1.2 与操作的聚合对象不同级,就是 Agg_Name的相对路径,多个 Agg_Name间使用“>”连接
代码语言:javascript复制POST /_search
{
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"calendar_interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"max_monthly_sales": {
"max_bucket": {
"buckets_path": "sales_per_month>sales"
}
}
}
}
1.3 操作对象是多值桶某个key 的聚合,就是 Agg_Name1['keyName']>Agg_Name2
代码语言:javascript复制POST /_search
{
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"calendar_interval" : "month"
},
"aggs": {
"sale_type": {
"terms": {
"field": "type"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"hat_vs_bag_ratio": {
"bucket_script": {
"buckets_path": {
"hats": "sale_type['hat']>sales",
"bags": "sale_type['bag']>sales"
},
"script": "params.hats / params.bags"
}
}
}
}
}
}
1.4 特殊路径
代码语言:javascript复制POST /_search
{
"aggs": {
"my_date_histo": {
"date_histogram": {
"field":"timestamp",
"calendar_interval":"day"
},
"aggs": {
"the_movavg": {
"moving_avg": { "buckets_path": "_count" }
}
}
}
}
}
或者
POST /sales/_search
{
"size": 0,
"aggs": {
"histo": {
"date_histogram": {
"field": "date",
"calendar_interval": "day"
},
"aggs": {
"categories": {
"terms": {
"field": "category"
}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"count": "categories._bucket_count"
},
"script": {
"source": "params.count != 0"
}
}
}
}
}
}
}
了解了buckets_path 的语法,接下来我们就详细学习管道聚合,在学习之前,我们要知道管道聚合根据输出结果的位置分为Parent【结果内嵌到现有的聚合分析结果中】 和 Sibling【结果和现有分析结果同级】 两类。
02 Parent 分类
2.1 Bucket Script 桶脚本聚合
场景示例:计算出每月T恤销售额与总销售额的比例百分比
代码语言:javascript复制POST /sales/_search
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"calendar_interval" : "month"
},
"aggs": {
"total_sales": {
"sum": {
"field": "price"
}
},
"t-shirts": {
"filter": {
"term": {
"type": "t-shirt"
}
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"t-shirt-percentage": {
"bucket_script": {
"buckets_path": {
"tShirtSales": "t-shirts>sales",
"totalSales": "total_sales"
},
"script": "params.tShirtSales / params.totalSales * 100"
}
}
}
}
}
}
结果:
代码语言:javascript复制{
"took": 11,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"total_sales": {
"value": 550.0
},
"t-shirts": {
"doc_count": 1,
"sales": {
"value": 200.0
}
},
"t-shirt-percentage": {
"value": 36.36363636363637
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"total_sales": {
"value": 60.0
},
"t-shirts": {
"doc_count": 1,
"sales": {
"value": 10.0
}
},
"t-shirt-percentage": {
"value": 16.666666666666664
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"total_sales": {
"value": 375.0
},
"t-shirts": {
"doc_count": 1,
"sales": {
"value": 175.0
}
},
"t-shirt-percentage": {
"value": 46.666666666666664
}
}
]
}
}
}
2.2 Bucket Selector 桶选择器聚合
场景示例:获取当月总销售额超过200的存储分区桶
代码语言:javascript复制POST /sales/_search
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"calendar_interval" : "month"
},
"aggs": {
"total_sales": {
"sum": {
"field": "price"
}
},
"sales_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"totalSales": "total_sales"
},
"script": "params.totalSales > 200"
}
}
}
}
}
}
结果:
代码语言:javascript复制{
"took": 11,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"total_sales": {
"value": 550.0
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"total_sales": {
"value": 375.0
}
}
]
}
}
}
2.3 Bucket Sort 桶排序聚合
场景示例:按降序返回总销售额最高的3个月相对应的存储桶
代码语言:javascript复制POST /sales/_search
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"calendar_interval" : "month"
},
"aggs": {
"total_sales": {
"sum": {
"field": "price"
}
},
"sales_bucket_sort": {
"bucket_sort": {
"sort": [
{"total_sales": {"order": "desc"}}
],
"size": 3,
"from": 0
}
}
}
}
}
}
结果:
代码语言:javascript复制 "aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"total_sales": {
"value": 550.0
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"total_sales": {
"value": 375.0
},
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"total_sales": {
"value": 60.0
},
}
]
}
}
2.4 Cumulative Cardinality 累积基数聚合
场景示例1:获取网站每天的新访问者总的累计数量【后一天会累加前一天的,就是以第一天为基准】
代码语言:javascript复制GET /user_hits/_search
{
"size": 0,
"aggs" : {
"users_per_day" : {
"date_histogram" : {
"field" : "timestamp",
"calendar_interval" : "day"
},
"aggs": {
"distinct_users": {
"cardinality": {
"field": "user_id"
}
},
"total_new_users": {
"cumulative_cardinality": {
"buckets_path": "distinct_users"
}
}
}
}
}
}
结果:
代码语言:javascript复制 "aggregations": {
"users_per_day": {
"buckets": [
{
"key_as_string": "2019-01-01T00:00:00.000Z",
"key": 1546300800000,
"doc_count": 2,
"distinct_users": {
"value": 2
},
"total_new_users": {
"value": 2
}
},
{
"key_as_string": "2019-01-02T00:00:00.000Z",
"key": 1546387200000,
"doc_count": 2,
"distinct_users": {
"value": 2
},
"total_new_users": {
"value": 3
}
},
{
"key_as_string": "2019-01-03T00:00:00.000Z",
"key": 1546473600000,
"doc_count": 3,
"distinct_users": {
"value": 3
},
"total_new_users": {
"value": 4
}
}
]
}
}
}
结果分析:
请注意,第二天2019-01-02拥有两个不同的用户,
但total_new_users累积管道agg生成的指标仅增加到三个。
这意味着当天的两个用户中只有一个是新用户,
而在前一天已经看到了另一个用户。
第三天再次发生这种情况,三个用户中只有一个是全新的。
添加聚合:derivative 增量累积基数
场景示例2:获取网站每天增加了多少新用户【数据以前一天为基准】
代码语言:javascript复制GET /user_hits/_search
{
"size": 0,
"aggs" : {
"users_per_day" : {
"date_histogram" : {
"field" : "timestamp",
"calendar_interval" : "day"
},
"aggs": {
"distinct_users": {
"cardinality": {
"field": "user_id"
}
},
"total_new_users": {
"cumulative_cardinality": {
"buckets_path": "distinct_users"
}
},
"incremental_new_users": {
"derivative": {
"buckets_path": "total_new_users"
}
}
}
}
}
}
结果:
代码语言:javascript复制 "aggregations": {
"users_per_day": {
"buckets": [
{
"key_as_string": "2019-01-01T00:00:00.000Z",
"key": 1546300800000,
"doc_count": 2,
"distinct_users": {
"value": 2
},
"total_new_users": {
"value": 2
}
},
{
"key_as_string": "2019-01-02T00:00:00.000Z",
"key": 1546387200000,
"doc_count": 2,
"distinct_users": {
"value": 2
},
"total_new_users": {
"value": 3
},
"incremental_new_users": {
"value": 1.0
}
},
{
"key_as_string": "2019-01-03T00:00:00.000Z",
"key": 1546473600000,
"doc_count": 3,
"distinct_users": {
"value": 3
},
"total_new_users": {
"value": 4
},
"incremental_new_users": {
"value": 1.0
}
}
]
}
}
}
2.5 Cumulative Sum 累积总和聚合
场景示例:计算到当月为止,每月累计销售金额的总和
代码语言:javascript复制POST /sales/_search
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"calendar_interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
},
"cumulative_sales": {
"cumulative_sum": {
"buckets_path": "sales"
}
}
}
}
}
}
结果:
代码语言:javascript复制 "aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550.0
},
"cumulative_sales": {
"value": 550.0
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60.0
},
"cumulative_sales": {
"value": 610.0
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375.0
},
"cumulative_sales": {
"value": 985.0
}
}
]
}
}
}
2.6 剩下4个Parent类型的管道聚合
先了解下,如有需要再深入学习
03 Sibling 分类
3.1 六种统计类管道聚合
我觉得对于这6种管道聚合,应该不用一一介绍了,就和指标聚合类似,Avg、Max、Min、Sum,一看就知道是什么意思,简单看一个示例即可。
场景示例:计算每月销售额总量的平均值
代码语言:javascript复制POST /_search
{
"size": 0,
"aggs": {
"sales_per_month": {
"date_histogram": {
"field": "date",
"calendar_interval": "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"avg_monthly_sales": {
"avg_bucket": {
"buckets_path": "sales_per_month>sales"
}
}
}
}
结果:
代码语言:javascript复制{
"took": 11,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550.0
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60.0
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375.0
}
}
]
},
"avg_monthly_sales": {
"value": 328.33333333333333
}
}
}
对于 Stats Bucket 统计数据桶聚合,我们看下响应结果,知道统计了哪些指标即可:
代码语言:javascript复制 "aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550.0
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60.0
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375.0
}
}
]
},
"stats_monthly_sales": {
"count": 3,
"min": 60.0,
"max": 550.0,
"avg": 328.3333333333333,
"sum": 985.0
}
}
}
Extended Stats Bucket 扩展统计数据桶聚合,也直接看统计结果:
代码语言:javascript复制示例结果:
"stats_monthly_sales": {
"count": 3,
"min": 60.0,
"max": 550.0,
"avg": 328.3333333333333,
"sum": 985.0,
"sum_of_squares": 446725.0,
"variance": 41105.55555555556,
"std_deviation": 202.74505063146563,
"std_deviation_bounds": {
"upper": 733.8234345962646,
"lower": -77.15676792959795
}
}
}
2.2 Percentiles Bucket 百分数桶聚合
场景示例:计算每月总销售额存储桶对应的百分比位置的金额
代码语言:javascript复制POST /sales/_search
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"calendar_interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"percentiles_monthly_sales": {
"percentiles_bucket": {
"buckets_path": "sales_per_month>sales",
"percents": [ 25.0, 50.0, 75.0 ]
}
}
}
}
结果:
代码语言:javascript复制 "aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550.0
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60.0
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375.0
}
}
]
},
"percentiles_monthly_sales": {
"values" : {
"25.0": 375.0,
"50.0": 375.0,
"75.0": 550.0
}
}
}
}
写在最后
对于需要使用聚合分析的小伙伴,建议一定要对ES的3种聚合有一个整体的概念,知道ES的聚合能做哪些数据操作,从而面对各种聚合分析的需求时候,才能快速反应,知道该用什么样的操作,而不是绞尽脑汁,使用自己仅知道的Max、Sum等简单聚合去组合DSL。