Elasticsearch探索:range 数据类型&聚合 (7.4版新功能)

2021-01-11 11:01:17 浏览数 (1)

简介

在Elasticsearch中有一种数据类型叫做 range 的数据类型。它目前支持的类型如下:

数据类型

释义

integer_range

一个带符号的32位整数范围,最小值为,最大值为。

float_range

一系列单精度32位IEEE 754浮点值。

long_range

一系列带符号的64位整数,最小值为-2的63次方,最大值为2的63次方-1。

double_range

一系列双精度64位IEEE 754浮点值。

date_range

自系 EPOCH 以来经过的一系列日期值,表示为无符号的64位整数毫秒。

ip_range

支持IPv4或IPv6(或混合)地址的一系列ip值。

Range 数据类型搜索

下面是一个简单的例子来展示这个数据类型的。首先我们来创建一个叫做 range_index 的索引,并同时定义一个 mapping:

代码语言:javascript复制
PUT range_index
{
  "settings": {
    "number_of_shards": 2
  },
  "mappings": {
    "properties": {
      "expected_attendees": {
        "type": "integer_range"
      },
      "time_frame": {
        "type": "date_range",
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
      }
    }
  }
}

PUT range_index/_doc/1?refresh
{
  "expected_attendees": {
    "gte": 10,
    "lte": 20
  },
  "time_frame": {
    "gte": "2015-10-31 12:00:00",
    "lte": "2015-11-01"
  }
}

在上面的文档中,我们输入了两个 range 的数据,它们分别对应我们之前在 mapping 中定义的 integer_range 及 date_range。

下面我们可以使用一个 term query 来查询 integer_range 字段 expected_attendees:

代码语言:javascript复制
GET range_index/_search
{
  "query": {
    "term": {
      "expected_attendees": {
        "value": "10"
      }
    }
  }
}

结果:
{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "range_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "expected_attendees" : {
            "gte" : 10,
            "lte" : 20
          },
          "time_frame" : {
            "gte" : "2015-10-31 12:00:00",
            "lte" : "2015-11-01"
          }
        }
      }
    ]
  }
}

因为10刚好是在我们之前的文档定义的10-20区间。为了验证我们的搜索是否有效,我们可以做另外的一个搜索:

代码语言:javascript复制
GET range_index/_search
{
  "query": {
    "term": {
      "expected_attendees": {
        "value": "40"
      }
    }
  }
}

结果:
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

因为40不在我们的10-20的区间,所以我们搜索的结果显示为空。

同样地,我们可以针对时间区间来进行搜索:

代码语言:javascript复制
GET range_index/_search
{
  "query": {
    "range": {
      "time_frame": {
        "gte": "2015-10-31",
        "lte": "2015-11-01",
        "relation": "within"
      }
    }
  }
}

结果:
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "range_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "expected_attendees" : {
            "gte" : 10,
            "lte" : 20
          },
          "time_frame" : {
            "gte" : "2015-10-31 12:00:00",
            "lte" : "2015-11-01"
          }
        }
      }
    ]
  }
}

相反,如果我们在这个时间之外的区间来进行搜索:

代码语言:javascript复制
GET range_index/_search
{
  "query": {
    "range": {
      "time_frame": {
        "gte": "2017-10-31",
        "lte": "2018-11-01"
      }
    }
  }
}

结果:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

Range 数据类型聚合

在这一节里,我们来针对 Range 的数据类型来做聚合展示。这是Elasticsearch 7.4发行版的一个新的功能。

在针对 range 聚合时,它会让用户可以更轻松地计算与特定存储桶重叠的范围数。例如,range 字段上的日期直方图聚合使用户可以计算在特定分钟内发生的电话呼叫次数,或者可以计算给定日期休假的员工人数。

我们还是拿我们之前的那个 sports 数据来进行展示。首先,我们来创建一个索引及 mapping:

代码语言:javascript复制
PUT sports
{
  "mappings": {
    "properties": {
      "age": {
        "type": "integer"
      },
      "birthdate": {
        "type": "date",
        "format": "date_optional_time"
      },
      "goals": {
        "type": "integer"
      },
      "location": {
        "type": "geo_point"
      },
      "name": {
        "type": "keyword"
      },
      "rating": {
        "type": "integer"
      },
      "role": {
        "type": "keyword"
      },
      "score_weight": {
        "type": "float"
      },
      "sport": {
        "type": "keyword"
      },
      "age_range": {
        "type": "integer_range"
      }
    }
  }
}

请注意上面的一个字段 age_range。它的类型是 integer_range 类型的。我们利用 Elasticsearch 所提供的 Bulk API 接口来把如下的数据导入到 Elasticsearch 之中:

代码语言:javascript复制
POST _bulk
{"index":{"_index":"sports"}}
{"name":"Michael", "birthdate":"1989-10-1", "sport":"Football", "rating": ["5", "4"],  "location":"46.22,-68.45","goals": "43","score_weight":"3","role":"midfielder","age": 30, "age_range": {"gte": 27, "lte": 30}  }
{"index":{"_index":"sports"}}
{"name":"Will", "birthdate":"1988-3-1", "sport":"Hockey", "rating": ["4", "4"], "location":"46.25,-84.25", "goals": "124", "score_weight":"2", "role":"forward", "age": 31, "age_range": {"gte": 31, "lte": 32} }
{"index":{"_index":"sports"}}
{"name":"Mick", "birthdate":"1989-10-1", "sport":"Football", "rating": ["3", "4"],  "location":"46.22,-68.45","goals": "56","score_weight":"3", "role":"midfielder", "age": 30, "age_range": {"gte": 27, "lte": 30}}
{"index":{"_index":"sports"}}
{"name":"Pong", "birthdate":"1989-11-2", "sport":"Basketball", "rating": ["1", "3"],  "location":"45.21,-68.35","goals": "1483","score_weight":"2", "role":"forward", "age": 30, "age_range": {"gte": 27, "lte": 30}}
{"index":{"_index":"sports"}}
{"name":"Ray", "birthdate":"1988-10-3", "sport":"Football", "rating": ["2", "2"],  "location":"45.16,-63.58","goals": "84", "score_weight":"3", "role":"midfielder", "age": 31, "age_range": {"gte": 31, "lte": 32} }
{"index":{"_index":"sports"}}
{"name":"Ping", "birthdate":"1992-5-20", "sport":"Basketball", "rating": ["4", "3"],  "location":"45.22,-68.53","goals": "1328", "score_weight":"2", "role":"forward", "age": 27, "age_range": {"gte": 27, "lte": 30}}
{"index":{"_index":"sports"}}
{"name":"Duke", "birthdate":"1992-2-28", "sport":"Hockey", "rating": ["5", "2"],  "location":"46.22,-68.85", "goals": "218", "score_weight":"2", "role":"forward", "age": 27, "age_range": {"gte": 27, "lte": 30}}
{"index":{"_index":"sports"}}
{"name":"Hal", "birthdate":"1990-9-9", "sport":"Hockey", "rating": ["4", "2"],  "location":"45.12,-68.35","goals": "148", "score_weight":"3", "role":"midfielder", "age": 29, "age_range": {"gte": 27, "lte": 30}}
{"index":{"_index":"sports"}}
{"name":"Charge", "birthdate":"1990-4-1", "sport":"Football", "rating": ["3", "2"], "location":"44.19,-82.55","goals": "34", "score_weight":"4", "role":"defender", "age": 29, "age_range": {"gte": 27, "lte": 30}}
{"index":{"_index":"sports"}}
{"name":"Barry", "birthdate":"1988-3-1", "sport":"Football", "rating": ["5", "2"], "location":"36.45,-79.15", "score_weight":"4", "role":"defender", "age": 31, "age_range": {"gte": 31, "lte": 32} }
{"index":{"_index":"sports"}}
{"name":"Bank", "birthdate":"1988-3-1", "sport":"Handball", "rating": ["6", "4"], "location":"46.25,-54.53", "goals": "150", "score_weight":"4", "role":"defender", "age": 31, "age_range": {"gte": 31, "lte": 32} }
{"index":{"_index":"sports"}}
{"name":"Bingo", "birthdate":"1988-3-1", "sport":"Handball", "rating": ["10", "7"], "location":"46.25,-68.55", "goals": "143", "score_weight":"3", "role":"midfielder", "age": 31, "age_range": {"gte": 31, "lte": 32} }
{"index":{"_index":"sports"}}
{"name":"James", "birthdate":"1988-3-1", "sport":"Basketball", "rating": ["10", "8"], "location":"41.25,-69.55", "goals": "1284", "score_weight":"2", "role":"forward", "age": 31, "age_range": {"gte": 31, "lte": 32} }
{"index":{"_index":"sports"}}
{"name":"Wayne", "birthdate":"1988-3-1", "sport":"Hockey", "rating": ["10", "10"], "location":"46.21,-68.55", "goals": "113", "score_weight":"3", "role":"midfielder", "age": 31, "age_range": {"gte": 31, "lte": 32} }
{"index":{"_index":"sports"}}
{"name":"Brady", "birthdate":"1988-3-1", "sport":"Handball", "rating": ["10", "10"], "location":"63.24,-84.55","goals": "443", "score_weight":"2", "role":"forward", "age": 31, "age_range": {"gte": 31, "lte": 32} }
{"index":{"_index":"sports"}}
{"name":"Lewis", "birthdate":"1988-3-1", "sport":"Football", "rating": ["10", "10"], "location":"56.25,-74.55","goals": "49", "score_weight":"3", "role":"midfielder", "age": 31, "age_range": {"gte": 31, "lte": 32} }

注意在我们的数据里,我们定义两个年龄段27-30及30-32 。这个是在 age_range 字段里表示的。

首先,我们来做一个 histogram 的查询:

我们按照年龄来进行一个直方图来表示我们的年龄的分布。显示的结果是:

代码语言:javascript复制
GET sports/_search
{
  "size": 0,
  "aggs": {
    "age_distogram": {
      "histogram": {
        "field": "age",
        "interval": 1
      }
    }
  }
}

我们按照年龄来进行一个直方图来表示我们的年龄的分布。显示的结果是:

代码语言:javascript复制
{
  "took" : 233,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 16,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "age_distogram" : {
      "buckets" : [
        {
          "key" : 27.0,
          "doc_count" : 2
        },
        {
          "key" : 28.0,
          "doc_count" : 0
        },
        {
          "key" : 29.0,
          "doc_count" : 2
        },
        {
          "key" : 30.0,
          "doc_count" : 3
        },
        {
          "key" : 31.0,
          "doc_count" : 9
        }
      ]
    }
  }
}

我们也可以通过Kibana来表示:

从上面的图上我们可以看出来各个年龄的文档数量的分布情况。

我们仔细地看一下我们的一个文档:

代码语言:javascript复制
"_source" : {
    "name":"Michael",
    "birthdate":"1989-10-1",
    "sport":"Football",
    "rating":[
        "5",
        "4"
    ],
    "location":"46.22,-68.45",
    "goals":"43",
    "score_weight":"3",
    "role":"midfielder",
    "age":30,
    "age_range":{
        "gte":27,
        "lte":30
    }
}

我们可以看出来在我们的文档里含有一个字段叫做 age_range 的。它定义了这个运动员所在的年龄范围。我们可以通过这个字段来对我们的数据进行统计:

代码语言:javascript复制
GET sports/_search
{
  "size": 0,
  "aggs": {
    "age_histogram": {
      "histogram": {
        "field": "age_range",
        "interval": 3
      }
    }
  }
}

在这里,我们使用age_range来进行聚合统计。那么返回的结果是:

代码语言:javascript复制
{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 16,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "age_histogram" : {
      "buckets" : [
        {
          "key" : 27.0,
          "doc_count" : 7
        },
        {
          "key" : 30.0,
          "doc_count" : 16
        }
      ]
    }
  }
}

结果显示返回有两个 bucket。第一个key为27的doc_count是12,我们知道在27-30 (因为我们的interval是3)岁之间的文档数是12个。第一个bucket刚好覆盖range1里的所有文档。而key为30的doc_count为22,也就是文档的总数。这是为什么呢?

从上面可以看出来30岁这个年龄是跨两个 range:range1 及 range2,所以当我们统计的时候其实是把 range1 和 range2 里所有的文档相加起来算起的,也就是整个文档的数量

当然如果我们把 interval 设置为2,我们在来看一下我们的统计结果:

代码语言:javascript复制
GET sports/_search
{
  "size": 0,
  "aggs": {
    "age_histogram": {
      "histogram": {
        "field": "age_range",
        "interval": 2
      }
    }
  }
}

返回的结果是:

代码语言:javascript复制
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 16,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "age_histogram" : {
      "buckets" : [
        {
          "key" : 26.0,
          "doc_count" : 7
        },
        {
          "key" : 28.0,
          "doc_count" : 7
        },
        {
          "key" : 30.0,
          "doc_count" : 16
        },
        {
          "key" : 32.0,
          "doc_count" : 9
        }
      ]
    }
  }
}

上面显示的第一个桶是26-27范围。因为27是在range 1里,由于range1里含有12个文档,所以返回的是12。同样针对key为28的情况,它的范围是28-29,由于29是在range1范围里,所以返回值也是12。对key为30的情况,因为它被包含在range1及range2里,那么返回的值等于range1及range2的总和,也就是22。针对key为32的情况,它的范围是32-34。因为32在range2里,而range2里只有10个文档,所以这个桶的值是range2的值,也就是10。

0 人点赞