开始使用Elasticsearch （3）

【腾讯云 Elasticsearch Service】高可用，可伸缩，云端全托管。集成X-Pack高级特性，适用日志分析/企业搜索/BI分析等场景

在今天的这篇文章中，我们将来学习如何运用 Elasticsearch 来对我们的数据进行分析及一些关于 Analyzer 的介绍。在学习这个之前，我们必须完成之前的练习：

开始使用 Elasticsearch （1）-- 如何对文档进行操作

开始使用 Elasticsearch （2）-- 如何对数据进行搜索

我们使用前面两个练习所使用的文章建立我们的 index ，并在这篇文章中进行使用。

分析数据对很多的企业非常重要。它可以帮我们很快地分析出生产，运营中出现的问题，并实时地进行纠正或报警。

Aggregation 简介

聚合框架有助于基于搜索查询提供聚合数据。它基于称为聚合的简单构建块，可以组合以构建复杂的数据摘要。

聚合可以被视为在一组文档上构建分析信息的工作单元。执行的上下文定义了该文档集的内容（例如，在执行的查询的上下文中执行顶级聚合/搜索请求的过滤器）。

有许多不同类型的聚合，每个聚合都有自己的目的和输出。为了更好地理解这些类型，通常更容易将它们分为四个主要方面：

Bucketing

构建存储桶的一系列聚合，其中每个存储桶与密钥和文档标准相关联。执行聚合时，将在上下文中的每个文档上评估所有存储桶条件，并且当条件匹配时，文档被视为“落入”相关存储桶。在聚合过程结束时，我们最终会得到一个桶列表 - 每个桶都有一组“属于”它的文档。

Metric

聚合可跟踪和计算一组文档的指标。

Martrix

一系列聚合，它们在多个字段上运行，并根据从请求的文档字段中提取的值生成矩阵结果。与度量标准和存储区聚合不同，此聚合系列尚不支持脚本。

Pipeline

聚合其他聚合的输出及其关联度量的聚合

接下来是有趣的部分。由于每个存储桶( bucket )有效地定义了一个文档集（属于该 bucket 的所有文档），因此可以在 bucket 级别上关联聚合，并且这些聚合将在该存储桶的上下文中执行。这就是聚合的真正力量所在：聚合可以嵌套！

注意一：bucketing 聚合可以具有子聚合（bucketing或metric）。将为其父聚合生成的桶计算子聚合。嵌套聚合的级别/深度没有硬性限制（可以在 “父” 聚合下嵌套聚合，“父” 聚合本身是另一个更高级聚合的子聚合）。

注意二：聚合可以操作于 double 类型的上限的数据。因此，当在绝对值大于 2 ^ 53 的 long 上运行时，结果可能是近似的。

Aggregation 请求是搜索 API 的一部分，它可以带有一个 query 的结构或者不带。

准备数据

为了更好地展示，我们首先来把我们之前的 twitter 的数据做一点小的修改。我们添加一个新的字段 DOB (date of birth)，也就是生日的意思。同时，我们也对 province，city 及 country 字段的类型做了调整，并把它们作为 keyword 。我们来做如下的操作：

代码语言：javascript复制

DELETE twitte
 
PUT twitte
{
  "mappings": {
    "properties": {
      "DOB": {
        "type": "date"
      },
      "address": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "age": {
        "type": "long"
      },
      "city": {
        "type": "keyword"
      },
      "country": {
        "type": "keyword"
      },
      "location": {
        "type": "geo_point"
      },
      "message": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "province": {
        "type": "keyword"
      },
      "uid": {
        "type": "long"
      },
      "user": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

然后，我们再次使用 bulk API 来把我们的数据导入到 Elasticsearch 中：

代码语言：javascript复制

POST _bulk
{"index":{"_index":"twitter","_id":1}}
{"user":"张三","message":"今儿天气不错啊，出去转转去","uid":2,"age":20,"city":"北京","province":"北京","country":"中国","address":"中国北京市海淀区","location":{"lat":"39.970718","lon":"116.325747"}, "DOB": "1999-04-01"}
{"index":{"_index":"twitter","_id":2}}
{"user":"老刘","message":"出发，下一站云南！","uid":3,"age":22,"city":"北京","province":"北京","country":"中国","address":"中国北京市东城区台基厂三条3号","location":{"lat":"39.904313","lon":"116.412754"}, "DOB": "1997-04-01"}
{"index":{"_index":"twitter","_id":3}}
{"user":"李四","message":"happy birthday!","uid":4,"age":25,"city":"北京","province":"北京","country":"中国","address":"中国北京市东城区","location":{"lat":"39.893801","lon":"116.408986"}, "DOB": "1994-04-01"}
{"index":{"_index":"twitter","_id":4}}
{"user":"老贾","message":"123,gogogo","uid":5,"age":30,"city":"北京","province":"北京","country":"中国","address":"中国北京市朝阳区建国门","location":{"lat":"39.718256","lon":"116.367910"}, "DOB": "1989-04-01"}
{"index":{"_index":"twitter","_id":5}}
{"user":"老王","message":"Happy BirthDay My Friend!","uid":6,"age":26,"city":"北京","province":"北京","country":"中国","address":"中国北京市朝阳区国贸","location":{"lat":"39.918256","lon":"116.467910"}, "DOB": "1993-04-01"}
{"index":{"_index":"twitter","_id":6}}
{"user":"老吴","message":"好友来了都今天我生日，好友来了,什么 birthday happy 就成!","uid":7,"age":28,"city":"上海","province":"上海","country":"中国","address":"中国上海市闵行区","location":{"lat":"31.175927","lon":"121.383328"}, "DOB": "1991-04-01"}

聚合操作

简单地说，聚合的语法是这样的：

代码语言：javascript复制

"aggregations" : {
    "<aggregation_name>" : {
        "<aggregation_type>" : {
            <aggregation_body>
        }
        [,"meta" : {  [<meta_data_body>] } ]?
        [,"aggregations" : { [<sub_aggregation>]  } ]?
    }
    [,"<aggregation_name_2>" : { ... } ]*
}

通常，我们也可以使用 aggs 来代替上面的 “aggregations” 。

下面，我们来针对我们的数据来进行一些简单的操作，这样可以使得大家更加明白一些。

range聚合

我们可以把用户进行年龄分段，查出来在不同的年龄段的用户：

代码语言：javascript复制

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "age": {
      "range": {
        "field": "age",
        "ranges": [
          {
            "from": 20,
            "to": 30
          },
          {
            "from": 30,
            "to": 40
          },
          {
            "from": 40,
            "to": 50
          }
        ]
      }
    }
  }
}

在这里，我们使用 range 类型的聚合。在上面我们定义了不同的年龄段。通过上面的查询，我们可以得到不同年龄段的 bucket 。显示的结果是：

代码语言：javascript复制

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "age" : {
      "buckets" : [
        {
          "key" : "20.0-22.0",
          "from" : 20.0,
          "to" : 22.0,
          "doc_count" : 1
        },
        {
          "key" : "22.0-25.0",
          "from" : 22.0,
          "to" : 25.0,
          "doc_count" : 1
        },
        {
          "key" : "25.0-30.0",
          "from" : 25.0,
          "to" : 30.0,
          "doc_count" : 3
        }
      ]
    }
  }
}

在上面，我们也注意到，我们把 size 设置为 0。这是因为针对聚合，我们并不关心返回的结果。加入我们设置为 1 的话，我们可以看到如下的输出：

代码语言：javascript复制

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "twitter",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "user" : "张三",
          "message" : "今儿天气不错啊，出去转转去",
          "uid" : 2,
          "age" : 20,
          "city" : "北京",
          "province" : "北京",
          "country" : "中国",
          "address" : "中国北京市海淀区",
          "location" : {
            "lat" : "39.970718",
            "lon" : "116.325747"
          },
          "DOB" : "1999-04-01"
        }
      }
    ]
  },
  "aggregations" : {
    "age" : {
      "buckets" : [
        {
          "key" : "20.0-22.0",
          "from" : 20.0,
          "to" : 22.0,
          "doc_count" : 1
        },
        {
          "key" : "22.0-25.0",
          "from" : 22.0,
          "to" : 25.0,
          "doc_count" : 1
        },
        {
          "key" : "25.0-30.0",
          "from" : 25.0,
          "to" : 30.0,
          "doc_count" : 3
        }
      ]
    }
  }
}

从这里，我们可以看出来，在我们的输出中，也看到了其中的一个文档的输出。

date_range 聚合

我们可以使用 date_range 来统计在某个时间段里的文档数：

代码语言：javascript复制

POST twitter/_search
{
  "size": 0,
  "aggs": {
    "birth_range": {
      "date_range": {
        "field": "DOB",
        "format": "yyyy-MM-dd",
        "ranges": [
          {
            "from": "1989-01-01",
            "to": "1990-01-01"
          },
          {
            "from": "1991-01-01",
            "to": "1992-01-01"
          }
        ]
      }
    }
  }
}

在上面我们查询出生年月（DOB）从 1989-01-01 到 1990-01-01 及从 1991-01-01 到 1992-01-01 的文档。显示的结果是：

代码语言：javascript复制

  "aggregations" : {
    "birth_range" : {
      "buckets" : [
        {
          "key" : "1989-01-01-1990-01-01",
          "from" : 5.99616E11,
          "from_as_string" : "1989-01-01",
          "to" : 6.31152E11,
          "to_as_string" : "1990-01-01",
          "doc_count" : 1
        },
        {
          "key" : "1991-01-01-1992-01-01",
          "from" : 6.62688E11,
          "from_as_string" : "1991-01-01",
          "to" : 6.94224E11,
          "to_as_string" : "1992-01-01",
          "doc_count" : 1
        }
      ]
    }

terms聚合

我们也可以通过 term 聚合来查询某一个关键字出现的频率。在如下的 term 聚合中，我们想寻找在所有的文档出现 ”Happy birthday” 里按照城市进行分类的一个聚合。

代码语言：javascript复制

GET twitter/_search
{
  "query": {
    "match": {
      "message": "happy birthday"
    }
  },
  "size": 0,
  "aggs": {
    "city": {
      "terms": {
        "field": "city",
        "size": 10
      }
    }
  }
}

注意这里的 10 指的是前 10 名的城市。聚合的结果是：

代码语言：javascript复制

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "city" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "北京",
          "doc_count" : 2
        },
        {
          "key" : "上海",
          "doc_count" : 1
        }
      ]
    }
  }
}

在上面，我们可以看出来，在所有的含有 "Happy birthday" 的文档中，有两个是来自北京的，有一个是来自上海。

我们也可以使用 script 来生成一个在索引里没有的术语来进行统计。比如，我们可以通过如下的 script 来生成一个对文档人出生年份的统计：

代码语言：javascript复制

POST twitter/_search
{
  "size": 0,
  "aggs": {
    "birth_year": {
      "terms": {
        "script": {
          "source": "2019 - doc['age'].value"
        }, 
        "size": 10
      }
    }
  }
}

在上面，我们通过脚本：

代码语言：javascript复制

        "script": {
          "source": "2019 - doc['age'].value"
        }

根据年龄来生成出生的年月来进行统计：

代码语言：javascript复制

  "aggregations" : {
    "birth_year" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "1989",
          "doc_count" : 1
        },
        {
          "key" : "1991",
          "doc_count" : 1
        },
        {
          "key" : "1993",
          "doc_count" : 1
        },
        {
          "key" : "1994",
          "doc_count" : 1
        },
        {
          "key" : "1997",
          "doc_count" : 1
        },
        {
          "key" : "1999",
          "doc_count" : 1
        }
      ]
    }
  }

在上面我们可以看到 key 为 1991，1993，1994 等。这些 key 在我们原有的字段中根本就不存在。

Histogram Aggregation

基于多桶值源的汇总，可以应用于从文档中提取的数值或数值范围值。它根据值动态构建固定大小（也称为间隔）的存储桶。

代码语言：javascript复制

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "age_distribution": {
      "histogram": {
        "field": "age",
        "interval": 2
      }
    }
  }
}

显示结果：

代码语言：javascript复制

  "aggregations" : {
    "age_distribution" : {
      "buckets" : [
        {
          "key" : 20.0,
          "doc_count" : 1
        },
        {
          "key" : 22.0,
          "doc_count" : 1
        },
        {
          "key" : 24.0,
          "doc_count" : 1
        },
        {
          "key" : 26.0,
          "doc_count" : 1
        },
        {
          "key" : 28.0,
          "doc_count" : 1
        },
        {
          "key" : 30.0,
          "doc_count" : 1
        }
      ]
    }
  }

上面显示从 20-22 年龄段，有一个文档。从 22-24 也有一个文档。

date_histogram

这种聚合类似于正常的直方图，但只能与日期或日期范围值一起使用。由于日期在 Elasticsearch 中内部以长值表示，因此也可以但不准确地对日期使用正常的直方图。

代码语言：javascript复制

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "age_distribution": {
      "date_histogram": {
        "field": "DOB",
        "interval": "year"
      }
    }
  }
}

在上面我们使用 DOB 来作为 date_histogram 的字段来进行聚合统计。我们按照每隔一年这样的时间间隔来进行。显示结果：

代码语言：javascript复制

  "aggregations" : {
    "age_distribution" : {
      "buckets" : [
        {
          "key_as_string" : "1989-01-01T00:00:00.000Z",
          "key" : 599616000000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "1990-01-01T00:00:00.000Z",
          "key" : 631152000000,
          "doc_count" : 0
        },
        {
          "key_as_string" : "1991-01-01T00:00:00.000Z",
          "key" : 662688000000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "1992-01-01T00:00:00.000Z",
          "key" : 694224000000,
          "doc_count" : 0
        },
        {
          "key_as_string" : "1993-01-01T00:00:00.000Z",
          "key" : 725846400000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "1994-01-01T00:00:00.000Z",
          "key" : 757382400000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "1995-01-01T00:00:00.000Z",
          "key" : 788918400000,
          "doc_count" : 0
        },
        {
          "key_as_string" : "1996-01-01T00:00:00.000Z",
          "key" : 820454400000,
          "doc_count" : 0
        },
        {
          "key_as_string" : "1997-01-01T00:00:00.000Z",
          "key" : 852076800000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "1998-01-01T00:00:00.000Z",
          "key" : 883612800000,
          "doc_count" : 0
        },
        {
          "key_as_string" : "1999-01-01T00:00:00.000Z",
          "key" : 915148800000,
          "doc_count" : 1
        }
      ]
    }

上面的结果显示 DOB 从 1989-01-01 到 1990-01-01 有一个文档。从 1990-01-01 到 1991-01-01 区间没有一个文档。

cardinality聚合

我们也可以使用 cardinality 聚合来统计到底有多少个城市：

代码语言：javascript复制

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "number_of_cities": {
      "cardinality": {
        "field": "city.keyword"
      }
    }
  }
}

运行上面的查询，我们可以看到结果是：

代码语言：javascript复制

{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "number_of_cities" : {
      "value" : 2
    }
  }
}

它显示我们有两个城市：北京及上海。它们在文档中虽然出现多次，但是从唯一性上，只有两个城市。

Metric 聚合

我们可以使用 Metrics 来统计我们的数值数据，比如我们想知道所有用户的平均年龄是多少？我们可以用下面的聚合：

代码语言：javascript复制

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "average_age": {
      "avg": {
        "field": "age"
      }
    }
  }
}

我们的返回的结果是：

代码语言：javascript复制

  "aggregations" : {
    "average_age" : {
      "value" : 25.166666666666668
    }
  }

所有人的平均年龄是 25.166666666666668 岁。

我们也可以对只在北京的用户文档进行统计：

代码语言：javascript复制

POST twitter/_search
{
  "size": 0,
  "query": {
    "match": {
      "city": "北京"
    }
  },
  "aggs": {
    "average_age_beijing": {
      "avg": {
        "field": "age"
      }
    }
  }
}

上面我们先查询到所有在北京的用户，然后再对这些文档进行求年龄的平均值。返回的结果：

代码语言：javascript复制

  "aggregations" : {
    "average_age_beijing" : {
      "value" : 24.6
    }
  }

聚合通常在查询搜索结果上执行。 Elasticsearch 提供了一个特殊的 global 聚合，该全局全局对所有文档执行，而不受查询的影响。

代码语言：javascript复制

POST twitter/_search
{
  "size": 0,
  "query": {
    "match": {
      "city": "北京"
    }
  },
  "aggs": {
    "average_age_beijing": {
      "avg": {
        "field": "age"
      }
    },
    "average_age_all": {
      "global": {},
      "aggs": {
        "age_global_avg": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}

在上面我们在 average_age_all 里添加了一个 gobal 的聚合，这个平均值将会使用所有的 6 个文档而不是限于在这个查询的 5 个北京的文档。返回的结果是：

代码语言：javascript复制

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "average_age_beijing" : {
      "value" : 24.6
    },
    "average_age_all" : {
      "doc_count" : 6,
      "age_global_avg" : {
        "value" : 25.166666666666668
      }
    }
  }
}

我们也可以对整个年龄进行一个统计，比如：

代码语言：javascript复制

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "age_stats": {
      "stats": {
        "field": "age"
      }
    }
  }
}

统计的结果如下：

代码语言：javascript复制

  "aggregations" : {
    "age_stats" : {
      "count" : 6,
      "min" : 20.0,
      "max" : 30.0,
      "avg" : 25.166666666666668,
      "sum" : 151.0
    }
  }

在这里，我们可以看到到底有多少条数据，并且最大，最小的，平均值及加起来的合都在这里一起显示。

我们也可以只得到这个年龄的最大值：

代码语言：javascript复制

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "age_max": {
      "max": {
        "field": "age"
      }
    }
  }
}

显示的结果:

代码语言：javascript复制

  "aggregations" : {
    "age_max" : {
      "value" : 30.0
    }
  }

聚合通常适用于从聚合文档集中提取的值。可以使用聚合体内的字段键从特定字段提取这些值，也可以使用脚本提取这些值。我们可以通过 script 的方法来对我们的 aggregtion 结果进行重新计算：

代码语言：javascript复制

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "average_age_1.5": {
      "avg": {
        "field": "age",
        "script": {
          "source": "_value * params.correction",
          "params": {
            "correction": 1.5
          }
        }
      }
    }
  }
}

上面的这个聚合可以帮我们计算平均值再乘以 1.5 倍的结果。运行一下的结果如下：

代码语言：javascript复制

  "aggregations" : {
    "average_age_1.5" : {
      "value" : 37.75
    }
  }

显然我们的结果是之前的 25.166666666666668 的 1.5 倍。

我们也可以直接使用 script 的方法来进行聚合。在这种情况下，我们可以不指定特定的 field 。我们可能把很多项进行综合处理，并把这个结果来进行聚合：

代码语言：javascript复制

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "average_2_times_age": {
      "avg": {
        "script": {
          "source": "doc['age'].value * params.times",
          "params": {
            "times": 2.0
          }
        }
      }
    }
  }
}

在这里我们完全没有使用 field 这个项。我们直接使用 script 来形成我们的聚合：

代码语言：javascript复制

  "aggregations" : {
    "average_2_times_age" : {
      "value" : 50.333333333333336
    }
  }

Percentile aggregation

百分位数（percentile）表示观察值出现一定百分比的点。例如，第 95 个百分位数是大于观察值的 95％的值。该聚合针对从聚合文档中提取的数值计算一个或多个百分位数。这些值可以从文档中的特定数字字段中提取，也可以由提供的脚本生成。

百分位通常用于查找离群值。在正态分布中，第 0.13 和第 99.87 个百分位数代表与平均值的三个标准差。任何超出三个标准偏差的数据通常被视为异常。这在统计的角度是非常有用的。

我们现在来通过一个简单的例子来展示 Percentile aggregation 的用法：

代码语言：javascript复制

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "age_quartiles": {
      "percentiles": {
        "field": "age",
        "percents": [
          25,
          50,
          75,
          100
        ]
      }
    }
  }
}

在上面，我们使用了以叫做 age 的字段。它是一个数值的字段。我们通过 percentile aggregation 可以得到 25%，50% 及 75% 的人在什么范围。显示结果是:

代码语言：javascript复制

  "aggregations" : {
    "age_quartiles" : {
      "values" : {
        "25.0" : 22.0,
        "50.0" : 25.5,
        "75.0" : 28.0,
        "100.0" : 30.0
      }
    }
  }

我们可以看到 25% 的人平均年龄是低于 22.0 岁，而 50% 的人的年龄是低于 25.5 岁，而所有的人的年龄都是低于 30 岁的。这里的 50% 的年龄和我们之前计算的平均年龄是不一样的。

代码语言：javascript复制

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "avarage_age": {
      "avg": {
        "field": "age"
      }
    }
  }
}

这个平均年龄是：

代码语言：javascript复制

  "aggregations" : {
    "avarage_age" : {
      "value" : 25.166666666666668
    }
  }

更为复杂的聚合

我们可以结合上面的 bucket 聚合及 metric 聚合形成更为复杂的搜索：

代码语言：javascript复制

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "cities": {
      "terms": {
        "field": "city",
        "order": {
          "average_age": "desc"
        }, 
        "size": 5
      },
      "aggs": {
        "average_age": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}

在上面，我们首先通过 terms 来生成每个城市的桶聚合，让后在每个桶里计算所有文档的平均年龄。在正常的情况下，这个排序是按照每个城市里文档的多少由多到少来排序的。在我们上面的搜索中，我们特意添加 average_age 来进行降序排序。这样返回的结果如下：

代码语言：javascript复制

  "aggregations" : {
    "cities" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "上海",
          "doc_count" : 1,
          "average_age" : {
            "value" : 28.0
          }
        },
        {
          "key" : "北京",
          "doc_count" : 5,
          "average_age" : {
            "value" : 24.6
          }
        }
      ]
    }

上面显示，有两个城市：上海及北京。在上海城市中有 1 个文档，而在北京城市里有 5 个文档。同时，我们也计算出来每个城市的平均年龄。由于我们使用了 average_age 来进行降排序，在我们的结果中，我们可以看到“上海”城市排在前面，这是因为上海城市的平均年龄比北京的平均年龄高。

如果你想对 aggregation 有更多的了解的话，那么可以阅读我的另外文章

Elasticsearch：aggregation 介绍
Elasticsearch：pipeline aggregation 介绍
Elasticsearch：透彻理解 Elasticsearch 中的 Bucket aggregation

Analyzer 简介

我们知道 Elasticsearch 可以实现秒级的搜索速度，其中很重要的一个原因就当一个文档被存储的时候，同时它也对文档的数据进行了索引（indexing）。这样在以后的搜索中，就可以变得很快。简单地说，当一个文档进入到 Elasticsearch 时，它会经历如下的步骤：

中间的那部分就叫做 Analyzer 。我们可以看出来，它分为三个部分：Char Filters, Tokenizer 及 Token Filter。它们的作用分别如下：

Char Filter: 字符过滤器的工作是执行清除任务，例如剥离 HTML 标记。
Tokenizer: 下一步是将文本拆分为称为标记的术语。这是由 tokenizer 完成的。可以基于任何规则（例如空格）来完成拆分。有关 tokennizer 的更多详细信息，请访问以下 URL：https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html。
Token filter: 一旦创建了 token ，它们就会被传递给 token filter，这些过滤器会对 token 进行规范化。Token filter 可以更改token，删除术语或向 token 添加术语。

Elasticsearch 已经提供了比较丰富的 analyzer 。我们可以自己创建自己的 token analyzer，甚至可以利用已经有的 char filter，tokenizer 及 token filter 来重新组合成一个新的 analyzer，并可以对文档中的每一个字段分别定义自己的 analyzer。如果大家对analyzer 比较感兴趣的话，请参阅我们的网址 https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html。

在默认的情况下，standard analyzer 是 Elasticsearch 的缺省分析器：

没有 Char Filte
使用 standard tokonize
把字符串变为小写，同时有选择地删除一些 stop words 等。默认的情况下 stop words 为 _none_，也即不过滤任何 stop words。

下面我们简单地展示一下我们的 analyzer 是如何实现的。

代码语言：javascript复制

GET twitter/_analyze
{
  "text": [
    "Happy Birthday"
  ],
  "analyzer": "standard"
}

在上面的接口中，我们使用标准的 analyzer 来对字符串 "Happy birthday" 来分析，那么如下就是我我们看到的结果。

我们可以看到有两个 token: happy 和 birthday 。两个 token 都变成小写的了。同时我们也可以看到它们在文档中的位置信息。

很多人很好奇，想知道中文字的切割时怎么样的。我们下面来做一个简单的实验。

代码语言：javascript复制

GET twitter/_analyze
{
  "text": [
    "生日快乐"
  ],
  "analyzer": "standard"
}

那么下面就是我们想看到的结果：

代码语言：javascript复制

{
  "tokens" : [
    {
      "token" : "生",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "日",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "快",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "乐",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    }
  ]
}

我们可以看到有四个 token，并且它们的 type 也有所变化。

代码语言：javascript复制

GET twitter/_analyze
{
  "text": [
    "Happy.Birthday"
  ],
  "analyzer": "simple"
}

显示的结果是：

代码语言：javascript复制

{
  "tokens" : [
    {
      "token" : "happy",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "birthday",
      "start_offset" : 6,
      "end_offset" : 14,
      "type" : "word",
      "position" : 1
    }
  ]
}

我们可以看到在我们的字符串中的 "." 也被正确认识，并作为分隔符把 Happy.Birthday 切割为两个 token。

代码语言：javascript复制

GET twitter/_analyze
{
  "text": ["Happy Birthday"],
  "tokenizer": "keyword"
}

当我们使用 keyword 分析器时，我们可以看到上面的整个字符串无论有多长，都被当做是一个 token。这个对我们的 term 相关的搜索及聚合是有很大的用途的。上面的分析结果显示：

代码语言：javascript复制

{
  "tokens" : [
    {
      "token" : "Happy Birthday",
      "start_offset" : 0,
      "end_offset" : 14,
      "type" : "word",
      "position" : 0
    }
  ]
}

我们也可以使用 filter 处理我们的 token，比如：

代码语言：javascript复制

GET twitter/_analyze
{
  "text": ["Happy Birthday"],
  "tokenizer": "keyword",
  "filter": ["lowercase"]
}

经过上面的处理，我们的 token 变成为：

代码语言：javascript复制

{
  "tokens" : [
    {
      "token" : "happy birthday",
      "start_offset" : 0,
      "end_offset" : 14,
      "type" : "word",
      "position" : 0
    }
  ]
}

我们也可以使用单独使用 tokenizer 来分析我们的文字：

standard tokenize

代码语言：javascript复制

POST _analyze
{
  "tokenizer": "standard",
  "text": "Those who dare to fail miserably can achieve greatly."
}

它将生成如下的 token:

代码语言：javascript复制

[Those, who, dare, to, fail, miserably, can, achieve, greatly]

keyword tokenize

代码语言：javascript复制

POST _analyze
{
  "tokenizer": "keyword",
  "text": "Los Angeles"
}

上面返回的结果是：

代码语言：javascript复制

{
  "tokens" : [
    {
      "token" : "Los Angeles",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    }
  ]
}

大家如果对 anaylyzer 感兴趣的话，可以读更多的资料：https://www.elastic.co/guide/en/elasticsearch/reference/7.3/analysis-analyzers.html。

大家可以参阅我更及进一步的学习文档：Elasticsearch: analyzer。

至此，我们基本上已经完成了对 Elasticsearch 最基本的了解。上面所有的 script 可以在如下的地址下载：

https://github.com/liu-xiao-guo/es-scripts-7.3

如果你想了解更多关于 Elastic Stack 相关的知识，请参阅我们的官方网站：https://www.elastic.co/guide/index.html

————————————————

原文链接：https://elasticstack.blog.csdn.net/article/details/99621105

开始使用Elasticsearch （3）

【腾讯云 Elasticsearch Service】高可用，可伸缩，云端全托管。集成X-Pack高级特性，适用日志分析/企业搜索/BI分析等场景

Aggregation 简介

准备数据

聚合操作

range聚合

date_range 聚合

terms聚合

Histogram Aggregation

date_histogram

cardinality聚合

Metric 聚合

Percentile aggregation

更为复杂的聚合

Analyzer 简介

standard tokenize

keyword tokenize

最新活动