Elasticsearch Mapping

Elasticsearch Mapping用于定义文档。比如：文档所拥有的字段、文档中每个字段的数据类型、哪些字段需要进行索引等。本文将先后从mapping type、mapping parameter、mapping field和mapping explosion这四个维度展开。

1 Mapping Type

Elasticsearch Mapping有两种类型，分别是Dynamic Mapping和Explicit Mapping。

1.1 Dynamic Mapping

Dynamic Mapping，即动态映射。动态映射使得我们在索引文档时甚至不需要新增一个空的索引，更无需配置显式映射，其自动将文档中新字段插入到索引的mapping中。另外，动态映射默认为text类型字段生成一个keyword类型的字段。动态映射的核心逻辑：

新字段自动检测
新字段自动插入

假设现有一my-index-000001索引，其mapping如下：

代码语言：javascript复制

{
    "my-index-000001": {
        "mappings": {
            "properties": {
                "content": {
                    "type": "text"
                },
                "date": {
                    "type": "date",
                    "store": true
                },
                "title": {
                    "type": "text",
                    "store": true
                }
            }
        }
    }
}

然后，我们新增一个包含author新字段的文档：

代码语言：javascript复制

PUT /my-index-000001/_doc/1
{
    "content": "i am optimus prime",
    "title": "transformers",
    "date": "2021-01-19",
    "author": "optimus prime"
}

最后，验证my-index-000001索引的mapping中是否已包含author字段：

代码语言：javascript复制

{
    "my-index-000001": {
        "mappings": {
            "properties": {
                "author": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "content": {
                    "type": "text"
                },
                "date": {
                    "type": "date",
                    "store": true
                },
                "title": {
                    "type": "text",
                    "store": true
                }
            }
        }
    }
}

1.2 Explicit Mapping

Explicit Mapping，即显式映射。显式映射允许我们更加精细化地定义文档，比如：哪些字段是全文搜索字段、哪些字段是数值型、日期数据类型的格式、自定义动态映射的规则等。

2 Mapping Parameter

2.1 type

type参数用于声明字段的数据类型。

2.2 analyzer

Only text fields support the analyzer mapping parameter.

在index和search场景中，analyzer参数用于指定针对text类型字段进行文本分析时所使用的分析器。如果试图针对同一text类型字段在index和search场景使用不同的分析器，那么你需要使用search_analyzer来单独声明search场景所使用的分析器。

代码语言：javascript复制

{
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "my_index_analyzer",
                "search_analyzer": "my_search_analyzer"
            }
        }
    }
}

2.3 boost

The boost is applied only for term queries.
Index time boost is deprecated. Instead, the field mapping boost is applied at query time.

boost，即权重提升，默认值为1。常见地，我们可以为特定字段设定权重提升值，其值越大，那么该字段对最后相关度得分的提升越明显。

2.3.1 权重提升设定方式一

代码语言：javascript复制

PUT /my-index-000001
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "boost": 2 
      },
      "content": {
        "type": "text"
      }
    }
  }
}

GET /my-index-000001/_search
{
  "query": {
    "match": {
      "title": {
        "query": "quick brown fox"
      }
    }
  }
}

2.3.2 权重提升设定方式二

代码语言：javascript复制

GET /my-index-000001/_search
{
  "query": {
    "match": {
      "title": {
        "query": "quick brown fox",
        "boost": 2
      }
    }
  }
}

2.4 dynamic

dynamic参数用于控制是否启用动态映射特性，其值如下:

值	描述
true	默认值，新字段会自动添加到mapping中
runtime	新字段作为运行时字段被添加到mapping中
false	新字段不会添加到mapping中，这些字段既无法被索引也无法被搜索
strict	如果检测到新字段，那么会抛出异常，进而导致文档写入失败

2.5 doc_values

doc values是一种在索引文档时构建于磁盘的数据结构，doc values存储的值与_source字段相同，只不过是以面向列的方式存储，这对于排序和聚合而言更为有效。几乎所有字段类型都支持doc_values参数，但text和annotated_text类型字段除外。

默认情况下，在所有支持doc values的字段中，doc_values值均为true。如果你确定不需要对字段进行排序或聚合操作，也不需要通过script访问字段值，则可以将doc_values置为false，从而以节省磁盘空间。

2.6 enabled

Elasticsearch尝试索引所有字段，但有时你只想存储该字段而不索引该字段，即无需对该字段进行搜索或者聚合操作，那么你就可以将enabled值置为false。enabled参数仅适用于mapping中的顶级字段且数据类型必须为object。若enabled值为true，那么Elasticsearch会跳过对其内容进行解析，但依然会存储该字段。默认值为true。

代码语言：javascript复制

PUT /my-index-000001
{
    "mappings": {
        "properties": {
            "user_id": {
                "type": "keyword"
            },
            "last_updated": {
                "type": "date"
            },
            "session_data": {
                "type": "object",
                "enabled": false
            }
        }
    }
}

PUT /my-index-000001/_doc/session_1
{
    "user_id": "kimchy",
    "session_data": {
        "arbitrary_object": {
            "some_array": [
                "foo",
                "bar",
                {
                    "baz": 2
                }
            ]
        }
    },
    "last_updated": "2015-12-06T18:20:22"
}

GET /my-index-000001/_doc/session_1
{
    "_index": "my-index-000001",
    "_type": "_doc",
    "_id": "session_1",
    "_version": 1,
    "_seq_no": 0,
    "_primary_term": 1,
    "found": true,
    "_source": {
        "user_id": "kimchy",
        "session_data": {
            "arbitrary_object": {
                "some_array": [
                    "foo",
                    "bar",
                    {
                        "baz": 2
                    }
                ]
            }
        },
        "last_updated": "2015-12-06T18:20:22"
    }
}

2.7 copy_to

copy_to参数可以将多个字段的值复制到多个字段中，然后可以将其作为单个字段进行查询；如果相关字段值是通过copy_to参数填充的，那么这些字段并不会在_source字段中出现。如果你经常搜索多个字段，则可以通过使用copy_to参数来搜索更少的字段，从而来提高搜索速度。

代码语言：javascript复制

PUT /my-index-000001
{
    "mappings": {
        "properties": {
            "first_name": {
                "type": "text",
                "copy_to": "full_name"
            },
            "last_name": {
                "type": "text",
                "copy_to": "full_name"
            },
            "full_name": {
                "type": "text"
            }
        }
    }
}

PUT /my-index-000001/_doc/1
{
    "first_name": "John",
    "last_name": "Smith"
}

GET /my-index-000001/_search
{
    "query": {
        "match": {
            "full_name": {
                "query": "John Smith",
                "operator": "and"
            }
        }
    }
}
{
    "took": 128,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.5753642,
        "hits": [
            {
                "_index": "my-index-000001",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.5753642,
                "_source": {
                    "first_name": "John",
                    "last_name": "Smith"
                }
            }
        ]
    }
}

2.8 fields

针对同一字段，你可能既想对其进行全文检索，又想将其作为排序或者聚合字段；亦或对同一字段采用不同的分词器等。

2.8.1 场景一

代码语言：javascript复制

PUT /my-index-000001
{
    "mappings": {
        "properties": {
            "city": {
                "type": "text",
                "fields": {
                    "raw": {
                        "type": "keyword"
                    }
                }
            }
        }
    }
}

2.8.2 场景二

代码语言：javascript复制

PUT /my-index-000001
{
    "mappings": {
        "properties": {
            "text": {
                "type": "text",
                "fields": {
                    "english": {
                        "type": "text",
                        "analyzer": "english"
                    }
                }
            }
        }
    }
}

2.9 format

format参数用于将特定格式的字符串日期解析为毫秒值。

2.10 ignore_above

Strings longer than the ignore_above setting will not be indexed, But the whole string will still be present in the _source field.

如果某字段所包含的字符长度大于ignore_above值，那么该字段将不会被索引，但整个字段值依然会完好无损地出现在_source字段中。

2.11 index

index参数用于控制是否对相关字段进行索引，默认值为true。

2.12 normalizer

normalizer是简化版的analyzer，其仅包含character filter和token filter，既然没有tokenizer分词器模块，那么normalizer只能生成一个分词。

2.13 null_value

null_value参数主要用于将null替换为指定内容，因为一旦字段值为null，那么就无法索引从而也就无法进行搜索。具体如下：

代码语言：javascript复制

PUT /my-index-000001
{
    "mappings": {
        "properties": {
            "status_code": {
                "type": "keyword",
                "null_value": "NULL"
            }
        }
    }
}

PUT /my-index-000001/_doc/1
{
    "status_code": null
}

GET /my-index-000001/_search
{
    "query": {
        "term": {
            "status_code": "NULL"
        }
    }
}
{
    "took": 66,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.2876821,
        "hits": [
            {
                "_index": "my-index-000001",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.2876821,
                "_source": {
                    "status_code": null
                }
            }
        ]
    }
}

2.14 position_increment_gap

当索引多值字段时，Elasticsearch会在该字段各个值之间添加一个间隙，间隙值的大小，取决于position_increment_gap参数值，其默认值为100。接下来，我们通过短语匹配来加深对该参数的理解。

代码语言：javascript复制

PUT /my-index-000001/_doc/1
{
  "names": [ "John Abraham", "Lincoln Smith"]
}

GET /my-index-000001/_search
{
    "query": {
        "match_phrase": {
            "names": {
                "query": "Abraham Lincoln"
            }
        }
    }
}
{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 0,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    }
}

很明显，并没有搜索到文档，这是为什么呢？这时就需要分词器登场了。

代码语言：javascript复制

GET /_analyze
{
    "tokenizer": "standard",
    "text": [ "John Abraham", "Lincoln Smith"]
}
{
    "tokens": [
        {
            "token": "John",
            "start_offset": 0,
            "end_offset": 4,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "Abraham",
            "start_offset": 5,
            "end_offset": 12,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "Lincoln",
            "start_offset": 13,
            "end_offset": 20,
            "type": "<ALPHANUM>",
            "position": 102
        },
        {
            "token": "Smith",
            "start_offset": 21,
            "end_offset": 26,
            "type": "<ALPHANUM>",
            "position": 103
        }
    ]
}

分析上述分词结果后，我们可以察觉到position_increment_gap参数的默认值的确为100，因为第二个文本中第一个分词Lincoln的position为102，其计算公式如下：

代码语言：javascript复制

final position = original position   position_increment_gap

为了得到正确的搜索结果，我们可以通过使用slop参数为短语匹配引入一定程度的灵活性，slop指分词的最大移动次数。

代码语言：javascript复制

GET /my-index-000001/_search
{
    "query": {
        "match_phrase": {
            "names": {
                "query": "Abraham Lincoln",
                "slop": 100
            }
        }
    }
}
{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.010358453,
        "hits": [
            {
                "_index": "my-index-000001",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.010358453,
                "_source": {
                    "names": [
                        "John Abraham",
                        "Lincoln Smith"
                    ]
                }
            }
        ]
    }
}

短语匹配在slop参数的加持下的确灵活很多：

代码语言：javascript复制

GET /my-index-000001/_search
{
    "query": {
        "match_phrase": {
            "names": {
                "query": "Lincoln Abraham",
                "slop": 102
            }
        }
    }
}
{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.010158896,
        "hits": [
            {
                "_index": "my-index-000001",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.010158896,
                "_source": {
                    "names": [
                        "John Abraham",
                        "Lincoln Smith"
                    ]
                }
            }
        ]
    }
}

2.15 similarity

similarity参数用于指定相关度得分的算法模型，其值如下：

值	描述
BM25	即Okapi_BM25算法，默认值
classic	即TF/IDF算法，Elasticsearch 7.0.0后该算法已经废弃
boolean	略

2.16 store

It’s also possible to store an individual field’s values by using the store mapping option. You can use the stored_fields parameter to include these stored values in the search response.

默认情况下，Elasticsearch对字段值进行索引以使其可搜索，但并不存储它们。这意味着可以查询该字段，但是无法检索原始字段值。通常这无关紧要，因为该字段值已经是_source字段的一部分，默认情况下，_source字段是已存储的。store参数默认值为false，那么什么时候建议将其显式置为true呢？假设文档中有三个字段，分别是：title、date和content，其中content字段内容很长，可能你仅仅想获取title和date字段值，那么这时你就需要将其显式置为true，具体如下：

代码语言：javascript复制

PUT /my-index-000001
{
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "store": true
            },
            "date": {
                "type": "date",
                "store": true
            },
            "content": {
                "type": "text"
            }
        }
    }
}

PUT /my-index-000001/_doc/1
{
    "title": "Some short title",
    "date": "2015-01-01",
    "content": "A very long content field..."
}

GET /my-index-000001/_search
{
    "stored_fields": [
        "title",
        "date"
    ]
}
{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "my-index-000001",
                "_type": "_doc",
                "_id": "1",
                "_score": 1.0,
                "fields": {
                    "date": [
                        "2015-01-01T00:00:00.000Z"
                    ],
                    "title": [
                        "Some short title"
                    ]
                }
            }
        ]
    }
}

3 Mapping Field

通俗的说，文档是一系列字段的集合。本章节从字段类型和字段数据类型两个维度进行展开。

3.1 Field Type

3.1.1 Metadata Field

元字段	描述
_id	文档唯一标识
_source	_source字段包含索引文档时传入的原生JSON；_source字段本身不会被索引，但会被存储
_type	文档映射类型
_index	标识文档所属索引
_routing	文档路由标识，默认文档ID
_doc_count	每个存储桶中已聚合和已分区的文档数量

3.1.2 Runtime Field

运行时字段特性在Elasticsearch 7.11中依然处于beta版本，因此这里不作介绍。

3.2 Field Data Type

3.2.1 Text

text数据类型适用于存储文本内容。Elasticsearch默认通过标准分析器对这些文本内容进行文本分析，文本分析包含分词预处理、分词和分词后处理三个流程。text数据类型的字段不能用于排序和聚合。另外，对于结构化的文本内容，你应该优先使用keyword作为其数据类型，如：邮箱地址、域名、状态码和标签等。

参数	默认值
analyzer	standard analyzer
boost	1.0
fielddata	false
fields	无
index	true
position_increment_gap	100
store	每个存储桶中已聚合和已分区的文档数量
search_analyzer	Defaults to the analyzer setting
search_quote_analyzer	Defaults to the analyzer setting
similarity	BM25

3.2.2 Keyword

从场景看，keyword数据类型一般应用于需要排序或者聚合的字段；从存储内容看，keyword数据类型比较适合结构化的文本，如：邮箱地址、域名、标签等；从搜索的角度看，keyword数据类型适合分词级别的精确匹配，而不适用于全文检索。事实上，一旦字段被映射为keyword数据类型，那么Elasticsearch并不会对其进行文本分析，这一点你可以从其参数中得到验证，因为它不像text数据类型有analyzer参数。

参数	默认值
boost	1.0
doc_values	true
fields	无
ignore_above	2147483647
index	false
null_value	null
store	false
similarity	BM25
normalizer	null

3.2.3 IP

ip数据类型适用于IPv4和IPv6。

参数	默认值
boost	1.0
doc_values	true
index	false
null_value	null
store	false

3.2.4 Numeric

Elasticsearch主要支持以下几种数值型数据类型：

数值型数据类型	取值区间
long	[-2^63, 2^63-1]
integer	[-2^31, 2^31-1]
short	[-32768, 32767]
byte	[-128, 127]
float	单精度32位IEEE 754浮点数
double	双精度64位IEEE 754浮点数
unsigned_long	[0, 2^64-1]

参数	默认值
boost	1.0
doc_values	true
index	false
null_value	null
store	false

3.2.5 Date

在JSON中是没有date数据类型的，所以Elasticsearch中的date可以是以下几种类型：

格式化的字符串日期，如："2015-01-01" 和 "2015/01/01 12:10:30"
13位整型时间戳
10位整型时间戳

参数	默认值
boost	1.0
doc_values	true
format	strict_date_optional_time
locale	Root Locale
index	false
null_value	null
store	false

3.2.6 Boolean

参数	默认值
boost	1.0
doc_values	true
index	false
null_value	null
store	false

3.2.7 Object

参数	默认值
dynamic	true
enabled	false
properties	略

4 Mapping Explosion & Mapping Limit Setting

Mapping Explosion，即映射膨胀。在索引中定义太多字段会导致映射膨胀，这可能会导致内存不足错误和难以恢复的情况。为了更好地应对动态映射或显式映射带来的映射膨胀问题，Elasticsearch提供了如下限制参数：

配置项	描述	默认值
index.mapping.total_fields.limit	索引中字段最大数量	1000
index.mapping.depth.limit	索引中字段最大层级数	20
index.mapping.nested_fields.limit	索引中嵌套字段最大数量	50
index.mapping.nested_objects.limit	索引中嵌套JSON对象最大数量	10000
index.mapping.field_name_length.limit	索引中字段最大长度值	Long.MAX_VALUE

代码语言：javascript复制

PUT /<index-name>/_settings
{
    "settings": {
        "index": {
            "mapping": {
                "total_fields.limit": 100,
                "depth.limit": 10,
                "nested_fields.limit": 20,
                "nested_objects.limit": 100,
                "field_name_length.limit": 200
            }
        }
    }
}

analyzer 编程算法存储 ElasticsearchService

0 人点赞