Elasticsearch 学习笔记

配置说明
Development 与 Production模式说明
参数修改的第二种方式
elasticsearch.yml: es相关配置
jvm.options: jvm的相关参数
log4j2.properties: 日志相关配置
api
可以直接指定 analyzer 进行测试
可以直接指定索引中的字段进行测试
创建文档
查询文档
批量写入文档
批量查询文档
/_cat/nodes?v
/_cat/nodes
/_cluster/status
Rest API
索引 API
文档 API
Analyze API
Elasticsearch 常用术语
Document
Index
Mapping
API
匹配规则参数
copy_to
index
数据类型
多字段特性 multi-fields
自定义Mapping
Dynamic Mapping
Dynamic Templates
Elasticsearch CRUD
Create
Read
Update
Delete
Elasticsearch Query
Query String
Query DSL
Elasticsearch Ingest Node
插件
Filter Plugin - dissect
预定义分词器
standard
simple
whitespace
stop
keyword
pattern
language
中文分词
IK
jieba
Hanlp
THULAC

配置说明

配置文件位于config目录

elasticsearch.yml: es相关配置

cluster.name: 集群名称，以此作为是否同一集群的判断条件
node.name: 节点名称，以此作为集群中不同节点的区分团建
nerwork.host/http.port 网络地址和端口，用于http和transport服务使用
path.data: 数据存储地址
path.log: 日志存储地址

Development 与 Production模式说明

以transport的地址是否绑定在localhost为判断标准(network.host)
Dev模式下启动时会以warning的方式提示配置检查异常
Production模式下启动会以error的方式提示配置检查异常并退出

参数修改的第二种方式

代码语言：javascript复制

bin/elasticsearch -E配置名=配置值

jvm.options: jvm的相关参数

log4j2.properties: 日志相关配置

api

/_cat/nodes

输出集群的结点信息

/_cat/nodes?v

输出集群的详细结点信息，其中master栏有*表示主结点

/_cluster/status

输出集群的详细信息

Rest API

REST REpresentational State Transfer，表现层状态转移
URL 指定资源，如 Index、Document 等
Http Method 指明资源操作类型，如GET获取、POST更新、PUT新增、DELETE删除

索引 API

es有专门的Index API,用于创建、更新、删除索引配置等

PUT /${index_name} : 创建索引
GET _cat/indices : 查看现有索引
DELETE //${index_name} : 删除索引

文档 API

创建文档

指定 id 创建文档 api

代码语言：javascript复制

# 创建文档时，如果索引不存在，es 会自动创建对应index、type# request#索引名index_name/类型type/idPUT /test_index/doc/1  {    "username":"alfred",    "age":1}# response{    "_index":"test_index",    "_type":"doc",    "_id":"1",    "_version":1,  # 每次对文档有变化的操作都会更新 1，包含了锁的机制
    "result":"created",    "_shards":{        "total":2,        "successful":1,        "failed":0
    },    "_seq_no":0,    "_primary_term":1}

不指定 id 创建文档 api

代码语言：javascript复制

# requestPOST /test_index/doc
{    "username":"tom",    "age":20}# response{    "_index":"test_index",    "_type":"doc",    "_id":"Mj-H2ABSmWv7ZHR8Oa3", # 自动生成
    "_version":1,    "result":"created",    "_shards":{        "total":2,        "successful":1,        "failed":0
    },    "_seq_no":0,    "_promary_term":1}

查询文档

指定要查询的文档id

代码语言：javascript复制

# request#索引名index_name/类型type/idGET /test_index/doc/1# 200 response{    "_index":"test_index",    "_type":"doc",    "_id":"1",    "_version":1,    "found":true,    "_source":{  # 文档的原始数据
        "username":"alfred",        "age":1
    }
}# 404 response{    "_index":"test_index",    "_type":"doc",    "_id":"2", # 不存在的id    "found":false}

搜索所有文档

代码语言：javascript复制

# request# 用到_search，并把查询语句作为json格式放到http body中发送到 esGET /test_index/doc/_search{    "query":{        "term":{ # 匹配id为1的
            "_id":"1"
        }
    }
}# response{    "took":0, # 查询耗时，单位ms
    "timed_out":false,    "_shards":{        "total":5,        "successful":5,        "skipped":0,        "failed":0
    },    "hits":{        "total":1, # 符合条件的总文档数
        "max_score":1,        "hits":[
            { # 返回的文档详情数据数组，默认前10个文档
                "_index":"test_index",                "_type":"doc",                "_id":"1",                "_version":1,                "_score":1, # 文档的得分
                "_source":{  # 文档的原始数据
                    "username":"alfred",                    "age":1
                }
            },
            {
                ...
            }
        ]
    }
}

批量写入文档

es允许一次创建多个文档，从而减少网络传输开销，提升写入速率

代码语言：javascript复制

# repuestPOST _bulk# action_type支持: # index 创建文档，如果已经存在就覆盖# create 创建文档，如果已经存在就报错# update 更新文档# delete 删除文档{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"username":"alfred","age":10}
{"delete":{"_index":"test_index","_type":"doc","_id":1}}
{"update":{"_id":"2","_index":"test_index"."_type":"doc"}}
{"doc":{"age":"20"}}# response{    "took":33, 耗时，单位ms    "errors":false,    "items":[ # 每个bulk操作的返回结果
        {            "index":{                "_index":"test_index",                "_type":"doc",                "_id":"1",                "_version":1,                "result":"created",                "_shards":{                    "total":2,                    "successful":1,                    "failed":0
                },                "_seq_no":0,                "_primary_term":1,                "status":201
            }
        },
        {            "delete":{                "_index":"test_index",                "_type":"doc",                "_id":"1",                "_version":2,                "result":"deleted",                "_shards":{                    "total":2,                    "successful":1,                    "failed":0
                },                "_seq_no":0,                "_primary_term":1,                "status":200
            }
        },
        {            "update":{                "_index":"test_index",                "_type":"doc",                "_id":"1",                "_version":2,                "result":"updated",                "_shards":{                    "total":2,                    "successful":1,                    "failed":0
                },                "_seq_no":0,                "_primary_term":1,                "status":200
            }
        }
    ]
}

批量查询文档

代码语言：javascript复制

# requestGET /_mget
{    "docs":[
        {            "_index":"test_index",            "_type":"doc",            "_id":"1"
        },
        {            "_index":"test_index",            "_type":"doc",            "_id":2
        }
    ]
}# response{    "docs":[
        {            "index":"test_index",            "_type":"doc",            "_id":"1",            "found":false # 未找到
        },
        {            "index":"test_index",            "_type":"doc",            "_id":"2",            "_version":2,            "found":true,            "_source":{                "username":"lee",                "age":"20"
            }
        }
    ]
}

Analyze API

es提供了一个测试分词的 api 接口，方便验证分词效果，endpoint 是 _analyze

可以直接指定 analyzer 进行测试

代码语言：javascript复制

# requestPOST _analyze{    "analyzer": "standard", # 分词器
    "text":"hello world!" # 测试文本}# response{    "tokens": [
    {        "token":"hello",  # 分词结果
        "start_offset":0, # 起始偏移
        "end_offset":5, # 结束偏移
        "type":"<ALPHANUM>",        "position":0 # 分词位置
    },
    {        "token":"world",        "start_offset":6,        "end_offset":11,        "type":"<ALPHANUM>",        "position":1
    }
    ]
}

可以直接指定索引中的字段进行测试

代码语言：javascript复制

# requestPOST test_index/_analyze{    "field":"username",  # 测试字段
    "text":"hello world!" # 测试文本}

可以自定义分词器进行测试

代码语言：javascript复制

# requestPOST _analyze{    "tokenizer": "standard",    "filter": ["lowercase"], # 自定义 analyzer
    "text":"Hello World!"}

Elasticsearch 常用术语

Document
文档数据，相对于mysql的一行数据
Index
索引: 所有的 Document 都存储在对应的 Index 中
由具有相同字段的文档列表组成
相对于mysql的table
Type 索引中的数据类型，目前一个index只允许有一个Type，后续可能会移除Type的概念
Node
一个es的运行实例，是集群的构成单元
Cluster
由一个或多个节点组成，对外提供服务
Field 字段，文档的属性
Query DSL 查询语法

Document

Json Object,由字段（Field）组成，常见数据类型如下：
字符串：text, keyword
数值：long，integer，short，byte，double，float，half_float，scaled_float
布尔：boolean
日期：date
二进制：binary
范围类型：integer_range，float_range，long_range，double_range，data_range
每个文档有唯一的 id 标识
自行指定
es 自动生成
元数据，用于标准文档的相关信息（Document MetaData）
_index: 文档所在的索引名
_type: 文档所在的类型名
_id: 文档唯一id
_uid: 组合id, 由 _type 和 _id 组成(6.x _type不再起作用，同 _id 一样)
_source: 文档的原始 Json 数据, 可以从这里获取每个字段的内容
_all: 整合所有字段内容到该字段, 默认禁用

Index

类别mysql的table
索引中存储具有相同结构的文档（Document）
每个索引都有自己的mapping 定义，用于定义字段名和类型
一个集群可以有多个索引，如：
nginx-log-2017-01-01
nginx-log-2017-01-02
nginx-log-2017-01-03
nginx 日志存的时候可以按照日期每天生成一个索引来存储

Mapping

类似数据库中的表结构定义：

定义 Index 下的字段名
定义字段的类型，比如数值型、字符串型、布尔型等
定义倒排索引相关的配置，比如是否索引、记录 position 等
测试

代码语言：javascript复制

# requestGET /test_index/_mapping# response{    "test_index": { # 索引
        "mappings": {            "doc": { # type                "properties": {                    "age": {                        "type": "integer"
                    },                    "username": {                        "type": "keyword"
                    }
                }
            }
        }
    }
}

自定义Mapping

测试：

代码语言：javascript复制

# requestPUT my_index
{    "mappings": { # mappings 关键词
        "doc": { # type            "properties": {                "title": {                    "type": "text"
                },                "name": {                    "type": "keyword"
                },                "age": {                    "type": "integer"
                }
            }
        }
        
    }
}# response{    "acknowledged": true,    "shards_acknowledge": true,    "index": "my_index"}

类型一旦设定后，禁止直接修改，因为 Lucene 实现的倒排索引生成后不允许修改
重新建立新的索引，然后做 reindex 操作
允许新增字段
通过 dynamic 参数来控制字段的新增
true（默认）: 允许自动新增字段
false: 不允许字段新增字段，但是文档可以正常写入，但无法对字段进行查询等操作
strict: 文档不能写入，报错

代码语言：javascript复制

# requestPUT my_index{    "mappings": {        "my_type": {            "dynamic": false,            "properties": {                "user": {                    "properties": {                        "name": {                            "type": "text"
                        },                        "social_networds": {                            "dynamic": true,                            "properties": {}
                        }
                    }
                }
            }
        }
    }}

copy_to

将该字段的值复制到目标字段，实现类型 _all 的作用
不会出现在 _source 中，只用来搜索

代码语言：javascript复制

PUT my_index
{    "mappings": {        "doc": {            "properties":{                "first_name":{                    "type": "text",                    "copy_to": "full_name"
                },                "last_name":{                    "type": "text",                    "copy_to": "full_name"
                },                "full_name":{                    "type":"text"
                }
            }
        }
    }
}

PUT my_index/doc/1{    "first_name":"John",    "last_name":"Smith"}

GET my_index/_search
{    "query":{        "match": {            "full_name":{                "query":"John Smith",                "operator": "and"
            }
        }
    }
}

index

控制当前字段是否索引，默认为true，即记录索引，flase 表示不记录，即不可搜索

代码语言：javascript复制

# requestPUT my_index
{    "mappings":{        "doc": {            "properties": {                "cookie": {                    "type": "text",                    "index": false
                }
            }
        }
    }
}

PUT my_index/doc/1
{    "cookie":"name=alfred"}GET my_index/_search
{    "query":{        "match": {            "cookie":"name"
        }
    }
}# response{    "error":{        "root_cause":[            ......
            "index": "my_index3",            "caused_by":{                "type":"illegal_argument_exception",                "reason":"Cannot search on field [cookie] since it is not indexed"
            }
        ]
    },    "status":400
}

index_options 用于控制倒排索引记录的内容，有如下4种配置
docs 只记录 doc id
freqs 记录 doc id 和 term ferquencies
positions 记录 doc id、term frequencies、term position 和 character offsets
text 类型默认配置为 positions, 其他默认为 docs
记录内容越多，占用空间越大

代码语言：javascript复制

# requestPUT my_index{    "mappings":{        "doc":{            "properties":{                "cookie":{                    "type":"text",                    "index_options":"offsets"
                }
            }
        }
    }
}

null_value
当字段遇到 null 值是的处理策略，默认为 null 时，即空值，此时 es 会忽略该值。可以通过设定该值设定字段的默认值。

代码语言：javascript复制

# requestPUT my_index{    "mappings":{        "my_type":{            "properties": {                "status_code":{                    "type": "keyword".                    "null_value":"NULL"
                }
            }
        }
    }
}

数据类型

核心数据类型
字符串型 text、keyword
数值型 long、integer、short、byte、double、float、half_float、scaled_float
日期类型 date
布尔类型 boolean
二进制类型 binary
范围类型 integer_range、float_range、long_range、double_range、date_range
复杂数据类型
数组类型 array
对象类型 object
嵌套类型 nested object
地理位置数据类型
geo_point
geo_shape
专用类型
ip 记录 ip 地址
completion 实现自动补全
token_count 记录分词数
murmur3 记录字符串 hash 值
percolator
join

多字段特性 multi-fields

允许对同一个自动采用不同的配置，比如分词，场景例子如对人名实现拼音搜索，只需要在人名中新增一个子字段为pinyin 即可

代码语言：javascript复制

# request{    "test_index":{        "mappings":{            "doc":{                "properties":{                    "username":{                        "type":"text",                        "fields":{                            "pinyin":{                                "type":"text",                                "analyzer":"pinyin"
                            }
                        }
                    }
                }
           }
        }
    }
}GET test_index/_search
{    "query":{        "match":{            "username_pinyin":"hanhan"
        }
    }
}

Dynamic Mapping

es 可以自动识别文档字段类型，从而降低用户使用成本，如下：

代码语言：javascript复制

# requestPUT /test_index/doc/1{    "username":"alfred",    "age":1}

GET /test_index/_mapping# response{    "test_index":{        "mappings":{            "doc":{                "properties": {                    "age":{                        "type":"long"
                    },                    "username":{                        "type":"test",                        "fields":{                            "keyword":{                                "type":"keyword",  # es自动识别 age 为long 类型，username 为 text 类型
                                "ignore_above":256
                            }
                        }
                    }
                }
            }
        }
    }
}

es 是依靠 JSON 文档的字段类型来实现自动识别字段类型，支持的类型如下:

JSON 类型	es 类型
null	忽略
boolean	boolean
浮点类型	float
整数	long
object	object
array	由第一个非 null 值的类型决定
string	匹配为日期则设定为date 类型（默认开启），匹配为数组的话设为 float 或 long 类型（默认关闭），设为 text 类型，并附带 keyword 的子字段

代码语言：javascript复制

# requestPUT /test_index/doc/1{    "username":"alfred",    "age":14,    "birth":"1988-10-10",    "married":false,    "year":"18",    "tags":["boy", "fashion"],    "money":100.1}

GET /test_index/_mapping# response{    "test_index":{        "mappings":{            "doc":{                "properties":{                    "age":{                        "type":"long"
                    },                    "birth":{                        "type":"date"
                    },                    "married":{                        "type":"boolean"
                    },                    "money":{                        "type":"float"
                    },                    "tags":{                        "type":"text",                        "fields":{                            "keyword":{                                "type":"keyword",                                "ignore_above":256
                            }
                        }
                    },                    "username":{                        "type":"text",                        "fields":{                            "keyword":{                                "type":"keyword",                                "ignore_above":256
                            }
                        }
                    },                    "year":{                        "type":"text",                        "fields":{                            "keyword":{                                "type":"keyword",                                "ignore_above":256
                            }
                        }
                    }
                }
            }
        }
    }
}

日期的自动识别可以自行配置日期格式，以满足各种需求
YYYY-MM-DDThh:mm:ssTZD (eg 1997-07-16T19:20:30 01:00)
默认是["strict_date_optional_time", "yyyy/MM/dd HH:mm:ss Z"]
strict_date_optional_time 是 ISO datetime 格式，完整格式类似下面:
dynamic_date_formats 可以自定义日期类型
date_detection 可以关闭日期的自动识别的机制

代码语言：javascript复制

# requestPUT my_index{    "mappings":{        "my_type":{            "dynamic_date_formats":["MM/dd/yyyy"]
        }
    }
}

PUT my_index/my_type/1
{    "create_date":"09/25/2015"}# 关闭日期自动识别机制PUT my_index{    "mappings":{        "my_type":{            "date_detection":false
        }
    }
}

字符串是数字时，默认不会自动识别为整数，因为字符串中出现数字是完全合理的
numeric_detection 可以开启字符串中数字的字段识别，如下：

代码语言：javascript复制

# requestPUT my_index{    "mappings":{        "my_type":{            "numeric_detection":true
        }
    }
}
PUT my_index/my_type/1
{    "my_float":"1.0",    "my_integer":"1"}# responseGET my_index/_mapping{    "my_index1":{        "mappings":{            "my_type":{                "numeric_detection":true,                "properties":{                    "my_float":{                        "type":"float"
                    },                    "my_integer":{                        "type":"long"
                    }
                }
            }
        }
    }
}

Dynamic Templates

允许根据 es 自动识别的数据类型、字段名等来动态设定字段类型，可以实现如下效果：
所有字符串类型都设定为 keyword 类型，即默认不分词
所有以 message 开头的字段都设定为 text 类型，即分词
所有以 long_ 开头的字段都设定为 long 类型
所有字段匹配为 double 类型的都设定为 float 类型，以节省空间

API

代码语言：javascript复制

# requestPUT test_index{    "mappings":{        "doc":{            "dynamic_templates":[ # 数组，可指定多个匹配规则
            {                "strings":{ # template 的名称
                    "match_mapping_type":"string", # 匹配规则
                    "mapping":{ # 设置 mapping 信息
                        "type":"keyword"
                    }
                }
            }
            ]
        }
    }
}

匹配规则参数

match_mapping_type: 匹配 es 自动识别的字段类型，如boolean,long,string等
match/unmatch: 匹配字段名
path_match/path_unmatch: 匹配路径，用于匹配object类型的内部字段

代码语言：javascript复制

# 字符串默认使用 keyword 类型# es默认会为字符串设置 text 类型，并增加一个 keyword 的子字段# requestPUT test_index
{    "mappings":{        "doc":{            "dynamic_templates":[
            {                "strings_as_keywords":{                    "match_mapping_type":"string",                    "mapping":{                        "type":"keyword"
                    }
                }
            }
            ]
        }
    }
}

代码语言：javascript复制

# 以 message 开头的字段都设置为 text 类型# requestPUT test_index
{    "mappings":{        "doc":{            "dynamic_templates":[
            {                "message_as_text":{                    "match_mapping_type":"string",                    "match":"message* ",                    "mapping":{                        "type":"text"
                    }
                }
            }
            ]
        }
    }
}

代码语言：javascript复制

# double 类型设定为 float，节省空间# requestPUT test_index
{    "mappings":{        "doc": {            "dynamic_templates":[
            {                "double_as_float":{                    "match_mapping_type":"double",                    "mapping":{                        "type":"float"
                    }
                }
            }
            ]
        }
    }
}

Elasticsearch CRUD

Create

代码语言：javascript复制

# 请求  /{Index}/{Type}/{id}POST /accouts/person/1{    "name": "John",    "lastname": "Doe",    "job_description": "Systems administrator and Linux specialit"}# 响应{    "_index": "accounts",    "_type": "person",    "_id":"1",    "_version": 1,    "result": "created",    "_shards": {        "total": 2,        "successful": 1,        "failed": 0
    },    "created": true}

Read

和Create不同的是，使用GET

代码语言：javascript复制

# 请求  /{Index}/{Type}/{id}GET /accouts/person/1
{    "name": "John",    "lastname": "Doe",    "job_description": "Systems administrator and Linux specialit"}# 响应{    "_index": "accounts",    "_type": "person",    "_id":"1",    "_version": 1,    "result": "created",    "_shards": {        "total": 2,        "successful": 1,        "failed": 0
    },    "created": true}

Update

代码语言：javascript复制

# 请求POST /accounts/person/1/_update
{    "doc":{        "job_description": "Systems administrator and Linux specialist"
    }
}# 响应{    "_index": "accounts",    "_type": "person",    "_id": "1",    "_version": 2,    "result": "updated",    "_shards": {        "total": 2,        "successful":1,        "failed":0
    }
}

Delete

代码语言：javascript复制

# 请求DELETE /accounts/person/1DELETE /accounts# 响应{    "found": true,    "_index": "acounts",    "_type": "person",    "_id": "1",    "_version":3,    "result":"deleted",    "_shards":{        "total":2,        "successful":1,        "failed":0
    }
}

Elasticsearch Query

Query String

代码语言：javascript复制

# 请求GET /accounts/person/_search?q=john

Query DSL

代码语言：javascript复制

# 请求GET /accounts/person/_search{    "query": {        "match": {            "name":"json"
        }
    }
}

Elasticsearch Ingest Node

因为 filebeat 缺乏数据转换能力，所以官方新增 Node: Elasticsearch Ingest Node 作为能力补充，在数据写入es前进行数据转换

pipeline api

插件

Filter Plugin - dissect

基于分隔符原理解析数据，解决 grok 解析时消耗过多 cpu 资源的问题

代码语言：javascript复制

%{clientip} %{ident} %{auth} [%{timestamp}] "%{request}" % {response} %{bytes} "%{referrer}" "%{agent}"

预定义分词器

standard

默认分词器

tokenizer:

standard

token filters:

standard
lower case
stop

特性：

按词切分，支持多语言
小写处理

simple

tokenizer:

lower case

特性：

按照非字母切分
小写处理

whitespace

tokenizer:

whitespace

特性:

按照空格切分

stop

按照 stop word 语气助词等修饰性的词语切分，如 the、an、的、这等等

tokenizer:

lower case

token filters:

stop

特性:

比simple多了stop word处理

keyword

tokenizer: keyword

特性:

不分词，直接将输入作为一个单词输出

pattern

tokenizer:

pattern

token filters:

lower case
stop

特性:

通过正则表达式自定义分隔符
默认是W ，即非字词的符号作为分隔符

language

特性:

提供了 30 常见的分词器

中文分词

IK

实现中英文单词的切分，支持ik_smart、ik_maxword等模式
可自定义词库，支持热更新分词字典

jieba

python 中最流行的分词系统，支持分词和词性标注
支持繁体分词、自定义词典、并行分词等

Hanlp

由一系列模型与算法组成的java工具包

THULAC

由清华大学自然语言处理与社会人文计算实验室研制推出的一套中文词法分析工具包，具有中文分词和词性标注功能

我的博客即将同步至腾讯云开发者社区，邀请大家一同入驻：https://cloud.tencent.com/developer/support-plan?invite_code=1y1u52rqoxs5s

es java 中文分词 ElasticsearchService 网络安全

0 人点赞