Elasticsearch学习（三）Elasticsearch默认提供的常见分词器，安装IK中文分词器，在线和离线的安装方式

分词器

Elasticsearch默认提供的常见分词器

standard analyzer

要切分的语句：Set the shape to semi-transparent by calling set_trans(5)

standard analyzer - 是Elasticsearch中的默认分词器。

标准分词器，处理英语语法的分词器。切分后的key_words：set, the, shape, to, semi, transparent, by, calling, set_trans, 5。这种分词器也是Elasticsearch中默认的分词器。切分过程中不会忽略停止词（如：the、a、an等）。会进行单词的大小写转换、过滤连接符（-）或括号等常见符号。

代码语言：javascript复制

GET _analyze
{
  "text": "Set the shape to semi-transparent by calling set_trans(5)",
  "analyzer": "standard"
}

以上的意思是：

使用默认的分词器standard，对Set the shape to semi-transparent by calling set_trans(5) 这句英语进行分词结果是：

代码语言：javascript复制

{
  "tokens" : [
    {
      "token" : "set",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "the",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "shape",
      "start_offset" : 8,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "to",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "semi",
      "start_offset" : 17,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "transparent",
      "start_offset" : 22,
      "end_offset" : 33,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "by",
      "start_offset" : 34,
      "end_offset" : 36,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "calling",
      "start_offset" : 37,
      "end_offset" : 44,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "set_trans",
      "start_offset" : 45,
      "end_offset" : 54,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "5",
      "start_offset" : 55,
      "end_offset" : 56,
      "type" : "<NUM>",
      "position" : 9
    }
  ]
}

simple analyzer

简单分词器。

切分后的key_words：set, the, shape, to, semi, transparent, by, calling, set, trans。就是将数据切分成一个个的单词。使用较少，经常会破坏英语语法。

代码语言：javascript复制

GET _analyze
{
  "text": "Set the shape to semi-transparent by calling set_trans(5)",
  "analyzer": "simple"
}

whitespace analyzer

空白符分词器。

切分后的key_words：Set, the, shape, to, semi-transparent, by, calling, set_trans(5)。就是根据空白符号切分数据。如：空格、制表符等。使用较少，经常会破坏英语语法。

代码语言：javascript复制

GET _analyze
{
  "text": "Set the shape to semi-transparent by calling set_trans(5)",
  "analyzer": "whitespace"
}

language analyzer

语言分词器，

如英语分词器（english）等。切分后的key_words：set, shape, semi, transpar, call, set_tran, 5。根据英语语法分词，会忽略停止词、转换大小写、单复数转换、时态转换等，应用分词器分词功能类似standard analyzer。

代码语言：javascript复制

GET _analyze
{
  "text": "Set the shape to semi-transparent by calling set_trans(5)",
  "analyzer": "english"
}

代码语言：javascript复制

注意：Elasticsearch中提供的常用分词器都是英语相关的分词器，对中文的分词都是一字一词。

2 安装中文分词器

2.1进入容器

代码语言：javascript复制

docker exec -it es /bin/bash

2.2安装IK

由于是从github上下载资源，所以耗时较长。

代码语言：javascript复制

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.8.4/elasticsearch-analysis-ik-6.8.4.zip

2.3重启容器

代码语言：javascript复制

docker restart es

2.4 离线安装分词器

1 下载安装包

代码语言：javascript复制

https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.8.4/elasticsearch-analysis-ik-6.8.4.zip

2 将IK分词器上传到/tmp目录中

3 将压缩包移动到容器中

代码语言：javascript复制

docker cp /tmp/elasticsearch-analysis-ik-6.5.4.zip elasticsearch:/usr/share/elasticsearch/plugins

4 进入容器

代码语言：javascript复制

docker exec -it elasticsearch /bin/bash

5 创建目录

代码语言：javascript复制

mkdir /usr/share/elasticsearch/plugins/ik

6 将文件压缩包移动到ik中

代码语言：javascript复制

mv /usr/share/elasticsearch/plugins/elasticsearch-analysis-ik-6.5.4.zip /usr/share/elasticsearch/plugins/ik

7 进入目录

代码语言：javascript复制

cd /usr/share/elasticsearch/plugins/ik

8 解压

代码语言：javascript复制

unzip elasticsearch-analysis-ik-6.5.4.zip

9 删除压缩包

代码语言：javascript复制

rm -rf elasticsearch-analysis-ik-6.5.4.zip

2.4测试IK分词器

IK分词器提供了两种analyzer，分别是ik_max_word和ik_smart。 ik_max_word: 会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,国,国歌”，会穷尽各种可能的组合； ik_smart: 会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。

代码语言：javascript复制

GET _analyze
{
  "text" : "中华人民共和国国歌",
  "analyzer": "ik_max_word"
}



GET _analyze
{
  "text" : "中华人民共和国国歌",
  "analyzer": "ik_smart"
}

analyzer ElasticsearchService 容器

0 人点赞