- 新增索引
- 查询索引
- 删除索引
- 关闭索引
- 开启索引
中文分词器之IK分词器
IK分词器的安装和使用
- 默认的standard分词器,仅适用于英文。
GET /_analyze
{
"analyzer": "standard",
"text": ["中华人民共和国人民大会堂"]
}
返回:
代码语言:javascript复制{
"tokens" : [
{
"token" : "中",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "华",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "人",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "民",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "共",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
},
{
"token" : "和",
"start_offset" : 5,
"end_offset" : 6,
"type" : "<IDEOGRAPHIC>",
"position" : 5
},
{
"token" : "国",
"start_offset" : 6,
"end_offset" : 7,
"type" : "<IDEOGRAPHIC>",
"position" : 6
},
{
"token" : "人",
"start_offset" : 7,
"end_offset" : 8,
"type" : "<IDEOGRAPHIC>",
"position" : 7
},
{
"token" : "民",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<IDEOGRAPHIC>",
"position" : 8
},
{
"token" : "大",
"start_offset" : 9,
"end_offset" : 10,
"type" : "<IDEOGRAPHIC>",
"position" : 9
},
{
"token" : "会",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<IDEOGRAPHIC>",
"position" : 10
},
{
"token" : "堂",
"start_offset" : 11,
"end_offset" : 12,
"type" : "<IDEOGRAPHIC>",
"position" : 11
}
]
}
● 我们想要的效果是:中华人民共和国,人民大会堂。而standard分词器不能满足我们的要求。
● IK分词器是目前最流行的ES中文分词器。
IK分词器的安装
参考此篇文章
IK分词器的基本知识
● ik_smart:会做最粗颗粒度的拆分,比如会将“中华人民共和国人民大会堂”拆分为“中华人民共和国“和“人民大会堂”。
● ik_max_word:会将文本做最细粒度的拆分,比如会将“中华人民共和国人民大会堂”拆分为“中华人民共和国”、“中华人民”、“中华”、“华人”、“人民共和国”、“人民大会堂”、“人民大会”、“大会堂”,会穷尽各种可能的组合;
IK分词器的使用
示例:
● 创建索引,存储的时候,使用ik_max_word,搜索的时候,使用ik_smart
代码语言:javascript复制PUT /my_index
{
"mappings": {
"properties": {
"name":{
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
}
}
}
}
插入数据:
代码语言:javascript复制PUT /my_index/_doc/1
{
"name":"中华人民共和国人民大会堂"
}
搜索数据:
代码语言:javascript复制GET /my_index/_search?q=共和国
IK配置文件
IK分词器配置文件
IK分词器配置文件地址:ES/plugins/ik/config目录。
● IKAnalyzer.cfg.xml:用来配置自定义词库。
● main.dic(重要):IK原生内置的中文词库,总共有27万多条,只要是这些单词,都会被分在一起。
● preposition.dic:介词。
● quantifier.dic:放了一些单位相关的词,量词。
● suffix.dic:放了一些后缀。
● surname.dic:中国的姓氏。
● stopword.dic(重要):英文停用词。
自定义词库
自己建立词库:
○ 每年都会涌现一些特殊的流行的词,比如网红、蓝瘦香菇、喊麦等,一般不会出现在原生词典中。
○ 步骤:
①创建mydict.dic文件,补充最新的词语。
②IKAnalyzer.cfg.xml文件中,配置mydict.dic。
代码语言:javascript复制<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict">mydict.dic</entry>
<!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords"></entry>
<!--用户可以在这里配置远程扩展字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!--用户可以在这里配置远程扩展停止词字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
③重启ES。
自己建立停用词:
○ 比如了、的、地、得等,我们可能并不想去建立索引,让别人搜索。
○ 步骤:
①创建ext_stopword.dic,补充常见的中文停用词。
②IKAnalyzer.cfg.xml文件中,配置ext_stopword.dic。
代码语言:javascript复制<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict">mydict.dic</entry>
<!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords">ext_stopword.dic</entry>
<!--用户可以在这里配置远程扩展字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!--用户可以在这里配置远程扩展停止词字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
③重启ES。
使用MySQL热更新词库
热更新
● 每次都是在ES的扩展词典中,手动添加新词语,很坑:
○ 每次添加完,都要重启ES,才能生效,非常麻烦。
○ ES是分布式的,可能有数百个节点,我们不可能每次都是一个一个节点去修改。
● 热更新:ES不停机,我们直接在外部的某个地方添加新的词语,ES就立即加载到这些新的词语。
● 热更新的方案:
○ 基于IK分词器的原生支持的热更新方案,部署一个web服务器,提供一个http接口,通过modified和tag两个http响应头,来提供词语的热更新。
○ 修改IK分词器的源码,然后手动支持从MySQL中每隔一段时间,自动加载新的词库,推荐方案。
步骤
- 下载源码
- 修改源码:
①创建HotDictReloadThread线程,不断的去调用Dictionary.getSingleton().reLoadMainDict()。
代码语言:javascript复制package org.wltea.analyzer.dic;
import org.apache.logging.log4j.Logger;
import org.wltea.analyzer.help.ESPluginLoggerFactory;
/**
* 加载字典线程
*
* @author 许大仙
* @version 1.0
* @since 2020-12-15 14:02
*/
public class HotDictReloadThread implements Runnable {
private static Logger logger = ESPluginLoggerFactory.getLogger(HotDictReloadThread.class.getName());
@Override
public void run() {
while (true) {
logger.info("----------reload hot dict from mysql--------------");
Dictionary.getSingleton().reLoadMainDict();
}
}
}
②在pom.xml中添加mysql的驱动依赖:
代码语言:javascript复制<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.21</version>
</dependency>
③数据库中新增es数据库以及对应的表的脚本:
代码语言:javascript复制SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;
-- ----------------------------
-- Table structure for hot_stopwords
-- ----------------------------
DROP TABLE IF EXISTS `hot_stopwords`;
CREATE TABLE `hot_stopwords` (
`stopword` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
`id` bigint(20) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;
-- ----------------------------
-- Table structure for hot_words
-- ----------------------------
DROP TABLE IF EXISTS `hot_words`;
CREATE TABLE `hot_words` (
`word` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
`id` bigint(20) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;
SET FOREIGN_KEY_CHECKS = 1;
④在项目的config目录下新建jdbc-reload.properties文件:
代码语言:javascript复制jdbc.url=jdbc:mysql://localhost:3306/es?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true&useSSL=false&serverTimezone=GMT+8&allowPublicKeyRetrieval=true&nullCatalogMeansCurrent=true
jdbc.user=root
jdbc.password=123456
jdbc.reload.sql=select word from hot_words
jdbc.reload.stopword.sql=select stopword as word from hot_stopwords
jdbc.reload.interval=5000
⑤修改Dictionary中的initial()方法:
代码语言:javascript复制/**
* 词典初始化 由于IK Analyzer的词典采用Dictionary类的静态方法进行词典初始化
* 只有当Dictionary类被实际调用时,才会开始载入词典, 这将延长首次分词操作的时间 该方法提供了一个在应用加载阶段就初始化字典的手段
*
* @return Dictionary
*/
public static synchronized void initial(Configuration cfg) {
if (singleton == null) {
synchronized (Dictionary.class) {
if (singleton == null) {
singleton = new Dictionary(cfg);
singleton.loadMainDict();
singleton.loadSurnameDict();
singleton.loadQuantifierDict();
singleton.loadSuffixDict();
singleton.loadPrepDict();
singleton.loadStopWordDict();
//*********mysql监控线程*********
new Thread(new HotDictReloadThread()).start();
if (cfg.isEnableRemoteDict()) {
// 建立监控线程
for (String location : singleton.getRemoteExtDictionarys()) {
// 10 秒是初始延迟可以修改的 60是间隔时间 单位秒
pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS);
}
for (String location : singleton.getRemoteExtStopWordDictionarys()) {
pool.scheduleAtFixedRate(new Monitor(location), 10, 60, TimeUnit.SECONDS);
}
}
}
}
}
}
⑥修改Dictionary中的loadMainDict()方法:
代码语言:javascript复制/**
* 加载主词典及扩展词典
*/
private void loadMainDict() {
// 建立一个主词典实例
_MainDict = new DictSegment((char) 0);
// 读取主词典文件
Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_MAIN);
loadDictFile(_MainDict, file, false, "Main Dict");
// 加载扩展词典
this.loadExtDict();
// 加载远程自定义词库
this.loadRemoteExtDict();
// ***********从MySQL中加载词典***********
this.loadMySQLExtDict();
}
⑥修改Dictionary中的loadMainDict()方法:
代码语言:javascript复制/**
* 加载主词典及扩展词典
*/
private void loadMainDict() {
// 建立一个主词典实例
_MainDict = new DictSegment((char) 0);
// 读取主词典文件
Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_MAIN);
loadDictFile(_MainDict, file, false, "Main Dict");
// 加载扩展词典
this.loadExtDict();
// 加载远程自定义词库
this.loadRemoteExtDict();
// ***********从MySQL中加载词典***********
this.loadMySQLExtDict();
}
代码语言:javascript复制private static Properties prop = new Properties();
static {
try {
Class.forName("com.mysql.jdbc.Driver");
} catch (ClassNotFoundException ex) {
logger.error("mysql driver not found exception", ex);
}
}
/**
* 从mysql加载热更新词典
*/
private void loadMySQLExtDict() {
Connection conn = null;
Statement stmt = null;
ResultSet rs = null;
try {
Path file = PathUtils.get(getDictRoot(), "jdbc-reload.properties");
prop.load(new FileInputStream(file.toFile()));
logger.info("[==========]jdbc-reload.properties");
for(Object key : prop.keySet()) {
logger.info("[==========]" key "=" prop.getProperty(String.valueOf(key)));
}
logger.info("[==========]query hot dict from mysql, " prop.getProperty("jdbc.reload.sql") "......");
conn = DriverManager.getConnection(
prop.getProperty("jdbc.url"),
prop.getProperty("jdbc.user"),
prop.getProperty("jdbc.password"));
stmt = conn.createStatement();
rs = stmt.executeQuery(prop.getProperty("jdbc.reload.sql"));
while(rs.next()) {
String theWord = rs.getString("word");
logger.info("[==========]hot word from mysql: " theWord);
_MainDict.fillSegment(theWord.trim().toCharArray());
}
Thread.sleep(Integer.valueOf(String.valueOf(prop.get("jdbc.reload.interval"))));
} catch (Exception e) {
logger.error("erorr", e);
} finally {
if(rs != null) {
try {
rs.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
if(stmt != null) {
try {
stmt.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
if(conn != null) {
try {
conn.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
}
}
⑦修改Dictionary中的loadStopWordDict()方法:
代码语言:javascript复制/**
* 加载用户扩展的停止词词典
*/
private void loadStopWordDict() {
// 建立主词典实例
_StopWords = new DictSegment((char) 0);
// 读取主词典文件
Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_DIC_STOP);
loadDictFile(_StopWords, file, false, "Main Stopwords");
// 加载扩展停止词典
List<String> extStopWordDictFiles = getExtStopWordDictionarys();
if (extStopWordDictFiles != null) {
for (String extStopWordDictName : extStopWordDictFiles) {
logger.info("[Dict Loading] " extStopWordDictName);
// 读取扩展词典文件
file = PathUtils.get(extStopWordDictName);
loadDictFile(_StopWords, file, false, "Extra Stopwords");
}
}
// 加载远程停用词典
List<String> remoteExtStopWordDictFiles = getRemoteExtStopWordDictionarys();
for (String location : remoteExtStopWordDictFiles) {
logger.info("[Dict Loading] " location);
List<String> lists = getRemoteWords(location);
// 如果找不到扩展的字典,则忽略
if (lists == null) {
logger.error("[Dict Loading] " location " load failed");
continue;
}
for (String theWord : lists) {
if (theWord != null && !"".equals(theWord.trim())) {
// 加载远程词典数据到主内存中
logger.info(theWord);
_StopWords.fillSegment(theWord.trim().toLowerCase().toCharArray());
}
}
}
//***********从mysql加载停用词************
this.loadMySQLStopwordDict();
}
代码语言:javascript复制//从mysql加载停用词
private void loadMySQLStopwordDict() {
Connection conn = null;
Statement stmt = null;
ResultSet rs = null;
try {
Path file = PathUtils.get(getDictRoot(), "jdbc-reload.properties");
prop.load(new FileInputStream(file.toFile()));
logger.info("[==========]jdbc-reload.properties");
for(Object key : prop.keySet()) {
logger.info("[==========]" key "=" prop.getProperty(String.valueOf(key)));
}
logger.info("[==========]query hot stopword dict from mysql, " prop.getProperty("jdbc.reload.stopword.sql") "......");
conn = DriverManager.getConnection(
prop.getProperty("jdbc.url"),
prop.getProperty("jdbc.user"),
prop.getProperty("jdbc.password"));
stmt = conn.createStatement();
rs = stmt.executeQuery(prop.getProperty("jdbc.reload.stopword.sql"));
while(rs.next()) {
String theWord = rs.getString("word");
logger.info("[==========]hot stopword from mysql: " theWord);
_StopWords.fillSegment(theWord.trim().toCharArray());
}
Thread.sleep(Integer.valueOf(String.valueOf(prop.get("jdbc.reload.interval"))));
} catch (Exception e) {
logger.error("erorr", e);
} finally {
if(rs != null) {
try {
rs.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
if(stmt != null) {
try {
stmt.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
if(conn != null) {
try {
conn.close();
} catch (SQLException e) {
logger.error("error", e);
}
}
}
}
⑧使用mvn package将项目进行打包。
⑨将刚才打包生成的jar包替换release压缩包中的elasticsearch-analysis-ik-7.10.0.jar。
⑩将刚才使用的jdbc-reload.properties文件复制到conf目录下,并顺便复制mysql的驱动到ik目录中。
⑪重启ES:观察日志,日志中会显示出我们打印的那些东西,比如加载了什么配置等等。
⑫在MySQL中添加词库和停用词。
⑬测试热更新是否成功:
代码语言:javascript复制GET /_analyze
{
"analyzer": "ik_smart",
"text": ["大忽悠爱忽悠"]
}
Java API实现索引管理
新增索引
代码语言:javascript复制package com.dhy;
import org.elasticsearch.action.admin.indices.alias.Alias;
import org.elasticsearch.action.support.ActiveShardCount;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.indices.CreateIndexRequest;
import org.elasticsearch.client.indices.CreateIndexResponse;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.unit.TimeValue;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
@SpringBootTest
public class ElkApplicationTests {
@Autowired
private RestHighLevelClient client;
/**
*
<pre>
* PUT /index
* {
* "settings":{...},
* "mappings":{
* "properties":{
* ...
* }
* },
* "aliases":{
* "default_index":{}
* }
* }
* </pre>
*/
@Test
public void testCreateIndex() throws IOException {
//创建请求
CreateIndexRequest request = new CreateIndexRequest("my_index");
//设置参数
request.settings(Settings.builder().put("number_of_shards", "1").put("number_of_replicas", "1").build());
//设置映射
Map<String, Object> field1 = new HashMap<>();
field1.put("type", "text");
Map<String, Object> field2 = new HashMap<>();
field2.put("type", "text");
Map<String,Object> properties = new HashMap<>();
properties.put("field1", field1);
properties.put("field2", field2);
Map<String,Object> mappings = new HashMap<>();
mappings.put("properties",properties);
request.mapping(mappings);
//设置别名
request.alias(new Alias("default_index"));
//---------------可选参数-------------
//超时5秒
request.setTimeout(TimeValue.timeValueSeconds(5));
//主节点超时5秒
request.setMasterTimeout(TimeValue.timeValueSeconds(5));
//设置创建索引API返回相应之前等待活动分片的数量
request.waitForActiveShards(ActiveShardCount.from(1));
//执行
CreateIndexResponse response = client.indices().create(request, RequestOptions.DEFAULT);
//获取返回结果
boolean acknowledged = response.isAcknowledged();
System.out.println("acknowledged = " acknowledged);
boolean shardsAcknowledged = response.isShardsAcknowledged();
System.out.println("shardsAcknowledged = " shardsAcknowledged);
String index = response.index();
System.out.println("index = " index);
}
}
查询索引
代码语言:javascript复制package com.sunxiaping.elk;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.indices.GetIndexRequest;
import org.elasticsearch.client.indices.GetIndexResponse;
import org.elasticsearch.cluster.metadata.AliasMetadata;
import org.elasticsearch.cluster.metadata.MappingMetadata;
import org.elasticsearch.common.settings.Settings;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import java.io.IOException;
import java.util.List;
import java.util.Map;
@SpringBootTest
public class ElkApplicationTests {
@Autowired
private RestHighLevelClient client;
/**
* 查询索引是否存在以及查询索引信息
*/
@Test
public void testExistIndex() throws IOException {
GetIndexRequest request = new GetIndexRequest("my_index");
//参数
request.local(false);//从主节点返回本地索引信息状态
request.humanReadable(true);//以适合人类的格式返回
request.includeDefaults(false);//是否返回每个索引的所有默认配置
//查询索引是否存在
boolean exists = client.indices().exists(request, RequestOptions.DEFAULT);
System.out.println("exists = " exists);
//查询索引
GetIndexResponse response = client.indices().get(request, RequestOptions.DEFAULT);
Map<String, List<AliasMetadata>> aliases = response.getAliases();
System.out.println("aliases = " aliases);
Map<String, MappingMetadata> mappings = response.getMappings();
System.out.println("mappings = " mappings);
Map<String, Settings> settings = response.getSettings();
System.out.println("settings = " settings);
}
}
删除索引
代码语言:javascript复制package com.sunxiaping.elk;
import org.elasticsearch.action.admin.indices.delete.DeleteIndexRequest;
import org.elasticsearch.action.support.master.AcknowledgedResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import java.io.IOException;
@SpringBootTest
public class ElkApplicationTests {
@Autowired
private RestHighLevelClient client;
/**
* 删除索引
*
* @throws IOException
*/
@Test
public void testDeleteIndex() throws IOException {
DeleteIndexRequest request = new DeleteIndexRequest("my_index");
AcknowledgedResponse response = client.indices().delete(request, RequestOptions.DEFAULT);
boolean acknowledged = response.isAcknowledged();
System.out.println("acknowledged = " acknowledged);
}
}
关闭索引
代码语言:javascript复制package com.sunxiaping.elk;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.indices.CloseIndexRequest;
import org.elasticsearch.client.indices.CloseIndexResponse;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import java.io.IOException;
@SpringBootTest
public class ElkApplicationTests {
@Autowired
private RestHighLevelClient client;
/**
* 关闭索引:可以查询索引,但是不可以新增、修改、删除数据
*
* @throws IOException
*/
@Test
public void testCloseIndex() throws IOException {
CloseIndexRequest request = new CloseIndexRequest("my_index");
CloseIndexResponse response = client.indices().close(request, RequestOptions.DEFAULT);
boolean acknowledged = response.isAcknowledged();
System.out.println("acknowledged = " acknowledged);
}
}
开启索引
代码语言:javascript复制package com.sunxiaping.elk;
import org.elasticsearch.action.admin.indices.open.OpenIndexRequest;
import org.elasticsearch.action.admin.indices.open.OpenIndexResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import java.io.IOException;
@SpringBootTest
public class ElkApplicationTests {
@Autowired
private RestHighLevelClient client;
/**
* 开启索引
*
* @throws IOException
*/
@Test
public void testOpenIndex() throws IOException {
OpenIndexRequest request = new OpenIndexRequest("my_index");
OpenIndexResponse response = client.indices().open(request, RequestOptions.DEFAULT);
boolean acknowledged = response.isAcknowledged();
System.out.println("acknowledged = " acknowledged);
}
}