简介
这是一个基于n-Gram CRF HMM的中文分词的java实现。分词速度达到每秒钟大约200万字左右(mac air下测试),准确率能达到96%以上。目前实现了中文分词、中文姓名识别、用户自定义词典、关键字提取、自动摘要、关键字标记等功能。可以应用到自然语言处理等方面,适用于对分词效果要求高的各种项目。
代码语言:javascript复制<dependency>
<groupId>org.ansj</groupId>
<artifactId>ansj_seg</artifactId>
<version>5.1.6</version>
</dependency>
配置文件:library.properties
dic=src/main/resources/userLibrary
词典文件夹:userLibrary
doctor.dic
....
hospital.dic
- 项目地址:https://github.com/NLPchina/ansj_seg
- 项目文档地址:https://github.com/NLPchina/ansj_seg/wiki
- 项目的文档地址:http://nlpchina.github.io/ansj_seg/
- 获取Jar包地址:https://mvnrepository.com/artifact/org.ansj/ansj_seg
Ansj In Elasticsearch
官网地址:https://github.com/NLPchina/elasticsearch-analysis-ansj
手动安装
- 下载地址(版本可选) :https://github.com/NLPchina/elasticsearch-analysis-ansj/releases
- 拷贝到Es的plugins文件夹下,新建一个文件夹再解压zip
- 安装之后,重启elasticsearch
[2021-07-16T18:18:20,286][INFO ][o.e.p.PluginsService ] [node-1] loaded module [x-pack-watcher]
[2021-07-16T18:18:20,286][INFO ][o.e.p.PluginsService ] [node-1] loaded plugin [analysis-icu]
[2021-07-16T18:18:20,287][INFO ][o.e.p.PluginsService ] [node-1] loaded plugin [elasticsearch-analysis-ansj]
[2021-07-16T18:18:21,892][INFO ][o.a.e.p.AnalysisAnsjPlugin] [node-1] regedit analyzer tokenizer named : base_ansj
[2021-07-16T18:18:21,892][INFO ][o.a.e.p.AnalysisAnsjPlugin] [node-1] regedit analyzer tokenizer named : index_ansj
[2021-07-16T18:18:21,893][INFO ][o.a.e.p.AnalysisAnsjPlugin] [node-1] regedit analyzer tokenizer named : query_ansj
[2021-07-16T18:18:21,893][INFO ][o.a.e.p.AnalysisAnsjPlugin] [node-1] regedit analyzer tokenizer named : dic_ansj
[2021-07-16T18:18:21,893][INFO ][o.a.e.p.AnalysisAnsjPlugin] [node-1] regedit analyzer tokenizer named : nlp_ansj
[2021-07-16T18:18:21,915][INFO ][o.a.e.p.AnalysisAnsjPlugin] [node-1] regedit analyzer provider named : base_ansj
[2021-07-16T18:18:21,915][INFO ][o.a.e.p.AnalysisAnsjPlugin] [node-1] regedit analyzer provider named : index_ansj
[2021-07-16T18:18:21,915][INFO ][o.a.e.p.AnalysisAnsjPlugin] [node-1] regedit analyzer provider named : query_ansj
[2021-07-16T18:18:21,915][INFO ][o.a.e.p.AnalysisAnsjPlugin] [node-1] regedit analyzer provider named : dic_ansj
[2021-07-16T18:18:21,915][INFO ][o.a.e.p.AnalysisAnsjPlugin] [node-1] regedit analyzer provider named : nlp_ansj
[2021-07-16T18:18:23,190][INFO ][o.e.x.s.a.s.FileRolesStore] [node-1] parsed [0] roles from file [/Users/lihuan/Documents/opt/elasticsearch-cluster/elasticsearch-7.2.1/elasticsearch-7.2.1/config/roles.yml]
[2021-07-16T18:18:23,886][DEBUG][o.e.a.ActionModule ] [node-1] Using REST wrapper from plugin org.elasticsearch.xpack.security.Security
[2021-07-16T18:18:24,207][INFO ][o.e.d.DiscoveryModule ] [node-1] using discovery type [zen] and seed hosts providers [settings]
[2021-07-16T18:18:24,470][INFO ][ansj-initializer ] [node-1] try to load ansj config file: /Users/lihuan/Documents/opt/elasticsearch-cluster/elasticsearch-7.2.1/elasticsearch-7.2.1/config/elasticsearch-analysis-ansj/ansj.cfg.yml
[2021-07-16T18:18:24,471][INFO ][ansj-initializer ] [node-1] try to load ansj config file: /Users/lihuan/Documents/opt/elasticsearch-cluster/elasticsearch-7.2.1/elasticsearch-7.2.1/plugins/elasticsearch-analysis-ansj-7.2.1.0-release/config/ansj.cfg.yml
[2021-07-16T18:18:24,472][INFO ][ansj-initializer ] [node-1] load ansj config: /Users/lihuan/Documents/opt/elasticsearch-cluster/elasticsearch-7.2.1/elasticsearch-7.2.1/plugins/elasticsearch-analysis-ansj-7.2.1.0-release/config/ansj.cfg.yml
[2021-07-16T18:18:24,481][WARN ][o.a.u.MyStaticValue ] [node-1] not find ansj_library.properties. reason: access denied ("java.io.FilePermission" "/Users/lihuan/Documents/opt/elasticsearch-cluster/elasticsearch-7.2.1/elasticsearch-7.2.1" "read")
[2021-07-16T18:18:24,482][WARN ][o.a.u.MyStaticValue ] [node-1] not find library.properties. reason: access denied ("java.io.FilePermission" "/Users/lihuan/Documents/opt/elasticsearch-cluster/elasticsearch-7.2.1/elasticsearch-7.2.1" "read")
[2021-07-16T18:18:24,482][WARN ][o.a.u.MyStaticValue ] [node-1] not find library.properties in classpath use it by default !
[2021-07-16T18:18:24,485][INFO ][o.a.d.i.File2Stream ] [node-1] path to stream ansj_library.properties
[2021-07-16T18:18:24,486][ERROR][ansj-initializer ] [node-1] ansj_library.properties load err: org.ansj.exception.LibraryException: org.ansj.exception.LibraryException: path :ansj_library.properties file:/Users/lihuan/Documents/opt/elasticsearch-cluster/elasticsearch-7.2.1/elasticsearch-7.2.1/ansj_library.properties not found or can not to read
[2021-07-16T18:18:24,489][INFO ][o.a.d.i.File2Stream ] [node-1] path to stream default.dic
[2021-07-16T18:18:24,489][ERROR][o.a.l.DicLibrary ] [node-1] Init dic library error :java.security.AccessControlException: access denied ("java.io.FilePermission" "default.dic" "read"), path: default.dic
[2021-07-16T18:18:24,489][INFO ][o.a.d.i.File2Stream ] [node-1] path to stream dic
[2021-07-16T18:18:24,490][ERROR][o.a.l.DicLibrary ] [node-1] Init dic library error :org.ansj.exception.LibraryException: path :dic file:/Users/lihuan/Documents/opt/elasticsearch-cluster/elasticsearch-7.2.1/elasticsearch-7.2.1/dic not found or can not to read, path: dic
[2021-07-16T18:18:24,491][INFO ][o.a.d.i.File2Stream ] [node-1] path to stream library/ambiguity.dic
[2021-07-16T18:18:24,491][ERROR][o.a.l.AmbiguityLibrary ] [node-1] Init ambiguity library error :java.security.AccessControlException: access denied ("java.io.FilePermission" "library/ambiguity.dic" "read"), path: library/ambiguity.dic
[2021-07-16T18:18:25,423][INFO ][o.a.l.DATDictionary ] [node-1] init core library ok use time : 872
[2021-07-16T18:18:25,676][INFO ][o.a.l.NgramLibrary ] [node-1] init ngram ok use time :250
[2021-07-16T18:18:25,681][INFO ][ansj-initializer ] [node-1] init ansj plugin ok , goodluck youyou
[2021-07-16T18:18:25,882][INFO ][o.e.n.Node ] [node-1] initialized
[2021-07-16T18:18:25,883][INFO ][o.e.n.Node ] [node-1] starting ...
[2021-07-16T18:18:31,009][INFO ][o.e.t.TransportService ] [node-1] publish_address {127.0.0.1:9300}, bound_addresses {127.0.0.1:9300}, {[::1]:9300}
[2021-07-16T18:18:31,015][WARN ][o.e.b.BootstrapChecks ] [node-1] the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured
[2021-07-16T18:18:31,019][INFO ][o.e.c.c.Coordinator ] [node-1] cluster UUID [Fv0qn48ET1W7xvReE6QfcA]
[2021-07-16T18:18:31,027][INFO ][o.e.c.c.ClusterBootstrapService] [node-1] no discovery configuration found, will perform best-effort cluster bootstrapping after [3s] unless existing master is discovered
[2021-07-16T18:18:31,201][INFO ][o.e.c.s.MasterService ] [node-1] elected-as-master ([1] nodes joined)[{node-1}{0bCCNIOgT96vNrRVaTnXXQ}{6aRu-ht-QreEGbLVlQerqw}{localhost}{127.0.0.1:9300}{xpack.installed=true} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 23, version: 208, reason: master node changed {previous [], current [{node-1}{0bCCNIOgT96vNrRVaTnXXQ}{6aRu-ht-QreEGbLVlQerqw}{localhost}{127.0.0.1:9300}{xpack.installed=true}]}
[2021-07-16T18:18:31,717][INFO ][o.e.c.s.ClusterApplierService] [node-1] master node changed {previous [], current [{node-1}{0bCCNIOgT96vNrRVaTnXXQ}{6aRu-ht-QreEGbLVlQerqw}{localhost}{127.0.0.1:9300}{xpack.installed=true}]}, term: 23, version: 208, reason: Publication{term=23, version=208}
[2021-07-16T18:18:31,741][INFO ][o.e.h.AbstractHttpServerTransport] [node-1] publish_address {127.0.0.1:9200}, bound_addresses {127.0.0.1:9200}, {[::1]:9200}
[2021-07-16T18:18:31,742][INFO ][o.e.n.Node ] [node-1] started
[2021-07-16T18:18:31,814][INFO ][o.e.c.s.ClusterSettings ] [node-1] updating [xpack.monitoring.collection.enabled] from [false] to [true]
[2021-07-16T18:18:31,936][INFO ][o.e.l.LicenseService ] [node-1] license [5cb32362-2a00-4d6d-8c72-0aa541924fd4] mode [basic] - valid
[2021-07-16T18:18:31,944][INFO ][o.e.g.GatewayService ] [node-1] recovered [9] indices into cluster_state
[2021-07-16T18:18:34,830][INFO ][o.e.c.r.a.AllocationService] [node-1] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[.kibana_task_manager][0]] ...]).
[2021-07-16T18:18:42,093][INFO ][o.e.c.m.MetaDataCreateIndexService] [node-1] [.monitoring-es-7-2021.07.16] creating index, cause [auto(bulk api)], templates [.monitoring-es], shards [1]/[0], mappings [_doc]
[2021-07-16T18:18:42,614][INFO ][o.e.c.r.a.AllocationService] [node-1] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-es-7-2021.07.16][0]] ...]).
插件安装
- 到elasticsearch的bin目录下,使用elasticsearch-plugin安装。进入Elasticsearch目录运行如下命令
进入es目录执行如下命令
./bin/elasticsearch-plugin install https://github.com/NLPchina/elasticsearch-analysis-ansj/releases/download/v6.7.0/elasticsearch-analysis-ansj-6.7.0.0-release.zip
- 安装之后,重启elasticsearch
切记:
1.安装分词插件的时候,一定要新建一个类似elasticsearch-ansj-6.7.0-release的文件夹,再解压分词zip,否则一直报错。
2.根据README进行配置
备注:相关配置信息可以参见elasticsearch-analysis-ansj的README.md
分词方式
名称 | 用户自定义词典 | 数字识别 | 人名识别 | 机构名识别 | 新词发现 |
---|---|---|---|---|---|
BaseAnalysis | X | X | X | X | X |
ToAnalysis | √ | √ | √ | X | X |
DicAnalysis | √ | √ | √ | X | X |
IndexAnalysis | √ | √ | √ | X | X |
NlpAnalysis | √ | √ | √ | √ | √ |
String str = "洁面仪配合洁面深层清洁毛孔 清洁鼻孔面膜碎觉使劲挤才能出一点点皱纹 脸颊毛孔修复的看不见啦 草莓鼻历史遗留问题没辙 脸和脖子差不多颜色的皮肤才是健康的 长期使用安全健康的比同龄人显小五到十岁 28岁的妹子看看你们的鱼尾纹" ;
System.out.println(BaseAnalysis.parse(str));
System.out.println(ToAnalysis.parse(str));
System.out.println(DicAnalysis.parse(str));
System.out.println(IndexAnalysis.parse(str));
System.out.println(NlpAnalysis.parse(str));
ToAnalysis 精准分词
精准分词是Ansj分词的店长推荐款。它在易用性,稳定性.准确性.以及分词效率上.都取得了一个不错的平衡.如果你初次尝试Ansj如果你想开箱即用.那么就用这个分词方式是不会错的。
DicAnalysis 用户自定义词典优先策略的分词
用户自定义词典优先策略的分词,如果你的用户自定义词典足够好,或者你的需求对用户自定义词典的要求比较高,那么强烈建议你使用DicAnalysis的分词方式。可以说在很多方面Dic优于ToAnalysis的结果。
NlpAnalysis 带有新词发现功能的分词
nlp分词是总能给你惊喜的一种分词方式。它可以识别出未登录词.但是它也有它的缺点.速度比较慢.稳定性差.ps:我这里说的慢仅仅是和自己的其他方式比较.应该是40w字每秒的速度吧。
IndexAnalysis 面向索引的分词
面向索引的分词。顾名思义就是适合在lucene等文本检索中用到的分词。 主要考虑以下两点
- 召回率 * 召回率是对分词结果尽可能的涵盖。比如对“上海虹桥机场南路” 召回结果是[上海/ns, 上海虹桥机场/nt, 虹桥/ns, 虹桥机场/nz, 机场/n, 南路/nr]
- 准确率 * 其实这和召回本身是具有一定矛盾性的Ansj的强大之处是很巧妙的避开了这两个的冲突 。比如我们常见的歧义句“旅游和服务”->对于一般保证召回 。大家会给出的结果是“旅游 和服 服务” 对于ansj不存在跨term的分词。意思就是。召回的词只是针对精准分词之后的结果的一个细分。比较好的解决了这个问题
BaseAnalysis 最小颗粒度的分词
基本就是保证了最基本的分词.词语颗粒度最非常小的..所涉及到的词大约是10万左右。
配置文件
在默认情况下,如果你想做更多的全局设定在程序调用时候,配置文件是个必不可少的玩意,在ansj中配置文件名为library.properties
,这是一个不可更改的约定。
字段名 | 默认值 | 说明 |
---|---|---|
isNameRecognition | true | 是否开启人名识别 |
isNumRecognition | true | 是否开启数字识别 |
isQuantifierRecognition | true | 是否数字和量词合并 |
isRealName | false | 是否取得真实的词,默认情况会取得标注化后的 |
isSkipUserDefine | false | 是否用户辞典不加载相同的词 |
dic | "library/default.dic" | 自定义词典路径 |
dic_[key] | "你的词典路径" | 针对不同语料调用不同的自定义词典 |
ambiguity | "library/ambiguity.dic" | 歧义词典路径 |
ambiguity_[key] | "library/ambiguity.dic" | 歧义词典路径 |
crf | null | crf词典路径,不设置为默认 |
crf_[key] | "你的模型路径" | 针对不同语料调用不同的分词模型 |
synonyms | "默认的同义词典" | 针对不同语料调用不同的分词模型 |
synonyms_[key] | "你的同义词典路径" | 针对不同语料调用不同的分词模型 |
默认的配置文件格式
代码语言:javascript复制#path of userLibrary this is default library
dic=library/default.dic
#redress dic file path
ambiguityLibrary=library/ambiguity.dic
#set real name
isRealName=true
#isNameRecognition default true
isNameRecognition=true
#isNumRecognition default true
isNumRecognition=true
#digital quantifier merge default true
isQuantifierRecognition=true
在5.1.0版本之后你可以通过多种方式加载你的词典,也可以自定义词典加载的接口。目前有如下方式:
- 文件加载 file://c:/dic.txt 当然你可以依旧保持旧的写法c:/dic.txt
- 从jar包中加载 jar://org.ansj.dic.DicReader|/crf.model , 以jar://开头..第一个为类全名称,后面为这个类所在jar包词典文件的路径.
- 从jdbc包中加载 jdbc://jdbc:mysql://192.168.10.103:3306/dic_table?useUnicode=true&characterEncoding=utf-8&zeroDateTimeBehavior=convertToNull|username|password|select name as name,nature,freq from dic where type=1
- 从url中加载.url://http://maven.nlpcn.org/down/library/default.dic
- 当然你也可以实现自己的 加载方式,通过继承 PathToStream来实现.
词典类别
Ansj目前支持以下的用户自定义词典的操作方式:
- 从文件中加载词典
- 配置文件
- 编码路径
- 从内存操作词典
- 增加
- 删除
- 修改
配置文件:library.properties
#支持同义词
synonyms=src/main/resources/userLibrary/synonyms.dic
注意:词典的加载都是懒加载
Ansj支持5种词典,分别为:DicLibrary、StopLibrary、CrfLibrary、AmbiguityLibrary、SynonymsLibrary 词典各字段之间使用tab(t)分割,而不是空格。
- DicLibrary:用户自定义词典,格式词语 词性 词频
小 a 57969
高 a 57483
长 a 40281
重要 a 37557
老 a 33423
- StopLibrary:停用词词典,格式词语 停用词类型[可以为空]
is
a
#
v nature
.*了 regex
- CrfLibrary:crf模型,格式 二进制格式
- AmbiguityLibrary:歧义词典,格式:'词语0 词性0 词语1 词性1.....'
习近平 nr
李民 nr 工作 vn
三个 m 和尚 n
的确 d 定 v 不 v
大 a 和尚 n
张三 nr 和 c
动漫 n 游戏 n
邓颖超 nr 生前 t
- SynonymsLibrary:同义词词典,格式:'词语0 词语1 词语2 词语3',采用幂等策略:w1=w2 w2=w3 w1=w3
人类 生人 全人类
人手 人员 人口 人丁 口 食指
劳力 劳动力 工作者
匹夫 个人
家伙 东西 货色 厮 崽子 兔崽子 狗崽子 小子 杂种 畜生 混蛋 王八蛋 竖子 鼠辈 小崽子
用户自定义词典
Ansj In SpringBoot
从文件中加载词典方式一:
代码语言:javascript复制配置文件:library.properties
dic=src/main/resources/userLibrary
词典文件夹:userLibrary
doctor.dic
hospital.dic
default.dic
arer.dic
注意:这里可直接加载userLibrary下全部dic文件
启动服务:说明dic词典加载成功
2021-07-16 19:37:05.879 INFO 5908 --- [ main] org.ansj.library.DicLibrary : load dic use time:7 path is : src/main/resources/userLibrary
@Override
public Map<String, String> dicAnalysis(String keyword) {
Map<String, String> itemsMap = new HashMap<>();
List<Term> terms = DicAnalysis.parse(keyword).getTerms();
terms.stream().forEach(v -> {
log.info("DicAnalysis term:{}", v);
itemsMap.put(v.getName(), v.getNatureStr());
});
return itemsMap;
}
/**
* 自定义分词
* @param keyword
*/
@RequestMapping(value = "/dicAnalysis")
public @ResponseBody
Map<String, String> dicAnalysis(String keyword) {
return dicAnalysis.dicAnalysis(keyword);
}
localhost:2000/ansj-master/dic/dicAnalysis?keyword=宁夏回族自治区
从文件中加载词典方式二:
代码语言:javascript复制package com.ansj.master.ansj.core;
import com.ansj.master.ansj.constant.SystemConstants;
import lombok.extern.slf4j.Slf4j;
import org.ansj.domain.Term;
import org.ansj.library.DicLibrary;
import org.ansj.splitWord.analysis.DicAnalysis;
import org.nlpcn.commons.lang.tire.domain.Forest;
import org.nlpcn.commons.lang.tire.library.Library;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.core.io.Resource;
import org.springframework.core.io.support.ResourcePatternResolver;
import org.springframework.stereotype.Component;
import javax.annotation.PostConstruct;
import java.util.ArrayList;
import java.util.List;
@Slf4j
@Component
public class InitDictionarys {
@Autowired
private ResourcePatternResolver resourcePatResolver;
private List<String> forestsName = new ArrayList<>();
/**
* 初始化加载字典
*/
@PostConstruct
private void initAnsjLibrary() {
try {
// 加载自定义词典
for(Resource resource: resourcePatResolver.getResources(SystemConstants.CLASS_PATH)) {
Forest forest = Library.makeForest(resource.getInputStream());
String key = "dic_" resource.getFilename().replace(".dic", "");
DicLibrary.put(key, key, forest);
forestsName.add(key);
}
} catch (Exception e) {
e.printStackTrace();
}
}
/**
* 只用ansj分词, 使用自定义词典分词
*/
public List<Term> parseWithLibrary(String keyWord) {
return DicAnalysis.parse(keyWord, DicLibrary.gets(forestsName)).getTerms();
}
}
@Override
public List<Term> parseWithLibrary(String keyword) {
return initDictionarys.parseWithLibrary(keyword);
}
/**
* 使用自定义分词加载词典
* @param keyword
* @return
*/
@RequestMapping(value = "/parseWithLibrary")
public @ResponseBody
List<Term> parseWithLibrary(String keyword) {
return dicAnalysis.parseWithLibrary(keyword);
}
localhost:2000/ansj-master/dic/parseWithLibrary?keyword=宁夏回族自治区
动态添加&删除:
代码语言:javascript复制/**
* 增加分词
* @param keyword
* @return
*/
@RequestMapping(value = "/addAnalysis")
public String addAnalysis(String keyword, String str) {
return dicAnalysis.addAnalysis(keyword, str);
}
/**
* 删除分词
* @param key
* @return
*/
@RequestMapping(value = "/delAnalysis")
public String delAnalysis(String key, String str) {
return dicAnalysis.delAnalysis(key, str);
}
@Override
public String addAnalysis(String keyword, String str) {
DicLibrary.insert("dic", keyword, "userDefine", 1000);
List<Term> terms = ToAnalysis.parse(str).getTerms();
System.out.println("增加新词例子:" terms);
return str;
}
@Override
public String delAnalysis(String key, String str) {
// 删除词语,只能删除.用户自定义的词典.
DicLibrary.remove(key);
List<Term> terms = ToAnalysis.parse(str).getTerms();
System.out.println("删除用户自定义词典例子:" terms);
return str;
}
localhost:2000/ansj-master/dic/addAnalysis?keyword=ansj中文分词&str=我觉得Ansj中文分词是一个不错的系统!我是王婆!
增加新词例子:[我/r, 觉得/v, ansj中文分词/userDefine, 是/v, 一个/m, 不错/a, 的/u, 系统/n, !/w, 我/r, 是/v, 王婆/nr, !/w]
localhost:2000/ansj-master/dic/delAnalysis?key=dic&str=我觉得Ansj中文分词是一个不错的系统!我是王婆!
删除用户自定义词典例子:[我/r, 觉得/v, ansj/en, 中文/nz, 分词/v, 是/v, 一个/m, 不错/a, 的/u, 系统/n, !/w, 我/r, 是/v, 王婆/nr, !/w]
Ansj In Elasticsearch
Synonyms同义词
Ansj In SpringBoot
代码语言:javascript复制1.添加词典
synonyms.dic 将官方提供的同义词词典表copy一份
userLibrary.dic 加入新词 “老朽 a 2000”
2.配置文件library.properties
#支持同义词
synonyms=src/main/resources/userLibrary/synonyms.dic
#支持自定义词典
dic_custom=src/main/resources/userLibrary/userLibrary.dic
3.测试代码
代码语言:javascript复制@RestController
@Slf4j
@RequestMapping(value = "/synonyms")
public class SynonymsController {
@Resource
private SynonymsService synonymsService;
/**
* 同义词
* @param keyword
* @return
*/
@RequestMapping(value = "/synonymsWithLibrary")
@ResponseBody
public Map<String, Object> synonymsWithLibrary(String keyword) {
return synonymsService.synonymsWithLibrary(keyword);
}
/**
* 测试ansj同义词
* @param keyword
* @return
*/
@RequestMapping(value = "/synonymsWithUserLibrary")
public @ResponseBody Map<String, Object> synonymsWithUserLibrary(String keyword) {
return synonymsService.synonymsWithUserLibrary(keyword);
}
}
代码语言:javascript复制@Service
@Slf4j
public class SynonymsServiceImpl implements SynonymsService {
@Override
public Map<String, Object> synonymsWithLibrary(String keyword) {
Map<String, Object> synonymsMap = new HashMap<>();
///使用默认的同义词词典
SynonymsRecgnition synonymsRecgnition = new SynonymsRecgnition();
String str = "我国中国就是华夏,也是天朝";
for (Term term : ToAnalysis.parse("我国中国就是华夏")) {
System.out.println(term.getName() "t" (term.getSynonyms()));
}
System.out.println("-------------init library------------------");
for (Term term : ToAnalysis.parse(str).recognition(synonymsRecgnition)) {
System.out.println(term.getName() "t" (term.getSynonyms()));
}
System.out.println("---------------insert----------------");
SynonymsLibrary.insert(SynonymsLibrary.DEFAULT, new String[] { "中国", "我国" });
for (Term term : ToAnalysis.parse(str).recognition(synonymsRecgnition)) {
System.out.println(term.getName() "t" (term.getSynonyms()));
}
System.out.println("---------------append----------------");
SynonymsLibrary.append(SynonymsLibrary.DEFAULT, new String[] { "中国", "华夏", "天朝" });
for (Term term : ToAnalysis.parse(str).recognition(synonymsRecgnition)) {
System.out.println(term.getName() "t" (term.getSynonyms()));
}
System.out.println("---------------remove----------------");
SynonymsLibrary.remove(SynonymsLibrary.DEFAULT, "我国");
for (Term term : ToAnalysis.parse(str).recognition(synonymsRecgnition)) {
System.out.println(term.getName() "t" (term.getSynonyms()));
}
return synonymsMap;
}
@Override
public Map<String, Object> synonymsWithUserLibrary(String keyword) {
Map<String, Object> synonymsMap = new HashMap<>();
new DicAnalysis()
.setForests(DicLibrary.get("dic_custom")) // 根据自定义词典先进行分词
.parseStr(keyword)
.recognition(new SynonymsRecgnition("synonyms")).forEach(t -> {
System.out.println("Name: " t.getName());
System.out.println("Nature: " t.getNatureStr());
System.out.println("Synonyms: " t.getSynonyms());
System.out.println("Offset: " t.getOffe());
System.out.println("RealName: " t.getRealName());
synonymsMap.put("Name: ", t.getName());
synonymsMap.put("Nature: ", t.getNatureStr());
synonymsMap.put("Synonyms: ", t.getSynonyms());
synonymsMap.put("Offset: ", t.getOffe());
synonymsMap.put("RealName: ", t.getRealName());
});
return synonymsMap;
}
}
代码语言:javascript复制访问:
1.localhost:2000/ansj-master/synonyms/synonymsWithLibrary
我国 null
中国 null
就是 null
华夏 null
-------------init library------------------
我国 [本国, 我国]
中国 [中原, 华夏, 中华, 华, 赤县, 神州, 九州, 赤县神州, 炎黄, 中国, 礼仪之邦]
就是 [即使, 即若, 即令, 即或, 即便, 就算, 就是, 尽管, 哪怕, 不怕, 纵使, 纵令, 纵然, 纵, 饶, 即, 就, 便, 虽, 即使如此, 不畏]
华夏 [中原, 华夏, 中华, 华, 赤县, 神州, 九州, 赤县神州, 炎黄, 中国, 礼仪之邦]
, null
也 [吗, 呢, 吧, 乎, 啊, 否, 欤, 耶, 邪, 为, 哉, 也, 也罢, 与否]
是 [凡是, 凡, 是, 大凡, 举凡]
天朝 null
---------------insert----------------
我国 [中国, 我国]
中国 [中国, 我国]
就是 [即使, 即若, 即令, 即或, 即便, 就算, 就是, 尽管, 哪怕, 不怕, 纵使, 纵令, 纵然, 纵, 饶, 即, 就, 便, 虽, 即使如此, 不畏]
华夏 null
, null
也 [吗, 呢, 吧, 乎, 啊, 否, 欤, 耶, 邪, 为, 哉, 也, 也罢, 与否]
是 [凡是, 凡, 是, 大凡, 举凡]
天朝 null
---------------append----------------
我国 [我国, 中国, 华夏, 天朝]
中国 [我国, 中国, 华夏, 天朝]
就是 [即使, 即若, 即令, 即或, 即便, 就算, 就是, 尽管, 哪怕, 不怕, 纵使, 纵令, 纵然, 纵, 饶, 即, 就, 便, 虽, 即使如此, 不畏]
华夏 [我国, 中国, 华夏, 天朝]
, null
也 [吗, 呢, 吧, 乎, 啊, 否, 欤, 耶, 邪, 为, 哉, 也, 也罢, 与否]
是 [凡是, 凡, 是, 大凡, 举凡]
天朝 [我国, 中国, 华夏, 天朝]
---------------remove----------------
我国 null
中国 [中国, 华夏, 天朝]
就是 [即使, 即若, 即令, 即或, 即便, 就算, 就是, 尽管, 哪怕, 不怕, 纵使, 纵令, 纵然, 纵, 饶, 即, 就, 便, 虽, 即使如此, 不畏]
华夏 [中国, 华夏, 天朝]
, null
也 [吗, 呢, 吧, 乎, 啊, 否, 欤, 耶, 邪, 为, 哉, 也, 也罢, 与否]
是 [凡是, 凡, 是, 大凡, 举凡]
天朝 [中国, 华夏, 天朝]
2.localhost:2000/ansj-master/synonyms/synonymsWithUserLibrary?keyword=老朽
{
"Synonyms: ": [
"年老",
"老",
"上岁数",
"上年纪",
"高大",
"苍老",
"衰老",
"年高",
"年迈",
"老迈",
"高迈",
"白头",
"皓首",
"大年",
"老大",
"老朽",
"朽迈",
"老态龙钟",
"年事已高",
"鹤发鸡皮",
"鸡皮鹤发",
"行将就木",
"大龄",
"七老八十",
"早衰",
"老弱病残",
"年迈体弱",
"老态",
"古稀之年",
"年逾古稀"
],
"Name: ": "老朽",
"Offset: ": 0,
"RealName: ": "老朽",
"Nature: ": "a"
}
Ansj In Elasticsearch
代码语言:javascript复制词典路径:elasticsearch/elasticsearch-6.7.2/config/analysis
代码语言:javascript复制配置索引Setting:
{
"settings": {
"index": {
"number_of_shards": "1",
"provided_name": "sphinx-diseasedoctor-21.07.16-102554",
"creation_date": "1626402354046",
"analysis": {
"filter": {
"my_synonym": {
"type": "synonym",
"synonyms_path": "analysis/synonym.txt"
}
},
"analyzer": {
"index_ansj_analyzer": {
"filter": [
"my_synonym",
"asciifolding"
],
"type": "custom",
"tokenizer": "index_ansj"
},
"comma": {
"pattern": ",",
"type": "pattern"
}
}
},
"number_of_replicas": "2",
"uuid": "380mXY-ITsCFuOfFSGRqhw",
"version": {
"created": "6070299"
}
}
}
}
Stop停用词
Ansj In SpringBoot
代码语言:javascript复制1.测试代码
说明:Ansj只有自带方法一个一个添加停用词,没有通过配置添加的方法,因此不需要在library.properties加配置项。
代码语言:javascript复制@RestController
@Slf4j
@RequestMapping(value = "/stop")
public class StopController {
@Resource
private StopService stopService;
/**
* 停用词
* @param keyword
* @return
*/
@RequestMapping(value = "/stopWithLibrary")
@ResponseBody
public Map<String, Object> stopWithLibrary(String keyword) {
return stopService.stopWithLibrary(keyword);
}
/**
* 自定义停用词
* @param keyword
*/
@RequestMapping(value = "/stopWithUserLibrary")
@ResponseBody
public Map<String, Object> stopWithUserLibrary(String keyword){
return stopService.stopWithUserLibrary(keyword);
}
/**
* 添加自定义停用词
* @param keyword
*/
@RequestMapping(value = "/insertStopWordsLibrary")
@ResponseBody
public Map<String, Object> insertStopWordsLibrary(String keyword){
return stopService.insertStopWordsLibrary(keyword);
}
/**
* 清除自定义停用词
* @param keyword
*/
@RequestMapping(value = "/clearStopWordsLibrary")
@ResponseBody
public Map<String, Object> clearStopWordsLibrary(String keyword){
return stopService.clearStopWordsLibrary(keyword);
}
}
代码语言:javascript复制public interface StopService {
Map<String, Object> stopWithLibrary(String keyword);
Map<String, Object> stopWithUserLibrary(String keyword);
Map<String, Object> insertStopWordsLibrary(String keyword);
Map<String, Object> clearStopWordsLibrary(String keyword);
}
代码语言:javascript复制@Service
@Slf4j
public class StopServiceImpl implements StopService {
@Override
public Map<String, Object> stopWithLibrary(String keyword) {
Map<String, Object> stopMap = new HashMap<>();
Result parse = ToAnalysis.parse(keyword);
System.out.println(parse);
StopRecognition fitler = new StopRecognition();
// 停用某一词性的词 比如 增加nr 后.人名将不在结果中
fitler.insertStopNatures("uj");
fitler.insertStopNatures("ul");
fitler.insertStopNatures("null");
fitler.insertStopWords("生活");
fitler.insertStopRegexes("国.*?");
Result modifResult = parse.recognition(fitler);
for (Term term : modifResult) {
stopMap.put(term.getName(), term.getNatureStr());
}
System.out.println(modifResult);
return stopMap;
}
@Override
public Map<String, Object> stopWithUserLibrary(String keyword) {
Map<String, Object> stopMap = new HashMap<>();
StopLibrary.insertStopWords(StopLibrary.DEFAULT, "的", "呵呵", "哈哈", "噢", "啊");
Result terms = ToAnalysis.parse(keyword);
//使用停用词
System.out.println(terms.recognition(StopLibrary.get()));
return stopMap;
}
@Override
public Map<String, Object> insertStopWordsLibrary(String keyword) {
Map<String, Object> stopMap = new HashMap<>();
Result parse = ToAnalysis.parse(keyword);
System.out.println(parse);
StopRecognition fitler = new StopRecognition();
Collection<String> filterWords = new ArrayList<>();
filterWords.add("国家");
fitler.insertStopWords(filterWords);
Result modifResult = parse.recognition(fitler);
for (Term term : modifResult) {
stopMap.put(term.getName(), term.getNatureStr());
}
System.out.println(modifResult);
return stopMap;
}
@Override
public Map<String, Object> clearStopWordsLibrary(String keyword) {
Map<String, Object> stopMap = new HashMap<>();
Result parse = ToAnalysis.parse(keyword);
System.out.println(parse);
StopRecognition fitler = new StopRecognition();
fitler.insertStopWords("咖啡");
fitler.clear();
Result modifResult = parse.recognition(fitler);
for (Term term : modifResult) {
stopMap.put(term.getName(), term.getNatureStr());
}
System.out.println(modifResult);
return stopMap;
}
}
代码语言:javascript复制请求1:localhost:2000/ansj-master/stop/stopWithLibrary?keyword=咖啡国家的生活质量提高了
咖啡/n,国家/n,的/u,生活/vn,质量/n,提高/v,了/u
咖啡/n,的/u,质量/n,提高/v,了/u
请求2:localhost:2000/ansj-master/stop/stopWithUserLibrary?keyword=英文版是小田亲自翻译的
英文版/n,小田/nr,亲自/d
请求3:localhost:2000/ansj-master/stop/insertStopWordsLibrary?keyword=咖啡国家的生活质量提高了
咖啡/n,国家/n,的/u,生活/vn,质量/n,提高/v,了/u
咖啡/n,的/u,生活/vn,质量/n,提高/v,了/u
请求4:localhost:2000/ansj-master/stop/clearStopWordsLibrary?keyword=咖啡国家的生活质量提高了
咖啡/n,国家/n,的/u,生活/vn,质量/n,提高/v,了/u
咖啡/n,国家/n,的/u,生活/vn,质量/n,提高/v,了/u
Ansj In Elasticsearch
代码语言:javascript复制词典文件路径:/elasticsearch/elasticsearch-6.7.2/config/ansj_dic/dic
代码语言:javascript复制配置文件路径:/elasticsearch/elasticsearch-6.7.2/config/elasticsearch-analysis-ansj/ansj.cfg.yml
stop: config/ansj_dic/dic/stopLibrary.dic