Elasticsearch分词:Ansj分词器

2021-08-13 10:49:03 浏览数 (1)

简介

这是一个基于n-Gram CRF HMM的中文分词的java实现。分词速度达到每秒钟大约200万字左右(mac air下测试),准确率能达到96%以上。目前实现了中文分词、中文姓名识别、用户自定义词典、关键字提取、自动摘要、关键字标记等功能。可以应用到自然语言处理等方面,适用于对分词效果要求高的各种项目。

代码语言:javascript复制
<dependency>
	<groupId>org.ansj</groupId>
	<artifactId>ansj_seg</artifactId>
	<version>5.1.6</version>
</dependency>

配置文件:library.properties
dic=src/main/resources/userLibrary


词典文件夹:userLibrary
	doctor.dic
	....
	hospital.dic
  • 项目地址:https://github.com/NLPchina/ansj_seg
  • 项目文档地址:https://github.com/NLPchina/ansj_seg/wiki
  • 项目的文档地址:http://nlpchina.github.io/ansj_seg/
  • 获取Jar包地址:https://mvnrepository.com/artifact/org.ansj/ansj_seg

Ansj In Elasticsearch

官网地址:https://github.com/NLPchina/elasticsearch-analysis-ansj

手动安装

  • 下载地址(版本可选) :https://github.com/NLPchina/elasticsearch-analysis-ansj/releases
  • 拷贝到Es的plugins文件夹下,新建一个文件夹再解压zip
  • 安装之后,重启elasticsearch
代码语言:javascript复制
[2021-07-16T18:18:20,286][INFO ][o.e.p.PluginsService     ] [node-1] loaded module [x-pack-watcher]
[2021-07-16T18:18:20,286][INFO ][o.e.p.PluginsService     ] [node-1] loaded plugin [analysis-icu]
[2021-07-16T18:18:20,287][INFO ][o.e.p.PluginsService     ] [node-1] loaded plugin [elasticsearch-analysis-ansj]
[2021-07-16T18:18:21,892][INFO ][o.a.e.p.AnalysisAnsjPlugin] [node-1] regedit analyzer tokenizer named : base_ansj
[2021-07-16T18:18:21,892][INFO ][o.a.e.p.AnalysisAnsjPlugin] [node-1] regedit analyzer tokenizer named : index_ansj
[2021-07-16T18:18:21,893][INFO ][o.a.e.p.AnalysisAnsjPlugin] [node-1] regedit analyzer tokenizer named : query_ansj
[2021-07-16T18:18:21,893][INFO ][o.a.e.p.AnalysisAnsjPlugin] [node-1] regedit analyzer tokenizer named : dic_ansj
[2021-07-16T18:18:21,893][INFO ][o.a.e.p.AnalysisAnsjPlugin] [node-1] regedit analyzer tokenizer named : nlp_ansj
[2021-07-16T18:18:21,915][INFO ][o.a.e.p.AnalysisAnsjPlugin] [node-1] regedit analyzer provider named : base_ansj
[2021-07-16T18:18:21,915][INFO ][o.a.e.p.AnalysisAnsjPlugin] [node-1] regedit analyzer provider named : index_ansj
[2021-07-16T18:18:21,915][INFO ][o.a.e.p.AnalysisAnsjPlugin] [node-1] regedit analyzer provider named : query_ansj
[2021-07-16T18:18:21,915][INFO ][o.a.e.p.AnalysisAnsjPlugin] [node-1] regedit analyzer provider named : dic_ansj
[2021-07-16T18:18:21,915][INFO ][o.a.e.p.AnalysisAnsjPlugin] [node-1] regedit analyzer provider named : nlp_ansj
[2021-07-16T18:18:23,190][INFO ][o.e.x.s.a.s.FileRolesStore] [node-1] parsed [0] roles from file [/Users/lihuan/Documents/opt/elasticsearch-cluster/elasticsearch-7.2.1/elasticsearch-7.2.1/config/roles.yml]
[2021-07-16T18:18:23,886][DEBUG][o.e.a.ActionModule       ] [node-1] Using REST wrapper from plugin org.elasticsearch.xpack.security.Security
[2021-07-16T18:18:24,207][INFO ][o.e.d.DiscoveryModule    ] [node-1] using discovery type [zen] and seed hosts providers [settings]
[2021-07-16T18:18:24,470][INFO ][ansj-initializer         ] [node-1] try to load ansj config file: /Users/lihuan/Documents/opt/elasticsearch-cluster/elasticsearch-7.2.1/elasticsearch-7.2.1/config/elasticsearch-analysis-ansj/ansj.cfg.yml
[2021-07-16T18:18:24,471][INFO ][ansj-initializer         ] [node-1] try to load ansj config file: /Users/lihuan/Documents/opt/elasticsearch-cluster/elasticsearch-7.2.1/elasticsearch-7.2.1/plugins/elasticsearch-analysis-ansj-7.2.1.0-release/config/ansj.cfg.yml
[2021-07-16T18:18:24,472][INFO ][ansj-initializer         ] [node-1] load ansj config: /Users/lihuan/Documents/opt/elasticsearch-cluster/elasticsearch-7.2.1/elasticsearch-7.2.1/plugins/elasticsearch-analysis-ansj-7.2.1.0-release/config/ansj.cfg.yml
[2021-07-16T18:18:24,481][WARN ][o.a.u.MyStaticValue      ] [node-1] not find ansj_library.properties. reason: access denied ("java.io.FilePermission" "/Users/lihuan/Documents/opt/elasticsearch-cluster/elasticsearch-7.2.1/elasticsearch-7.2.1" "read")
[2021-07-16T18:18:24,482][WARN ][o.a.u.MyStaticValue      ] [node-1] not find library.properties. reason: access denied ("java.io.FilePermission" "/Users/lihuan/Documents/opt/elasticsearch-cluster/elasticsearch-7.2.1/elasticsearch-7.2.1" "read")
[2021-07-16T18:18:24,482][WARN ][o.a.u.MyStaticValue      ] [node-1] not find library.properties in classpath use it by default !
[2021-07-16T18:18:24,485][INFO ][o.a.d.i.File2Stream      ] [node-1] path to stream ansj_library.properties
[2021-07-16T18:18:24,486][ERROR][ansj-initializer         ] [node-1] ansj_library.properties load err: org.ansj.exception.LibraryException: org.ansj.exception.LibraryException:  path :ansj_library.properties file:/Users/lihuan/Documents/opt/elasticsearch-cluster/elasticsearch-7.2.1/elasticsearch-7.2.1/ansj_library.properties not found or can not to read
[2021-07-16T18:18:24,489][INFO ][o.a.d.i.File2Stream      ] [node-1] path to stream default.dic
[2021-07-16T18:18:24,489][ERROR][o.a.l.DicLibrary         ] [node-1] Init dic library error :java.security.AccessControlException: access denied ("java.io.FilePermission" "default.dic" "read"), path: default.dic
[2021-07-16T18:18:24,489][INFO ][o.a.d.i.File2Stream      ] [node-1] path to stream dic
[2021-07-16T18:18:24,490][ERROR][o.a.l.DicLibrary         ] [node-1] Init dic library error :org.ansj.exception.LibraryException:  path :dic file:/Users/lihuan/Documents/opt/elasticsearch-cluster/elasticsearch-7.2.1/elasticsearch-7.2.1/dic not found or can not to read, path: dic
[2021-07-16T18:18:24,491][INFO ][o.a.d.i.File2Stream      ] [node-1] path to stream library/ambiguity.dic
[2021-07-16T18:18:24,491][ERROR][o.a.l.AmbiguityLibrary   ] [node-1] Init ambiguity library error :java.security.AccessControlException: access denied ("java.io.FilePermission" "library/ambiguity.dic" "read"), path: library/ambiguity.dic
[2021-07-16T18:18:25,423][INFO ][o.a.l.DATDictionary      ] [node-1] init core library ok use time : 872
[2021-07-16T18:18:25,676][INFO ][o.a.l.NgramLibrary       ] [node-1] init ngram ok use time :250
[2021-07-16T18:18:25,681][INFO ][ansj-initializer         ] [node-1] init ansj plugin ok , goodluck youyou
[2021-07-16T18:18:25,882][INFO ][o.e.n.Node               ] [node-1] initialized
[2021-07-16T18:18:25,883][INFO ][o.e.n.Node               ] [node-1] starting ...
[2021-07-16T18:18:31,009][INFO ][o.e.t.TransportService   ] [node-1] publish_address {127.0.0.1:9300}, bound_addresses {127.0.0.1:9300}, {[::1]:9300}
[2021-07-16T18:18:31,015][WARN ][o.e.b.BootstrapChecks    ] [node-1] the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured
[2021-07-16T18:18:31,019][INFO ][o.e.c.c.Coordinator      ] [node-1] cluster UUID [Fv0qn48ET1W7xvReE6QfcA]
[2021-07-16T18:18:31,027][INFO ][o.e.c.c.ClusterBootstrapService] [node-1] no discovery configuration found, will perform best-effort cluster bootstrapping after [3s] unless existing master is discovered
[2021-07-16T18:18:31,201][INFO ][o.e.c.s.MasterService    ] [node-1] elected-as-master ([1] nodes joined)[{node-1}{0bCCNIOgT96vNrRVaTnXXQ}{6aRu-ht-QreEGbLVlQerqw}{localhost}{127.0.0.1:9300}{xpack.installed=true} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 23, version: 208, reason: master node changed {previous [], current [{node-1}{0bCCNIOgT96vNrRVaTnXXQ}{6aRu-ht-QreEGbLVlQerqw}{localhost}{127.0.0.1:9300}{xpack.installed=true}]}
[2021-07-16T18:18:31,717][INFO ][o.e.c.s.ClusterApplierService] [node-1] master node changed {previous [], current [{node-1}{0bCCNIOgT96vNrRVaTnXXQ}{6aRu-ht-QreEGbLVlQerqw}{localhost}{127.0.0.1:9300}{xpack.installed=true}]}, term: 23, version: 208, reason: Publication{term=23, version=208}
[2021-07-16T18:18:31,741][INFO ][o.e.h.AbstractHttpServerTransport] [node-1] publish_address {127.0.0.1:9200}, bound_addresses {127.0.0.1:9200}, {[::1]:9200}
[2021-07-16T18:18:31,742][INFO ][o.e.n.Node               ] [node-1] started
[2021-07-16T18:18:31,814][INFO ][o.e.c.s.ClusterSettings  ] [node-1] updating [xpack.monitoring.collection.enabled] from [false] to [true]
[2021-07-16T18:18:31,936][INFO ][o.e.l.LicenseService     ] [node-1] license [5cb32362-2a00-4d6d-8c72-0aa541924fd4] mode [basic] - valid
[2021-07-16T18:18:31,944][INFO ][o.e.g.GatewayService     ] [node-1] recovered [9] indices into cluster_state
[2021-07-16T18:18:34,830][INFO ][o.e.c.r.a.AllocationService] [node-1] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[.kibana_task_manager][0]] ...]).
[2021-07-16T18:18:42,093][INFO ][o.e.c.m.MetaDataCreateIndexService] [node-1] [.monitoring-es-7-2021.07.16] creating index, cause [auto(bulk api)], templates [.monitoring-es], shards [1]/[0], mappings [_doc]
[2021-07-16T18:18:42,614][INFO ][o.e.c.r.a.AllocationService] [node-1] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-es-7-2021.07.16][0]] ...]).

插件安装

  • 到elasticsearch的bin目录下,使用elasticsearch-plugin安装。进入Elasticsearch目录运行如下命令
代码语言:javascript复制
进入es目录执行如下命令

./bin/elasticsearch-plugin install https://github.com/NLPchina/elasticsearch-analysis-ansj/releases/download/v6.7.0/elasticsearch-analysis-ansj-6.7.0.0-release.zip
  • 安装之后,重启elasticsearch

切记:

1.安装分词插件的时候,一定要新建一个类似elasticsearch-ansj-6.7.0-release的文件夹,再解压分词zip,否则一直报错。

2.根据README进行配置

备注:相关配置信息可以参见elasticsearch-analysis-ansj的README.md

分词方式

名称

用户自定义词典

数字识别

人名识别

机构名识别

新词发现

BaseAnalysis

X

X

X

X

X

ToAnalysis

X

X

DicAnalysis

X

X

IndexAnalysis

X

X

NlpAnalysis

代码语言:javascript复制
String str = "洁面仪配合洁面深层清洁毛孔 清洁鼻孔面膜碎觉使劲挤才能出一点点皱纹 脸颊毛孔修复的看不见啦 草莓鼻历史遗留问题没辙 脸和脖子差不多颜色的皮肤才是健康的 长期使用安全健康的比同龄人显小五到十岁 28岁的妹子看看你们的鱼尾纹" ;
			
System.out.println(BaseAnalysis.parse(str));
System.out.println(ToAnalysis.parse(str));
System.out.println(DicAnalysis.parse(str));
System.out.println(IndexAnalysis.parse(str));
System.out.println(NlpAnalysis.parse(str));

ToAnalysis 精准分词

精准分词是Ansj分词的店长推荐款。它在易用性,稳定性.准确性.以及分词效率上.都取得了一个不错的平衡.如果你初次尝试Ansj如果你想开箱即用.那么就用这个分词方式是不会错的。

DicAnalysis 用户自定义词典优先策略的分词

用户自定义词典优先策略的分词,如果你的用户自定义词典足够好,或者你的需求对用户自定义词典的要求比较高,那么强烈建议你使用DicAnalysis的分词方式。可以说在很多方面Dic优于ToAnalysis的结果。

NlpAnalysis 带有新词发现功能的分词

nlp分词是总能给你惊喜的一种分词方式。它可以识别出未登录词.但是它也有它的缺点.速度比较慢.稳定性差.ps:我这里说的慢仅仅是和自己的其他方式比较.应该是40w字每秒的速度吧。

IndexAnalysis 面向索引的分词

面向索引的分词。顾名思义就是适合在lucene等文本检索中用到的分词。 主要考虑以下两点

  • 召回率 * 召回率是对分词结果尽可能的涵盖。比如对“上海虹桥机场南路” 召回结果是[上海/ns, 上海虹桥机场/nt, 虹桥/ns, 虹桥机场/nz, 机场/n, 南路/nr]
  • 准确率 * 其实这和召回本身是具有一定矛盾性的Ansj的强大之处是很巧妙的避开了这两个的冲突 。比如我们常见的歧义句“旅游和服务”->对于一般保证召回 。大家会给出的结果是“旅游 和服 服务” 对于ansj不存在跨term的分词。意思就是。召回的词只是针对精准分词之后的结果的一个细分。比较好的解决了这个问题

BaseAnalysis 最小颗粒度的分词

基本就是保证了最基本的分词.词语颗粒度最非常小的..所涉及到的词大约是10万左右。

配置文件

在默认情况下,如果你想做更多的全局设定在程序调用时候,配置文件是个必不可少的玩意,在ansj中配置文件名为library.properties,这是一个不可更改的约定。

字段名

默认值

说明

isNameRecognition

true

是否开启人名识别

isNumRecognition

true

是否开启数字识别

isQuantifierRecognition

true

是否数字和量词合并

isRealName

false

是否取得真实的词,默认情况会取得标注化后的

isSkipUserDefine

false

是否用户辞典不加载相同的词

dic

"library/default.dic"

自定义词典路径

dic_[key]

"你的词典路径"

针对不同语料调用不同的自定义词典

ambiguity

"library/ambiguity.dic"

歧义词典路径

ambiguity_[key]

"library/ambiguity.dic"

歧义词典路径

crf

null

crf词典路径,不设置为默认

crf_[key]

"你的模型路径"

针对不同语料调用不同的分词模型

synonyms

"默认的同义词典"

针对不同语料调用不同的分词模型

synonyms_[key]

"你的同义词典路径"

针对不同语料调用不同的分词模型

默认的配置文件格式

代码语言:javascript复制
#path of userLibrary this is default library
dic=library/default.dic

#redress dic file path
ambiguityLibrary=library/ambiguity.dic

#set real name
isRealName=true

#isNameRecognition default true
isNameRecognition=true

#isNumRecognition default true
isNumRecognition=true

#digital quantifier merge default true
isQuantifierRecognition=true

在5.1.0版本之后你可以通过多种方式加载你的词典,也可以自定义词典加载的接口。目前有如下方式:

  • 文件加载 file://c:/dic.txt 当然你可以依旧保持旧的写法c:/dic.txt
  • 从jar包中加载 jar://org.ansj.dic.DicReader|/crf.model , 以jar://开头..第一个为类全名称,后面为这个类所在jar包词典文件的路径.
  • 从jdbc包中加载 jdbc://jdbc:mysql://192.168.10.103:3306/dic_table?useUnicode=true&characterEncoding=utf-8&zeroDateTimeBehavior=convertToNull|username|password|select name as name,nature,freq from dic where type=1
  • 从url中加载.url://http://maven.nlpcn.org/down/library/default.dic
  • 当然你也可以实现自己的 加载方式,通过继承 PathToStream来实现.

词典类别

Ansj目前支持以下的用户自定义词典的操作方式:

  • 从文件中加载词典
  • 配置文件
  • 编码路径
  • 从内存操作词典
  • 增加
  • 删除
  • 修改
代码语言:javascript复制
配置文件:library.properties

#支持同义词
synonyms=src/main/resources/userLibrary/synonyms.dic

注意:词典的加载都是懒加载

Ansj支持5种词典,分别为:DicLibrary、StopLibrary、CrfLibrary、AmbiguityLibrary、SynonymsLibrary 词典各字段之间使用tab(t)分割,而不是空格。

  • DicLibrary:用户自定义词典,格式词语 词性 词频
代码语言:javascript复制
小	a	57969
高	a	57483
长	a	40281
重要	a	37557
老	a	33423
  • StopLibrary:停用词词典,格式词语 停用词类型[可以为空]
代码语言:javascript复制
is
a
#
v	nature
.*了	regex
  • CrfLibrary:crf模型,格式 二进制格式
  • AmbiguityLibrary:歧义词典,格式:'词语0 词性0 词语1 词性1.....'
代码语言:javascript复制
习近平	nr
李民	nr	工作	vn
三个	m	和尚	n
的确	d	定	v	不	v
大	a	和尚	n
张三	nr	和	c
动漫	n	游戏	n
邓颖超	nr	生前	t 
  • SynonymsLibrary:同义词词典,格式:'词语0 词语1 词语2 词语3',采用幂等策略:w1=w2 w2=w3 w1=w3
代码语言:javascript复制
人类	生人	全人类
人手	人员	人口	人丁	口	食指
劳力	劳动力	工作者
匹夫	个人
家伙	东西	货色	厮	崽子	兔崽子	狗崽子	小子	杂种	畜生	混蛋	王八蛋	竖子	鼠辈	小崽子

用户自定义词典

Ansj In SpringBoot

从文件中加载词典方式一:

代码语言:javascript复制
配置文件:library.properties
dic=src/main/resources/userLibrary


词典文件夹:userLibrary
doctor.dic
hospital.dic
default.dic
arer.dic

注意:这里可直接加载userLibrary下全部dic文件
启动服务:说明dic词典加载成功
2021-07-16 19:37:05.879  INFO 5908 --- [           main] org.ansj.library.DicLibrary              : load dic use time:7 path is : src/main/resources/userLibrary

@Override
public Map<String, String> dicAnalysis(String keyword) {

    Map<String, String> itemsMap = new HashMap<>();
    List<Term> terms = DicAnalysis.parse(keyword).getTerms();
    terms.stream().forEach(v -> {
        log.info("DicAnalysis term:{}", v);
        itemsMap.put(v.getName(), v.getNatureStr());
    });
    return itemsMap;
}

/**
 * 自定义分词
 * @param keyword
 */
@RequestMapping(value = "/dicAnalysis")
public @ResponseBody
Map<String, String> dicAnalysis(String keyword) {
    return dicAnalysis.dicAnalysis(keyword);
}

localhost:2000/ansj-master/dic/dicAnalysis?keyword=宁夏回族自治区

从文件中加载词典方式二:

代码语言:javascript复制
package com.ansj.master.ansj.core;

import com.ansj.master.ansj.constant.SystemConstants;
import lombok.extern.slf4j.Slf4j;
import org.ansj.domain.Term;
import org.ansj.library.DicLibrary;
import org.ansj.splitWord.analysis.DicAnalysis;
import org.nlpcn.commons.lang.tire.domain.Forest;
import org.nlpcn.commons.lang.tire.library.Library;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.core.io.Resource;
import org.springframework.core.io.support.ResourcePatternResolver;
import org.springframework.stereotype.Component;

import javax.annotation.PostConstruct;
import java.util.ArrayList;
import java.util.List;

@Slf4j
@Component
public class InitDictionarys {

    @Autowired
    private ResourcePatternResolver resourcePatResolver;

    private List<String> forestsName = new ArrayList<>();

    /**
     * 初始化加载字典
     */
    @PostConstruct
    private void initAnsjLibrary() {

        try {
            // 加载自定义词典
            for(Resource resource:  resourcePatResolver.getResources(SystemConstants.CLASS_PATH)) {
                Forest forest = Library.makeForest(resource.getInputStream());

                String key = "dic_"   resource.getFilename().replace(".dic", "");
                DicLibrary.put(key, key, forest);
                forestsName.add(key);
            }

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    /**
     * 只用ansj分词, 使用自定义词典分词
     */
    public List<Term> parseWithLibrary(String keyWord) {
        return DicAnalysis.parse(keyWord, DicLibrary.gets(forestsName)).getTerms();
    }
}


@Override
public List<Term> parseWithLibrary(String keyword) {
    return initDictionarys.parseWithLibrary(keyword);
}

/**
 * 使用自定义分词加载词典
 * @param keyword
 * @return
 */
@RequestMapping(value = "/parseWithLibrary")
public @ResponseBody
List<Term> parseWithLibrary(String keyword) {
    return dicAnalysis.parseWithLibrary(keyword);
}

localhost:2000/ansj-master/dic/parseWithLibrary?keyword=宁夏回族自治区

动态添加&删除:

代码语言:javascript复制
/**
 * 增加分词
 * @param keyword
 * @return
 */
@RequestMapping(value = "/addAnalysis")
public String addAnalysis(String keyword, String str) {
    return dicAnalysis.addAnalysis(keyword, str);
}

/**
 * 删除分词
 * @param key
 * @return
 */
@RequestMapping(value = "/delAnalysis")
public String delAnalysis(String key, String str) {
    return dicAnalysis.delAnalysis(key, str);
}

@Override
public String addAnalysis(String keyword, String str) {

    DicLibrary.insert("dic", keyword, "userDefine", 1000);
    List<Term> terms = ToAnalysis.parse(str).getTerms();
    System.out.println("增加新词例子:"   terms);
    return str;
}

@Override
public String delAnalysis(String key, String str) {

    // 删除词语,只能删除.用户自定义的词典.
    DicLibrary.remove(key);
    List<Term> terms = ToAnalysis.parse(str).getTerms();
    System.out.println("删除用户自定义词典例子:"   terms);
    return str;
}

localhost:2000/ansj-master/dic/addAnalysis?keyword=ansj中文分词&str=我觉得Ansj中文分词是一个不错的系统!我是王婆!
增加新词例子:[我/r, 觉得/v, ansj中文分词/userDefine, 是/v, 一个/m, 不错/a, 的/u, 系统/n, !/w, 我/r, 是/v, 王婆/nr, !/w]

localhost:2000/ansj-master/dic/delAnalysis?key=dic&str=我觉得Ansj中文分词是一个不错的系统!我是王婆!
删除用户自定义词典例子:[我/r, 觉得/v, ansj/en, 中文/nz, 分词/v, 是/v, 一个/m, 不错/a, 的/u, 系统/n, !/w, 我/r, 是/v, 王婆/nr, !/w]

Ansj In Elasticsearch

Synonyms同义词

Ansj In SpringBoot

代码语言:javascript复制
1.添加词典
    synonyms.dic 将官方提供的同义词词典表copy一份
    userLibrary.dic 加入新词 “老朽	a	2000”
2.配置文件library.properties
    #支持同义词
    synonyms=src/main/resources/userLibrary/synonyms.dic
    #支持自定义词典
    dic_custom=src/main/resources/userLibrary/userLibrary.dic
3.测试代码
代码语言:javascript复制
@RestController
@Slf4j
@RequestMapping(value = "/synonyms")
public class SynonymsController {

    @Resource
    private SynonymsService synonymsService;

    /**
     * 同义词
     * @param keyword
     * @return
     */
    @RequestMapping(value = "/synonymsWithLibrary")
    @ResponseBody
    public Map<String, Object> synonymsWithLibrary(String keyword) {
        return synonymsService.synonymsWithLibrary(keyword);
    }

    /**
     * 测试ansj同义词
     * @param keyword
     * @return
     */
    @RequestMapping(value = "/synonymsWithUserLibrary")
    public @ResponseBody Map<String, Object> synonymsWithUserLibrary(String keyword) {
        return synonymsService.synonymsWithUserLibrary(keyword);
    }
}
代码语言:javascript复制
@Service
@Slf4j
public class SynonymsServiceImpl implements SynonymsService {


    @Override
    public Map<String, Object> synonymsWithLibrary(String keyword) {

        Map<String, Object> synonymsMap = new HashMap<>();

        ///使用默认的同义词词典
        SynonymsRecgnition synonymsRecgnition = new SynonymsRecgnition();

        String str = "我国中国就是华夏,也是天朝";

        for (Term term : ToAnalysis.parse("我国中国就是华夏")) {
            System.out.println(term.getName()   "t"   (term.getSynonyms()));
        }

        System.out.println("-------------init library------------------");

        for (Term term : ToAnalysis.parse(str).recognition(synonymsRecgnition)) {
            System.out.println(term.getName()   "t"   (term.getSynonyms()));
        }

        System.out.println("---------------insert----------------");
        SynonymsLibrary.insert(SynonymsLibrary.DEFAULT, new String[] { "中国", "我国" });

        for (Term term : ToAnalysis.parse(str).recognition(synonymsRecgnition)) {
            System.out.println(term.getName()   "t"   (term.getSynonyms()));
        }

        System.out.println("---------------append----------------");
        SynonymsLibrary.append(SynonymsLibrary.DEFAULT, new String[] { "中国", "华夏", "天朝" });

        for (Term term : ToAnalysis.parse(str).recognition(synonymsRecgnition)) {
            System.out.println(term.getName()   "t"   (term.getSynonyms()));
        }

        System.out.println("---------------remove----------------");
        SynonymsLibrary.remove(SynonymsLibrary.DEFAULT, "我国");

        for (Term term : ToAnalysis.parse(str).recognition(synonymsRecgnition)) {
            System.out.println(term.getName()   "t"   (term.getSynonyms()));
        }
        return synonymsMap;
    }

    @Override
    public Map<String, Object> synonymsWithUserLibrary(String keyword) {

        Map<String, Object> synonymsMap = new HashMap<>();

        new DicAnalysis()
                .setForests(DicLibrary.get("dic_custom")) // 根据自定义词典先进行分词
                .parseStr(keyword)
                .recognition(new SynonymsRecgnition("synonyms")).forEach(t -> {
            System.out.println("Name: "   t.getName());
            System.out.println("Nature: "   t.getNatureStr());
            System.out.println("Synonyms: "   t.getSynonyms());
            System.out.println("Offset: "   t.getOffe());
            System.out.println("RealName: "   t.getRealName());

            synonymsMap.put("Name: ", t.getName());
            synonymsMap.put("Nature: ", t.getNatureStr());
            synonymsMap.put("Synonyms: ", t.getSynonyms());
            synonymsMap.put("Offset: ", t.getOffe());
            synonymsMap.put("RealName: ", t.getRealName());
        });
        return synonymsMap;
    }
}
代码语言:javascript复制
访问:
1.localhost:2000/ansj-master/synonyms/synonymsWithLibrary
我国	null
中国	null
就是	null
华夏	null
-------------init library------------------
我国	[本国, 我国]
中国	[中原, 华夏, 中华, 华, 赤县, 神州, 九州, 赤县神州, 炎黄, 中国, 礼仪之邦]
就是	[即使, 即若, 即令, 即或, 即便, 就算, 就是, 尽管, 哪怕, 不怕, 纵使, 纵令, 纵然, 纵, 饶, 即, 就, 便, 虽, 即使如此, 不畏]
华夏	[中原, 华夏, 中华, 华, 赤县, 神州, 九州, 赤县神州, 炎黄, 中国, 礼仪之邦]
,	null
也	[吗, 呢, 吧, 乎, 啊, 否, 欤, 耶, 邪, 为, 哉, 也, 也罢, 与否]
是	[凡是, 凡, 是, 大凡, 举凡]
天朝	null
---------------insert----------------
我国	[中国, 我国]
中国	[中国, 我国]
就是	[即使, 即若, 即令, 即或, 即便, 就算, 就是, 尽管, 哪怕, 不怕, 纵使, 纵令, 纵然, 纵, 饶, 即, 就, 便, 虽, 即使如此, 不畏]
华夏	null
,	null
也	[吗, 呢, 吧, 乎, 啊, 否, 欤, 耶, 邪, 为, 哉, 也, 也罢, 与否]
是	[凡是, 凡, 是, 大凡, 举凡]
天朝	null
---------------append----------------
我国	[我国, 中国, 华夏, 天朝]
中国	[我国, 中国, 华夏, 天朝]
就是	[即使, 即若, 即令, 即或, 即便, 就算, 就是, 尽管, 哪怕, 不怕, 纵使, 纵令, 纵然, 纵, 饶, 即, 就, 便, 虽, 即使如此, 不畏]
华夏	[我国, 中国, 华夏, 天朝]
,	null
也	[吗, 呢, 吧, 乎, 啊, 否, 欤, 耶, 邪, 为, 哉, 也, 也罢, 与否]
是	[凡是, 凡, 是, 大凡, 举凡]
天朝	[我国, 中国, 华夏, 天朝]
---------------remove----------------
我国	null
中国	[中国, 华夏, 天朝]
就是	[即使, 即若, 即令, 即或, 即便, 就算, 就是, 尽管, 哪怕, 不怕, 纵使, 纵令, 纵然, 纵, 饶, 即, 就, 便, 虽, 即使如此, 不畏]
华夏	[中国, 华夏, 天朝]
,	null
也	[吗, 呢, 吧, 乎, 啊, 否, 欤, 耶, 邪, 为, 哉, 也, 也罢, 与否]
是	[凡是, 凡, 是, 大凡, 举凡]
天朝	[中国, 华夏, 天朝]

2.localhost:2000/ansj-master/synonyms/synonymsWithUserLibrary?keyword=老朽
{
    "Synonyms: ": [
        "年老",
        "老",
        "上岁数",
        "上年纪",
        "高大",
        "苍老",
        "衰老",
        "年高",
        "年迈",
        "老迈",
        "高迈",
        "白头",
        "皓首",
        "大年",
        "老大",
        "老朽",
        "朽迈",
        "老态龙钟",
        "年事已高",
        "鹤发鸡皮",
        "鸡皮鹤发",
        "行将就木",
        "大龄",
        "七老八十",
        "早衰",
        "老弱病残",
        "年迈体弱",
        "老态",
        "古稀之年",
        "年逾古稀"
    ],
    "Name: ": "老朽",
    "Offset: ": 0,
    "RealName: ": "老朽",
    "Nature: ": "a"
}

Ansj In Elasticsearch

代码语言:javascript复制
词典路径:elasticsearch/elasticsearch-6.7.2/config/analysis
代码语言:javascript复制
配置索引Setting:

{
  "settings": {
    "index": {
      "number_of_shards": "1",
      "provided_name": "sphinx-diseasedoctor-21.07.16-102554",
      "creation_date": "1626402354046",
      "analysis": {
        "filter": {
          "my_synonym": {
            "type": "synonym",
            "synonyms_path": "analysis/synonym.txt"
          }
        },
        "analyzer": {
          "index_ansj_analyzer": {
            "filter": [
              "my_synonym",
              "asciifolding"
            ],
            "type": "custom",
            "tokenizer": "index_ansj"
          },
          "comma": {
            "pattern": ",",
            "type": "pattern"
          }
        }
      },
      "number_of_replicas": "2",
      "uuid": "380mXY-ITsCFuOfFSGRqhw",
      "version": {
        "created": "6070299"
      }
    }
  }
}

Stop停用词

Ansj In SpringBoot

代码语言:javascript复制
1.测试代码

说明:Ansj只有自带方法一个一个添加停用词,没有通过配置添加的方法,因此不需要在library.properties加配置项。
代码语言:javascript复制
@RestController
@Slf4j
@RequestMapping(value = "/stop")
public class StopController {

    @Resource
    private StopService stopService;

    /**
     * 停用词
     * @param keyword
     * @return
     */
    @RequestMapping(value = "/stopWithLibrary")
    @ResponseBody
    public Map<String, Object> stopWithLibrary(String keyword) {
        return stopService.stopWithLibrary(keyword);
    }

    /**
     * 自定义停用词
     * @param keyword
     */
    @RequestMapping(value = "/stopWithUserLibrary")
    @ResponseBody
    public Map<String, Object> stopWithUserLibrary(String keyword){
        return stopService.stopWithUserLibrary(keyword);
    }

    /**
     * 添加自定义停用词
     * @param keyword
     */
    @RequestMapping(value = "/insertStopWordsLibrary")
    @ResponseBody
    public Map<String, Object> insertStopWordsLibrary(String keyword){
        return stopService.insertStopWordsLibrary(keyword);
    }

    /**
     * 清除自定义停用词
     * @param keyword
     */
    @RequestMapping(value = "/clearStopWordsLibrary")
    @ResponseBody
    public Map<String, Object> clearStopWordsLibrary(String keyword){
        return stopService.clearStopWordsLibrary(keyword);
    }
}
代码语言:javascript复制
public interface StopService {

    Map<String, Object> stopWithLibrary(String keyword);

    Map<String, Object> stopWithUserLibrary(String keyword);

    Map<String, Object> insertStopWordsLibrary(String keyword);

    Map<String, Object> clearStopWordsLibrary(String keyword);
}
代码语言:javascript复制
@Service
@Slf4j
public class StopServiceImpl implements StopService {

    @Override
    public Map<String, Object> stopWithLibrary(String keyword) {

        Map<String, Object> stopMap = new HashMap<>();

        Result parse = ToAnalysis.parse(keyword);

        System.out.println(parse);

        StopRecognition fitler = new StopRecognition();

        // 停用某一词性的词 比如 增加nr 后.人名将不在结果中
        fitler.insertStopNatures("uj");
        fitler.insertStopNatures("ul");
        fitler.insertStopNatures("null");
        fitler.insertStopWords("生活");
        fitler.insertStopRegexes("国.*?");

        Result modifResult = parse.recognition(fitler);

        for (Term term : modifResult) {
            stopMap.put(term.getName(), term.getNatureStr());
        }

        System.out.println(modifResult);

        return stopMap;
    }


    @Override
    public Map<String, Object> stopWithUserLibrary(String keyword) {

        Map<String, Object> stopMap = new HashMap<>();

        StopLibrary.insertStopWords(StopLibrary.DEFAULT, "的", "呵呵", "哈哈", "噢", "啊");
        Result terms = ToAnalysis.parse(keyword);
        //使用停用词
        System.out.println(terms.recognition(StopLibrary.get()));

        return stopMap;
    }

    @Override
    public Map<String, Object> insertStopWordsLibrary(String keyword) {

        Map<String, Object> stopMap = new HashMap<>();

        Result parse = ToAnalysis.parse(keyword);
        System.out.println(parse);
        StopRecognition fitler = new StopRecognition();
        Collection<String> filterWords = new ArrayList<>();
        filterWords.add("国家");
        fitler.insertStopWords(filterWords);
        Result modifResult = parse.recognition(fitler);
        for (Term term : modifResult) {
            stopMap.put(term.getName(), term.getNatureStr());
        }
        System.out.println(modifResult);
        return stopMap;
    }

    @Override
    public Map<String, Object> clearStopWordsLibrary(String keyword) {

        Map<String, Object> stopMap = new HashMap<>();

        Result parse = ToAnalysis.parse(keyword);
        System.out.println(parse);
        StopRecognition fitler = new StopRecognition();
        fitler.insertStopWords("咖啡");
        fitler.clear();
        Result modifResult = parse.recognition(fitler);
        for (Term term : modifResult) {
            stopMap.put(term.getName(), term.getNatureStr());
        }
        System.out.println(modifResult);
        return stopMap;
    }
}
代码语言:javascript复制
请求1:localhost:2000/ansj-master/stop/stopWithLibrary?keyword=咖啡国家的生活质量提高了
咖啡/n,国家/n,的/u,生活/vn,质量/n,提高/v,了/u
咖啡/n,的/u,质量/n,提高/v,了/u

请求2:localhost:2000/ansj-master/stop/stopWithUserLibrary?keyword=英文版是小田亲自翻译的
英文版/n,小田/nr,亲自/d

请求3:localhost:2000/ansj-master/stop/insertStopWordsLibrary?keyword=咖啡国家的生活质量提高了
咖啡/n,国家/n,的/u,生活/vn,质量/n,提高/v,了/u
咖啡/n,的/u,生活/vn,质量/n,提高/v,了/u

请求4:localhost:2000/ansj-master/stop/clearStopWordsLibrary?keyword=咖啡国家的生活质量提高了
咖啡/n,国家/n,的/u,生活/vn,质量/n,提高/v,了/u
咖啡/n,国家/n,的/u,生活/vn,质量/n,提高/v,了/u

Ansj In Elasticsearch

代码语言:javascript复制
词典文件路径:/elasticsearch/elasticsearch-6.7.2/config/ansj_dic/dic
代码语言:javascript复制
配置文件路径:/elasticsearch/elasticsearch-6.7.2/config/elasticsearch-analysis-ansj/ansj.cfg.yml

stop: config/ansj_dic/dic/stopLibrary.dic 

Ambiguity歧义词

0 人点赞