根据github最新官方文档整理
1 在Terminal使用pip安装
依赖PyTorch、TensorFlow等深度学习技术,适合专业NLP工程师、研究者以及本地海量数据场景。要求Python 3.6至3.10,支持Windows,推荐*nix。可以在CPU上运行,推荐GPU/TPU。安装PyTorch版:
安装时请关闭节点代理
- STEP1
pip install hanlp
回显内容:
代码语言:javascript复制(MyTest) C:UsersLenovoPycharmProjectsMyTest>pip install hanlp
Collecting hanlp
Downloading hanlp-2.1.0b52-py3-none-any.whl (651 kB)
---------------------------------------- 651.5/651.5 kB 1.2 MB/s eta 0:00:00
Collecting pynvml
Downloading pynvml-11.5.0-py3-none-any.whl (53 kB)
---------------------------------------- 53.1/53.1 kB ? eta 0:00:00
Collecting transformers>=4.1.1
Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
---------------------------------------- 7.2/7.2 MB 5.5 MB/s eta 0:00:00
Collecting hanlp-trie>=0.0.4
Downloading hanlp_trie-0.0.5.tar.gz (6.7 kB)
Preparing metadata (setup.py) ... done
Collecting toposort==1.5
Downloading toposort-1.5-py2.py3-none-any.whl (7.6 kB)
Collecting hanlp-common>=0.0.19
Downloading hanlp_common-0.0.19.tar.gz (28 kB)
Preparing metadata (setup.py) ... done
Collecting termcolor
Downloading termcolor-2.3.0-py3-none-any.whl (6.9 kB)
Requirement already satisfied: hanlp-downloader in c:userslenovoanaconda3envsmytestlibsite-packages (from hanlp) (0.0.25)
Collecting torch>=1.6.0
Downloading torch-1.13.1-cp37-cp37m-win_amd64.whl (162.6 MB)
---------------------------------------- 162.6/162.6 MB 6.3 MB/s eta 0:00:00
Collecting sentencepiece>=0.1.91
Downloading sentencepiece-0.1.99-cp37-cp37m-win_amd64.whl (977 kB)
---------------------------------------- 977.7/977.7 kB 10.3 MB/s eta 0:00:00
Collecting phrasetree
Downloading phrasetree-0.0.8.tar.gz (42 kB)
---------------------------------------- 42.2/42.2 kB 2.0 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Collecting typing-extensions
Downloading typing_extensions-4.7.1-py3-none-any.whl (33 kB)
Collecting regex!=2019.12.17
Downloading regex-2023.10.3-cp37-cp37m-win_amd64.whl (269 kB)
---------------------------------------- 269.9/269.9 kB 17.3 MB/s eta 0:00:00
Collecting filelock
Downloading filelock-3.12.2-py3-none-any.whl (10 kB)
Collecting importlib-metadata
Downloading importlib_metadata-6.7.0-py3-none-any.whl (22 kB)
Collecting huggingface-hub<1.0,>=0.14.1
Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
---------------------------------------- 268.8/268.8 kB 8.3 MB/s eta 0:00:00
Requirement already satisfied: requests in c:userslenovoanaconda3envsmytestlibsite-packages (from transformers>=4.1.1->hanlp) (2.31.0)
Collecting tqdm>=4.27
Downloading tqdm-4.66.1-py3-none-any.whl (78 kB)
---------------------------------------- 78.3/78.3 kB ? eta 0:00:00
Collecting numpy>=1.17
Downloading numpy-1.21.6-cp37-cp37m-win_amd64.whl (14.0 MB)
---------------------------------------- 14.0/14.0 MB 11.7 MB/s eta 0:00:00
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
Downloading tokenizers-0.13.3-cp37-cp37m-win_amd64.whl (3.5 MB)
---------------------------------------- 3.5/3.5 MB 12.3 MB/s eta 0:00:00
Collecting safetensors>=0.3.1
Downloading safetensors-0.4.0-cp37-none-win_amd64.whl (277 kB)
---------------------------------------- 277.3/277.3 kB 17.8 MB/s eta 0:00:00
Collecting pyyaml>=5.1
Downloading PyYAML-6.0.1-cp37-cp37m-win_amd64.whl (153 kB)
---------------------------------------- 153.2/153.2 kB 9.5 MB/s eta 0:00:00
Collecting packaging>=20.0
Downloading packaging-23.2-py3-none-any.whl (53 kB)
---------------------------------------- 53.0/53.0 kB ? eta 0:00:00
Collecting fsspec
Downloading fsspec-2023.1.0-py3-none-any.whl (143 kB)
---------------------------------------- 143.0/143.0 kB ? eta 0:00:00
Collecting colorama
Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Collecting zipp>=0.5
Downloading zipp-3.15.0-py3-none-any.whl (6.8 kB)
Requirement already satisfied: certifi>=2017.4.17 in c:userslenovoanaconda3envsmytestlibsite-packages (from requests->transformers>=4.1.1->hanlp)
(2022.12.7)
Requirement already satisfied: charset-normalizer<4,>=2 in c:userslenovoanaconda3envsmytestlibsite-packages (from requests->transformers>=4.1.1->
hanlp) (3.3.2)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:userslenovoanaconda3envsmytestlibsite-packages (from requests->transformers>=4.1.1->hanlp)
(2.0.7)
Requirement already satisfied: idna<4,>=2.5 in c:userslenovoanaconda3envsmytestlibsite-packages (from requests->transformers>=4.1.1->hanlp) (3.4)
Building wheels for collected packages: hanlp-common, hanlp-trie, phrasetree
Building wheel for hanlp-common (setup.py) ... done
Created wheel for hanlp-common: filename=hanlp_common-0.0.19-py3-none-any.whl size=30650 sha256=d3135f8a0e8bde4ff02320c6c84f1d809a9357f9ae2524a5bd99d4
a096d2db2e
Stored in directory: c:userslenovoappdatalocalpipcachewheelsf270bf57226335746d58210d202e3a64428b8e3b4d57ca373f26d77b
Building wheel for hanlp-trie (setup.py) ... done
Created wheel for hanlp-trie: filename=hanlp_trie-0.0.5-py3-none-any.whl size=6831 sha256=87b214b03fe0473f53b8b12a34ed2a2bee54a152b96c9531ad22d6044b6e
b790
Stored in directory: c:userslenovoappdatalocalpipcachewheels69ceb1c15e96cb4d3b170d002be4b8fd14c2185d32111080f352b3e6
Building wheel for phrasetree (setup.py) ... done
Created wheel for phrasetree: filename=phrasetree-0.0.8-py3-none-any.whl size=44234 sha256=e86b74c1ad7ebacc6dceeace9ea3b9452a2d4367ebf2080b253bb4bfef8
fb53d
Stored in directory: c:userslenovoappdatalocalpipcachewheelsc2813f3ed1a1f06d94d021590de96e6953e44854599db1cd90d66846
Successfully built hanlp-common hanlp-trie phrasetree
Installing collected packages: toposort, tokenizers, sentencepiece, phrasetree, zipp, typing-extensions, termcolor, safetensors, regex, pyyaml, pynvml,
packaging, numpy, hanlp-common, fsspec, filelock, colorama, tqdm, torch, importlib-metadata, hanlp-trie, huggingface-hub, transformers, hanlp
Successfully installed colorama-0.4.6 filelock-3.12.2 fsspec-2023.1.0 hanlp-2.1.0b52 hanlp-common-0.0.19 hanlp-trie-0.0.5 huggingface-hub-0.16.4 importl
ib-metadata-6.7.0 numpy-1.21.6 packaging-23.2 phrasetree-0.0.8 pynvml-11.5.0 pyyaml-6.0.1 regex-2023.10.3 safetensors-0.4.0 sentencepiece-0.1.99 termcol
or-2.3.0 tokenizers-0.13.3 toposort-1.5 torch-1.13.1 tqdm-4.66.1 transformers-4.30.2 typing-extensions-4.7.1 zipp-3.15.0
- STEP2
第一次使用要预下载大约600M压缩包。
预下载:
代码语言:javascript复制hanlp
回显内容
代码语言:javascript复制下载 http://download.hanlp.com/hanlp-1.8.4-release.zip 到 C:UsersLenovoanaconda3envsMyTestlibsite-packagespyhanlpstatichanlp-1.8.4-release.zip
100% 1.8 MiB 727.1 KiB/s ETA: 0 s [=============================================================]
下载 https://file.hankcs.com/hanlp/data-for-1.7.5.zip 到 C:UsersLenovoanaconda3envsMyTestlibsite-packagespyhanlpstaticdata-for-1.8.4.zip
100% 637.7 MiB 89.3 KiB/s ETA: 0 s [=============================================================]
解压 data.zip...
usage: hanlp [-h] [-v] {segment,parse,serve,update} ...
HanLP: Han Language Processing v1.8.4
positional arguments:
{segment,parse,serve,update}
which task to perform?
segment word segmentation
parse dependency parsing
serve start http server
update update jar and data of HanLP
optional arguments:
-h, --help show this help message and exit
-v, --version show installed versions of HanLP
2 第一个hanlp demo
语法:
classhanlp_common.document.Document(*args, **kwargs)
保存已解析注释的字典结构。 文档是 dict 的子类,它支持 dict 的每个接口。 此外,它还支持处理各种语言结构的接口。 它的 str 和 dict 表示形式与 JSON 序列化兼容。
参数:
代码语言:javascript复制*args – An iterator of key-value pairs.
**kwargs – Arguments from ** operator.
2.1 示例Demo:
代码语言:javascript复制# Create a document
from hanlp_common.document import Document
doc = Document(
tok=[["晓美焰", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司"]],
pos=[["NR", "VV", "NR", "NR", "VV", "NN", "NN", "NN", "NN"]],
ner=[[["晓美焰", "PERSON", 0, 1], ["北京立方庭", "LOCATION", 2, 4],
["自然语义科技公司", "ORGANIZATION", 5, 9]]],
dep=[[[2, "nsubj"], [0, "root"], [4, "name"], [2, "dobj"], [2, "conj"],
[9, "compound"], [9, "compound"], [9, "compound"], [5, "dobj"]]]
)
# print(doc) or str(doc) to get its JSON representation
print(doc)
print("----------annotation-----------")
# Access an annotation by its task name
print(doc['tok'])
print("----------count_sentences-----------")
# Get number of sentences
print(f'It has {doc.count_sentences()} sentence(s)')
print("----------n-th sentence-----------")
# Access the n-th sentence
print(doc.squeeze(0)['tok'])
# Pretty print it right in your console or notebook
print("----------pretty_print-----------")
doc.pretty_print()
# To save the pretty prints in a str
pretty_text: str = 'nn'.join(doc.to_pretty())
print("----------squeeze-----------")
print(doc.squeeze(i=0))
print("----------to_conll()-----------")
print(doc.to_conll())
print("----------to_dict()-----------")
print(doc.to_dict())
print("----------to_json-----------")
print(doc.to_json())
print("----------to_pretty-----------")
print(doc.to_pretty)
print("----------translate-----------")
print(doc.translate('zh'))
控制台输出:
代码语言:javascript复制C:UsersLenovoanaconda3envsMyTestpython.exe C:/Users/Lenovo/PycharmProjects/MyTest/1113/hanLP/HanLP.py
{
"tok": [
["晓美焰", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司"]
],
"pos": [
["NR", "VV", "NR", "NR", "VV", "NN", "NN", "NN", "NN"]
],
"ner": [
[["晓美焰", "PERSON", 0, 1], ["北京立方庭", "LOCATION", 2, 4], ["自然语义科技公司", "ORGANIZATION", 5, 9]]
],
"dep": [
[[2, "nsubj"], [0, "root"], [4, "name"], [2, "dobj"], [2, "conj"], [9, "compound"], [9, "compound"], [9, "compound"], [5, "dobj"]]
]
}
----------annotation-----------
[['晓美焰', '来到', '北京', '立方庭', '参观', '自然', '语义', '科技', '公司']]
----------count_sentences-----------
It has 1 sentence(s)
----------n-th sentence-----------
['晓美焰', '来到', '北京', '立方庭', '参观', '自然', '语义', '科技', '公司']
----------pretty_print-----------
Dep Tree Tok Relation Po Tok NER Type
─────────── ─── ──────── ── ─── ────────────────
┌─► 晓美焰 nsubj NR 晓美焰 ───►PERSON
┌────┬──┴── 来到 root VV 来到
│ │ ┌─► 北京 name NR 北京 ◄─┐
│ └─►└── 立方庭 dobj NR 立方庭 ◄─┴►LOCATION
└─►┌─────── 参观 conj VV 参观
│ ┌───► 自然 compound NN 自然 ◄─┐
│ │┌──► 语义 compound NN 语义 │
│ ││┌─► 科技 compound NN 科技 ├►ORGANIZATION
└─►└┴┴── 公司 dobj NN 公司 ◄─┘
----------squeeze-----------
{
"tok": [
"晓美焰",
"来到",
"北京",
"立方庭",
"参观",
"自然",
"语义",
"科技",
"公司"
],
"pos": [
"NR",
"VV",
"NR",
"NR",
"VV",
"NN",
"NN",
"NN",
"NN"
],
"ner": [
["晓美焰", "PERSON", 0, 1],
["北京立方庭", "LOCATION", 2, 4],
["自然语义科技公司", "ORGANIZATION", 5, 9]
],
"dep": [
[2, "nsubj"],
[0, "root"],
[4, "name"],
[2, "dobj"],
[2, "conj"],
[9, "compound"],
[9, "compound"],
[9, "compound"],
[5, "dobj"]
]
}
----------to_conll()-----------
1 晓美焰 _ NR _ _ 2 nsubj _ _
2 来到 _ VV _ _ 0 root _ _
3 北京 _ NR _ _ 4 name _ _
4 立方庭 _ NR _ _ 2 dobj _ _
5 参观 _ VV _ _ 2 conj _ _
6 自然 _ NN _ _ 9 compound _ _
7 语义 _ NN _ _ 9 compound _ _
8 科技 _ NN _ _ 9 compound _ _
9 公司 _ NN _ _ 5 dobj _ _
----------to_dict()-----------
{'tok': [['晓美焰', '来到', '北京', '立方庭', '参观', '自然', '语义', '科技', '公司']], 'pos': [['NR', 'VV', 'NR', 'NR', 'VV', 'NN', 'NN', 'NN', 'NN']], 'ner': [[['晓美焰', 'PERSON', 0, 1], ['北京立方庭', 'LOCATION', 2, 4], ['自然语义科技公司', 'ORGANIZATION', 5, 9]]], 'dep': [[[2, 'nsubj'], [0, 'root'], [4, 'name'], [2, 'dobj'], [2, 'conj'], [9, 'compound'], [9, 'compound'], [9, 'compound'], [5, 'dobj']]]}
----------to_json-----------
{
"tok": [
["晓美焰", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司"]
],
"pos": [
["NR", "VV", "NR", "NR", "VV", "NN", "NN", "NN", "NN"]
],
"ner": [
[["晓美焰", "PERSON", 0, 1], ["北京立方庭", "LOCATION", 2, 4], ["自然语义科技公司", "ORGANIZATION", 5, 9]]
],
"dep": [
[[2, "nsubj"], [0, "root"], [4, "name"], [2, "dobj"], [2, "conj"], [9, "compound"], [9, "compound"], [9, "compound"], [5, "dobj"]]
]
}
----------to_pretty-----------
<bound method Document.to_pretty of {'tok': [['晓美焰', '来到', '北京', '立方庭', '参观', '自然', '语义', '科技', '公司']], 'pos': [['NR', 'VV', 'NR', 'NR', 'VV', 'NN', 'NN', 'NN', 'NN']], 'ner': [[['晓美焰', 'PERSON', 0, 1], ['北京立方庭', 'LOCATION', 2, 4], ['自然语义科技公司', 'ORGANIZATION', 5, 9]]], 'dep': [[[2, 'nsubj'], [0, 'root'], [4, 'name'], [2, 'dobj'], [2, 'conj'], [9, 'compound'], [9, 'compound'], [9, 'compound'], [5, 'dobj']]]}>
----------translate-----------
{
"tok": [
["晓美焰", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司"]
],
"pos": [
["专有名词", "其他动词", "专有名词", "专有名词", "其他动词", "其他名词", "其他名词", "其他名词", "其他名词"]
],
"ner": [
[["晓美焰", "PERSON", 0, 1], ["北京立方庭", "LOCATION", 2, 4], ["自然语义科技公司", "ORGANIZATION", 5, 9]]
],
"dep": [
[[2, "名词性主语"], [0, "核心关系"], [4, "name"], [2, "直接宾语"], [2, "连接性状语"], [9, "compound"], [9, "compound"], [9, "compound"], [5, "直接宾语"]]
]
}
Process finished with exit code 0
3 Demo方法解释
3.1 计算句子数
代码语言:javascript复制count_sentences()→ int[source]
Count number of sentences in this document.
- Returns:
Number of sentences.
3.2 获取所有以指定前缀开头的元素
代码语言:javascript复制get_by_prefix(prefix: str)[source]
Get value by the prefix of a key.
- Parameters:
prefix – The prefix of a key. If multiple keys are matched, only the first one will be used.
- Returns:
The value assigned with the matched key.
3.3 美丽化输出语言
代码语言:javascript复制pretty_print(tok='tok', lem='lem', pos='pos', dep='dep', sdp='sdp', ner='ner', srl='srl', con='con', show_header=True, html=False)[source]
Print a pretty text representation which visualizes linguistic structures.
- Parameters:
tok – Token key.
lem – Lemma key.
pos – Part-of-speech key.
dep – Dependency parse tree key.
sdp – Semantic dependency tree/graph key. SDP visualization has not been implemented yet.
ner – Named entity key.
srl – Semantic role labeling key.
con – Constituency parsing key.
show_header – True to print a header which indicates each field with its name.
html – True to output HTML format so that non-ASCII characters can align correctly.
代码语言:javascript复制tok:词的键。
lem:词的词形还原键。
pos:词性标记的键。
dep:依赖关系树的键。
sdp:语义依赖关系树/图的键。SDP 可视化尚未实现。
ner:命名实体识别标记的键。
srl:语义角色标注的键。
con:句法分析树的键。
show_header:是否打印标题,标题显示每个字段的名称。默认值为 True。
html:是否以 HTML 格式输出格式化文本。这确保了非 ASCII 字符可以正确对齐。默认值为 False。
3.4 维度压缩
代码语言:javascript复制squeeze(i=0)[source]
Squeeze the dimension of each field into one. It’s intended to convert a nested document like [[sent_i]] to [sent_i]. When there are multiple sentences, only the i-th one will be returned. Note this is not an inplace operation.
- Parameters:
i – Keep the element at index for all lists.
Returns:
A squeezed document with only one sentence.
3.5 转为 CoNLL 格式
代码语言:javascript复制to_conll(tok='tok', lem='lem', pos='pos', dep='dep', sdp='sdp')→ Union[hanlp_common.conll.CoNLLSentence, List[hanlp_common.conll.CoNLLSentence]][source]
Convert to CoNLLSentence.
- Parameters:
tok (str) – Field name for tok.
lem (str) – Field name for lem.
pos (str) – Filed name for upos.
dep (str) – Field name for dependency parsing.
sdp (str) – Field name for semantic dependency parsing.
代码语言:javascript复制tok: 词的字符串表示。
lem: 词的词形还原表示。
pos: 词的词性标记。
dep: 词的依赖关系标记。
sdp: 词的语义依赖关系标记。
- Returns:
A CoNLLSentence representation.
3.6 转换为 JSON 兼容的字典
代码语言:javascript复制to_dict()[source]
Convert to a json compatible dict.
- Returns:
A dict representation.
3.7 将文档转换为 JSON 字符串
代码语言:javascript复制to_json(ensure_ascii=False, indent=2)→ str[source]
Convert to json string.
- Parameters:
ensure_ascii – False to allow for non-ascii text.
indent – Indent per nested structure.
- Returns:
A text representation in str.
3.8 美丽化文本表示可打印
代码语言:javascript复制to_pretty(tok='tok', lem='lem', pos='pos', dep='dep', sdp='sdp', ner='ner', srl='srl', con='con', show_header=True, html=False)→ Union[str, List[str]][source]
Convert to a pretty text representation which can be printed to visualize linguistic structures.
- Parameters:
tok – Token key.
lem – Lemma key.
pos – Part-of-speech key.
dep – Dependency parse tree key.
sdp – Semantic dependency tree/graph key. SDP visualization has not been implemented yet.
ner – Named entity key.
srl – Semantic role labeling key.
con – Constituency parsing key.
show_header – True to include a header which indicates each field with its name.
html – True to output HTML format so that non-ASCII characters can align correctly.
代码语言:javascript复制tok: 词素键。
lem: 词形还原键。
pos: 词性键。
dep: 依赖关系解析树键。
sdp: 语义依赖关系树/图键。 SDP 可视化尚未实现。
ner: 命名实体键。
srl: 语义角色标注键。
con: 句法分析键。
show_header: 是否包含标题,标题显示每个字段的名称。
html: 是否输出 HTML 格式以便正确对齐非 ASCII 字符。
- Returns:
A pretty string.
3.9 翻译
代码语言:javascript复制translate(lang, tok='tok', pos='pos', dep='dep', sdp='sdp', ner='ner', srl='srl')[source]
Translate tags for each annotation. This is an inplace operation.
- Parameters:
lang – Target language to be translated to.
tok – Token key.
pos – Part-of-speech key.
dep – Dependency parse tree key.
sdp – Semantic dependency tree/graph key. SDP visualization has not been implemented yet.
ner – Named entity key.
srl – Semantic role labeling key.
代码语言:javascript复制lang: 要翻译的目标语言。
tok: 词素键。
pos: 词性键。
dep: 依赖关系解析树键。
sdp: 语义依赖关系树/图键。 SDP 可视化尚未实现。
ner: 命名实体键。
srl: 语义角色标注键。
代码语言:javascript复制截至 2023 年 11 月 16 日,hanlp.utils.lang 中支持的语言包括:
简体中文 (zh)
繁体中文 (zh-tw)
英语 (en)
日语 (ja)
韩语 (ko)
法语 (fr)
德语 (de)
西班牙语 (es)
俄语 (ru)
这些语言都支持词性标注、命名实体识别、依赖关系分析和语义角色标注。
以下是每个语言的简要说明:
简体中文:hanlp 支持简体中文的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。
繁体中文:hanlp 支持繁体中文的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。
英语:hanlp 支持英语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。
日语:hanlp 支持日语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。
韩语:hanlp 支持韩语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。
法语:hanlp 支持法语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。
德语:hanlp 支持德语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。
西班牙语:hanlp 支持西班牙语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。
俄语:hanlp 支持俄语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。
- Returns:
The translated document.