Python环境中HanLP安装与使用

2024-07-25 15:30:29 浏览数 (1)

根据github最新官方文档整理

1 在Terminal使用pip安装

依赖PyTorch、TensorFlow等深度学习技术,适合专业NLP工程师、研究者以及本地海量数据场景。要求Python 3.6至3.10,支持Windows,推荐*nix。可以在CPU上运行,推荐GPU/TPU。安装PyTorch版:

安装时请关闭节点代理

  • STEP1
代码语言:javascript复制
pip install hanlp

回显内容:

代码语言:javascript复制
(MyTest) C:UsersLenovoPycharmProjectsMyTest>pip install hanlp
Collecting hanlp
  Downloading hanlp-2.1.0b52-py3-none-any.whl (651 kB)
     ---------------------------------------- 651.5/651.5 kB 1.2 MB/s eta 0:00:00
Collecting pynvml
  Downloading pynvml-11.5.0-py3-none-any.whl (53 kB)
     ---------------------------------------- 53.1/53.1 kB ? eta 0:00:00
Collecting transformers>=4.1.1
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
     ---------------------------------------- 7.2/7.2 MB 5.5 MB/s eta 0:00:00
Collecting hanlp-trie>=0.0.4
  Downloading hanlp_trie-0.0.5.tar.gz (6.7 kB)
  Preparing metadata (setup.py) ... done
Collecting toposort==1.5
  Downloading toposort-1.5-py2.py3-none-any.whl (7.6 kB)
Collecting hanlp-common>=0.0.19
  Downloading hanlp_common-0.0.19.tar.gz (28 kB)
  Preparing metadata (setup.py) ... done
Collecting termcolor
  Downloading termcolor-2.3.0-py3-none-any.whl (6.9 kB)
Requirement already satisfied: hanlp-downloader in c:userslenovoanaconda3envsmytestlibsite-packages (from hanlp) (0.0.25)
Collecting torch>=1.6.0
  Downloading torch-1.13.1-cp37-cp37m-win_amd64.whl (162.6 MB)
     ---------------------------------------- 162.6/162.6 MB 6.3 MB/s eta 0:00:00
Collecting sentencepiece>=0.1.91
  Downloading sentencepiece-0.1.99-cp37-cp37m-win_amd64.whl (977 kB)
     ---------------------------------------- 977.7/977.7 kB 10.3 MB/s eta 0:00:00
Collecting phrasetree
  Downloading phrasetree-0.0.8.tar.gz (42 kB)
     ---------------------------------------- 42.2/42.2 kB 2.0 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting typing-extensions
  Downloading typing_extensions-4.7.1-py3-none-any.whl (33 kB)
Collecting regex!=2019.12.17
  Downloading regex-2023.10.3-cp37-cp37m-win_amd64.whl (269 kB)
     ---------------------------------------- 269.9/269.9 kB 17.3 MB/s eta 0:00:00
Collecting filelock
  Downloading filelock-3.12.2-py3-none-any.whl (10 kB)
Collecting importlib-metadata
  Downloading importlib_metadata-6.7.0-py3-none-any.whl (22 kB)
Collecting huggingface-hub<1.0,>=0.14.1
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
     ---------------------------------------- 268.8/268.8 kB 8.3 MB/s eta 0:00:00
Requirement already satisfied: requests in c:userslenovoanaconda3envsmytestlibsite-packages (from transformers>=4.1.1->hanlp) (2.31.0)
Collecting tqdm>=4.27
  Downloading tqdm-4.66.1-py3-none-any.whl (78 kB)
     ---------------------------------------- 78.3/78.3 kB ? eta 0:00:00
Collecting numpy>=1.17
  Downloading numpy-1.21.6-cp37-cp37m-win_amd64.whl (14.0 MB)
     ---------------------------------------- 14.0/14.0 MB 11.7 MB/s eta 0:00:00
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp37-cp37m-win_amd64.whl (3.5 MB)
     ---------------------------------------- 3.5/3.5 MB 12.3 MB/s eta 0:00:00
Collecting safetensors>=0.3.1
  Downloading safetensors-0.4.0-cp37-none-win_amd64.whl (277 kB)
     ---------------------------------------- 277.3/277.3 kB 17.8 MB/s eta 0:00:00
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0.1-cp37-cp37m-win_amd64.whl (153 kB)
     ---------------------------------------- 153.2/153.2 kB 9.5 MB/s eta 0:00:00
Collecting packaging>=20.0
  Downloading packaging-23.2-py3-none-any.whl (53 kB)
     ---------------------------------------- 53.0/53.0 kB ? eta 0:00:00
Collecting fsspec
  Downloading fsspec-2023.1.0-py3-none-any.whl (143 kB)
     ---------------------------------------- 143.0/143.0 kB ? eta 0:00:00
Collecting colorama
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Collecting zipp>=0.5
  Downloading zipp-3.15.0-py3-none-any.whl (6.8 kB)
Requirement already satisfied: certifi>=2017.4.17 in c:userslenovoanaconda3envsmytestlibsite-packages (from requests->transformers>=4.1.1->hanlp)
 (2022.12.7)
Requirement already satisfied: charset-normalizer<4,>=2 in c:userslenovoanaconda3envsmytestlibsite-packages (from requests->transformers>=4.1.1->
hanlp) (3.3.2)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:userslenovoanaconda3envsmytestlibsite-packages (from requests->transformers>=4.1.1->hanlp)
 (2.0.7)
Requirement already satisfied: idna<4,>=2.5 in c:userslenovoanaconda3envsmytestlibsite-packages (from requests->transformers>=4.1.1->hanlp) (3.4)

Building wheels for collected packages: hanlp-common, hanlp-trie, phrasetree
  Building wheel for hanlp-common (setup.py) ... done
  Created wheel for hanlp-common: filename=hanlp_common-0.0.19-py3-none-any.whl size=30650 sha256=d3135f8a0e8bde4ff02320c6c84f1d809a9357f9ae2524a5bd99d4
a096d2db2e
  Stored in directory: c:userslenovoappdatalocalpipcachewheelsf270bf57226335746d58210d202e3a64428b8e3b4d57ca373f26d77b
  Building wheel for hanlp-trie (setup.py) ... done
  Created wheel for hanlp-trie: filename=hanlp_trie-0.0.5-py3-none-any.whl size=6831 sha256=87b214b03fe0473f53b8b12a34ed2a2bee54a152b96c9531ad22d6044b6e
b790
  Stored in directory: c:userslenovoappdatalocalpipcachewheels69ceb1c15e96cb4d3b170d002be4b8fd14c2185d32111080f352b3e6
  Building wheel for phrasetree (setup.py) ... done
  Created wheel for phrasetree: filename=phrasetree-0.0.8-py3-none-any.whl size=44234 sha256=e86b74c1ad7ebacc6dceeace9ea3b9452a2d4367ebf2080b253bb4bfef8
fb53d
  Stored in directory: c:userslenovoappdatalocalpipcachewheelsc2813f3ed1a1f06d94d021590de96e6953e44854599db1cd90d66846
Successfully built hanlp-common hanlp-trie phrasetree
Installing collected packages: toposort, tokenizers, sentencepiece, phrasetree, zipp, typing-extensions, termcolor, safetensors, regex, pyyaml, pynvml,
packaging, numpy, hanlp-common, fsspec, filelock, colorama, tqdm, torch, importlib-metadata, hanlp-trie, huggingface-hub, transformers, hanlp
Successfully installed colorama-0.4.6 filelock-3.12.2 fsspec-2023.1.0 hanlp-2.1.0b52 hanlp-common-0.0.19 hanlp-trie-0.0.5 huggingface-hub-0.16.4 importl
ib-metadata-6.7.0 numpy-1.21.6 packaging-23.2 phrasetree-0.0.8 pynvml-11.5.0 pyyaml-6.0.1 regex-2023.10.3 safetensors-0.4.0 sentencepiece-0.1.99 termcol
or-2.3.0 tokenizers-0.13.3 toposort-1.5 torch-1.13.1 tqdm-4.66.1 transformers-4.30.2 typing-extensions-4.7.1 zipp-3.15.0
  • STEP2

第一次使用要预下载大约600M压缩包。

预下载:

代码语言:javascript复制
hanlp

回显内容

代码语言:javascript复制
下载 http://download.hanlp.com/hanlp-1.8.4-release.zip 到 C:UsersLenovoanaconda3envsMyTestlibsite-packagespyhanlpstatichanlp-1.8.4-release.zip

100%   1.8 MiB 727.1 KiB/s ETA:  0 s [=============================================================]
下载 https://file.hankcs.com/hanlp/data-for-1.7.5.zip 到 C:UsersLenovoanaconda3envsMyTestlibsite-packagespyhanlpstaticdata-for-1.8.4.zip
100% 637.7 MiB  89.3 KiB/s ETA:  0 s [=============================================================]
解压 data.zip...
usage: hanlp [-h] [-v] {segment,parse,serve,update} ...

HanLP: Han Language Processing v1.8.4

positional arguments:
  {segment,parse,serve,update}
                        which task to perform?
    segment             word segmentation
    parse               dependency parsing
    serve               start http server
    update              update jar and data of HanLP

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show installed versions of HanLP

2 第一个hanlp demo

语法:

classhanlp_common.document.Document(*args, **kwargs)

保存已解析注释的字典结构。 文档是 dict 的子类,它支持 dict 的每个接口。 此外,它还支持处理各种语言结构的接口。 它的 str 和 dict 表示形式与 JSON 序列化兼容。

参数:

代码语言:javascript复制
*args – An iterator of key-value pairs.

**kwargs – Arguments from ** operator.

2.1 示例Demo:

代码语言:javascript复制
# Create a document
from hanlp_common.document import Document

doc = Document(
    tok=[["晓美焰", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司"]],
    pos=[["NR", "VV", "NR", "NR", "VV", "NN", "NN", "NN", "NN"]],
    ner=[[["晓美焰", "PERSON", 0, 1], ["北京立方庭", "LOCATION", 2, 4],
          ["自然语义科技公司", "ORGANIZATION", 5, 9]]],
    dep=[[[2, "nsubj"], [0, "root"], [4, "name"], [2, "dobj"], [2, "conj"],
          [9, "compound"], [9, "compound"], [9, "compound"], [5, "dobj"]]]
)

# print(doc) or str(doc) to get its JSON representation
print(doc)

print("----------annotation-----------")
# Access an annotation by its task name
print(doc['tok'])

print("----------count_sentences-----------")
# Get number of sentences
print(f'It has {doc.count_sentences()} sentence(s)')

print("----------n-th sentence-----------")
# Access the n-th sentence
print(doc.squeeze(0)['tok'])

# Pretty print it right in your console or notebook
print("----------pretty_print-----------")
doc.pretty_print()

# To save the pretty prints in a str
pretty_text: str = 'nn'.join(doc.to_pretty())

print("----------squeeze-----------")
print(doc.squeeze(i=0))

print("----------to_conll()-----------")
print(doc.to_conll())

print("----------to_dict()-----------")
print(doc.to_dict())

print("----------to_json-----------")
print(doc.to_json())

print("----------to_pretty-----------")
print(doc.to_pretty)

print("----------translate-----------")
print(doc.translate('zh'))

控制台输出:

代码语言:javascript复制
C:UsersLenovoanaconda3envsMyTestpython.exe C:/Users/Lenovo/PycharmProjects/MyTest/1113/hanLP/HanLP.py
{
  "tok": [
    ["晓美焰", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司"]
  ],
  "pos": [
    ["NR", "VV", "NR", "NR", "VV", "NN", "NN", "NN", "NN"]
  ],
  "ner": [
    [["晓美焰", "PERSON", 0, 1], ["北京立方庭", "LOCATION", 2, 4], ["自然语义科技公司", "ORGANIZATION", 5, 9]]
  ],
  "dep": [
    [[2, "nsubj"], [0, "root"], [4, "name"], [2, "dobj"], [2, "conj"], [9, "compound"], [9, "compound"], [9, "compound"], [5, "dobj"]]
  ]
}
----------annotation-----------
[['晓美焰', '来到', '北京', '立方庭', '参观', '自然', '语义', '科技', '公司']]
----------count_sentences-----------
It has 1 sentence(s)
----------n-th sentence-----------
['晓美焰', '来到', '北京', '立方庭', '参观', '自然', '语义', '科技', '公司']
----------pretty_print-----------
Dep Tree   	Tok	Relation	Po	Tok	NER Type        
───────────	───	────────	──	───	────────────────
        ┌─►	晓美焰	nsubj   	NR	晓美焰	───►PERSON      
┌────┬──┴──	来到 	root    	VV	来到 	                
│    │  ┌─►	北京 	name    	NR	北京 	◄─┐             
│    └─►└──	立方庭	dobj    	NR	立方庭	◄─┴►LOCATION    
└─►┌───────	参观 	conj    	VV	参观 	                
   │  ┌───►	自然 	compound	NN	自然 	◄─┐             
   │  │┌──►	语义 	compound	NN	语义 	  │             
   │  ││┌─►	科技 	compound	NN	科技 	  ├►ORGANIZATION
   └─►└┴┴──	公司 	dobj    	NN	公司 	◄─┘             
----------squeeze-----------
{
  "tok": [
    "晓美焰",
    "来到",
    "北京",
    "立方庭",
    "参观",
    "自然",
    "语义",
    "科技",
    "公司"
  ],
  "pos": [
    "NR",
    "VV",
    "NR",
    "NR",
    "VV",
    "NN",
    "NN",
    "NN",
    "NN"
  ],
  "ner": [
    ["晓美焰", "PERSON", 0, 1],
    ["北京立方庭", "LOCATION", 2, 4],
    ["自然语义科技公司", "ORGANIZATION", 5, 9]
  ],
  "dep": [
    [2, "nsubj"],
    [0, "root"],
    [4, "name"],
    [2, "dobj"],
    [2, "conj"],
    [9, "compound"],
    [9, "compound"],
    [9, "compound"],
    [5, "dobj"]
  ]
}
----------to_conll()-----------
1	晓美焰	_	NR	_	_	2	nsubj	_	_
2	来到	_	VV	_	_	0	root	_	_
3	北京	_	NR	_	_	4	name	_	_
4	立方庭	_	NR	_	_	2	dobj	_	_
5	参观	_	VV	_	_	2	conj	_	_
6	自然	_	NN	_	_	9	compound	_	_
7	语义	_	NN	_	_	9	compound	_	_
8	科技	_	NN	_	_	9	compound	_	_
9	公司	_	NN	_	_	5	dobj	_	_
----------to_dict()-----------
{'tok': [['晓美焰', '来到', '北京', '立方庭', '参观', '自然', '语义', '科技', '公司']], 'pos': [['NR', 'VV', 'NR', 'NR', 'VV', 'NN', 'NN', 'NN', 'NN']], 'ner': [[['晓美焰', 'PERSON', 0, 1], ['北京立方庭', 'LOCATION', 2, 4], ['自然语义科技公司', 'ORGANIZATION', 5, 9]]], 'dep': [[[2, 'nsubj'], [0, 'root'], [4, 'name'], [2, 'dobj'], [2, 'conj'], [9, 'compound'], [9, 'compound'], [9, 'compound'], [5, 'dobj']]]}
----------to_json-----------
{
  "tok": [
    ["晓美焰", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司"]
  ],
  "pos": [
    ["NR", "VV", "NR", "NR", "VV", "NN", "NN", "NN", "NN"]
  ],
  "ner": [
    [["晓美焰", "PERSON", 0, 1], ["北京立方庭", "LOCATION", 2, 4], ["自然语义科技公司", "ORGANIZATION", 5, 9]]
  ],
  "dep": [
    [[2, "nsubj"], [0, "root"], [4, "name"], [2, "dobj"], [2, "conj"], [9, "compound"], [9, "compound"], [9, "compound"], [5, "dobj"]]
  ]
}
----------to_pretty-----------
<bound method Document.to_pretty of {'tok': [['晓美焰', '来到', '北京', '立方庭', '参观', '自然', '语义', '科技', '公司']], 'pos': [['NR', 'VV', 'NR', 'NR', 'VV', 'NN', 'NN', 'NN', 'NN']], 'ner': [[['晓美焰', 'PERSON', 0, 1], ['北京立方庭', 'LOCATION', 2, 4], ['自然语义科技公司', 'ORGANIZATION', 5, 9]]], 'dep': [[[2, 'nsubj'], [0, 'root'], [4, 'name'], [2, 'dobj'], [2, 'conj'], [9, 'compound'], [9, 'compound'], [9, 'compound'], [5, 'dobj']]]}>
----------translate-----------
{
  "tok": [
    ["晓美焰", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司"]
  ],
  "pos": [
    ["专有名词", "其他动词", "专有名词", "专有名词", "其他动词", "其他名词", "其他名词", "其他名词", "其他名词"]
  ],
  "ner": [
    [["晓美焰", "PERSON", 0, 1], ["北京立方庭", "LOCATION", 2, 4], ["自然语义科技公司", "ORGANIZATION", 5, 9]]
  ],
  "dep": [
    [[2, "名词性主语"], [0, "核心关系"], [4, "name"], [2, "直接宾语"], [2, "连接性状语"], [9, "compound"], [9, "compound"], [9, "compound"], [5, "直接宾语"]]
  ]
}

Process finished with exit code 0

3 Demo方法解释

3.1 计算句子数

代码语言:javascript复制
count_sentences()→ int[source]

Count number of sentences in this document.

  • Returns:

Number of sentences.

3.2 获取所有以指定前缀开头的元素

代码语言:javascript复制
get_by_prefix(prefix: str)[source]

Get value by the prefix of a key.

  • Parameters:

prefix – The prefix of a key. If multiple keys are matched, only the first one will be used.

  • Returns:

The value assigned with the matched key.

3.3 美丽化输出语言

代码语言:javascript复制
pretty_print(tok='tok', lem='lem', pos='pos', dep='dep', sdp='sdp', ner='ner', srl='srl', con='con', show_header=True, html=False)[source]

Print a pretty text representation which visualizes linguistic structures.

  • Parameters:

tok – Token key.

lem – Lemma key.

pos – Part-of-speech key.

dep – Dependency parse tree key.

sdp – Semantic dependency tree/graph key. SDP visualization has not been implemented yet.

ner – Named entity key.

srl – Semantic role labeling key.

con – Constituency parsing key.

show_header – True to print a header which indicates each field with its name.

html – True to output HTML format so that non-ASCII characters can align correctly.

代码语言:javascript复制
tok:词的键。
lem:词的词形还原键。
pos:词性标记的键。
dep:依赖关系树的键。
sdp:语义依赖关系树/图的键。SDP 可视化尚未实现。
ner:命名实体识别标记的键。
srl:语义角色标注的键。
con:句法分析树的键。
show_header:是否打印标题,标题显示每个字段的名称。默认值为 True。
html:是否以 HTML 格式输出格式化文本。这确保了非 ASCII 字符可以正确对齐。默认值为 False。

3.4 维度压缩

代码语言:javascript复制
squeeze(i=0)[source]

Squeeze the dimension of each field into one. It’s intended to convert a nested document like [[sent_i]] to [sent_i]. When there are multiple sentences, only the i-th one will be returned. Note this is not an inplace operation.

  • Parameters:

i – Keep the element at index for all lists.

Returns:

A squeezed document with only one sentence.

3.5 转为 CoNLL 格式

代码语言:javascript复制
to_conll(tok='tok', lem='lem', pos='pos', dep='dep', sdp='sdp')→ Union[hanlp_common.conll.CoNLLSentence, List[hanlp_common.conll.CoNLLSentence]][source]

Convert to CoNLLSentence.

  • Parameters:

tok (str) – Field name for tok.

lem (str) – Field name for lem.

pos (str) – Filed name for upos.

dep (str) – Field name for dependency parsing.

sdp (str) – Field name for semantic dependency parsing.

代码语言:javascript复制
tok: 词的字符串表示。
lem: 词的词形还原表示。
pos: 词的词性标记。
dep: 词的依赖关系标记。
sdp: 词的语义依赖关系标记。
  • Returns:

A CoNLLSentence representation.

3.6 转换为 JSON 兼容的字典

代码语言:javascript复制
to_dict()[source]

Convert to a json compatible dict.

  • Returns:

A dict representation.

3.7 将文档转换为 JSON 字符串

代码语言:javascript复制
to_json(ensure_ascii=False, indent=2)→ str[source]

Convert to json string.

  • Parameters:

ensure_ascii – False to allow for non-ascii text.

indent – Indent per nested structure.

  • Returns:

A text representation in str.

3.8 美丽化文本表示可打印

代码语言:javascript复制
to_pretty(tok='tok', lem='lem', pos='pos', dep='dep', sdp='sdp', ner='ner', srl='srl', con='con', show_header=True, html=False)→ Union[str, List[str]][source]

Convert to a pretty text representation which can be printed to visualize linguistic structures.

  • Parameters:

tok – Token key.

lem – Lemma key.

pos – Part-of-speech key.

dep – Dependency parse tree key.

sdp – Semantic dependency tree/graph key. SDP visualization has not been implemented yet.

ner – Named entity key.

srl – Semantic role labeling key.

con – Constituency parsing key.

show_header – True to include a header which indicates each field with its name.

html – True to output HTML format so that non-ASCII characters can align correctly.

代码语言:javascript复制
tok: 词素键。
lem: 词形还原键。
pos: 词性键。
dep: 依赖关系解析树键。
sdp: 语义依赖关系树/图键。 SDP 可视化尚未实现。
ner: 命名实体键。
srl: 语义角色标注键。
con: 句法分析键。
show_header: 是否包含标题,标题显示每个字段的名称。
html: 是否输出 HTML 格式以便正确对齐非 ASCII 字符。
  • Returns:

A pretty string.

3.9 翻译

代码语言:javascript复制
translate(lang, tok='tok', pos='pos', dep='dep', sdp='sdp', ner='ner', srl='srl')[source]

Translate tags for each annotation. This is an inplace operation.

  • Parameters:

lang – Target language to be translated to.

tok – Token key.

pos – Part-of-speech key.

dep – Dependency parse tree key.

sdp – Semantic dependency tree/graph key. SDP visualization has not been implemented yet.

ner – Named entity key.

srl – Semantic role labeling key.

代码语言:javascript复制
lang: 要翻译的目标语言。
tok: 词素键。
pos: 词性键。
dep: 依赖关系解析树键。
sdp: 语义依赖关系树/图键。 SDP 可视化尚未实现。
ner: 命名实体键。
srl: 语义角色标注键。
代码语言:javascript复制
截至 2023 年 11 月 16 日,hanlp.utils.lang 中支持的语言包括:

简体中文 (zh)
繁体中文 (zh-tw)
英语 (en)
日语 (ja)
韩语 (ko)
法语 (fr)
德语 (de)
西班牙语 (es)
俄语 (ru)
这些语言都支持词性标注、命名实体识别、依赖关系分析和语义角色标注。

以下是每个语言的简要说明:

简体中文:hanlp 支持简体中文的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。
繁体中文:hanlp 支持繁体中文的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。
英语:hanlp 支持英语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。
日语:hanlp 支持日语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。
韩语:hanlp 支持韩语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。
法语:hanlp 支持法语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。
德语:hanlp 支持德语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。
西班牙语:hanlp 支持西班牙语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。
俄语:hanlp 支持俄语的常见词性标注、命名实体识别、依赖关系分析和语义角色标注模型。
  • Returns:

The translated document.

0 人点赞