前一段时间简单了解 tensorflow_text 简单中文分词使用[1],再结合 Rasa 的学习,就萌生出模仿 Rasa 的结巴分词 tokenizer,造一个 Tensorflow_text_tokenizer。
创建一个 Rasa tokenizer 主要包括以下几个步骤: 1. Setup 2. Tokenizer 3. Registry File 4. Train and Test 5. Conclusion
了解结巴分词代码
为了开始自建插件,我们先拿一个JiebaTokenizer[2]源代码做测试,并在分词处打印出分词效果:
代码语言:javascript复制...
def tokenize(self, message: Message, attribute: Text) -> List[Token]:
import jieba
text = message.get(attribute)
tokenized = jieba.tokenize(text)
print('******')
print(f"{[t for t in tokenized]}")
print('******')
tokens = [Token(word, start) for (word, start, end) in tokenized]
return self._apply_token_pattern(tokens)
...
在 config 中,加入自定义插件:
代码语言:javascript复制language: zh
pipeline:
- name: components.fanlyJiebaTokenizer.JiebaTokenizer
- name: CRFEntityExtractor
- name: CountVectorsFeaturizer
OOV_token: oov
token_pattern: '(?u)bw b'
- name: KeywordIntentClassifier
训练和测试:
代码语言:javascript复制NLU model loaded. Type a message and press enter to parse it.
Next message:
我想找地方吃饭
******
Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/cz/kq5sssg12jx887hj62hwczrr0000gn/T/jieba.cache
Loading model cost 0.729 seconds.
Prefix dict has been built successfully.
[('我', 0, 1), ('想', 1, 2), ('找', 2, 3), ('地方', 3, 5), ('吃饭', 5, 7)]
******
{
"text": "我想找地方吃饭",
"intent": {
"name": "eat_search",
"confidence": 1.0
},
"entities": []
}
Next message:
构建 TF-Text 分词
注:由于 Rasa 目前只支持 TensorFlow 2.3 版本,而 TensorFlow-Text 最新版需要使用 TensorFlow 2.4 版本,所以我们为了兼容,下载 Rasa 源代码,并对源代码引入的 TensorFlow 和相关的插件版本号都做修改来匹配使用 TensorFlow-Text 的中文分词功能。
在 Rasa 源代码路径:
代码语言:javascript复制/rasa/nlu/tokenizers
创建文件 tensorflow_text_tokenizer.py
:
import glob
import logging
import os
import shutil
import typing
from typing import Any, Dict, List, Optional, Text
from rasa.nlu.components import Component
from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer
from rasa.shared.nlu.training_data.message import Message
logger = logging.getLogger(__name__)
if typing.TYPE_CHECKING:
from rasa.nlu.model import Metadata
class TensorFlowTextTokenizer(Tokenizer):
"""This tokenizer is a wrapper for tensorflow_text (https://www.tensorflow.org/tutorials/tensorflow_text/intro)."""
supported_language_list = ["zh"]
defaults = {
"model_handle": "https://hub.tensorflow.google.cn/google/zh_segmentation/1",
# Flag to check whether to split intents
"intent_tokenization_flag": False,
# Symbol on which intent should be split
"intent_split_symbol": "_",
# Regular expression to detect tokens
"token_pattern": None,
} # default don't load custom dictionary
def __init__(self, component_config: Dict[Text, Any] = None) -> None:
"""Construct a new intent classifier using the TensorFlow framework."""
super().__init__(component_config)
@classmethod
def required_packages(cls) -> List[Text]:
return ["tensorflow", "tensorflow_text"]
def tokenize(self, message: Message, attribute: Text) -> List[Token]:
import tensorflow_text as tftext
import tensorflow as tf
# 设定模型的 UR
self.model_handle = self.component_config.get("model_handle")
segmenter = tftext.HubModuleTokenizer(self.model_handle)
text = message.get(attribute)
print(text)
tokens, starts, ends = segmenter.tokenize_with_offsets(text)
tokens_list = tokens.numpy()
starts_list = starts.numpy()
print('******')
print(f"{[t.decode('utf-8') for t in tokens_list]}")
print(f"{[t for t in starts_list]}")
print('******')
tokensData = [Token(tokens_list[i], starts_list[i]) for i in range(len(tokens_list))]
return self._apply_token_pattern(tokensData)
初步模仿结巴分词代码,并直接打印出 log,看看分词的效果。
在 registry.py
注入我们写的插件:
from rasa.nlu.tokenizers.tensorflow_text_tokenizer import TensorFlowTextTokenizer
...
component_classes = [
# utils
SpacyNLP,
MitieNLP,
HFTransformersNLP,
# tokenizers
MitieTokenizer,
SpacyTokenizer,
WhitespaceTokenizer,
ConveRTTokenizer,
JiebaTokenizer,
TensorFlowTextTokenizer,
...
]
测试
我们在 examples 路径下直接利用 Rasa 源代码执行环境 init
一个 demo 出来:
poetry run rasa init
在 nlu.yml
增加一组测试数据:
nlu:
- intent: eat_search
examples: |
- 我想找地方吃饭
- 我想吃[火锅](food)了
- 找个吃[拉面](food)的地方
- 附近有什么好吃的地方吗?
这样就可以对这组数据进行训练了,在 config.yml
中加入 pipeline
等,其中就包括我们创建的 TensorFlowTextTokenizer
:
language: zh
pipeline:
- name: TensorFlowTextTokenizer
- name: CRFEntityExtractor
- name: CountVectorsFeaturizer
OOV_token: oov
token_pattern: '(?u)bw b'
- name: KeywordIntentClassifier
大功告成,我们通过训练看分词效果:
代码语言:javascript复制// 训练
poetry run rasa train nlu
看看测试结果:
总结
下一步计划完善 TensorFlow Text Tokenizer
分词功能,提交代码给 Rasa,看是否有机会参与 Rasa 的开源项目。
另:Tensorflow_text 分词的 Starts 是偏移量
参考
[1] tensorflow_text 简单中文分词使用 https://www.yemeishu.com/2021/01/16/tf-text-1/
[2] JiebaTokenizer https://github.com/RasaHQ/rasa/blob/master/rasa/nlu/tokenizers/jieba_tokenizer.py