文本分词和去停止词的一次优化

2019-12-18 16:49:26 浏览数 (3)

之前在处理QA语料库的时候，在分词和去停止词的时候消耗时间很长，所以专门搜了一些资料针对这个问题进行了一次优化，总结如下。

文本分词

使用jieba自带的并行分词

在分词前添加jieba.enable_parallel(4)就行了。但是我这里并没有这么做，主要是怕分词顺序出错了。

使用jieba_fast

这是一个cpython的库，使用方法和jieba一致，Github官网。官网的描述如下：

使用cpython重写了jieba分词库中计算DAG和HMM中的vitrebi函数，速度得到大幅提升。

去停止词

构建字典加速

我最开始使用的是把停止词读成列表，然后去列表里面查找，速度很慢。原先的代码如下：

代码语言：javascript复制

def get_stopwords(self,stopwords_path):
    stop_f = open(stopwords_path, "r", encoding='utf-8')
    stop_words = list()
    for line in stop_f.readlines():
        line = line.strip()
        if not len(line):
            continue
        stop_words.append(line)
    stop_f.close()
    # print('哈工大停止词表长度为：'   str(len(stop_words)))
    return stop_words

改进之后，构建了停止词字典，速度提高了一倍左右。代码如下：

代码语言：javascript复制

def get_stopwords(self,stopwords_path):
    stop_f = open(stopwords_path, "r", encoding='utf-8')
    stop_words = {}
    for line in stop_f.readlines():
        line = line.strip()
        if not len(line):
            continue
        stop_words[line] = line
    stop_f.close()
    # print('哈工大停止词表长度为：'   str(len(stop_words)))
    return stop_words

总结

经过以上改进，代码加速了4倍左右，提升还是很明显的。

github cpython jieba

0 人点赞