如何生成自定义的逆向文件频率(IDF)文本语料库

jieba分词中，关键词提取使用逆向文件频率文本语料库时，除了使用现有的语料库外，还可以自定义生成文本语料库。

代码语言：javascript复制

import jieba
import jieba.analyse

topK = 5
file_name = 'test.txt'

with open(file_name, 'rb') as f:
    content = f.read()

# 关键词提取所使用逆向文件频率（IDF）文本语料库可以切换成自定义语料库的路径
jieba.analyse.set_idf_path("../extra_dict/idf.txt.big");
tags = jieba.analyse.extract_tags(content, topK=topK)
print(", ".join(tags))

TF-IDF

"词频"（Term Frequency，缩写为TF）

TF指的是某词在文章中出现的总次数。

该指标通常会被归一化定义为TF=（某词在文档中出现的次数/文档的总词量），这样可以防止结果偏向过长的文档（同一个词语在长文档里通常会具有比短文档更高的词频)

"逆文档频率"（Inverse Document Frequency，缩写为IDF

包含某词语的文档越少，IDF值越大，说明该词语具有很强的区分能力
IDF=loge（语料库中文档总数N 1/包含该词的文档数N(x) 1）， 1原因是避免分母为0。

IDF(x)=logfrac{N 1}{N(x) 1}

TF-IDF = 词频（TF）X 逆文档频率（IDF）

TF-IDF与一个词在文档中出现的次数成正比，与该词在整个语言中的出现的次数成反比。

是一种针对关键字的统计分析方法，用来评估关键字或词语对于文档、语料库和文件集合重要性程度。关键字的重要程度与它在文档中出现的次数成正比，但与它出现的频率成反比。

主要思想： 如果一个关键字在文档中出现的频率（TF）高，同时在其他文档中很少出现，那么认为该关键字具有良好的区分不同文档的能力。

IDF文本语料库

在jieba的TF-IDF模型里面，当调用获取关键词的函数jieba.analyse.extract_tags()的时候，该函数会调用默认的IDF语料库。

IDF语料库就是jieba官方在大量文本的基础上，通过

IDF = log{frac{语料库的文档总数}{包含词条w的文档数 1}}

计算得到的一个idf字典，其key为分词之后的每个词，其value为每个词的IDF数值。

计算自定义的IDF文本语料库

1、读取文本文件，分词，去停用词，得到 all_dict 字典；

2、计算IDF值并保存到txt中 idf_dict 字典

0、主函数

代码语言：javascript复制

import math
import os
import jieba

corpus_path = './' # 存储语料库的路径，按照类别分
seg_path = './' # 拼出分词后语料的目录
all_dict = dict()
# 获取每个目录下所有的文件
for mydir in catelist:
    class_path = corpus_path mydir "/"    # 拼出分类子目录的路径
    #print(class_path)
    seg_dir = seg_path mydir "/"          # 拼出分词后语料分类目录
    if not os.path.exists(seg_dir):       # 是否存在目录，如果没有创建
            os.makedirs(seg_dir)
    #print(seg_dir)
    file_list = os.listdir(class_path) # 获取class_path下的所有文件
    for file_path in file_list: # 遍历类别目录下文件
        fullname = class_path   file_path   # 拼出文件名全路径
        #print(fullname)
        content = readfile(fullname).strip()  # 读取文件内容
        
        outstr = get_cut_word(content)        # 为文件内容分词
        savefile(seg_dir file_path,"".join(outstr))  # 将处理后的文件保存到分词后语料目录
        
        # 计算包含 word 的文档的个数
        all_dict = get_all_dict(outstr, all_dict)
    
  # 获取idf_dict字典
    idf_dict= get_idf_dict(all_dict, total)
  # 保存为txt，这里必须要'utf-8'编码，不然jieba不识别。
    with open('wdic.txt', 'w',encoding='utf-8') as fw:
      for k in idf_dict:
        if k != 'n':
          print(k)
          fw.write(k   ' '   idf_dict[k]   'n')  # fw.wirte()一行行把字典写入txt。

1、文本分词，并去除停用词

代码语言：javascript复制

def get_cut_word(content):
    content_seg = jieba.cut(content.strip(), cut_all=False) # 为文件内容分词
    stopwords = stopwordslist('./stopwords')
    outstr = []

    for word in content_seg:   # 去除停用词
      if word not in stopwords:
        if word != 't' and word != 'n':
          outstr.append(word)
    for word in outstr:  # 删除空格
      if ' ' in outstr:
        outstr.remove(' ')
    print('分词结束。')
    return outstr

2、计算包含 word 的文档的个数

从分词结果中判断每个分词在每个文档是否存在，并计算包含每个word的文档总数。并得到 all_dict字典，字典的键是 word，字典的值是包含 word 的文档的个数。

代码语言：javascript复制

def get_all_dict(outstr, all_dict):
    temp_dict = {}
    total  = 1
    for word in outstr:
      # print(word)
      temp_dict[word] = 1 # temp_dict记录的是只要该文档中包含 key 就记录1（不需要记录个数），否则为0.
      # print(temp_dict)
      for key in temp_dict:  # all_dict的 value 是包含 key 文档的个数
        num = all_dict.get(key, 0)
        all_dict[key] = num   1
    print('word字典构造结束')
  return all_dict

3、计算IDF值并保存到txt中

idf_dict 字典的键是word , 值是对应的IDF数值。idf_dict字典就是生成的IDF语料库

代码语言：javascript复制

def get_idf_dict(all_dict, total):

    idf_dict = {}
    for key in all_dict:
        # print(all_dict[key])
        w = key.encode('utf-8')
        p = '%.10f' % (math.log10(total/(all_dict[key] 1)))
        if w > u'u4e00' and w <= u'u9fa5': # 通过if判断语句，保证字典的key 都是汉字。
            idf_dict[w] = p
    print('IDF字典构造结束')
    return idf_dict

获取自定义stopwords

代码语言：javascript复制

def stopwordslist(path = "./stopwords"):
    """
    获取停用词
    """
    stopwords_list = []
    for filename in os.listdir(path):
        file_ext = os.path.splitext(filename)[1]
        if file_ext == '.txt':
            stopwords_list.extend(read_text(path os.sep filename))
    # 去重
    stopwords = list(set(stopwords_list))
    print('获取停用词完毕。')
    return stopwords

读取、存储文件

代码语言：javascript复制

def readfile(path):
    """
    读取文本内容
    """
    ftextlist=[]
    f=open(path,'r',encoding='gb18030')#gb18030兼容性好，是比utf-8更新的国家标准
    line=f.readline()
    while line:
        line=line.strip('n').replace(u'u3000',u' ') # 'u3000'是全角到空白符
        ftextlist.append(line)
        line=f.readline()
    filetxt="".join(ftextlist)
    # replace()函数内要使用'utf-8'编码。
    filetxt = filetxt.replace("rn".encode(encoding="utf-8"),"".encode(encoding="utf-8")) # 删除换行和多余的空格
    filetxt = filetxt.replace(" ".encode(encoding="utf-8"),"".encode(encoding="utf-8"))
    f.close()
    print(f'读取文件{path}完毕。')
    return filetxt
  
def savefile(path, content):
   """
   保存文件
   """
    with open(path, 'a') as f:
      f.write(content)
    print(f'存储文件{path}完毕。')

提炼总结主要信息

本篇文章关键内容如下所示，循环遍历所有文档，统计包含每个word等文档数，并计算IDF值。

代码语言：javascript复制

all_dict = {}
idf_dict = {}
for line in lines: # line 是单个文档
    temp_dict = {}
    total  = 1
    cut_line = jieba.cut(line, cut_all=False)
    for word in cut_line:
        temp_dict[word] = 1
    for key in temp_dict:
        num = all_dict.get(key, 0)
        all_dict[key] = num   1
for key in all_dict:
    w = key.encode('utf-8')
    p = '%.10f' % (math.log10(total/(all_dict[key]   1)))
    idf_dict[w] = p

推荐阅读
1、Jieba中文分词 (一) ——分词与自定义字典 2、Jieba中文分词 (二) ——词性标注与关键词提取

数据挖掘中文分词

0 人点赞