基于HMM的中文词性标注 POSTagging

本文的代码是在徐老师的代码基础上，自己加了些注释，在此表示感谢！

1. 词性标注

1.1 概念

请看专家介绍中文词性标注简介

1.2 任务

给定标注文本corpus4pos_tagging.txt，训练一个模型，用模型预测给定文本的词性

标注文本部分内容如下所示：

代码语言：javascript复制

19980101-01-001-001/m  迈向/v  充满/v  希望/n  的/u  新/a  世纪/n  ——/w  一九九八年/t  新年/t  讲话/n  （/w  附/v  图片/n  １/m  张/q  ）/w   
19980101-01-001-003/m  （/w  一九九七年/t  十二月/t  三十一日/t  ）/w  
19980101-01-001-004/m  １２月/t  ３１日/t  ，/w  发表/v  １９９８年/t  新年/t  讲话/n  《/w  迈向/v  充满/v  希望/n  的/u  新/a  世纪/n  》/w  。/w  （/w  新华社/nt  记者/n  兰/nr  红光/nr  摄/Vg  ）/w  
19980101-01-001-005/m  同胞/n  们/k  、/w  朋友/n  们/k  、/w  女士/n  们/k  、/w  先生/n  们/k  ：/w  
19980101-01-001-006/m  在/p  １９９８年/t  来临/v  之际/f  ，/w  我/r  十分/m  高兴/a  地/u  通过/p  [中央/n  人民/n  广播/vn  电台/n]nt  、/w  [中国/ns  国际/n  广播/vn  电台/n]nt  和/c  [中央/n  电视台/n]nt  ，/w  向/p  全国/n  各族/r  人民/n  ，/w  向/p  [香港/ns  特别/a  行政区/n]ns  同胞/n  、/w  澳门/ns  和/c  台湾/ns  同胞/n  、/w  海外/s  侨胞/n  ，/w  向/p  世界/n  各国/r  的/u  朋友/n  们/k  ，/w  致以/v  诚挚/a  的/u  问候/vn  和/c  良好/a  的/u  祝愿/vn  ！/w

1.3 预处理

文本处理corpusSplit函数：删除空格；词语分割；特殊字符删除；最后存入句子list
数据切分out函数：将句子分配到20个文件中（18个训练集，1个开发集，1个测试集）

代码语言：javascript复制

# corpusSplit.py
def corpusSplit(infile, sentenceList):  # 将语料分割为句子
    fdi = open(infile, 'r', encoding='utf-8')  # 打开原始数据
    fullStopDict = {"。": 1, "；": 1, "？": 1, "！": 1}
    for line in fdi:
        text = line.strip()  # 删除左右空格
        if text == "":
            continue
        else:
            infs = text.split() # 将所有单词分开
            sentence = []
            flag = True
            for s in infs:
                w_p = s.split("/")  # 返回分割后的字符串列表
                if len(w_p) == 2:
                    word = w_p[0]
                    if word.startswith("["):
                        word = word.replace("[", "")  # 以[开始的，删除[
                    pos = w_p[1]
                    pos = re.sub("].*", "", pos)  # re正则表达式模块替换掉后面的]
                    if word == "" or pos == "":
                        flag = False
                    else:
                        sentence.append(word   "/"   pos)
                    if word in fullStopDict:
                        if flag == True:
                            sentenceList.append(" ".join(sentence)) # 序列中元素用空格隔开
                        flag = True
                        sentence = []
                else:
                    flag = False
            if sentence != [] and flag == True:
                sentenceList.append(" ".join(sentence))
    fdi.close()

def out(sentenceList, out_dir): # 将句子分别写到20个文件中，18个训练文件
    fdo_train_list = []
    for i in range(18):
        fdo_train = open(out_dir   "/train.%d" % (i), "w", encoding='utf-8')
        fdo_train_list.append(fdo_train)
    fdo_dev = open(out_dir   "/dev.txt", "w", encoding='utf-8')
    fdo_test = open(out_dir   "/test.txt", "w", encoding='utf-8')
    for sindx in range(len(sentenceList)):
        if sindx % 20 < 18:
            for i in range(sindx % 20, 18): # 后面的文件语料多
                fdo_train_list[i].write(sentenceList[sindx]   "n")
        elif sindx % 20 == 18:
            fdo_dev.write(sentenceList[sindx]   "n")   # 1个开发集
        elif sindx % 20 == 19:
            fdo_test.write(sentenceList[sindx]   "n")  # 1个测试集
    for i in range(18):
        fdo_train_list[i].close()   # 文件有开，有关
    fdo_dev.close()
    fdo_test.close()

import sys
import re # 正则表达式模块
import random
'''
try:
    infile = sys.argv[1]
    out_dir = sys.argv[2]
except:
    sys.stderr.write("tpython "   sys.argv[0]   " infile out_dirn")
    sys.exit(-1)
'''
# step 1 : 将语料分割为句子
infile = "./data/corpus4pos_tagging.txt"
out_dir = "./data"
sentenceList = []
corpusSplit(infile, sentenceList)
# step 2 : 输出
out(sentenceList, out_dir)

处理后的文本示例：

代码语言：javascript复制

19980101-01-001-001/m 迈向/v 充满/v 希望/n 的/u 新/a 世纪/n ——/w 一九九八年/t 新年/t 讲话/n （/w 附/v 图片/n １/m 张/q ）/w
中国/ns 与/p 周边/n 国家/n 和/c 广大/b 发展中国家/l 的/u 友好/a 合作/vn 进一步/d 加强/v 。/w
但/c 前进/v 的/u 道路/n 不会/v 也/d 不/d 可能/v 一帆风顺/i ，/w 关键/n 是/v 世界/n 各国/r 人民/n 要/v 进一步/d 团结/a 起来/v ，/w 共同/d 推动/v 早日/d 建立/v 公正/a 合理/a 的/u 国际/n 政治/n 经济/n 新/a 秩序/n 。/w
我们/r 必须/d 进一步/d 深入/ad 学习/v 和/c 掌握/v 党/n 的/u 十五大/j 精神/n ，/w 统揽全局/l ，/w 精心/d 部署/v ，/w 狠抓/v 落实/v ，/w 团结/a 一致/a ，/w 艰苦奋斗/i ，/w 开拓/v 前进/v ，/w 为/p 夺取/v 今年/t 改革/v 开放/v 和/c 社会主义/n 现代化/vn 建设/vn 的/u 新/a 胜利/vn 而/c 奋斗/v 。/w

1.4 初步统计预览

代码语言：javascript复制

# staForPosDistribution.py
import sys
def add2posDict(pos, pDict):
	if pos in pDict:
		pDict[pos]  = 1
	else:
		pDict[pos]  = 1
def sta(infile, pDict):
	fdi = open(infile, 'r', encoding='utf-8')
	for line in fdi:
		infs = line.strip().split()
		posList = [s.split("/")[1] for s in infs]	# 词性list
		for pos in posList:
			add2posDict(pos, pDict)	# 统计各个词性的次数
			add2posDict("all", pDict)	# 总的次数
	fdi.close()
def out(pDict):
	oList = list(pDict.items())
	oList.sort(key=lambda infs:(infs[1]), reverse=True)	# 按匿名函数排序
	total = oList[0][1]
	for pos, num in oList:
		print("%st%.4f" % (pos, num/total))	# 打印 词性，对应频率
try:
	infile = sys.argv[1]
except:
	sys.stderr.write("tpython " sys.argv[0] " infilen")
	sys.exit(-1)
pDict = {}
sta(infile, pDict)	# 统计训练集中的语料出现频率
out(pDict)	# 打印输出

输入以下命令，对最大的那个训练集执行统计

代码语言：javascript复制

python staForPosDistribution.py ./data/train.17

2. 最大概率模型

2.1 训练

统计每个单词、其总的出现次数、其出现最多的词性、该词性的概率

代码语言：javascript复制

# trainByMaxProb.py
def staForWordToPosDict(infile, word2posDict):
	fdi = open(infile, 'r', encoding='utf-8')
	for line in fdi:
		infs = line.strip().split()
		for s in infs:
			w_p = s.split("/")
			if len(w_p) == 2:
				word = w_p[0]
				pos  = w_p[1]
				if word in word2posDict:
					if pos in word2posDict[word]:
						word2posDict[word][pos]  = 1
					else:
						word2posDict[word][pos]  = 1
				else:
					word2posDict[word] = {pos:1}
				# 两重字典 {word ： {pos, count}}
				# 统计文本中：单词、  词性、 频次
	fdi.close()

def getMaxProbPos(posDict):
	total = sum(posDict.values())
	max_num  = -1
	max_pos  = ""
	for pos in posDict:
		if posDict[pos] > max_num:
			max_num = posDict[pos]
			max_pos = pos
	return max_pos, max_num/total

def out4model(word2posDict, model_file):
	wordNumList = [[word, sum(word2posDict[word].values())] for word in  word2posDict]
	# [[word, counts]] 两重列表，单词 & 其所有词性下的频次总和
	wordNumList.sort(key=lambda infs:(infs[1]), reverse=True)	# 按counts降序
	fdo = open(model_file, "w", encoding='utf-8')
	for word, num in wordNumList:
		pos, prob = getMaxProbPos(word2posDict[word])	
		# 单词可能有多个词性，出现最多的词性，及其概率(最大)
		if word != "" and pos != "":
			fdo.write("%st%dt%st%fn" % (word, num, pos, prob))	
			# 写入文件			单词、 出现次数、出现最多的词性、该词性的概率
	fdo.close()

import sys
try:
	infile     = sys.argv[1]
	model_file = sys.argv[2]
except:
	sys.stderr.write("tpython " sys.argv[0] " infile model_filen")
	sys.exit(-1)
word2posDict = {}
staForWordToPosDict(infile, word2posDict)	# 对训练文件进行统计
out4model(word2posDict, model_file)	# 输出到文件

代码语言：javascript复制

python trainByMaxProb.py ./data/train.0 ./data/model.MaxProb.0

输出的模型文件model.MaxProb.0部分内容如下：

2.2 预测

代码语言：javascript复制

# predictByMaxProb.py
def loadModel(model_file, word2posDict):	# 加载训练模型
	fdi = open(model_file, 'r', encoding='utf-8')
	for line in fdi:
		infs = line.strip().split()
		if len(infs) == 4:
			word = infs[0]
			pos  = infs[2]
			word2posDict[word] = pos	# 从模型读取单词，和其最大概率的词性
		else:
			sys.stderr.write("format error in " model_file "n")
			sys.stderr.write(line)
			sys.exit(-1)
	fdi.close()

def getWords(infs):
	return [s.split("/")[0] for s in infs]

def predict(infile, word2posDict, outfile):
	fdi = open(infile, 'r', encoding='utf-8')
	fdo = open(outfile, 'w', encoding='utf-8')
	for line in fdi:
		infs = line.strip().split()
		# 盖住答案，闭卷考试
		words = getWords(infs)	# 只获取输入文件的单词
		results = []
		for word in words:
			if word in word2posDict:	# 从模型中获取它的最大概率词性
				results.append(word   "/"   word2posDict[word])
			else:
				results.append(word   "/unknown")
		fdo.write(" ".join(results) "n")	# 写入输出文件
	fdo.close()
	fdi.close()

import sys
try:
	infile     = sys.argv[1]
	model_file = sys.argv[2]
	outfile    = sys.argv[3]
except:
	sys.stderr.write("tpython " sys.argv[0] " infile model_file outfilen")
	sys.exit(-1)
word2posDict = {}
loadModel(model_file, word2posDict)	# 加载训练模型
predict(infile, word2posDict, outfile)	# 输出

运行命令：执行预测

代码语言：javascript复制

python predictByMaxProb.py ./data/train.0 ./data/model.MaxProb.0 ./data/train.0.MaxProb.predict

预测文件train.0.MaxProb.predict部分内容如下：

代码语言：javascript复制

19980101-01-001-001/m 迈向/v 充满/v 希望/v 的/u 新/a 世纪/n ——/w 一九九八年/t 新年/t 讲话/n （/w 附/v 图片/n １/m 张/nr ）/w
中国/ns 与/p 周边/n 国家/n 和/c 广大/b 发展中国家/l 的/u 友好/a 合作/vn 进一步/d 加强/v 。/w
但/c 前进/v 的/u 道路/n 不会/v 也/d 不/d 可能/v 一帆风顺/i ，/w 关键/n 是/v 世界/n 各国/r 人民/n 要/v 进一步/d 团结/v 起来/v ，/w 共同/d 推动/v 早日/d 建立/v 公正/a 合理/a 的/u 国际/n 政治/n 经济/n 新/a 秩序/n 。/w
我们/r 必须/d 进一步/d 深入/v 学习/v 和/c 掌握/v 党/n 的/u 十五大/j 精神/n ，/w 统揽全局/l ，/w 精心/ad 部署/vn ，/w 狠抓/v 落实/v ，/w 团结/v 一致/a ，/w 艰苦奋斗/i ，/w 开拓/v 前进/v ，/w 为/p 夺取/v 今年/t 改革/vn 开放/v 和/c 社会主义/n 现代化/vn 建设/vn 的/u 新/a 胜利/vn 而/c 奋斗/v 。/w

2.3 结果评估

代码语言：javascript复制

# resultEval.py
import sys
def getPosList(infs):
	return [s.split("/")[1] for s in infs]
def add2staDict(pos, indx, staDict):
	if pos not in staDict:
		staDict[pos] = [pos, 0, 0, 0]
	staDict[pos][indx]  = 1
def add2errDict(mykey, errDict):
	if mykey in errDict:
		errDict[mykey]  = 1
	else:
		errDict[mykey]  = 1
def sta(label_file, predict_file, staDict, errDict):
	fdi1 = open(label_file, 'r', encoding='utf-8')
	fdi2 = open(predict_file, 'r', encoding='utf-8')
	while True:
		line1 = fdi1.readline()
		line2 = fdi2.readline()
		if line1 == "" and line2 == "":
			break
		elif line1 == "" or line2 == "":
			sys.stderr.write("the number of lines is not equal between %s and %s!n" % (
				label_file, predict_file))
			sys.exit(-1)
		else:
			labelList = getPosList(line1.strip().split())	# 读取正确的词性
			predictList = getPosList(line2.strip().split())	# 读取预测的词性
			if len(labelList) != len(predictList):
				sys.stderr.write("the number of words is not equal between %s and %s!n" % (
					label_file, predict_file))
				sys.exit(-1)
			else:
				for i in range(len(labelList)):
					label = labelList[i]
					predict = predictList[i]
					add2staDict(label, 1, staDict)	# staDict[pos] = [pos, 0, 0, 0]
					add2staDict(predict, 2, staDict) # (词性，正确词性频数，预测词性频数，label=预测的频数)
					add2staDict("all", 1, staDict)
					add2staDict("all", 2, staDict)
					if label == predict:
						add2staDict(label, 3, staDict)
						add2staDict("all", 3, staDict)
					else:
						add2errDict("%s-->%s" % (label, predict), errDict)	# 统计错误频数
						add2errDict("all-->all", errDict)
	fdi2.close()
	fdi1.close()

def out(staDict, errDict, outfile):
	staList = list(staDict.values())
	staList.sort(key=lambda infs:(infs[1]), reverse=True)
	errList = list(errDict.items())
	errList.sort(key=lambda infs:(infs[1]), reverse=True)
	fdo = open(outfile, 'w', encoding='utf-8')
	total = staList[0][1]
	for pos, nlabel, npredict, nright in staList:
		fdo.write("pos_%st%.4ft%.4ft%.4fn" % (pos, 
			nlabel/total, 
			nright/(npredict if npredict > 0 else 100), 
			nright/(nlabel if nlabel > 0 else 100)))
		# 写入评估文件：(词性、各种概率)
	total = errList[0][1]
	for errKey, num in errList:
		fdo.write("err_%st%.4fn" % (errKey, num/total))
	fdo.close()

try:
	label_file   = sys.argv[1]
	predict_file = sys.argv[2]
	outfile      = sys.argv[3]
except:
	sys.stderr.write("tpython " sys.argv[0] " label_file predict_file outfilen")
	sys.exit(-1)
staDict = {}
errDict = {}
sta(label_file, predict_file, staDict, errDict)	# 统计正确率
out(staDict, errDict, outfile)	# 写入评估文件

执行评估：

代码语言：javascript复制

python resultEval.py ./data/train.0 ./data/train.0.MaxProb.predict ./data/train.0.MaxProb.eval

评估文件train.0.MaxProb.eval部分内容如下：

2.4 结果可视化

编写shell脚本，对18个训练集批量执行

代码语言：javascript复制

echo "将python的路径改为当前机器环境下的路径"
alias python='/usr/local/bin/python3.7'
for ((i=0; i<=17; i  ))
do
	# step 1 : 最大概率模型
	# step 1.1 : 训练模型
	python trainByMaxProb.py ./data/train.${i} ./data/model.MaxProb.${i}
	# step 1.2 : 在训练集上做评估
	python predictByMaxProb.py ./data/train.${i} ./data/model.MaxProb.${i} ./data/train.${i}.MaxProb.predict
	python resultEval.py ./data/train.${i} ./data/train.${i}.MaxProb.predict ./data/train.${i}.MaxProb.eval
	# step 1.3 : 在开发集上做评估
	python predictByMaxProb.py ./data/dev.txt ./data/model.MaxProb.${i} ./data/dev.${i}.MaxProb.predict
	python resultEval.py ./data/dev.txt ./data/dev.${i}.MaxProb.predict ./data/dev.${i}.MaxProb.eval
	# step 1.4 : 在测试集上做评估
	python predictByMaxProb.py ./data/test.txt ./data/model.MaxProb.${i} ./data/test.${i}.MaxProb.predict
	python resultEval.py ./data/test.txt ./data/test.${i}.MaxProb.predict ./data/test.${i}.MaxProb.eval
done
echo "FINISH !!!"

对所有的eval 评估文件读取第一行的第3个或第4个准确率，绘制语料大小与准确率的曲线

代码语言：javascript复制

# -*- coding:utf-8 -*-
# python3.7
# @Time: 2019/12/20 23:03
# @Author: Michael Ming
# @Website: https://michael.blog.csdn.net/
# @File: resultView.py

trainEval = []
devEval = []
testEval = []
for i in range(18):
    filename1 = "./data/train."   str(i)   ".MaxProb.eval"
    filename2 = "./data/dev."   str(i)   ".MaxProb.eval"
    filename3 = "./data/test."   str(i)   ".MaxProb.eval"
    with open(filename1, 'r', encoding='utf-8') as f1:
        trainEval.append(float(f1.readline().split()[2]))
    with open(filename2, 'r', encoding='utf-8') as f2:
        devEval.append(float(f2.readline().split()[2]))
    with open(filename3, 'r', encoding='utf-8') as f3:
        testEval.append(float(f3.readline().split()[2]))

import matplotlib.pyplot as plt

# plt.rcParams['font.family'] = 'sans-serif'	# 消除中文乱码
plt.rcParams['font.sans-serif'] = 'SimHei'	# 消除中文乱码
plt.title("不同大小语料下的结果对比")
plt.xlabel("语料")
plt.ylabel("准确率")
plt.plot(trainEval, 'r-', devEval, 'b-', testEval, 'g-')
plt.legend(('train', 'dev', 'test'), loc='upper right')
plt.show()

可以看出，随着训练语料的不断增加，模型在开发集和测试集上的准确率在不断提升，一开始提升很快，后序提升趋于平缓，模型的预测准确率达到了一个瓶颈 90% 左右

3. 二元隐马尔科夫BiHMM模型

HMM模型介绍请点击我的博客：隐马尔科夫模型（HMM）笔记

3.1 训练

代码语言：javascript复制

# -*- coding: UTF-8 -*-
# trainByBiHMM.py
def add2transDict(pos1, pos2, transDict):
    if pos1 in transDict:
        if pos2 in transDict[pos1]:
            transDict[pos1][pos2]  = 1
        else:
            transDict[pos1][pos2] = 1
    else:
        transDict[pos1] = {pos2: 1}


def add2emitDict(pos, word, emitDict):
    if pos in emitDict:
        if word in emitDict[pos]:
            emitDict[pos][word]  = 1
        else:
            emitDict[pos][word] = 1
    else:
        emitDict[pos] = {word: 1}


def sta(infile, transDict, emitDict):
    fdi = open(infile, 'r', encoding='utf-8')
    for line in fdi:
        infs = line.strip().split()
        wpList = [["__NONE__", "__start__"]]   [s.split("/") for s in infs]   [["__NONE_", "__end__"]]
        # 边界处理，首尾加个开始和结束标记
        for i in range(1, len(wpList)):
            pre_pos = wpList[i - 1][1]  # 前面一个词性（隐藏状态 y_t-1）
            cur_pos = wpList[i][1]  # 当前词性状态 y_t
            word = wpList[i][0]  # 当前观测值(发射值) x_t
            if word == "" or cur_pos == "" or pre_pos == "":
                continue
            add2transDict(pre_pos, cur_pos, transDict)	# 统计转移频次
            add2emitDict(cur_pos, word, emitDict)	# 统计发射频次
        add2transDict("__end__", "__end__", transDict)
    fdi.close()


def getPosNumList(transDict):
    pnList = []
    for pos in transDict:  # {pre_pos,{cur_pos, count}}
        # if pos == "__start__" or pos == "__end__":
        #	continue
        num = sum(transDict[pos].values())
        pnList.append([pos, num])  # 前一个词性出现了多少次
    pnList.sort(key=lambda infs: (infs[1]), reverse=True)
    return pnList


def getTotalWordNum(emitDict):
    total_word_num = 0
    for pos in emitDict:
        total_word_num  = sum(list(emitDict[pos].values()))
    return total_word_num


def out4model(transDict, emitDict, model_file):
    pnList = getPosNumList(transDict)

    # 状态集合
    fdo = open(model_file, 'w', encoding='utf-8')
    total = sum([num for pos, num in pnList])  # 所有词性的出现次数
    for pos, num in pnList:
        fdo.write("pos_sett%st%dt%fn" % (pos, num, num / total))
    #								词性、词性出现次数，出现频率

    # 转移概率
    total_word_num = getTotalWordNum(emitDict)  # {cur_pos, {word, count}}
    for pos1, num1 in pnList:  # 前一个词性，频次
        if pos1 == "__end__":
            continue
        #smoothing_factor = num1/total_word_num # 平滑方案1
        smoothing_factor = 1.0                  # 平滑方案2
        tmpList = []
        for pos2, _ in pnList:
            if pos2 == "__start__":
                continue
            if pos2 in transDict[pos1]:
                tmpList.append([pos2, transDict[pos1][pos2]   smoothing_factor])
            else:
                tmpList.append([pos2, smoothing_factor])
        denominator = sum([infs[1] for infs in tmpList])
        for pos2, numerator in tmpList:
            fdo.write("trans_probt%st%st%fn" % (pos1, pos2, math.log(numerator/denominator)))
        
    # 发射概率
    for pos, _ in pnList:
        if pos == "__start__" or pos == "__end__":
            continue
        wnList = list(emitDict[pos].items())
        wnList.sort(key=lambda infs: infs[1], reverse=True)
        num = sum([num for _, num in wnList])
        #smoothing_factor = num/total_word_num # 平滑方案1
        smoothing_factor = 1.0                 # 平滑方案2
        tmpList = []
        for word, num in wnList:
            tmpList.append([word, num smoothing_factor])
        tmpList.append(["__NEW__", smoothing_factor])
        # pos词性下，发射其他未统计到的词时的概率给个平滑
        denominator = sum([infs[1] for infs in tmpList])
        for word, numerator in tmpList:
            fdo.write("emit_probt%st%st%fn" % (pos, word, math.log(numerator/denominator)))    
    fdo.close()


import sys
import math

try:
    infile = sys.argv[1]
    model_file = sys.argv[2]
except:
    sys.stderr.write("tpython "   sys.argv[0]   " infile model_filen")
    sys.exit(-1)
transDict = {}  # 转移
emitDict = {}  # 发射
sta(infile, transDict, emitDict)
out4model(transDict, emitDict, model_file)

执行训练

代码语言：javascript复制

python trainByBiHMM.py ./data/train.0 ./data/model.BiHMM.0

生成的模型文件 model.BiHMM.0部分内容如下：

代码语言：javascript复制

pos_set	n	77189	0.192527
pos_set	v	59762	0.149060
pos_set	w	54829	0.136756
pos_set	u	24474	0.061044
pos_set	m	19642	0.048992
pos_set	d	15820	0.039459
pos_set	__start__	15432	0.038491
pos_set	__end__	15432	0.038491
pos_set	vn	15115	0.037700
（省略）
trans_prob	n	n	-1.710944
trans_prob	n	v	-1.924026
trans_prob	n	w	-1.346831
trans_prob	n	u	-2.400427
trans_prob	n	m	-4.080009
trans_prob	n	d	-2.913617
trans_prob	n	__end__	-4.992247
trans_prob	n	vn	-2.887937
（省略）
emit_prob	n	人	-4.622626
emit_prob	n	经济	-4.715296
emit_prob	n	企业	-4.757801
emit_prob	n	记者	-4.804948
emit_prob	n	国家	-4.840039
emit_prob	n	问题	-4.980944
emit_prob	n	人民	-5.088500
emit_prob	n	全国	-5.099550
emit_prob	Bg	翠	-0.405465
emit_prob	Bg	__NEW__	-12.862278

3.2 预测

代码语言：javascript复制

# -*- coding: UTF-8 -*-
# predictByBiHMM.py
def add2transDict(pos1, pos2, prob, transDict):
	if pos1 in transDict:
		transDict[pos1][pos2] = prob
	else:
		transDict[pos1] = {pos2:prob}
def add2emitDict(pos, word, prob, emitDict):
	if pos in emitDict:
		emitDict[pos][word] = prob
	else:
		emitDict[pos] = {word:prob}
def loadModel(infile, gPosList, transDict, emitDict):
	fdi = open(infile, 'r', encoding='utf-8')
	for line in fdi:
		infs = line.strip().split()
		if infs[0] == "pos_set":
			pos = infs[1]
			if pos != "__start__" and pos != "__end__":
				gPosList.append(pos)
		if infs[0] == "trans_prob":
			pos1 = infs[1]
			pos2 = infs[2]
			prob = float(infs[3])
			add2transDict(pos1, pos2, prob, transDict)
		if infs[0] == "emit_prob":
			pos = infs[1]
			word = infs[2]
			prob = float(infs[3])
			add2emitDict(pos, word, prob, emitDict)
	fdi.close()
	
def getWords(infs):
	return [s.split("/")[0] for s in infs]	# 只获取单词
def getEmitProb(emitDict, pos, word):
	if word in emitDict[pos]:
		return emitDict[pos][word]
	else:
		return emitDict[pos]["__NEW__"]
def predict4one(words, gPosList, transDict, emitDict, results):
	if words == []:
		return
	prePosDictList = []
	for i in range(len(words)):	# 遍历单词，相当于时间i
		prePosDict = {}
		for pos in gPosList:	# 遍历词性，即状态
			if i == 0:	# 初始时刻
				trans_prob = transDict["__start__"][pos]
				emit_prob  = getEmitProb(emitDict, pos, words[i])
				total_prob = trans_prob   emit_prob	# 概率之前取了log，logA logB = logAB
				prePosDict[pos] = [total_prob, "__start__"]
			else:
				emit_prob = getEmitProb(emitDict, pos, words[i])
				max_total_prob = -10000000.0
				max_pre_pos    = ""
				for pre_pos in prePosDictList[i-1]:	# 在前一次的里面找最大的
					pre_prob   = prePosDictList[i-1][pre_pos][0]
					trans_prob = transDict[pre_pos][pos]
					total_prob = pre_prob   trans_prob   emit_prob
					if max_pre_pos == "" or total_prob > max_total_prob:
						max_total_prob = total_prob
						max_pre_pos = pre_pos
				prePosDict[pos] = [max_total_prob, max_pre_pos]
		prePosDictList.append(prePosDict)
	max_total_prob = -10000000.0
	max_pre_pos    = ""
	for pre_pos in prePosDictList[len(prePosDictList)-1]:	# 最后一列
		pre_prob   = prePosDictList[len(prePosDictList)-1][pre_pos][0]
		trans_prob = transDict[pre_pos]["__end__"]
		total_prob = pre_prob   trans_prob	# end 不发射
		if max_pre_pos == "" or total_prob > max_total_prob:
			max_total_prob = total_prob
			max_pre_pos = pre_pos
	posList = [max_pre_pos]	# 最优路径
	indx = len(prePosDictList)-1
	max_pre_pos = prePosDictList[indx][max_pre_pos][1]
	indx -= 1
	while indx >= 0:
		posList.append(max_pre_pos)
		max_pre_pos = prePosDictList[indx][max_pre_pos][1]	# 递推前向的路径
		indx -= 1
	if len(posList) == len(words):
		posList.reverse()	# 原来的推出来的路径是逆向的，反转下
		for i in range(len(posList)):
			results.append(words[i] "/" posList[i])	# 预测结果
	else:
		sys.stderr.write("error : the number of pos is not equal to the number of words!n")
		sys.exit(-1)
def predict(infile, gPosList, transDict, emitDict, outfile):
	fdi = open(infile, 'r', encoding='utf-8')
	fdo = open(outfile, "w", encoding='utf-8')
	for line in fdi:
		infs = line.strip().split()
		# 盖住答案，闭卷考试
		words = getWords(infs)
		results = []
		predict4one(words, gPosList, transDict, emitDict, results)
		fdo.write(" ".join(results) "n")
	fdo.close()
	fdi.close()

import sys
import math
try:
	infile     = sys.argv[1]
	model_file = sys.argv[2]
	outfile    = sys.argv[3]
except:
	sys.stderr.write("tpython " sys.argv[0] " infile model_file outfilen")
	sys.exit(-1)
gPosList  = []
transDict = {}
emitDict  = {}
loadModel(model_file, gPosList, transDict, emitDict)
predict(infile, gPosList, transDict, emitDict, outfile)

执行预测：

代码语言：javascript复制

predictByBiHMM.py ./data/train.0 ./data/model.BiHMM.0 ./data/train.0.BiHMM.predict

生成预测文件train.0.BiHMM.predict部分内容如下：

代码语言：javascript复制

19980101-01-001-001/m 迈向/v 充满/v 希望/v 的/u 新/a 世纪/n ——/w 一九九八年/t 新年/t 讲话/n （/w 附/v 图片/n １/m 张/q ）/w
中国/ns 与/p 周边/n 国家/n 和/c 广大/b 发展中国家/l 的/u 友好/a 合作/vn 进一步/d 加强/v 。/w

3.3 结果评估

执行评估：

代码语言：javascript复制

resultEval.py ./data/train.0 ./data/train.0.BiHMM.predict ./data/train.0.BiHMM.eval

评估文件train.0.BiHMM.eval部分内容如下：

（预测准确率在95 %左右）

代码语言：javascript复制

pos_all	1.0000	0.9541	0.9541
pos_n	0.2086	0.9790	0.9815
pos_v	0.1615	0.9331	0.8918
pos_w	0.1482	0.9907	0.9999
pos_u	0.0661	0.9901	0.9905
pos_m	0.0531	0.9855	0.9746
pos_d	0.0427	0.9442	0.9530
pos_vn	0.0408	0.7667	0.8178
pos_p	0.0366	0.8887	0.9410
pos_a	0.0298	0.9062	0.8951

3.4 结果可视化

编写shell脚本批量执行：（训练耗时1天多的时间）

代码语言：javascript复制

echo "将python的路径改为当前机器环境下的路径"
for ((i=0; i<=17; i  ))
do
	alias python='/usr/local/bin/python3.7'
	# step 2 : BiHMM模型
	# step 2.1 : 训练模型
	python trainByBiHMM.py ./data/train.${i} ./data/model.BiHMM.${i}
	# step 2.2 : 在训练集上做评估
	python predictByBiHMM.py ./data/train.${i} ./data/model.BiHMM.${i} ./data/train.${i}.BiHMM.predict
	python resultEval.py ./data/train.${i} ./data/train.${i}.BiHMM.predict ./data/train.${i}.BiHMM.eval
	# step 2.3 : 在开发集上做评估
	python predictByBiHMM.py ./data/dev.txt ./data/model.BiHMM.${i} ./data/dev.${i}.BiHMM.predict
	python resultEval.py ./data/dev.txt ./data/dev.${i}.BiHMM.predict ./data/dev.${i}.BiHMM.eval
	# step 2.4 : 在测试集上做评估
	python predictByBiHMM.py ./data/test.txt ./data/model.BiHMM.${i} ./data/test.${i}.BiHMM.predict
	python resultEval.py ./data/test.txt ./data/test.${i}.BiHMM.predict ./data/test.${i}.BiHMM.eval
done
echo "FINISH !!!"

对所有的eval 评估文件读取第一行的第3个或第4个准确率，绘制语料大小与准确率的曲线

对比上面最大概率模型的 90% 的预测准确率，二元隐马尔科夫模型BiHMM的预测准确率提升到了 94.5% 左右，随着语料的增加，预测的准确率也在提升，提升速率也趋于平缓。

4. 结果讨论思考

在数据规模较小的情况下，每种模型（最大概率、二元HMM、三元HMM）的各自表现如何？差距是怎样产生的？

| 解答：最大概率模型的预测准确率比BiHMM模型小，原因有2个，1. 最大概率模型需要的参数多（words个数 * pos词性40种），BiHMM模型参数大概只有40*40种，相同的语料训练下，参数少的模型得到的训练充分性更好。 |

|:----|

| 2. BiHMM模型结合了上下文来进行预测，准确率更高 |

随着语料规模增加，每种模型的性能曲线如何？语料增加，在解决什么问题？

| 解答：最大概率模型小语料情况下预测准确率低，模型准确率上升空间大，随着语料增加准确率提升速度较快，BiHMM由于小语料下准确率就比较高，所以准确率上升没有那么快。 |

|:----|

| 语料的增加在解决统计的充分性问题，统计的越充分，统计结果越趋近于真实的概率分布，所以在小语料时，统计不充分，得到的概率分布可能与实际不符合，随着语料的增多，概率分布趋于真实情况，预测准确率在提升。 |

模型在哪方面的限制，使性能改进遇到了天花板？

| 解答：模型在一些词的预测上是有缺陷的，比如数词 m，告诉机器 20200112是日期，但是换一个日期 20200113，机器不认识了，不知道他是日期，还比如人名，地名等等，这些机器遇到的时候都会预测不准； |

|:----|

| 另外一个原因就是词语的歧义造成的预测不准，比如中国/ 建设/ 高铁；中国/ 建设/ 银行；建设的词性一个是动词，一个是动名词，且他们前面都是名词中国；这样的话，即使模型非常接近真实情况，预测的时候也只会将最大概率的路径输出，比如预测建设是动词，这就是模型的瓶颈所在。 |

错分的词性，应该怎样归类问题？

| 解答：预测时不认识的词，进行统计分析，比如数词（日期）基本上都不认识，那么是不是可以按照日期格式，写正则匹配，遇到 XXXX-XX-XX 的数词，预测其为日期，再比如姓名，遇到姓式开头的词，将其和其后的几个字归为姓名；但是也有更复杂的问题，以姓氏开头的词语，不一定是姓名，比如陈酒 |

|:----|

如何提升解码效率？

| 解答：避免多重for循环，尽可能利用造好的轮子，numpy等进行矩阵运算 |

|:----|

标注偏置、概率平滑问题

| 解答：需要选择合适的平滑算法。对没有出现过的事例，需要给他一个概率，用来贴近真实情况。 |

|:----|

| 粗暴法：频次都 1；缺点，对事例较少的词，给了他较大的发射概率，造成路径上的总的概率是最大的，继而预测失败。 |

| 举例：比如，Rg 这个词性，在文本中只出现了一次，对应的词是斯（逝者如斯夫），那么在 1 平滑的时候，当预测当前词性为 Rg，但是词又不是斯的时候，斯的频次1 1=2，不认识的词是 0 1=1，所以不认识的词给的发射概率为 1/3，这是个很大的概率，足以打败所有的其他路径，继而造成文本预测结果的词性全部都是 Rg，所以选择合适的概率平滑算法很重要。 |

| 变种 1法：比如 pos1（n，名词） --> pos2（v，动词），pos1是名词的时候，pos2可能有40种可能，但是统计的时候，有的路径的频次为0，这时候我们给pos2的每种可能的词性的频次 p，p=num(pos1)/num(total)p = num(pos1)/num(total)p=num(pos1)/num(total); 举个例子就是扶贫，给每个省都给1亿（粗暴 1法），按照贫困人口占比，人口多的省，多发一些（变种 1法，该法相比更优）。 |

p = num(pos1)/num(total)

; 举个例子就是扶贫，给每个省都给1亿（粗暴 1法），按照贫困人口占比，人口多的省，多发一些（变种 1法，该法相比更优）。

shell eval model predict

0 人点赞

基于HMM的中文词性标注 POSTagging

​

1. 词性标注

1.1 概念

1.2 任务

1.3 预处理

1.4 初步统计预览

2. 最大概率模型

2.1 训练

2.2 预测

2.3 结果评估

2.4 结果可视化

3. 二元隐马尔科夫BiHMM模型

3.1 训练

3.2 预测

3.3 结果评估

3.4 结果可视化

4. 结果讨论思考