使用word2vec和xgboost寻找Quora上的相似问题

原作者 Susan Li

Changing the world, one article at a time. Sr. Data Scientist, Toronto Canada. Opinion=my own.

http://www.linkedin.com/in/susanli/

使用word2vec和xgboost寻找Quora上的相似问题

备注：Quora是一个国外的问答网站，可以理解为外国的知乎。

网址 https://www.quora.com/ 当然是需要翻墙的

上周，我们探索了几种不同的去重技术，尝试使用BOW，TFIDF，Xgboost方法来识别相似文档。我们发现使用传统的TFIDF方法可以解决一些比较明显的问题。这可以解释为什么谷歌在搜索领域长期使用TFIDF方法来判断一个单词对于一个页面的重要程度。

为了深入研究和提升能力，我们来探索一些新的方法来解决类似的匹配和去重问题，首先我们把去重问题引申为一个分类问题，然后再去解决它。

数据

这个任务的目标是鉴别Quora中的一对问题是不是表达同样的意思，在数据中，每一组数据包含两个问题，以及人类专家（难道不是运营）标注的这俩问题是否属于同一个意思的标签。不过需要注意的是，这个标注过程是很主观的，对于同一对问题是否表述同一个意思，不同的专家可能有不同的意见。所以这个标签算是一种参考，它不是100%准确的。

df = pd.read_csv('quora_train.csv')

df = df.dropna(how="any").reset_index(drop=True)

a= 0

for i in range(a,a 10):

print(df.question1[i])

print(df.question2[i])

print()

计算词移距离（WMD：word mover’s distance）

词移距离是一个能让我们使用“距离”评估两个文档相似度的方法，这种方法不关心是不是有相同的单词。因为它使用了word2vec的向量进行计算。它的主要思想是利用文档中词的embedded向量，来计算一篇文档“游走”到另一篇文档的最小距离来衡量两篇文章的差异性。我们来看一个例子。下面是一对已经标注重复的数据：

question1 = 'What would a Trump presidency mean for

current international master’s students on an F1

visa?'

question2 = 'How will a Trump presidency affect the

students presently in US or planning to study in US?'

question1 = question1.lower().split()

question2 = question2.lower().split()

question1 = [w for w in question1 if w not in

stop_words]

question2 = [w for w in question2 if w not in

stop_words]

上面的代码做了两个操作，一个是转换大小写，一个是去除停用词，算是初步清洗样本吧。

我们预先用google news的语料训练了Word2vec模型。使用了genim的word2vec算法包。

import gensim

from gensim.models import Word2Vec

model =

gensim.models.KeyedVectors.load_word2vec_format('./wor

d2Vec_models/GoogleNews-vectors-negative300.bin.gz',

binary=True)

下面开始计算两个问题的WMD距离。注意，这俩文本表达了同样的意思，并被标注为重复Quora数据。

distance = model.wmdistance(question1, question2)

print('distance = %.4f' % distance)

结果是 distance = 1.8293

结果的值看起来很大，我们需要把它进行标准化。

标准化word2vec向量

在使用wmd方法时，首先去标准化word2vec向量，这是有好处的，这样他们就有一样的长度了。

model.init_sims(replace=True)

distance = model.wmdistance(question1, question2)

print('normalized distance = %.4f' % distance)

normalized distance = 0.7589

标准化之后，这个值就小多了。

我们再来做一对实验，这次用的是两个不重复的问题。

question3 = 'Why am I mentally very lonely? How can I

solve it?'

question4 = 'Find the remainder when [math]23^{24}

[/math] is divided by 24,23?'

question3 = question3.lower().split()

question4 = question4.lower().split()

question3 = [w for w in question3 if w not in

stop_words]

question4 = [w for w in question4 if w not in

stop_words]

distance = model.wmdistance(question3, question4)

print('distance = %.4f' % distance)

distance = 1.2637

model.init_sims(replace=True)

distance = model.wmdistance(question3, question4)

print('normalized distance = %.4f' % distance)

normalized distance = 1.2637 、

这一次，标准化之后的值没有发生变化。WMD方法认为这一组数据不如第一组那么相似，看起来很有效果不是吗。

FuzzyWuzzy

在前面的一篇文章中，我们已经了解过Python中模糊字符匹配的方法https://towardsdatascience.com/natural-language-processing-for-fuzzy-string-matching-with-python-6632b7824c49

快速了解一下FuzzyWuzzy工具在处理我们的重复问题上能起什么作用。

from fuzzywuzzy import fuzz

question1 = 'What would a Trump presidency mean for

current international master’s students on an F1

visa?'

question2 = 'How will a Trump presidency affect the

students presently in US or planning to study in US?'

fuzz.ratio(question1, question2)

fuzz.partial_token_set_ratio(question1, question2)

100

question3 = 'Why am I mentally very lonely? How can I

solve it?'

question4 = 'Find the remainder when [math]23^{24}

[/math] is divided by 24,23?'

fuzz.ratio(question3, question4)

fuzz.partial_token_set_ratio(question3, question4)

从上面的结果看起来，FuzzyWuzzy工具也不认为第二组数据相似。这很好，因为人类专家也这么认为。

特征工程

首先，我们先实现几个函数——计算WMD，标准化WMD，word2vec表达

def wmd(q1, q2):

q1 = str(q1).lower().split()

q2 = str(q2).lower().split()

stop_words = stopwords.words('english')

q1 = [w for w in q1 if w not in stop_words]

q2 = [w for w in q2 if w not in stop_words]

return model.wmdistance(q1, q2)

def norm_wmd(q1, q2):

q1 = str(q1).lower().split()

q2 = str(q2).lower().split()

stop_words = stopwords.words('english')

q1 = [w for w in q1 if w not in stop_words]

q2 = [w for w in q2 if w not in stop_words]

return norm_model.wmdistance(q1, q2)

def sent2vec(s):

words = str(s).lower()

words = word_tokenize(words)

words = [w for w in words if not w in stop_words]

M = []

for w in words:

try:

M.append(model[w])

except:

continue

M = np.array(M)

v = M.sum(axis=0)

return v / np.sqrt((v ** 2).sum())

我们需要构建一些新的特征：

1.单词个数

2.字符个数

3.问题1和问题2中相同单词的个数

4.问题1和问题2中不同单词的个数

5.问题1和问题2的向量余弦距离

6.问题1和问题2的向量曼哈顿距离

7.--杰卡德距离

8.--兰氏距离

9.--欧氏距离

10.--闵可夫斯基距离

11.--布雷柯蒂斯距离

12.峰度和偏度

13.词移距离

14.标准化词移距离

所有以上距离计算公式都可以在scipy.spatial.distance中找到。

1. df['len_q1'] = df.question1.apply(lambda x: len(str(x))

2. df['len_q2'] = df.question2.apply(lambda x: len(str(x))

3. df['diff_len'] = df.len_q1 - df.len_q2

4. df['len_char_q1'] = df.question1.apply(lambda x: len(''.join(set(str(x).replace(' ',''))))

5. df['len_char_q2'] = df.question2.apply(lambda x: len(''.join(set(str(x).replace(' ',''))))

6. df['len_word_q1'] = df.question1.apply(lambda x: len(str(x).split())

7. df['len_word_q2'] = df.question2.apply(lambda x: len(str(x).split())

8. df['common_words'] = df.apply(lambda x: len(set(str(x[‘question1’]).lower().split()).intersection(set(str(x[‘question2’]).lower().split()))), axis=1)

9. df['fuzz_ratio'] = df.apply(lambda x: fuzz.ratio(str(x[‘question1’]), str(x[‘question2'])),axis=1)

10.df['fuzz_partial_ratio'] = df.apply(lambda x: fuzz.partial_ratio(str(x[‘question1’]), str(x['question2'])), axis=1)

11.df['fuzz_partial_token_set_ratio'] = df.apply(lambda x: fuzz.partial_token_set_ratio(str(x['question1']), str(x['question2'])), axis=1)

12.df[‘fuzz_partial_token_sort_ratio'] = df.apply(lambda x: fuzz.partial_token_sort_ratio(str(x['question1']),str(x['question2'])), axis=1)

13.df['fuzz_token_set_ratio’] = df.apply(lambda x: fuzz.token_set_ratio(str(x[‘question1’]), str(x['question2'])), axis=1)

14.df[‘fuzz_token_sort_ratio’] = df.apply(lambda x: fuzz.token_sort_ratio(str(x[‘question1’]),str(x[‘question2'])), axis=1)

（这块原文pdf不全，由于是算法包的内容，大家自己补齐吧）

word2vec模型

前面说了，我们使用预先训练好的google news 语料的Word2vec模型。我下载下来并保存在word2Vec_models文件夹里面。我们用gensim的模块加载这个模型。

model =

gensim.models.KeyedVectors.load_word2vec_format('./wor

d2Vec_models/GoogleNews-vectors-negative300.bin.gz',

binary=True)

df['wmd'] = df.apply(lambda x: wmd(x['question1'],

x['question2']), axis=1)

标准化word2vec模型

norm_model =

gensim.models.KeyedVectors.load_word2vec_format('./word2Vec_models/GoogleNews-vectors-negative300.bin.gz',

binary=True)

norm_model.init_sims(replace=True)

df['norm_wmd'] = df.apply(lambda x:

norm_wmd(x['question1'], x['question2']), axis=1)

好了，到这里，我们构建的所有特征都已经有了，计算出他们的距离。

question1_vectors = np.zeros((df.shape[0], 300))

for i, q in enumerate(tqdm_notebook(df.question1.values)):

question1_vectors[i, :] = sent2vec(q)

question2_vectors = np.zeros((df.shape[0], 300))

for i, q in enumerate(tqdm_notebook(df.question2.values)):

question2_vectors[i, :] = sent2vec(q)

df['cosine_distance'] = [cosine(x, y) for (x, y) in zip(np.nan_to_num(question1_vectors), np.nan_to_num(question2_vectors))]

df['cityblock_distance'] = [cityblock(x, y) for (x, y) in zip(np.nan_to_num(question1_cectors). np.nan_to_num(question2_vectors))]

df['jaccard_distance'] = [jaccard(x, y) for (x, y) in zip(np.nan_to_num(question1_cectors). np.nan_to_num(question2_vectors))]

df['canberra_distance'] = [canberra(x, y) for (x, y) in zip(np.nan_to_num(question1_cectors). np.nan_to_num(question2_vectors))]

df['euclidean_distance'] = [euclidean(x, y) for (x, y) in zip(np.nan_to_num(question1_cectors). np.nan_to_num(question2_vectors))]

df['skew_q1vec'] = [skew(x) for x in np.nan_to_num(question1_vectors)]

df['skew_q2vec'] = [skew(x) for x in np.nan_to_num(question2_vectors)]

df[‘kur_q1vec’] = [kurtosis(x) for x in mp.nan_to_num(question1_vectors)]

df[‘kur_q2vec’] = [kurtosis(x) for x in mp.nan_to_num(question2_vectors)]

使用Xgboost进行分类模型训练

df.drop(['question1', 'question2'], axis=1, inplace=True)

df = df[pd.notnull(df['cosine_distance'])]

df = df[pd.notnull(df['jaccard_distance'])]

X = df.loc[:, df.columns != 'is_duplicate']

y = df.loc[:, df.columns == 'is_duplicate']

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3, random_state=0)

import xgboost as xgb

model = xgb.XGBClassifier(max_depth=50, n_estimators=80, learning_rate=0.1, colsample_bytree=.7, gamma=0, reg_alpha=4, objective=’binary:logistic’, eta=0.3,silent=1, subsample=0.8).fit(X_train, y_train.values.ravel))

prediction = model.predict(X_test)

cm = confusion_matrix(y_test, prediction)

print(cm)

print(’Accuracy’, accyracy_score(y_test, prediction))

print(classification_report(y_test, prediction))

使用我们构建的特征，Xgboost分类准确率为0.77，如果融合Xgboost TFIDF可以达到0.8

同时，我们的召回从0.67提升到了0.73

Jupyter notebook can be found on Github. 周末加油鸭！

Reference:

https://www.linkedin.com/pulse/duplicate-quora-question-abhishek-thakur/

serverless javascript

0 人点赞