相似文档查询
==================
接下来介绍如何查询类似文档的语料库。
代码语言:javascript复制import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
创建语料库
首先,我们需要创建一个要使用的语料库。这个步骤与上一个教程中的步骤相同; 如果您完成了这个步骤,请随意跳到下一个部分。|
代码语言:javascript复制from collections import defaultdict
from gensim import corpora
documents = [
"Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey",
]
# 去除停用词并进行分词
stoplist = set('for a of the and to in'.split())
texts = [
[word for word in document.lower().split() if word not in stoplist]
for document in documents
]
# 去除低频词
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] = 1
texts = [
[token for token in text if frequency[token] > 1]
for text in texts
]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
输出
代码语言:javascript复制2021-01-28 10:06:04,335 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2021-01-28 10:06:04,336 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)
相似性推断
我们讨论了在向量空间模型中创建语料库的含义,以及如何在不同的向量空间之间转换语料库。一个常见的原因是,我们想要确定对文档之间的相似性,或者确定特定文档与一组其他文档之间的相似性(例如用户查询vs.索引文档)。
代码语言:javascript复制from gensim import models
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
输出
代码语言:javascript复制2021-01-28 10:20:02,307 : INFO : using serial LSI version on this node
2021-01-28 10:20:02,308 : INFO : updating model with new documents
2021-01-28 10:20:02,309 : INFO : preparing a new chunk of documents
2021-01-28 10:20:02,309 : INFO : using 100 extra samples and 2 power iterations
2021-01-28 10:20:02,310 : INFO : 1st phase: constructing (12, 102) action matrix
2021-01-28 10:20:02,311 : INFO : orthonormalizing (12, 102) action matrix
2021-01-28 10:20:02,312 : INFO : 2nd phase: running dense svd on (12, 9) matrix
2021-01-28 10:20:02,313 : INFO : computing the final decomposition
2021-01-28 10:20:02,313 : INFO : keeping 2 factors (discarding 43.156% of energy spectrum)
2021-01-28 10:20:02,314 : INFO : processed documents up to #9
2021-01-28 10:20:02,315 : INFO : topic #0(3.341): 0.644*"system" 0.404*"user" 0.301*"eps" 0.265*"time" 0.265*"response" 0.240*"computer" 0.221*"human" 0.206*"survey" 0.198*"interface" 0.036*"graph"
2021-01-28 10:20:02,315 : INFO : topic #1(2.542): 0.623*"graph" 0.490*"trees" 0.451*"minors" 0.274*"survey" -0.167*"system" -0.141*"eps" -0.113*"human" 0.107*"response" 0.107*"time" -0.072*"interface"
本次教程的目的就是要求我们LSI两个事情即可:
- 首先,这只是另一种转换:将向量从一个空间转换到另一个空间。
- 其次,LSI的好处是可以识别术语(在我们的情况下是文档中的单词)与主题之间的模式和关系。
我们的LSI空间是二维的(
num_topics = 2
),所以有两个主题,但这是任意的。 如果您有兴趣,可以在这里阅读有关LSI的更多信息:潜在语义索引<https://en.wikipedia.org/wiki/Latent_semantic_indexing>
_:
现在假设用户键入查询“人机交互”。 我们会 希望按照与该查询相关性的降序对我们的九个语料库文档进行排序。 与现代搜索引擎不同,这里我们只关注可能的一个方面 相似性-关于它们的文本(单词)的明显语义相关性。 没有超链接, 没有随机游动的静态排名,只是布尔关键字match的语义扩展:
代码语言:javascript复制doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow] # 查询文档的LSI向量
print(vec_lsi)
输出
代码语言:javascript复制[(0, 0.46182100453271596), (1, -0.07002766527899937)]
- 不同的相似性匹配方法
different similarity measures <http://en.wikipedia.org/wiki/Kullback–Leibler_divergence#Symmetrised_divergence>
_
为了准备相似性查询,我们需要输入所有需要的文档 与后续查询进行比较。 在我们的情况下,它们是相同的九个文档 用于训练LSI,转换为2-D LSA空间。 但这只是偶然的,我们 可能还一起索引了另一个语料库.
代码语言:javascript复制from gensim import similarities
index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it
index
输出:
代码语言:javascript复制2021-01-28 10:37:02,431 : WARNING : scanning corpus to determine the number of features (consider setting `num_features` explicitly)
2021-01-28 10:37:02,433 : INFO : creating matrix with 9 documents and 2 features
<gensim.similarities.docsim.MatrixSimilarity at 0x7f5e58cc4070>
保存索引:
代码语言:javascript复制# 保存索引
index.save('/tmp/deerwester.index')
index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')
执行查询
要获得我们的查询文档与九个索引文档的相似性:
代码语言:javascript复制sims = index[vec_lsi] # perform a similarity query against the corpus
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples
可以得到查询文档与所有语料的相似性
代码语言:javascript复制[(0, 0.998093), (1, 0.93748635), (2, 0.9984453), (3, 0.98658866), (4, 0.90755945), (5, -0.12416792), (6, -0.1063926), (7, -0.09879464), (8, 0.05004177)]
代码语言:javascript复制sims = sorted(enumerate(sims), key=lambda item: -item[1])
for doc_position, doc_score in sims:
print(doc_score, documents[doc_position])
代码语言:javascript复制0.9984453 The EPS user interface management system
0.998093 Human machine interface for lab abc computer applications
0.98658866 System and human system engineering testing of EPS
0.93748635 A survey of user opinion of computer system response time
0.90755945 Relation of user perceived response time to error measurement
0.05004177 Graph minors A survey
-0.09879464 Graph minors IV Widths of trees and well quasi ordering
-0.1063926 The intersection graph of paths in trees
-0.12416792 The generation of random binary unordered trees