sklearn: 利用TruncatedSVD做文本主题分析

本文是一个使用sklearn中的TruncatedSVD进行文本主题分析的简要demo。通过主题分析，我们可以得到一个语料中的关键主题，即各个词语在主题中的重要程度，各个文章在各个主题上的倾向程度。并且可以根据它们，得到主题对应的关键词以及代表性文本。我前面写的一篇数据分析一文看评论里的中超风云就用到了主题分析的一种：

下面介绍的形式是LSI（潜在语义分析），主题模型中较早也较为简单的一种，在sklearn库中以TruncatedSVD的形式实现，使用非常方便，现在进入代码：

In [1]:

代码语言：javascript复制

from sklearn.decomposition import TruncatedSVD           # namely LSA/LSI(即潜在语义分析)
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

使用TF-IDF对文本进行预处理,将文本化为向量的表示形式

In [2]:

代码语言：javascript复制

# ♪ Until the Day ♪ by JJ Lin 林俊杰
docs = ["In the middle of the night",
        "When our hopes and fears collide",
        "In the midst of all goodbyes",
        "Where all human beings lie",
        "Against another lie"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)
terms = vectorizer.get_feature_names()
print(terms)

代码语言：javascript复制

['against', 'all', 'and', 'another', 'beings', 'collide', 'fears', 'goodbyes', 'hopes', 'human', 'in', 'lie', 'middle', 'midst', 'night', 'of', 'our', 'the', 'when', 'where']

使用TruncatedSVD,把原先规模为(文本数，词汇数)的特征矩阵X化为规模为(文本数，主题数)的新特征矩阵X2：

(由于主题数一般比词汇数少，这一方法也可以用来降维，以进行分类或聚类操作)

In [3]:

代码语言：javascript复制

n_pick_topics = 3            # 设定主题数为3
lsa = TruncatedSVD(n_pick_topics)               
X2 = lsa.fit_transform(X)
X2

Out[3]:

代码语言：javascript复制

array([[ 8.26629804e-01, -2.46905901e-01, -0.00000000e 00],
       [ 4.66516068e-16,  8.40497045e-16,  1.00000000e 00],
       [ 8.66682085e-01, -9.09029610e-02, -1.11022302e-16],
       [ 2.80099067e-01,  7.28669961e-01, -6.38342104e-16],
       [ 1.03123637e-01,  7.63975842e-01, -4.43944669e-16]])

X2[i,t]为第i篇文档在第t个主题上的分布，所以该值越高的文档i，可以认为在主题t上更有代表性，我们便以此筛选出最能代表该主题的文档。

In [4]:

代码语言：javascript复制

n_pick_docs= 2
topic_docs_id = [X2[:,t].argsort()[:-(n_pick_docs 1):-1] for t in range(n_pick_topics)]
topic_docs_id

Out[4]:

代码语言：javascript复制

[array([2, 0], dtype=int64),
 array([4, 3], dtype=int64),
 array([1, 0], dtype=int64)]

lsa.components_ 为规模为(主题数，词汇数)的矩阵,其(t,j)位置的元素代表了词语j在主题t上的权重，同样以此获得主题关键词：

In [5]:

代码语言：javascript复制

n_pick_keywords = 4
topic_keywords_id = [lsa.components_[t].argsort()[:-(n_pick_keywords 1):-1] for t in range(n_pick_topics)]
topic_keywords_id

Out[5]:

代码语言：javascript复制

[array([17, 15, 10,  1], dtype=int64),
 array([11,  3,  0,  4], dtype=int64),
 array([16,  2, 18,  8], dtype=int64)]

In [6]:

代码语言：javascript复制

for t in range(n_pick_topics):
    print("topic %d:" % t)
    print("    keywords: %s" % ", ".join(terms[topic_keywords_id[t][j]] for j in range(n_pick_keywords)))
    for i in range(n_pick_docs):
        print("    doc %d" % i)
        print("t" docs[topic_docs_id[t][i]])

代码语言：javascript复制

topic 0:
    keywords: the, of, in, all
    doc 0
	In the midst of all goodbyes
    doc 1
	In the middle of the night
topic 1:
    keywords: lie, another, against, beings
    doc 0
	Against another lie
    doc 1
	Where all human beings lie
topic 2:
    keywords: our, and, when, hopes
    doc 0
	When our hopes and fears collide
    doc 1
	In the middle of the night

components

0 人点赞