NLP: Word Embedding 词嵌入(Part2: fastText)

1. word2vec 和 fastText 对比

word2vec, n-gram 等 word-embedding 方法选择用vector表示single word 而不考虑词根词缀之间的关系

fastText 则会考虑single word 中词根词缀之间的关系，所以 fastText 使用 character 级别的 n-grams表示single word

比如单词 book 会被表示成: ["bo", "boo", "ook", "ok"]

这样当我们有两段文本 "肚子饿了我要吃饭" 和 "肚子饿了我要吃东西"; 用word2vec计算vector差别很大;

但是用fastText计算，由于fastText可以计算出words间语义相似程度，因此fastText计算会差别很小

神经网络结构很 similar, 都是3层结构; 采用的embedding vector的形式; Output Layer 都是 word 的隐向量

优化方法很 similar, 都用了 softmax 等

Word2Vec	fastText
输入: one-hot形式的单词的向量	输入: embedding过的单词的词向量和n-gram向量
输出: 对应的是每一个term,计算某term概率最大	输出: 对应的是分类的标签。

word2Vec 和 fastText 在 softmax 的使用上也不同

word2Vec 通过 h-softmax 生成的vectors不会被使用; fastText通过 h-softmax遍历分类树所有nodes得到最大概率的label

fastText 和 CBOW 一样都是简单的神经网络结构: Input Layer, Hidden Layer, Output Layer

CBOW 的 Input 是 single word 的 Context；fastText 的 Input 是 multiple words and their n-gram features 即特征

CBOW 的 Input 被 one-hot encoding 过；fastText 的 Input 被 embedding 处理过

CBOW 的 Output 是 single word；fastText 的 Output 是文档对应的类标

fastText 其实就是一个 softmax linear multi-category classfier, 即一个分类器 (线性)

Input Layer到Hidden Layer 主要把feature特征做成vectors; 通过 add all words and n-gram-word-vectors 然后求avg

因此 fastText核心思想是: 将 all words and n-gram-word-vectors 叠加平均得到文档的vector, 然后输入文档的vector 做

softmax 多分类 (主要包含字符级别的n-gram 特征引入和分层softmax 分类)

0 人点赞