MMD_2a_FindSimilarSets

2018-01-02 17:03:52 浏览数 (1)

概述

application

there are many data-mining problems which can be expressed as finding similar sets, such as:

  • pages with similar words, e.g., for classification by topic
  • recommendation systems, classification by people or movie
  • entity resolution, different informs all point to one person

essential parts

shingling

minhashing

jaccard similarity measure

matrix represent

minihash

总体步骤

例子

性质

签名的相似性

implementation

将对矩阵的转置操作用不同的hash函数代替。

LSH

LSH means locality-sensitive hashing.

概述

  • general idea: generate from the collection of all elements (signatures in our examples) a small list of candidates pairs : pairs of elements whose similarity must be evaluated.
  • for signature matrices: hush columns to many buckets, and make elements of the same bucket candidate pairs.

候选pairs的规则

LSH具体阐述

例子

概率分析

总结

得到更多的signature(但是会有更多的空间占用与计算),可以有更大的b和r,能够获得更step的函数。

LSH Application

entity resolution

fingerprint

similar news articles

0 人点赞