作者 | Susan Li
来源 | Towards Data Science
编辑 | 代码医生团队
在冷启动问题是一个众所周知的深入研究的问题推荐系统,其中系统不能够推荐项目给用户。由于三种不同的情况,即新用户,新产品和新网站。
基于内容的过滤是解决此问题的方法。系统在创建推荐时首先使用新产品的元数据,而访客操作在一段时间内是次要的。系统根据产品的类别和描述向用户推荐产品。
基于内容的推荐系统可以用于各种领域,包括推荐网页,新闻文章,餐馆,电视节目和酒店。基于内容的过滤的优点是它没有冷启动问题。如果刚开始使用新网站,或者可以立即推荐任何新产品。
假设正在开设一家新的在线旅行社(OTA),已经注册了数千家愿意在平台上销售的酒店,开始看到来自网站用户的流量,但没有任何用户, 因此将建立一个基于内容的推荐系统来分析酒店描述,以识别用户特别感兴趣的酒店。
想根据用户已使用余弦相似性预订或查看的酒店推荐酒店。建议与之前预订或查看或与用户感兴趣的酒店具有最大相似性的酒店。推荐系统高度依赖于定义适当的相似性度量。最后,选择要向用户显示的酒店子集或确定显示酒店的顺序。
数据
很难找到公共酒店描述数据,因此从每个酒店的主页收集了西雅图地区的150多家酒店,其中包括市中心商务酒店,精品酒店和住宿加早餐,机场商务酒店,附近的酒店。大学,不知名的汽车旅馆,等等。数据可以在这里找到。
https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Seattle_Hotels.csv
代码语言:javascript复制import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import re
import random
import plotly.graph_objs as go
import plotly.plotly as py
import cufflinks
pd.options.display.max_columns = 30
from IPython.core.interactiveshell import InteractiveShell
import plotly.figure_factory as ff
InteractiveShell.ast_node_interactivity = 'all'
from plotly.offline import iplot
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='solar')
df = pd.read_csv('Seattle_Hotels.csv', encoding="latin-1")
df.head()
print('We have ', len(df), 'hotels in the data')
看看几个酒店名称和描述。
代码语言:javascript复制def print_description(index):
example = df[df.index == index][['desc', 'name']].values[0]
if len(example) > 0:
print(example[0])
print('Name:', example[1])
print_description.py
print_description(10)
图1
print_description(100)
图2
EDA
去除停用词之前的令牌(词汇)频率分布
代码语言:javascript复制def get_top_n_words(corpus, n=None):
vec = CountVectorizer().fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_words(df['desc'], 20)
df1 = pd.DataFrame(common_words, columns = ['desc' , 'count'])
df1.groupby('desc').sum()['count'].sort_values().iplot(kind='barh', yTitle='Count', linecolor='black', title='Top 20 words in hotel description before removing stop words')
unigram_distribution.py
图3
删除停用词后的令牌(词汇)频率分布
代码语言:javascript复制def get_top_n_words(corpus, n=None):
vec = CountVectorizer(stop_words='english').fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_words(df['desc'], 20)
df2 = pd.DataFrame(common_words, columns = ['desc' , 'count'])
df2.groupby('desc').sum()['count'].sort_values().iplot(kind='barh', yTitle='Count', linecolor='black', title='Top 20 words in hotel description after removing stop words')
unigram_distribution_stopwords_removed.py
图4
删除停用词之前的Bigrams频率分布
代码语言:javascript复制def get_top_n_bigram(corpus, n=None):
vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_bigram(df['desc'], 20)
df3 = pd.DataFrame(common_words, columns = ['desc' , 'count'])
df3.groupby('desc').sum()['count'].sort_values(ascending=False).iplot(kind='bar', yTitle='Count', linecolor='black', title='Top 20 bigrams in hotel description before removing stop words')
bigrams_distribution.py
图5
移除停用词后的Bigrams频率分布
代码语言:javascript复制def get_top_n_bigram(corpus, n=None):
vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_bigram(df['desc'], 20)
df4 = pd.DataFrame(common_words, columns = ['desc' , 'count'])
df4.groupby('desc').sum()['count'].sort_values(ascending=False).iplot(kind='bar', yTitle='Count', linecolor='black', title='Top 20 bigrams in hotel description After removing stop words')
bigrams_distribution_stopwords_removed.py
图6
删除停用词之前的Trigrams频率分布
代码语言:javascript复制def get_top_n_trigram(corpus, n=None):
vec = CountVectorizer(ngram_range=(3, 3)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_trigram(df['desc'], 20)
df5 = pd.DataFrame(common_words, columns = ['desc' , 'count'])
df5.groupby('desc').sum()['count'].sort_values(ascending=False).iplot(kind='bar', yTitle='Count', linecolor='black', title='Top 20 trigrams in hotel description before removing stop words')
trigrams_distribution.py
图7
删除停用词后的Trigrams频率分布
代码语言:javascript复制def get_top_n_trigram(corpus, n=None):
vec = CountVectorizer(ngram_range=(3, 3), stop_words='english').fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_trigram(df['desc'], 20)
df6 = pd.DataFrame(common_words, columns = ['desc' , 'count'])
df6.groupby('desc').sum()['count'].sort_values(ascending=False).iplot(kind='bar', yTitle='Count', linecolor='black', title='Top 20 trigrams in hotel description after removing stop words')
trigrams_distribution_stopwords_removed.py
图8
每个人都知道西雅图的派克市场,它不仅仅是一个公共农贸市场。这是一个历史悠久的旅游景点,由数百名农民,手工艺人,小企业组成。酒店业在地理位置上蓬勃发展,游客寻找可能最靠近市中心和/或必须参观城市景点的酒店。
酒店说明字数分布
代码语言:javascript复制df['word_count'] = df['desc'].apply(lambda x: len(str(x).split()))
desc_lengths = list(df['word_count'])
print("Number of descriptions:",len(desc_lengths),
"nAverage word count", np.average(desc_lengths),
"nMinimum word count", min(desc_lengths),
"nMaximum word count", max(desc_lengths))
df['word_count'].iplot( kind='hist', bins = 50, linecolor='black', xTitle='word count', yTitle='count', title='Word Count Distribution in Hotel Description')
word_count_distribution.py
图9
许多酒店充分利用描述,知道如何利用迷人的描述吸引旅行者的情绪来推动直接预订。描述可能比其他人更长。
文本预处理
代码语言:javascript复制REPLACE_BY_SPACE_RE = re.compile('[/(){}[]|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z # _]')
STOPWORDS = set(stopwords.words('english'))
def clean_text(text):
"""
text: a string
return: modified initial string
"""
text = text.lower() # lowercase text
text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
text = BAD_SYMBOLS_RE.sub('', text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing.
text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwors from text
return text
df['desc_clean'] = df['desc'].apply(clean_text)
description_preprocessing.py
Modeling
- 为每家酒店创建一个TF-IDF矩阵,包括unigrams,bigrams和trigrams。
- 使用sklearn的linear_kernel计算所有酒店之间的相似度。
- 定义一个以酒店名称作为输入的函数,并返回前10位推荐酒店。
df.set_index('name', inplace = True)
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(df['desc_clean'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
indices = pd.Series(df.index)
def recommendations(name, cosine_similarities = cosine_similarities):
recommended_hotels = []
# gettin the index of the hotel that matches the name
idx = indices[indices == name].index[0]
# creating a Series with the similarity scores in descending order
score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending = False)
# getting the indexes of the 10 most similar hotels except itself
top_10_indexes = list(score_series.iloc[1:11].index)
# populating the list with the names of the top 10 matching hotels
for i in top_10_indexes:
recommended_hotels.append(list(df.index)[i])
return recommended_hotels
hotel_rec_model.py
建议
recommendations('Hilton Seattle Airport & Conference Center')
以下是Google推荐的“希尔顿西雅图机场和会议中心”:
图10
Google推荐的四分之三也是我们推荐的。
以下是“希尔顿西雅图机场和会议中心” 的tripadvisor推荐:
图11
试试住宿加早餐。
recommendations("The Bacon Mansion Bed and Breakfast")
以下是Google推荐的“The Bacon Mansion Bed and Breakfast”:
图12
以下是tripadvisor推荐的“The Bacon Mansion Bed and Breakfast”,并没有留下深刻的印象。
图13
Jupyter笔记本可以在Github上找到,还有一个nbviewer版本。
https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Seattle Hotels Recommender.ipynb
https://nbviewer.jupyter.org/github/susanli2016/Machine-Learning-with-Python/blob/master/Seattle Hotels Recommender.ipynb