为西雅图酒店建立基于内容的推荐系统

作者 | Susan Li

来源 | Towards Data Science

编辑 | 代码医生团队

在冷启动问题是一个众所周知的深入研究的问题推荐系统，其中系统不能够推荐项目给用户。由于三种不同的情况，即新用户，新产品和新网站。

基于内容的过滤是解决此问题的方法。系统在创建推荐时首先使用新产品的元数据，而访客操作在一段时间内是次要的。系统根据产品的类别和描述向用户推荐产品。

基于内容的推荐系统可以用于各种领域，包括推荐网页，新闻文章，餐馆，电视节目和酒店。基于内容的过滤的优点是它没有冷启动问题。如果刚开始使用新网站，或者可以立即推荐任何新产品。

假设正在开设一家新的在线旅行社（OTA），已经注册了数千家愿意在平台上销售的酒店，开始看到来自网站用户的流量，但没有任何用户，因此将建立一个基于内容的推荐系统来分析酒店描述，以识别用户特别感兴趣的酒店。

想根据用户已使用余弦相似性预订或查看的酒店推荐酒店。建议与之前预订或查看或与用户感兴趣的酒店具有最大相似性的酒店。推荐系统高度依赖于定义适当的相似性度量。最后，选择要向用户显示的酒店子集或确定显示酒店的顺序。

数据

很难找到公共酒店描述数据，因此从每个酒店的主页收集了西雅图地区的150多家酒店，其中包括市中心商务酒店，精品酒店和住宿加早餐，机场商务酒店，附近的酒店。大学，不知名的汽车旅馆，等等。数据可以在这里找到。

https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Seattle_Hotels.csv

代码语言：javascript复制

import pandas as pd

import numpy as np

from nltk.corpus import stopwords

from sklearn.metrics.pairwise import linear_kernel

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import LatentDirichletAllocation

import re

import random

import plotly.graph_objs as go

import plotly.plotly as py

import cufflinks

pd.options.display.max_columns = 30

from IPython.core.interactiveshell import InteractiveShell

import plotly.figure_factory as ff

InteractiveShell.ast_node_interactivity = 'all'

from plotly.offline import iplot

cufflinks.go_offline()

cufflinks.set_config_file(world_readable=True, theme='solar')

df = pd.read_csv('Seattle_Hotels.csv', encoding="latin-1")

df.head()

print('We have ', len(df), 'hotels in the data')

看看几个酒店名称和描述。

代码语言：javascript复制

def print_description(index):

    example = df[df.index == index][['desc', 'name']].values[0]

    if len(example) > 0:

        print(example[0])

        print('Name:', example[1])

print_description.py

print_description（10）

图1

print_description（100）

图2

EDA

去除停用词之前的令牌（词汇）频率分布

代码语言：javascript复制

def get_top_n_words(corpus, n=None):

    vec = CountVectorizer().fit(corpus)

    bag_of_words = vec.transform(corpus)

    sum_words = bag_of_words.sum(axis=0)

    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]

    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)

    return words_freq[:n]

common_words = get_top_n_words(df['desc'], 20)

df1 = pd.DataFrame(common_words, columns = ['desc' , 'count'])

df1.groupby('desc').sum()['count'].sort_values().iplot(kind='barh', yTitle='Count', linecolor='black', title='Top 20 words in hotel description before removing stop words')

unigram_distribution.py

图3

删除停用词后的令牌（词汇）频率分布

代码语言：javascript复制

def get_top_n_words(corpus, n=None):

    vec = CountVectorizer(stop_words='english').fit(corpus)

    bag_of_words = vec.transform(corpus)

    sum_words = bag_of_words.sum(axis=0)

    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]

    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)

    return words_freq[:n]

common_words = get_top_n_words(df['desc'], 20)

df2 = pd.DataFrame(common_words, columns = ['desc' , 'count'])

df2.groupby('desc').sum()['count'].sort_values().iplot(kind='barh', yTitle='Count', linecolor='black', title='Top 20 words in hotel description after removing stop words')

unigram_distribution_stopwords_removed.py

图4

删除停用词之前的Bigrams频率分布

代码语言：javascript复制

def get_top_n_bigram(corpus, n=None):

    vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)

    bag_of_words = vec.transform(corpus)

    sum_words = bag_of_words.sum(axis=0)

    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]

    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)

    return words_freq[:n]

common_words = get_top_n_bigram(df['desc'], 20)

df3 = pd.DataFrame(common_words, columns = ['desc' , 'count'])

df3.groupby('desc').sum()['count'].sort_values(ascending=False).iplot(kind='bar', yTitle='Count', linecolor='black', title='Top 20 bigrams in hotel description before removing stop words')

bigrams_distribution.py

图5

移除停用词后的Bigrams频率分布

代码语言：javascript复制

def get_top_n_bigram(corpus, n=None):

    vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)

    bag_of_words = vec.transform(corpus)

    sum_words = bag_of_words.sum(axis=0)

    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]

    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)

    return words_freq[:n]

common_words = get_top_n_bigram(df['desc'], 20)

df4 = pd.DataFrame(common_words, columns = ['desc' , 'count'])

df4.groupby('desc').sum()['count'].sort_values(ascending=False).iplot(kind='bar', yTitle='Count', linecolor='black', title='Top 20 bigrams in hotel description After removing stop words')

bigrams_distribution_stopwords_removed.py

图6

删除停用词之前的Trigrams频率分布

代码语言：javascript复制

def get_top_n_trigram(corpus, n=None):

    vec = CountVectorizer(ngram_range=(3, 3)).fit(corpus)

    bag_of_words = vec.transform(corpus)

    sum_words = bag_of_words.sum(axis=0)

    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]

    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)

    return words_freq[:n]

common_words = get_top_n_trigram(df['desc'], 20)

df5 = pd.DataFrame(common_words, columns = ['desc' , 'count'])

df5.groupby('desc').sum()['count'].sort_values(ascending=False).iplot(kind='bar', yTitle='Count', linecolor='black', title='Top 20 trigrams in hotel description before removing stop words')

trigrams_distribution.py

图7

删除停用词后的Trigrams频率分布

代码语言：javascript复制

def get_top_n_trigram(corpus, n=None):

    vec = CountVectorizer(ngram_range=(3, 3), stop_words='english').fit(corpus)

    bag_of_words = vec.transform(corpus)

    sum_words = bag_of_words.sum(axis=0)

    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]

    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)

    return words_freq[:n]

common_words = get_top_n_trigram(df['desc'], 20)

df6 = pd.DataFrame(common_words, columns = ['desc' , 'count'])

df6.groupby('desc').sum()['count'].sort_values(ascending=False).iplot(kind='bar', yTitle='Count', linecolor='black', title='Top 20 trigrams in hotel description after removing stop words')

trigrams_distribution_stopwords_removed.py

图8

每个人都知道西雅图的派克市场，它不仅仅是一个公共农贸市场。这是一个历史悠久的旅游景点，由数百名农民，手工艺人，小企业组成。酒店业在地理位置上蓬勃发展，游客寻找可能最靠近市中心和/或必须参观城市景点的酒店。

酒店说明字数分布

代码语言：javascript复制

df['word_count'] = df['desc'].apply(lambda x: len(str(x).split()))

desc_lengths = list(df['word_count'])

print("Number of descriptions:",len(desc_lengths),

      "nAverage word count", np.average(desc_lengths),

      "nMinimum word count", min(desc_lengths),

      "nMaximum word count", max(desc_lengths))

df['word_count'].iplot( kind='hist', bins = 50, linecolor='black', xTitle='word count', yTitle='count', title='Word Count Distribution in Hotel Description')

word_count_distribution.py

图9

许多酒店充分利用描述，知道如何利用迷人的描述吸引旅行者的情绪来推动直接预订。描述可能比其他人更长。

文本预处理

代码语言：javascript复制

REPLACE_BY_SPACE_RE = re.compile('[/(){}[]|@,;]')

BAD_SYMBOLS_RE = re.compile('[^0-9a-z # _]')

STOPWORDS = set(stopwords.words('english'))



def clean_text(text):

    """

        text: a string



        return: modified initial string

    """

    text = text.lower() # lowercase text

    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.

    text = BAD_SYMBOLS_RE.sub('', text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing.

    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwors from text

    return text



df['desc_clean'] = df['desc'].apply(clean_text)

description_preprocessing.py

Modeling

为每家酒店创建一个TF-IDF矩阵，包括unigrams，bigrams和trigrams。
使用sklearn的linear_kernel计算所有酒店之间的相似度。
定义一个以酒店名称作为输入的函数，并返回前10位推荐酒店。

代码语言：javascript复制

df.set_index('name', inplace = True)

tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')

tfidf_matrix = tf.fit_transform(df['desc_clean'])

cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)



indices = pd.Series(df.index)



def recommendations(name, cosine_similarities = cosine_similarities):



    recommended_hotels = []



    # gettin the index of the hotel that matches the name

    idx = indices[indices == name].index[0]



    # creating a Series with the similarity scores in descending order

    score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending = False)



    # getting the indexes of the 10 most similar hotels except itself

    top_10_indexes = list(score_series.iloc[1:11].index)



    # populating the list with the names of the top 10 matching hotels

    for i in top_10_indexes:

        recommended_hotels.append(list(df.index)[i])



return recommended_hotels

hotel_rec_model.py

建议

recommendations('Hilton Seattle Airport & Conference Center')