作者 | Rohit Agrawal
来源 | Medium
编辑 | 代码医生团队
文本分类是自然语言处理(NLP)旨在解决的经典问题,其涉及分析原始文本的内容并决定其属于哪个类别。它具有广泛的应用,如情绪分析,主题标签,垃圾邮件检测和意图检测。
今天将采用一个相当简单的任务,根据标题和描述,使用不同的技术(Naive Bayes,支持向量机,Adaboost和LSTM)将视频分类到不同的类中,并分析它们的性能。这些类被选择为(但不限于):
- 旅游博客
- 科学和技术
- 餐饮
- 制造业
- 历史
- 艺术与音乐
收集数据
在处理诸如此类的自定义机器学习问题时,发现收集数据非常有用,如果不是简单的满足。对于这个问题,需要一些关于属于不同类别的视频的元数据。欢迎手动收集数据并构建数据集。将使用Youtube API v3。它是由Google自己创建的,通过一段专门编写的代码与Youtube进行交互。转到Google Developer Console,创建一个示例项目并开始使用。选择这样做的原因是需要收集数以千计的样本,这是用其他技术找不到的。
注意:Youtube API与Google提供的任何其他API一样,适用于配额系统。根据您的计划,每封电子邮件每天/每月都会提供一套配额。在免费计划中,只能向Youtube提出大约2000次的请求,这提出了一些问题,但使用多个电子邮件帐户克服了它。
API的文档非常简单,在使用8个以上的电子邮件帐户来补偿所需的配额后,收集了以下数据并将其存储在.csv文件中。如果希望将此数据集用于自己的项目,可以在此处下载。
https://github.com/agrawal-rohit/Text-Classification-Analysis/blob/master/Collected_data_raw.csv
收集的原始数据
代码语言:javascript复制from apiclient.discovery import build
import pandas as pd
# Data to be stored
category = []
no_of_samples = 1700
# Gathering Data using the Youtube API
api_key = "AIzaSyAS9eTgOEnOJ2GlJbbqm_0bR1onuRQjTHE"
youtube_api = build('youtube','v3', developerKey = api_key)
# Travel Data
tvl_titles = []
tvl_descriptions = []
tvl_ids = []
req = youtube_api.search().list(q='travel vlogs', part='snippet', type='video', maxResults = 50)
res = req.execute()
while(len(tvl_titles)<no_of_samples):
for i in range(len(res['items'])):
tvl_titles.append(res['items'][i]['snippet']['title'])
tvl_descriptions.append(res['items'][i]['snippet']['description'])
tvl_ids.append(res['items'][i]['id']['videoId'])
category.append('travel')
if('nextPageToken' in res):
next_page_token = res['nextPageToken']
req = youtube_api.search().list(q='travelling', part='snippet', type='video', maxResults = 50, pageToken=next_page_token)
res = req.execute()
else:
break
# Science Data
science_titles = []
science_descriptions = []
science_ids = []
next_page_token = None
req = youtube_api.search().list(q='robotics', part='snippet', type='video', maxResults = 50)
res = req.execute()
while(len(science_titles)<no_of_samples):
if(next_page_token is not None):
req = youtube_api.search().list(q='robotics', part='snippet', type='video', maxResults = 50, pageToken=next_page_token)
res = req.execute()
for i in range(len(res['items'])):
science_titles.append(res['items'][i]['snippet']['title'])
science_descriptions.append(res['items'][i]['snippet']['description'])
science_ids.append(res['items'][i]['id']['videoId'])
category.append('science and technology')
if('nextPageToken' in res):
next_page_token = res['nextPageToken']
else:
break
# Food Data
food_titles = []
food_descriptions = []
food_ids = []
next_page_token = None
req = youtube_api.search().list(q='delicious food', part='snippet', type='video', maxResults = 50)
res = req.execute()
while(len(food_titles)<no_of_samples):
if(next_page_token is not None):
req = youtube_api.search().list(q='delicious food', part='snippet', type='video', maxResults = 50, pageToken=next_page_token)
res = req.execute()
for i in range(len(res['items'])):
food_titles.append(res['items'][i]['snippet']['title'])
food_descriptions.append(res['items'][i]['snippet']['description'])
food_ids.append(res['items'][i]['id']['videoId'])
category.append('food')
if('nextPageToken' in res):
next_page_token = res['nextPageToken']
else:
break
# Food Data
manufacturing_titles = []
manufacturing_descriptions = []
manufacturing_ids = []
next_page_token = None
req = youtube_api.search().list(q='3d printing', part='snippet', type='video', maxResults = 50)
res = req.execute()
while(len(manufacturing_titles)<no_of_samples):
if(next_page_token is not None):
req = youtube_api.search().list(q='3d printing', part='snippet', type='video', maxResults = 50, pageToken=next_page_token)
res = req.execute()
for i in range(len(res['items'])):
manufacturing_titles.append(res['items'][i]['snippet']['title'])
manufacturing_descriptions.append(res['items'][i]['snippet']['description'])
manufacturing_ids.append(res['items'][i]['id']['videoId'])
category.append('manufacturing')
if('nextPageToken' in res):
next_page_token = res['nextPageToken']
else:
break
# History Data
history_titles = []
history_descriptions = []
history_ids = []
next_page_token = None
req = youtube_api.search().list(q='archaeology', part='snippet', type='video', maxResults = 50)
res = req.execute()
while(len(history_titles)<no_of_samples):
if(next_page_token is not None):
req = youtube_api.search().list(q='archaeology', part='snippet', type='video', maxResults = 50, pageToken=next_page_token)
res = req.execute()
for i in range(len(res['items'])):
history_titles.append(res['items'][i]['snippet']['title'])
history_descriptions.append(res['items'][i]['snippet']['description'])
history_ids.append(res['items'][i]['id']['videoId'])
category.append('history')
if('nextPageToken' in res):
next_page_token = res['nextPageToken']
else:
break
# Art and Music Data
art_titles = []
art_descriptions = []
art_ids = []
next_page_token = None
req = youtube_api.search().list(q='painting', part='snippet', type='video', maxResults = 50)
res = req.execute()
while(len(art_titles)<no_of_samples):
if(next_page_token is not None):
req = youtube_api.search().list(q='painting', part='snippet', type='video', maxResults = 50, pageToken=next_page_token)
res = req.execute()
for i in range(len(res['items'])):
art_titles.append(res['items'][i]['snippet']['title'])
art_descriptions.append(res['items'][i]['snippet']['description'])
art_ids.append(res['items'][i]['id']['videoId'])
category.append('art and music')
if('nextPageToken' in res):
next_page_token = res['nextPageToken']
else:
break
# Construct Dataset
final_titles = tvl_titles science_titles food_titles manufacturing_titles history_titles art_titles
final_descriptions = tvl_descriptions science_descriptions food_descriptions manufacturing_descriptions history_descriptions art_descriptions
final_ids = tvl_ids science_ids food_ids manufacturing_ids history_ids art_ids
data = pd.DataFrame({'Video Id': final_ids, 'Title': final_titles, 'Description': final_descriptions, 'Category': category})
data.to_csv('Collected_data_raw.csv')
注意:可以自由地探索一种称为Web Scraping的技术,该技术用于从网站中提取数据。Python有一个名为BeautifulSoup的漂亮库,用于同样的目的。但发现在从Youtube搜索结果中抓取数据的情况下,它只返回一个搜索查询的25个结果。
数据清理和预处理
数据预处理过程的第一步是处理丢失的数据。由于缺失值应该是文本数据,因此无法将它们归于它们,因此唯一的选择是删除它们。幸运的是,9999个样本中只有334个缺失值,因此它不会影响训练期间的模型性能。
代码语言:javascript复制# Missing Values
num_missing_desc = data.isnull().sum()[2] # No. of values with msising descriptions
print('Number of missing values: ' str(num_missing_desc))
data = data.dropna()
“Video Id”列对预测分析并不真正有用,因此它不会被选为最终训练集的一部分,因此没有任何预处理步骤。
这里有两列重要的列,即标题和描述,但它们是未处理的原始文本。因此为了消除噪音,将采用一种非常常见的方法来清理这两列的文本。此方法分为以下步骤:
- 转换为小写:执行此步骤是因为大写不会对单词的语义重要性产生影响。例如。“Travel”和“Travel”应视为相同。
- 删除数字值和标点符号:标点符号中使用的数值和特殊字符($,!等)无助于确定正确的类
- 删除多余的空格:这样每个单词由一个空格分隔,否则在标记化过程中可能会出现问题
- 标记为单词:这是指将文本字符串拆分为“标记”列表,其中每个标记都是一个单词。例如,句子“I have huge biceps”将被转换为[‘I’, ‘have’, ‘huge’, ‘biceps’]。
- 删除非字母词和'Stop words': 'Stop words'指的是像和,等等词,它们在学习如何构建句子时是重要的词,但对预测分析毫无用处。
- 词形还原:词形还原是一种非常漂亮的技术,它将类似的单词转换为它们的基本含义。例如“flying”和“flew”这两个词将被转换为最简单的意思“fly”。
文本清理后的数据集
代码语言:javascript复制# Change to lowercase
data['Title'] = data['Title'].map(lambda x: x.lower())
data['Description'] = data['Description'].map(lambda x: x.lower())
# Remove numbers
data['Title'] = data['Title'].map(lambda x: re.sub(r'd ', '', x))
data['Description'] = data['Description'].map(lambda x: re.sub(r'd ', '', x))
# Remove Punctuation
data['Title'] = data['Title'].map(lambda x: x.translate(x.maketrans('', '', string.punctuation)))
data['Description'] = data['Description'].map(lambda x: x.translate(x.maketrans('', '', string.punctuation)))
# Remove white spaces
data['Title'] = data['Title'].map(lambda x: x.strip())
data['Description'] = data['Description'].map(lambda x: x.strip())
# Tokenize into words
data['Title'] = data['Title'].map(lambda x: word_tokenize(x))
data['Description'] = data['Description'].map(lambda x: word_tokenize(x))
# Remove non alphabetic tokens
data['Title'] = data['Title'].map(lambda x: [word for word in x if word.isalpha()])
data['Description'] = data['Description'].map(lambda x: [word for word in x if word.isalpha()])
# filter out stop words
stop_words = set(stopwords.words('english'))
data['Title'] = data['Title'].map(lambda x: [w for w in x if not w in stop_words])
data['Description'] = data['Description'].map(lambda x: [w for w in x if not w in stop_words])
# Word Lemmatization
lem = WordNetLemmatizer()
data['Title'] = data['Title'].map(lambda x: [lem.lemmatize(word,"v") for word in x])
data['Description'] = data['Description'].map(lambda x: [lem.lemmatize(word,"v") for word in x])
# Turn lists back to string
data['Title'] = data['Title'].map(lambda x: ' '.join(x))
data['Description'] = data['Description'].map(lambda x: ' '.join(x))
“现在文字很干净,来一瓶香槟庆祝吧!“
即使今天的计算机能够解决世界问题并玩超现实的视频游戏,它们仍然是不懂语言的机器。因此无法将文本数据提供给机器学习模型,无论它多么干净。因此需要将它们转换为基于数字的特征,以便计算机可以构建数学模型作为解决方案。这构成了数据预处理步骤
LabelEncoding之后的类别列
由于输出变量('Category')本质上也是分类的,需要将每个类编码为数字。这称为标签编码。
最后关注每个样本的主要信息 - 原始文本数据。为了从文本中提取数据作为特征并以数字格式表示它们,一种非常常见的方法是对它们进行矢量化。为此目的Scikit-learn库包含'TF-IDFVectorizer'。TF-IDF(术语频率 - 逆文档频率)计算多个文档内部和跨文档的每个单词的频率,以便识别每个单词的重要性。
代码语言:javascript复制# Encode classes
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(data.Category)
data.Category = le.transform(data.Category)
data.head(5)
# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_title = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
tfidf_desc = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
labels = data.Category
features_title = tfidf_title.fit_transform(data.Title).toarray()
features_description = tfidf_desc.fit_transform(data.Description).toarray()
print('Title Features Shape: ' str(features_title.shape))
print('Description Features Shape: ' str(features_description.shape))
数据分析与特征探索
作为一个额外的步骤,决定显示类的分布,以检查不均衡的样本数。
此外想检查使用TF-IDF矢量化提取的特征是否有意义,因此决定使用标题和描述功能找到每个类最相关的unigrams和bigrams。
代码语言:javascript复制# Best 5 keywords for each class using Title Feaures
from sklearn.feature_selection import chi2
import numpy as np
N = 5
for current_class in list(le.classes_):
current_class_id = le.transform([current_class])[0]
features_chi2 = chi2(features_title, labels == current_class_id)
indices = np.argsort(features_chi2[0])
feature_names = np.array(tfidf_title.get_feature_names())[indices]
unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
print("# '{}':".format(current_class))
print("Most correlated unigrams:")
print('-' *30)
print('. {}'.format('n. '.join(unigrams[-N:])))
print("Most correlated bigrams:")
print('-' *30)
print('. {}'.format('n. '.join(bigrams[-N:])))
print("n")
# Best 5 keywords for each class using Description Features
from sklearn.feature_selection import chi2
import numpy as np
N = 5
for current_class in list(le.classes_):
current_class_id = le.transform([current_class])[0]
features_chi2 = chi2(features_description, labels == current_class_id)
indices = np.argsort(features_chi2[0])
feature_names = np.array(tfidf_desc.get_feature_names())[indices]
unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
print("# '{}':".format(current_class))
print("Most correlated unigrams:")
print('-' *30)
print('. {}'.format('n. '.join(unigrams[-N:])))
print("Most correlated bigrams:")
print('-' *30)
print('. {}'.format('n. '.join(bigrams[-N:])))
print("n")
代码语言:javascript复制# USING TITLE FEATURES
# 'art and music':
Most correlated unigrams:
------------------------------
. paint
. official
. music
. art
. theatre
Most correlated bigrams:
------------------------------
. capitol theatre
. musical theatre
. work theatre
. official music
. music video
# 'food':
Most correlated unigrams:
------------------------------
. foods
. eat
. snack
. cook
. food
Most correlated bigrams:
------------------------------
. healthy snack
. snack amp
. taste test
. kid try
. street food
# 'history':
Most correlated unigrams:
------------------------------
. discoveries
. archaeological
. archaeology
. history
. anthropology
Most correlated bigrams:
------------------------------
. history channel
. rap battle
. epic rap
. battle history
. archaeological discoveries
# 'manufacturing':
Most correlated unigrams:
------------------------------
. business
. printer
. process
. print
. manufacture
Most correlated bigrams:
------------------------------
. manufacture plant
. lean manufacture
. additive manufacture
. manufacture business
. manufacture process
# 'science and technology':
Most correlated unigrams:
------------------------------
. compute
. computers
. science
. computer
. technology
Most correlated bigrams:
------------------------------
. science amp
. amp technology
. primitive technology
. computer science
. science technology
# 'travel':
Most correlated unigrams:
------------------------------
. blogger
. vlog
. travellers
. blog
. travel
Most correlated bigrams:
------------------------------
. viewfinder travel
. travel blogger
. tip travel
. travel vlog
. travel blog
# USING DESCRIPTION FEATURES
# 'art and music':
Most correlated unigrams:
------------------------------
. official
. paint
. music
. art
. theatre
Most correlated bigrams:
------------------------------
. capitol theatre
. click listen
. production connexion
. official music
. music video
# 'food':
Most correlated unigrams:
------------------------------
. foods
. eat
. snack
. cook
. food
Most correlated bigrams:
------------------------------
. special offer
. hiho special
. come play
. sponsor series
. street food
# 'history':
Most correlated unigrams:
------------------------------
. discoveries
. archaeological
. history
. archaeology
. anthropology
Most correlated bigrams:
------------------------------
. episode epic
. epic rap
. battle history
. rap battle
. archaeological discoveries
# 'manufacturing':
Most correlated unigrams:
------------------------------
. factory
. printer
. process
. print
. manufacture
Most correlated bigrams:
------------------------------
. process make
. lean manufacture
. additive manufacture
. manufacture business
. manufacture process
# 'science and technology':
Most correlated unigrams:
------------------------------
. quantum
. computers
. science
. computer
. technology
Most correlated bigrams:
------------------------------
. quantum computers
. primitive technology
. quantum compute
. computer science
. science technology
# 'travel':
Most correlated unigrams:
------------------------------
. vlog
. travellers
. trip
. blog
. travel
Most correlated bigrams:
------------------------------
. tip travel
. start travel
. expedia viewfinder
. travel blogger
. travel blog
建模和训练
将分析的四个模型是:
- 朴素贝叶斯分类器
- 支持向量机
- Adaboost分类器
- LSTM
数据集分为训练集和测试集,分割比为8:2。标题和描述的特征是独立计算的,然后连接起来构建最终的特征矩阵。这用于训练分类器(LSTM除外)。
代码语言:javascript复制# Splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(data.iloc[:, 1:3], data['Category'], random_state = 0)
X_train_title_features = tfidf_title.transform(X_train['Title']).toarray()
X_train_desc_features = tfidf_desc.transform(X_train['Description']).toarray()
features = np.concatenate([X_train_title_features, X_train_desc_features], axis=1)
# Naive Bayes
nb = MultinomialNB().fit(features, y_train)
# SVM
svm = linear_model.SGDClassifier(loss='modified_huber',max_iter=1000, tol=1e-3).fit(features,y_train)
# AdaBoost
adaboost = AdaBoostClassifier(n_estimators=40,algorithm="SAMME").fit(features,y_train)
对于使用LSTM,数据预处理步骤与前面讨论的完全不同。这是以下过程:
- 将每个样本的标题和描述组合成一个句子
- 将组合句子标记为填充序列:将每个句子转换为标记列表,为每个标记分配一个数字id,然后通过填充较短的序列使每个序列具有相同的长度,并截断较长的序列。
- One-Hot编码'Category'变量
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from keras.utils.np_utils import to_categorical
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 20000
# Max number of words in each complaint.
MAX_SEQUENCE_LENGTH = 50
# This is fixed.
EMBEDDING_DIM = 100
# Combining titles and descriptions into a single sentence
titles = data['Title'].values
descriptions = data['Description'].values
data_for_lstms = []
for i in range(len(titles)):
temp_list = [titles[i], descriptions[i]]
data_for_lstms.append(' '.join(temp_list))
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()* ,-./:;<=>?@[]^_`{|}~', lower=True)
tokenizer.fit_on_texts(data_for_lstms)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
# Convert the data to padded sequences
X = tokenizer.texts_to_sequences(data_for_lstms)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)
# One-hot Encode labels
Y = pd.get_dummies(data['Category']).values
print('Shape of label tensor:', Y.shape)
# Splitting into training and test set
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, random_state = 42)
# Define LSTM Model
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(6, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
# Training LSTM Model
epochs = 5
batch_size = 64
history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1)
plt.title('Loss')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
plt.show();
plt.title('Accuracy')
plt.plot(history.history['acc'], label='train')
plt.plot(history.history['val_acc'], label='test')
plt.legend()
plt.show();
LSTM的学习曲线如下:
LSTM损耗曲线
LSTM精度曲线
分析表现
以下是所有不同分类器的精确调用曲线。要获得其他指标,请查看完整代码如下:
https://github.com/agrawal-rohit/Text-Classification-Analysis/blob/master/Text Classification Analysis.ipynb
在项目中观察到的每个分类器的排名如下:
LSTM> SVM>Naive Bayes> AdaBoost
LSTM在自然语言处理中的多个任务中表现出了出色的表现。LSTM中存在多个“gates”允许它们学习序列中的长期依赖性。
SVM是非常强大的分类器,它们尽力发现提取的特征之间的相互作用,但是学到的交互与LSTM不相同。另一方面,朴素贝叶斯分类器将这些特征视为独立的,因此它的性能比SVM稍差,因为它没有考虑不同特征之间的任何相互作用。
AdaBoost分类器对超参数的选择非常敏感,并且由于使用了默认模型,因此它没有最佳参数,这可能是性能不佳的原因
完整的代码可以在Github上找到。
https://github.com/agrawal-rohit