阅读(1224) (20)

AI人工智能 解决问题

2020-09-23 17:00:08 更新

解决问题

在本节中,我们将解决一些相关问题。

类别预测

在一组文件中,不仅单词而且单词的类别也很重要; 在哪个类别的文本中特定的词落入。 例如,想要预测给定的句子是否属于电子邮件,新闻,体育,计算机等类别。在下面的示例中,我们将使用 tf-idf 来制定特征向量来查找文档的类别。使用 sklearn 的 20 个新闻组数据集中的数据。

导入必要的软件包 -

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

定义分类图。使用五个不同的类别,分别是宗教,汽车,体育,电子和空间。

category_map = {'talk.religion.misc':'Religion','rec.autos''Autos',
   'rec.sport.hockey':'Hockey','sci.electronics':'Electronics', 'sci.space': 'Space'}

创建训练集 -

training_data = fetch_20newsgroups(subset = 'train',
   categories = category_map.keys(), shuffle = True, random_state = 5)

构建一个向量计数器并提取术语计数 -

vectorizer_count = CountVectorizer()
train_tc = vectorizer_count.fit_transform(training_data.data)
print("nDimensions of training data:", train_tc.shape)

tf-idf 转换器的创建过程如下 -

tfidf = TfidfTransformer()
train_tfidf = tfidf.fit_transform(train_tc)

现在,定义测试数据 -

input_data = [
   'Discovery was a space shuttle',
   'Hindu, Christian, Sikh all are religions',
   'We must have to drive safely',
   'Puck is a disk made of rubber',
   'Television, Microwave, Refrigrated all uses electricity'
]

以上数据将用于训练一个Multinomial朴素贝叶斯分类器 -

classifier = MultinomialNB().fit(train_tfidf, training_data.target)

使用计数向量化器转换输入数据 -

input_tc = vectorizer_count.transform(input_data)

现在,将使用 tfidf 转换器来转换矢量化数据 -

input_tfidf = tfidf.transform(input_tc)

执行上面代码,将预测输出类别 -

predictions = classifier.predict(input_tfidf)

输出结果如下 -

for sent, category in zip(input_data, predictions):
   print('nInput Data:', sent, 'n Category:', 
      category_map[training_data.target_names[category]])

类别预测器生成以下输出 -

Dimensions of training data: (2755, 39297)


Input Data: Discovery was a space shuttle
Category: Space


Input Data: Hindu, Christian, Sikh all are religions
Category: Religion


Input Data: We must have to drive safely
Category: Autos


Input Data: Puck is a disk made of rubber
Category: Hockey


Input Data: Television, Microwave, Refrigrated all uses electricity
Category: Electronics

性别发现器

在这个问题陈述中,将通过提供名字来训练分类器以找到性别(男性或女性)。 我们需要使用启发式构造特征向量并训练分类器。这里使用 scikit-learn 软件包中的标签数据。 以下是构建性别查找器的 Python 代码 -

导入必要的软件包 -

import random


from nltk import NaiveBayesClassifier
from nltk.classify import accuracy as nltk_accuracy
from nltk.corpus import names

现在需要从输入字中提取最后的 N 个字母。 这些字母将作为功能 -

def extract_features(word, N = 2):
   last_n_letters = word[-N:]
   return {'feature': last_n_letters.lower()}


if __name__=='__main__':

使用 NLTK 中提供的标签名称(男性和女性)创建培训数据 -

male_list = [(name, 'male') for name in names.words('male.txt')]
female_list = [(name, 'female') for name in names.words('female.txt')]
data = (male_list + female_list)


random.seed(5)
random.shuffle(data)

现在,测试数据将被创建如下 -

namesInput = ['Rajesh', 'Gaurav', 'Swati', 'Shubha']

使用以下代码定义用于列车和测试的样本数 -

train_sample = int(0.8 * len(data))

现在,需要迭代不同的长度,以便可以比较精度 -

for i in range(1, 6):
   print('nNumber of end letters:', i)
   features = [(extract_features(n, i), gender) for (n, gender) in data]
   train_data, test_data = features[:train_sample],
features[train_sample:]
   classifier = NaiveBayesClassifier.train(train_data)

分类器的准确度可以计算如下 -

accuracy_classifier = round(100 * nltk_accuracy(classifier, test_data), 2)
   print('Accuracy = ' + str(accuracy_classifier) + '%')

现在,可以预测输出结果 -

for name in namesInput:
   print(name, '==>', classifier.classify(extract_features(name, i))

上述程序将生成以下输出 -

Number of end letters: 1
Accuracy = 74.7%
Rajesh -> female
Gaurav -> male
Swati -> female
Shubha -> female


Number of end letters: 2
Accuracy = 78.79%
Rajesh -> male
Gaurav -> male
Swati -> female
Shubha -> female


Number of end letters: 3
Accuracy = 77.22%
Rajesh -> male
Gaurav -> female
Swati -> female
Shubha -> female


Number of end letters: 4
Accuracy = 69.98%
Rajesh -> female
Gaurav -> female
Swati -> female
Shubha -> female


Number of end letters: 5
Accuracy = 64.63%
Rajesh -> female
Gaurav -> female
Swati -> female
Shubha -> female

在上面的输出中可以看到,结束字母的最大数量的准确性是两个,并且随着结束字母数量的增加而减少。