Python深度学习精华笔记3：基于Keras解决多分类问题

公众号：机器学习杂货店作者：Peter 编辑：Peter

持续更新《Python深度学习》一书的精华内容，仅作为学习笔记分享。

本文是第三篇：介绍如何使用Keras解决Python深度学习中的多分类问题。

多分类问题和二分类问题的区别注意两点：

最后一层的激活函数使用softmax函数输出预测类别的概率，最大概率所在的位置就是预测的类别
损失函数使用分类交叉熵-categorical_crossentropy（针对0-1标签），整数标签使用（sparse_categorical_crossentropy）

运行环境：Python3.9.13 Keras2.12.0 tensorflow2.12.0

导入数据

机器学习中的路透社数据集是一个非常常用的数据集，它包含来自新闻专线的文本数据，主要用于文本分类任务。这个数据集是由路透社新闻机构提供的，包含了大量的新闻文章，共计22类分类标签。

该数据集的每一条新闻文章都被标记了一个或多个分类标签，这些标签表明了新闻文章的主题或类别。例如，政治、经济、体育、科技等。数据集中的每条新闻都包含文本内容和对应的分类标签，这使得路透社数据集成为机器学习领域中一个非常有价值的数据集。
路透社数据集的挑战在于数据的复杂性、多样性和快速变化。新闻文章具有各种不同的语言和格式，包括标题、段落、列表和图片等。此外，新闻文章的表述方式也各不相同，包括情感、风格和话题等。因此，路透社数据集的难度较高，需要机器学习算法具备高水平的分类能力。
路透社数据集在机器学习领域中得到了广泛应用，主要用于评估和提升文本分类算法的性能。许多机器学习算法，包括支持向量机、决策树、随机森林和神经网络等，都曾在路透社数据集上进行过测试和比较。因此，该数据集被广泛用于评估文本分类算法的性能，并已成为机器学习领域中的经典数据集之一。

In 1:

代码语言：txt复制

import numpy as np
np.random.seed(1234)

import warnings 
warnings.filterwarnings("ignore")

训练集和标签

In 2:

代码语言：txt复制

from keras.datasets import reuters

In 3:

代码语言：txt复制

# 取出数据中前10000个词语

(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)

数据查看

In 4:

代码语言：txt复制

train_data[:2]

Out4:

代码语言：txt复制

array([list([1, 2, 2, 8, 43, 10, 447, 5, 25, 207, 270, 5, 3095, 111, 16, 369, 186, 90, 67, 7, 89, 5, 19, 102, 6, 19, 124, 15, 90, 67, 84, 22, 482, 26, 7, 48, 4, 49, 8, 864, 39, 209, 154, 6, 151, 6, 83, 11, 15, 22, 155, 11, 15, 7, 48, 9, 4579, 1005, 504, 6, 258, 6, 272, 11, 15, 22, 134, 44, 11, 15, 16, 8, 197, 1245, 90, 67, 52, 29, 209, 30, 32, 132, 6, 109, 15, 17, 12]),
       list([1, 3267, 699, 3434, 2295, 56, 2, 7511, 9, 56, 3906, 1073, 81, 5, 1198, 57, 366, 737, 132, 20, 4093, 7, 2, 49, 2295, 2, 1037, 3267, 699, 3434, 8, 7, 10, 241, 16, 855, 129, 231, 783, 5, 4, 587, 2295, 2, 2, 775, 7, 48, 34, 191, 44, 35, 1795, 505, 17, 12])],
      dtype=object)

In 5:

代码语言：txt复制

len(train_data), len(test_data)

Out5:

代码语言：txt复制

(8982, 2246)

查看label中数据信息：总共是46个类别

In 6:

代码语言：txt复制

train_labels[:20]

Out6:

代码语言：txt复制

array([ 3,  4,  3,  4,  4,  4,  4,  3,  3, 16,  3,  3,  4,  4, 19,  8, 16,
        3,  3, 21], dtype=int64)

In 7:

代码语言：txt复制

test_labels[:20]

Out7:

代码语言：txt复制

array([ 3, 10,  1,  4,  4,  3,  3,  3,  3,  3,  5,  4,  1,  3,  1, 11, 23,
        3, 19,  3], dtype=int64)

单词和索引的互换：

In 8:

代码语言：txt复制

word_index = reuters.get_word_index()

reverse_word_index = dict([value, key] for (key, value) in word_index.items())  # 翻转过程
decoded_review = ' '.join([reverse_word_index.get(i-3, "?") for i in train_data[0]])
decoded_review

Out8:

代码语言：txt复制

'? ? ? said as a result of its december acquisition of space co it expects earnings per share in 1987 of 1 15 to 1 30 dlrs per share up from 70 cts in 1986 the company said pretax net should rise to nine to 10 mln dlrs from six mln dlrs in 1986 and rental operation revenues to 19 to 22 mln dlrs from 12 5 mln dlrs it said cash flow per share this year should be 2 50 to three dlrs reuter 3'

数据向量化

关于数据向量化的过程：

In 9:

代码语言：txt复制

# 同样的向量化函数

import numpy as np

def vectorszie(seq, dim=10000):  
    """
    seq: 输入序列
    dim：10000，维度
    """
    results = np.zeros((len(seq), dim))  # 创建全0矩阵  length * dim
    for i, s in enumerate(seq):
        results[i,s] = 1.   # 将该位置的值从0变成1，如果没有出现则还是0
    return results

In 10:

代码语言：txt复制

# 两个数据向量化

x_train = vectorszie(train_data)
x_test = vectorszie(test_data)

标签向量化

针对标签向量化方法1：自定义独热编码函数

In 11:

代码语言：txt复制

# 1、手动实现

def to_one_hot(labels, dimension=10000):
    results = np.zeros((len(labels), dimension))  # 全0矩阵 np.zeros((m, n))
    for i, label in enumerate(labels):
        results[i,labels] = 1.  # 一定是浮点数
    return results 

# 调用定义的函数
one_hot_train_labels = to_one_hot(train_labels)
one_hot_test_labels = to_one_hot(test_labels)

针对标签向量化方法2：基于keras内置函数来实现

In 12:

代码语言：txt复制

# keras内置方法
from keras.utils.np_utils import to_categorical

one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)

整数标签处理（基于sparse_categorical_crossentropy）

如果我们不想将分类标签（46个取值）转成独热码形式，可以使用稀疏分类标签：sparse_categorical_crossentropy。

使用方法都是类似的：

代码语言：python代码运行次数：0复制

y_train = np.array(train_labels)
y_test = np.array(test_labels)

model.compile(
    optimizer='rmsprop',  # 优化器
    loss='sparse_categorical_crossentropy',  # 稀疏分类损失
    metrics=['accuracy']   # 评价指标
    )

训练集和验证集

In 13:

代码语言：txt复制

#  取出1000个样本作为验证集

x_val = x_train[:1000]
part_x_train = x_train[1000:]

y_val = one_hot_train_labels[:1000]
part_y_train = one_hot_train_labels[1000:]

构建网络

In 14:

代码语言：txt复制

from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(x_train.shape[1],)))  #  X_train.shape[1] = 10000
model.add(layers.Dense(64,activation="relu"))
model.add(layers.Dense(46, activation="softmax"))    #  46就是最终的分类数目

对比二分类问题，有3个需要注意的点：

网络的第一层输入的
机器学习 tensorflow 深度学习 keras深度学习机器学习 TensorFlow Keras

1 人点赞