关于深度学习系列笔记六（激活函数、损失函数、优化器）

关于激活函数、损失函数、优化器也是深度学习重要的构建，不同的激活函数、损失函数、优化器适用于不同的应用场景，目前只对损失函数的场景有一定了解，其他待探索。

代码示例

代码语言：javascript复制

from keras.datasets import imdb

def printshape(x):
    #print('数据值=',x)
    print('#----------------')
    print('#数据形状=',x.shape)
    print('#数据张量=',x.ndim)
    print('#数据类型=',x.dtype)

#加载IMDB 数据集
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(path="D:/Python36/Coding/PycharmProjects/ttt/imdb.npz", num_words=10000)

printshape(train_data)
#----------------
#数据形状= (25000,)
#数据张量= 1
#数据类型= object
printshape(train_labels)
#----------------
#数据形状= (25000,)
#数据张量= 1
#数据类型= int64
max([max(sequence) for sequence in train_data]) #9999

word_index = imdb.get_word_index(path="D:/Python36/Coding/PycharmProjects/ttt/imdb_word_index.json")
#44864, 'gussied': 65111, "bullock's": 32066, "'delivery'": 65112
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
#5111: 'gussied', 32066: "bullock's", 65112: "'delivery'"
decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in train_data[0]])
#train_data[0] = [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
#decoded_review= ? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all

test_setence = 'this film was just brilliant casting'
test_index = ' '.join([str(word_index.get(i) 3) for i in test_setence.split(' ')])
#test_index = 14 22 16 43 530 973
#准备数据
# 填充列表，使其具有相同的长度，再将列表转换成形状为 (samples, word_indices)的整数张量，然后网络第一层使用能处理这种整数张量的层（即Embedding 层）。
# 对列表进行 one-hot 编码，将其转换为 0 和 1 组成的向量。举个例子，序列[3, 5]将会被转换为10 000 维向量，只有索引为3 和5 的元素是1，其余元素都是0。然后网络第一层可以用Dense 层，它能够处理浮点数向量数据。
import numpy as np
#创建一个形状为(len(sequences),dimension) 的零矩阵
def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

#将训练数据向量化
#将测试数据向量化
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
print(x_train)
#[[0. 1. 1. ... 0. 0. 0.]
# ...
# [0. 1. 1. ... 0. 0. 0.]]
printshape(x_train)
#数据形状= (25000, 10000)
#数据张量= 2
#数据类型= float64
printshape(x_test)
#----------------
#数据形状= (25000, 10000)
#数据张量= 2
#数据类型= float64

#标签向量化
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

#构建网络
#对于这种Dense 层的堆叠，需要确定以下两个关键架构：
# 网络有多少层；
# 每层有多少个隐藏单元
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
#  激活函数
#    softmax: 在多分类中常用的激活函数，是基于逻辑回归的。
#    Softplus：softplus(x)=log(1 e^x)，近似生物神经激活函数，最近出现的。
#    Relu：近似生物神经激活函数，最近出现的。
#    tanh：双曲正切激活函数，也是很常用的。
#    sigmoid：S型曲线激活函数，最常用的。
#    hard_sigmoid：基于S型激活函数。
#    linear：线性激活函数，最简单的。

#损失函数
#  对于分类、回归、序列预测等常见问题，你可以遵循一些简单的指导原则来选择正确的损失函数
#    对于二分类问题，你可以使用二元交叉熵（binary crossentropy）损失函数；
#    对于多分类问题，可以用分类交叉熵（categorical crossentropy）损失函数；
#    对于回归问题，可以用均方误差（mean-squared error）损失函数；
#    对于序列学习问题，可以用联结主义时序分类（CTC，connectionist temporal classification）损失函数
#    在面对真正全新的研究问题时，需要自主开发目标函数。

#优化器
#Batch gradient descent
#  缺点：由于这种方法是在一次更新中，就对整个数据集计算梯度，所以计算起来非常慢，遇到很大量的数据集也会非常棘手，而且不能投入新数据实时更新模型
#  优点：Batch gradient descent 对于凸函数可以收敛到全局极小值，对于非凸函数可以收敛到局部极小值。
#Stochastic gradient descent
#  缺点： SGD 因为更新比较频繁，会造成 cost function 有严重的震荡
#  优点：和 BGD 的一次用所有数据计算梯度相比，SGD 每次更新时对每个样本进行梯度更新，SGD 一次只进行一次更新，就没有冗余，而且比较快，并且可以新增样本。
#Mini-batch gradient descent
#  缺点：learning rate 如果选择的太小，收敛速度会很慢，如果太大，loss function 就会在极小值处不停地震荡甚至偏离。
#        这种方法是对所有参数更新时应用同样的 learning rate，如果我们的数据是稀疏的，我们更希望对出现频率低的特征进行大一点的更新。
#        对于非凸函数，还要避免陷于局部极小值处，或者鞍点处，因为鞍点周围的error 是一样的，所有维度的梯度都接近于0，SGD 很容易被困在这里。
#  优点：MBGD 每一次利用一小批样本，即 n 个样本进行计算。这样它可以降低参数更新时的方差，收敛更稳定。另一方面可以充分地利用深度学习库中高度优化的矩阵操作来进行更有效的梯度计算。
#Momentum
#  缺点：这种情况相当于小球从山上滚下来时是在盲目地沿着坡滚，如果它能具备一些先知，例如快要上坡时，就知道需要减速了的话，适应性会更好。
#  优点：当我们将一个小球从山上滚下来时，没有阻力的话，它的动量会越来越大，但是如果遇到了阻力，速度就会变小。
#        加入的这一项，可以使得梯度方向不变的维度上速度变快，梯度方向有所改变的维度上的更新速度变慢，这样就可以加快收敛并减小震荡
#Nesterov accelerated gradient
#  缺点：
#  优点：NAG 会先在前一步的累积梯度上(brown vector)有一个大的跳跃，然后衡量一下梯度做一下修正(red vector)，这种预期的更新可以避免我们走的太快。NAG 可以使 RNN 在很多任务上有更好的表现。
#Adagrad
#  缺点：它的缺点是分母会不断积累，这样学习率就会收缩并最终会变得非常小。
#  优点：这个算法就可以对低频的参数做较大的更新，对高频的做较小的更新，也因此，对于稀疏的数据它的表现很好，很好地提高了 SGD 的鲁棒性
#Adadelta
#  优点：这个算法是对 Adagrad 的改进，和 Adagrad 相比，就是分母的 G 换成了过去的梯度平方的衰减平均值
#　编译模型
#RMSprop
#  优点：RMSprop 是 Geoff Hinton 提出的一种自适应学习率方法。RMSprop 和 Adadelta 都是为了解决 Adagrad 学习率急剧下降问题的，
#  缺点：
#Adam
#  优点：这个算法是另一种计算每个参数的自适应学习率的方法。除了像 Adadelta 和 RMSprop 一样存储了过去梯度的平方 vt 的指数衰减平均值 ，也像 momentum 一样保持了过去梯度 mt 的指数衰减平均值：Adam 比其他适应性学习方法效果要好。
#综述：
#如果数据是稀疏的，就用自适用方法，即 Adagrad, Adadelta, RMSprop, Adam。
#RMSprop, Adadelta, Adam 在很多情况下的效果是相似的。
#Adam 就是在 RMSprop 的基础上加了 bias-correction 和 momentum，
#随着梯度变的稀疏，Adam 比 RMSprop 效果会好。
#整体来讲，Adam 是最好的选择。
#很多论文里都会用 SGD，没有 momentum 等。SGD 虽然能达到极小值，但是比其它算法用的时间长，而且可能会被困在鞍点。

from keras import optimizers
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])
#留出验证集
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]
#训练模型
#model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['acc'])
history = model.fit(partial_x_train,partial_y_train,epochs=20,batch_size=512,validation_data=(x_val, y_val))
#在测试数据上评估模型
results = model.evaluate(x_test, y_test)
#results= [0.7800159653329849, 0.851]
#用predict 方法来得到评论为正面的可能性大小。
model.predict(x_test)
#model.predict(x_test)= [[0.00587359] [0.9999999 ] [0.9153487 ] ... [0.00243772] [0.004872  ] [0.7773562 ]]

#model.fit() 返回了一个History 对象。这个对象有一个成员history，它是一个字典，包含训练过程中的所有数据。
history_dict = history.history
#history_dict= {
#'val_loss': [0.38002283658981323, 0.30036407799720766, 0.30828651866912843,...],
#'val_acc': [0.8691999996185302, 0.8902000004768371, 0.8717000001907349,...],
#'loss': [0.5088913717746735, 0.3007549472649892, 0.21801055087248483, ...],
#'acc': [0.7813333331743876, 0.9050000000317892, 0.9285333332697551, ...]}
history_dict.keys()
#history_dict.keys()=['val_acc', 'acc', 'val_loss', 'loss']

#绘制训练损失和验证损失
import matplotlib.pyplot as plt
history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
epochs = range(1, len(loss_values)   1)
plt.plot(epochs, loss_values, 'bo', label='Training loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

#绘制训练精度和验证精度
plt.clf()
acc = history_dict['acc']
val_acc = history_dict['val_acc']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

验证不同的batch_size对训练精度和验证精度、训练损失和验证损失的影响。

代码语言：javascript复制

units = [32,64,128,512]
colors = ['red','blue','green','black']
epochs = range(1, len(loss_values)   1)
fig=plt.figure()
ax=fig.add_subplot(1,1,1)
for i in range(len(units)):
    unit=units[i]
    model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
    history = model.fit(partial_x_train,partial_y_train,epochs=20,batch_size=unit,validation_data=(x_val, y_val))
    history_dict = history.history
    print("history_dict%s =" %history_dict)
    acc = history_dict['acc']
    val_acc = history_dict['val_acc']
    ax.plot(epochs, acc, 'bo', label='Training acc,unit=%s' %unit,color=colors[i])
    ax.plot(epochs, val_acc, 'b', label='Validation acc,unit=%s' %unit,color=colors[i])
ax.legend(loc='best')
ax.set_title('Training and validation accuracy')
ax.set_xlabel('Epochs')
ax.set_ylabel('Accuracy')
plt.show()

深度学习 size 函数优化

0 人点赞