朴素贝叶斯分类器是基于贝叶斯定理以及一些有关特征独立性的强(朴素)假设的简单概率分类器,也称“独立特征模型”。本文demo使用TF的实现朴素贝叶斯分类器,用TensorFlow_probability概率库实现参数可训练的高斯分布变种。
1. 导入及查看数据
iris数据集包含来自鸢尾(Iris setosa),鸢尾鸢尾(Iris virginica)和杂色鸢尾(Iris versicolor)三种物种的50个样本。从每个样品中测量出四个特征:萼片和花瓣的长度和宽度。本文目标是构建一个朴素的贝叶斯分类器模型,根据萼片长度和萼片宽度特征(因此,只有4个特征中的2个)预测正确的类别。
代码语言:txt复制import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn import datasets, model_selection
# Load the dataset
iris = datasets.load_iris()
# Use only the first two features: sepal length and width
data = iris.data[:, :2]
targets = iris.target
# Randomly shuffle the data and make train and test splits
x_train, x_test, y_train, y_test =
model_selection.train_test_split(data, targets, test_size=0.2)
# Plot the training data
labels = {0: 'Setosa', 1: 'Versicolour', 2: 'Virginica'}
label_colours = ['blue', 'red', 'green']
def plot_data(x, y, labels, colours):
for y_class in np.unique(y):
index = np.where(y == y_class)
plt.scatter(x[index, 0], x[index, 1],
label=labels[y_class], c=colours[y_class])
plt.title("Training set")
plt.xlabel("Sepal length (cm)")
plt.ylabel("Sepal width (cm)")
plt.legend()
plt.figure(figsize=(8, 5))
plot_data(x_train, y_train, labels, label_colours)
plt.show()
2.贝叶斯分类器理论基础
贝叶斯分类器的基本方程式是贝叶斯定律:
d是特征维数,k是类的数目,P(Y)是类别的先验分布,P(X | Y)是输入的类条件分布
朴素贝叶斯分类器假设数据特征Xi是条件独立的,给出了Y类(“朴素”假设)。在这种情况下,类条件分布分解为
有了类的先验分布和类条件分布,朴素贝叶斯分类器模型简化为
3.TensorFlow math api 实现
3.1 先验分布
代码语言:txt复制def get_prior(y):
classes = np.unique(y)
counts = np.zeros(len(classes))
for c_k in classes:
counts[c_k] = np.sum(np.where(y==c_k,1,0))
probs = counts / np.sum(counts)
print('The class priors are {}'.format(probs))
priors = tfd.Categorical(probs=probs)
return priors
prior = get_prior(y_train)
# ==> The class priors are [0.31666667 0.325 0.35833333]
验证先验分布
代码语言:txt复制labels = ['Iris-Setosa', 'Iris-Versicolour', 'Iris-Virginica']
plt.bar([0, 1, 2], prior.probs.numpy(), color=label_colours)
plt.xlabel("Class")
plt.ylabel("Prior probability")
plt.title("Class prior distribution")
plt.xticks([0, 1, 2], labels)
plt.show()
3.2 类的条件分布
针对i = 0,1和k = 0,1,2的类条件分布P(Xi | Y = yk)进行定义。在我们的模型中,我们将假设这些分布为单变量高斯分布
具有平均参数μik和标准偏差参数σik,总共有十二个参数。我们将再次使用最大似然估计这些参数。在这种情况下,估算值以下公式计算
代码语言:txt复制def get_class_conditionals(x, y):
def delta(a,b):
return 1. if a==b else 0.
classes = np.unique(y)
n_sample = len(y)
mu=np.zeros((len(classes),2))
sigma=np.zeros((len(classes),2))
for k in range(len(classes)): # loop over every class
for i in range(2):
dyk = [ delta(y[r], k) for r in range(n_sample) ]
xin = x[:,i]
mu_ki = np.dot(xin,dyk) / np.sum(dyk)
sigma_squared_ki = np.dot((xin - mu_ki)**2, dyk) / np.sum(dyk)
mu[k][i] = mu_ki
sigma[k][i] = np.sqrt(sigma_squared_ki)
dist = tfd.MultivariateNormalDiag(loc=tf.cast(mu,tf.float32),
scale_diag=tf.cast(sigma,tf.float32))
return dist
class_conditionals = get_class_conditionals(x_train, y_train)
可以用等高线图可视化类条件分布密度。每个分布的轮廓与具有对角协方差矩阵的高斯分布相对应,因为模型假定类别特征都是独立的。
代码语言:txt复制def get_meshgrid(x0_range, x1_range, num_points=100):
x0 = np.linspace(x0_range[0], x0_range[1], num_points)
x1 = np.linspace(x1_range[0], x1_range[1], num_points)
return np.meshgrid(x0, x1)
def contour_plot(x0_range, x1_range, prob_fn, batch_shape, colours, levels=None, num_points=100):
X0, X1 = get_meshgrid(x0_range, x1_range, num_points=num_points)
Z = prob_fn(np.expand_dims(np.array([X0.ravel(), X1.ravel()]).T, 1))
Z = np.array(Z).T.reshape(batch_shape, *X0.shape)
for batch in np.arange(batch_shape):
if levels:
plt.contourf(X0, X1, Z[batch], alpha=0.2, colors=colours, levels=levels)
else:
plt.contour(X0, X1, Z[batch], colors=colours[batch], alpha=0.3)
plt.figure(figsize=(10, 6))
plot_data(x_train, y_train, labels, label_colours)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
contour_plot((x0_min, x0_max), (x1_min, x1_max), class_conditionals.prob, 3, label_colours)
plt.title("Training set with class-conditional density contours")
plt.show()
3.3 预测
先验条件分布和类别条件分布,根据它们来预测未知测试输入分类:
使用argmax将具有最大概率的类别作为预测类
代码语言:txt复制def predict_class(prior, class_conditionals, x):
def predict_fn(myx):
class_probs = class_conditionals.prob(tf.cast(myx, dtype=tf.float32))
prior_probs = tf.cast(prior.probs, dtype=tf.float32)
class_times_prior_probs = class_probs * prior_probs
Q = tf.reduce_sum(class_times_prior_probs)
P = tf.math.divide(class_times_prior_probs,Q)
Y = tf.cast(tf.argmax(P), dtype=tf.float32)
return Y
x = tf.cast(x,tf.float32)
y = tf.cast(tf.map_fn(predict_fn, x),tf.int32)
return np.array(y)
predictions = predict_class(prior, class_conditionals, x_test)
accuracy = accuracy_score(y_test, predictions)
print("Test accuracy: {:.4f}".format(accuracy))
# ==> Test accuracy: 0.9000
绘制模型的decision regions:
代码语言:txt复制plt.figure(figsize=(10, 6))
plot_data(x_train, y_train, labels, label_colours)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
contour_plot((x0_min, x0_max), (x1_min, x1_max),
lambda x: predict_class(prior, class_conditionals, x),
1, label_colours, levels=[-0.5, 0.5, 1.5, 2.5],
num_points=500)
plt.title("Training set with decision regions")
plt.show()
4.TFP trainable Gaussian Distribution
属于某个类别的各个独立特征满足高斯分布,其参数均值和方差除了可以用上述统计学方法计算,还可以用TFP从数据中学习。
用tfd.MultivariateNormalDiag()为数据集的特征创建可训练的高斯分布。分布的参数将通过最小化负对数似然来估算,这等效于最大化对数似然。
4.1 构建自定义训练过程
代码语言:txt复制def learn_parameters(x, y, mus, scales, optimiser, epochs):
@tf.function
def nll(dist, x_train, y_train):
log_probs = dist.log_prob(x_train)
L = len(tf.unique(y_train)[0])
y_train = tf.one_hot(indices=y_train, depth=L)
return -tf.reduce_mean(log_probs * y_train)
@tf.function
def get_loss_and_grads(dist, x_train, y_train):
with tf.GradientTape() as tape:
tape.watch(dist.trainable_variables)
loss = nll(dist, x_train, y_train)
grads = tape.gradient(loss, dist.trainable_variables)
return loss, grads
nll_loss = []
mu_values = []
scales_values = []
x = tf.cast(np.expand_dims(x, axis=1), tf.float32)
dist = tfd.MultivariateNormalDiag(loc=mus, scale_diag=scales)
for epoch in range(epochs):
loss, grads = get_loss_and_grads(dist, x, y)
optimiser.apply_gradients(zip(grads, dist.trainable_variables))
nll_loss.append(loss)
mu_values.append(mus.numpy())
scales_values.append(scales.numpy())
nll_loss, mu_values, scales_values =
np.array(nll_loss), np.array(mu_values), np.array(scales_values)
return (nll_loss, mu_values, scales_values, dist)
4.2 模型训练
代码语言:txt复制# Assign initial values for the model's parameters
mus = tf.Variable([[1., 1.], [1., 1.], [1., 1.]])
scales = tf.Variable([[1., 1.], [1., 1.], [1., 1.]])
opt = tf.keras.optimizers.Adam(learning_rate=0.005)
epochs = 10000
nlls, mu_arr, scales_arr, class_conditionals =
learn_parameters(x_train, y_train, mus, scales, opt, epochs)
训练完成后,绘制loss与epoch的关系图,以确保我们的优化程序收敛。
出于相同的原因,我们还绘制了模型参数的值。
代码语言:txt复制# Plot the loss and convergence of the standard deviation parameters
fig, ax = plt.subplots(1, 3, figsize=(15, 4))
ax[0].plot(nlls)
ax[0].set_title("Loss vs. epoch")
ax[0].set_xlabel("Epoch")
ax[0].set_ylabel("Negative log-likelihood")
for k in [0, 1, 2]:
ax[1].plot(mu_arr[:, k, 0])
ax[1].plot(mu_arr[:, k, 1])
ax[1].set_title("ML estimates for model'snmeans vs. epoch")
ax[1].set_xlabel("Epoch")
ax[1].set_ylabel("Means")
for k in [0, 1, 2]:
ax[2].plot(scales_arr[:, k, 0])
ax[2].plot(scales_arr[:, k, 1])
ax[2].set_title("ML estimates for model'snscales vs. epoch")
ax[2].set_xlabel("Epoch")
ax[2].set_ylabel("Scales")
plt.show()
确实,我们的训练循环达到了收敛,并且即使模型参数都以相同的初始条件开始,模型的参数也取决于其最佳估计值。以下代码绘制了三个类的三个双变量高斯分布
代码语言:txt复制def get_meshgrid(x0_range, x1_range, n_points=100):
x0 = np.linspace(x0_range[0], x0_range[1], n_points)
x1 = np.linspace(x1_range[0], x1_range[1], n_points)
return np.meshgrid(x0, x1)
x0_range = x_train[:, 0].min(), x_train[:, 0].max()
x1_range = x_train[:, 1].min(), x_train[:, 1].max()
X0, X1 = get_meshgrid(x0_range, x1_range, n_points=300)
X_v = np.expand_dims(np.array([X0.ravel(), X1.ravel()]).T, axis=1)
Z = class_conditionals.prob(X_v)
Z = np.array(Z).T.reshape(3, *X0.shape)
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(12, 6))
ax = fig.gca(projection='3d')
for k in [0,1,2]:
ax.plot_surface(X0, X1, Z[k,:,:], alpha=.8)
ax.set_xlabel('Sepal length (cm)')
ax.set_ylabel('Sepal width (cm)')
ax.set_zlabel('Probability');
ax.elev = 35
ax.azim = 45
代码语言:txt复制print("Class conditional means:")
print(class_conditionals.loc.numpy())
print("nClass conditional standard deviations:")
print(class_conditionals.stddev().numpy())
# Class conditional means:
# [[5.002439 3.42439 ]
# [5.89 2.78 ]
# [6.5179486 2.9307692]]
# Class conditional standard deviations:
# [[0.34675348 0.37855995]
# [0.5309426 0.31796226]
# [0.5808357 0.28025055]]
5. 其他
经典机器学习框架比如TF的输出是确定的值,这里面有些问题,因为世界上不确定是普遍,包含Aleatoric数据本身的不确定性(数据集标签中的测量误差或噪声)Epistemic uncerntainty模型的认知不确定。模型是需要不确定性。深度学习模型需要知道它们不知道什么。“To know what they don't know.” 概率编程或者概率模型是个解决方案,而要做概率编程 (Probabilistic Programming) 的话,TensorFlow Probability (TFP) 这个库是个不错的选择。
附录&参考
Probabilistic Deep Learning with TensorFlow 2, by 伦敦帝国学院
TensorFlow Probability