Tensorflow实现朴素贝叶斯分类器

朴素贝叶斯分类器是基于贝叶斯定理以及一些有关特征独立性的强（朴素）假设的简单概率分类器，也称“独立特征模型”。本文demo使用TF的实现朴素贝叶斯分类器，用TensorFlow_probability概率库实现参数可训练的高斯分布变种。

1. 导入及查看数据

iris数据集包含来自鸢尾（Iris setosa），鸢尾鸢尾（Iris virginica）和杂色鸢尾（Iris versicolor）三种物种的50个样本。从每个样品中测量出四个特征：萼片和花瓣的长度和宽度。本文目标是构建一个朴素的贝叶斯分类器模型，根据萼片长度和萼片宽度特征（因此，只有4个特征中的2个）预测正确的类别。

代码语言：txt复制

import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn import datasets, model_selection

# Load the dataset
iris = datasets.load_iris()

# Use only the first two features: sepal length and width
data = iris.data[:, :2]
targets = iris.target

# Randomly shuffle the data and make train and test splits
x_train, x_test, y_train, y_test = 
    model_selection.train_test_split(data, targets, test_size=0.2)

# Plot the training data
labels = {0: 'Setosa', 1: 'Versicolour', 2: 'Virginica'}
label_colours = ['blue', 'red', 'green']

def plot_data(x, y, labels, colours):
    for y_class in np.unique(y):
        index = np.where(y == y_class)
        plt.scatter(x[index, 0], x[index, 1],
                    label=labels[y_class], c=colours[y_class])
    plt.title("Training set")
    plt.xlabel("Sepal length (cm)")
    plt.ylabel("Sepal width (cm)")
    plt.legend()
    
plt.figure(figsize=(8, 5))
plot_data(x_train, y_train, labels, label_colours)
plt.show()

2.贝叶斯分类器理论基础

贝叶斯分类器的基本方程式是贝叶斯定律：

d是特征维数，k是类的数目，P(Y)是类别的先验分布，P（X | Y）是输入的类条件分布

朴素贝叶斯分类器假设数据特征Xi是条件独立的，给出了Y类（“朴素”假设）。在这种情况下，类条件分布分解为

有了类的先验分布和类条件分布，朴素贝叶斯分类器模型简化为

3.TensorFlow math api 实现

3.1 先验分布

代码语言：txt复制

def get_prior(y):
    classes = np.unique(y)
    counts = np.zeros(len(classes))
    for c_k in classes:
        counts[c_k] = np.sum(np.where(y==c_k,1,0))
    probs = counts / np.sum(counts)   
    print('The class priors are {}'.format(probs))
    priors = tfd.Categorical(probs=probs)
    return priors

prior = get_prior(y_train)
# ==> The class priors are [0.31666667 0.325      0.35833333]

验证先验分布

代码语言：txt复制

labels = ['Iris-Setosa', 'Iris-Versicolour', 'Iris-Virginica']
plt.bar([0, 1, 2], prior.probs.numpy(), color=label_colours)
plt.xlabel("Class")
plt.ylabel("Prior probability")
plt.title("Class prior distribution")
plt.xticks([0, 1, 2], labels)
plt.show()

3.2 类的条件分布

针对i = 0,1和k = 0,1,2的类条件分布P（Xi | Y = yk）进行定义。在我们的模型中，我们将假设这些分布为单变量高斯分布

具有平均参数μik和标准偏差参数σik，总共有十二个参数。我们将再次使用最大似然估计这些参数。在这种情况下，估算值以下公式计算

代码语言：txt复制

def get_class_conditionals(x, y):
    def delta(a,b):
        return 1. if a==b else 0. 
    classes = np.unique(y)
    n_sample = len(y)  
   
    mu=np.zeros((len(classes),2))
    sigma=np.zeros((len(classes),2))
    
    for k in range(len(classes)): # loop over every class
        for i in range(2):
            dyk = [ delta(y[r], k) for r in range(n_sample) ]
            xin = x[:,i]           
            mu_ki = np.dot(xin,dyk) / np.sum(dyk)
            sigma_squared_ki = np.dot((xin - mu_ki)**2, dyk) / np.sum(dyk)
            mu[k][i] = mu_ki
            sigma[k][i] = np.sqrt(sigma_squared_ki)

    dist = tfd.MultivariateNormalDiag(loc=tf.cast(mu,tf.float32), 
    scale_diag=tf.cast(sigma,tf.float32))
    return dist 
class_conditionals = get_class_conditionals(x_train, y_train)

可以用等高线图可视化类条件分布密度。每个分布的轮廓与具有对角协方差矩阵的高斯分布相对应，因为模型假定类别特征都是独立的。

代码语言：txt复制

def get_meshgrid(x0_range, x1_range, num_points=100):
    x0 = np.linspace(x0_range[0], x0_range[1], num_points)
    x1 = np.linspace(x1_range[0], x1_range[1], num_points)
    return np.meshgrid(x0, x1)

def contour_plot(x0_range, x1_range, prob_fn, batch_shape, colours, levels=None, num_points=100):
    X0, X1 = get_meshgrid(x0_range, x1_range, num_points=num_points)
    Z = prob_fn(np.expand_dims(np.array([X0.ravel(), X1.ravel()]).T, 1))
    Z = np.array(Z).T.reshape(batch_shape, *X0.shape)
    for batch in np.arange(batch_shape):
        if levels:
            plt.contourf(X0, X1, Z[batch], alpha=0.2, colors=colours, levels=levels)
        else:
            plt.contour(X0, X1, Z[batch], colors=colours[batch], alpha=0.3)

plt.figure(figsize=(10, 6))
plot_data(x_train, y_train, labels, label_colours)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
contour_plot((x0_min, x0_max), (x1_min, x1_max), class_conditionals.prob, 3, label_colours)
plt.title("Training set with class-conditional density contours")
plt.show()

3.3 预测

先验条件分布和类别条件分布，根据它们来预测未知测试输入分类：

使用argmax将具有最大概率的类别作为预测类

代码语言：txt复制

def predict_class(prior, class_conditionals, x):
    def predict_fn(myx):
        class_probs = class_conditionals.prob(tf.cast(myx, dtype=tf.float32))
        prior_probs = tf.cast(prior.probs, dtype=tf.float32)
        class_times_prior_probs = class_probs * prior_probs
        Q = tf.reduce_sum(class_times_prior_probs)
        P = tf.math.divide(class_times_prior_probs,Q)
        Y = tf.cast(tf.argmax(P), dtype=tf.float32)
        return Y
    x = tf.cast(x,tf.float32)
    y = tf.cast(tf.map_fn(predict_fn, x),tf.int32)
    return np.array(y)

predictions = predict_class(prior, class_conditionals, x_test)
accuracy = accuracy_score(y_test, predictions)
print("Test accuracy: {:.4f}".format(accuracy))

# ==> Test accuracy: 0.9000

绘制模型的decision regions:

代码语言：txt复制

plt.figure(figsize=(10, 6))
plot_data(x_train, y_train, labels, label_colours)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
contour_plot((x0_min, x0_max), (x1_min, x1_max), 
             lambda x: predict_class(prior, class_conditionals, x), 
             1, label_colours, levels=[-0.5, 0.5, 1.5, 2.5],
             num_points=500)
plt.title("Training set with decision regions")
plt.show()

4.TFP trainable Gaussian Distribution

属于某个类别的各个独立特征满足高斯分布，其参数均值和方差除了可以用上述统计学方法计算，还可以用TFP从数据中学习。

用tfd.MultivariateNormalDiag（）为数据集的特征创建可训练的高斯分布。分布的参数将通过最小化负对数似然来估算，这等效于最大化对数似然。

4.1 构建自定义训练过程

代码语言：txt复制

def learn_parameters(x, y, mus, scales, optimiser, epochs):
    @tf.function
    def nll(dist, x_train, y_train):
        log_probs = dist.log_prob(x_train)
        L = len(tf.unique(y_train)[0])
        y_train = tf.one_hot(indices=y_train, depth=L)
        return -tf.reduce_mean(log_probs * y_train)

    @tf.function
    def get_loss_and_grads(dist, x_train, y_train):
        with tf.GradientTape() as tape:
            tape.watch(dist.trainable_variables)
            loss = nll(dist, x_train, y_train)
            grads = tape.gradient(loss, dist.trainable_variables)
        return loss, grads

    nll_loss = []
    mu_values = []
    scales_values = []
    x = tf.cast(np.expand_dims(x, axis=1), tf.float32)
    dist = tfd.MultivariateNormalDiag(loc=mus, scale_diag=scales)
    for epoch in range(epochs):
        loss, grads = get_loss_and_grads(dist, x, y)
        optimiser.apply_gradients(zip(grads, dist.trainable_variables))
        nll_loss.append(loss)
        mu_values.append(mus.numpy())
        scales_values.append(scales.numpy())
    nll_loss, mu_values, scales_values = 
        np.array(nll_loss), np.array(mu_values), np.array(scales_values)
    return (nll_loss, mu_values, scales_values, dist)

4.2 模型训练

代码语言：txt复制

# Assign initial values for the model's parameters
mus = tf.Variable([[1., 1.], [1., 1.], [1., 1.]])
scales = tf.Variable([[1., 1.], [1., 1.], [1., 1.]])
opt = tf.keras.optimizers.Adam(learning_rate=0.005)
epochs = 10000
nlls, mu_arr, scales_arr, class_conditionals = 
    learn_parameters(x_train, y_train, mus, scales, opt, epochs)

训练完成后，绘制loss与epoch的关系图，以确保我们的优化程序收敛。

出于相同的原因，我们还绘制了模型参数的值。

代码语言：txt复制

# Plot the loss and convergence of the standard deviation parameters
fig, ax = plt.subplots(1, 3, figsize=(15, 4))
ax[0].plot(nlls)
ax[0].set_title("Loss vs. epoch")
ax[0].set_xlabel("Epoch")
ax[0].set_ylabel("Negative log-likelihood")
for k in [0, 1, 2]:
    ax[1].plot(mu_arr[:, k, 0])
    ax[1].plot(mu_arr[:, k, 1])
ax[1].set_title("ML estimates for model'snmeans vs. epoch")
ax[1].set_xlabel("Epoch")
ax[1].set_ylabel("Means")
for k in [0, 1, 2]:
    ax[2].plot(scales_arr[:, k, 0])
    ax[2].plot(scales_arr[:, k, 1])
ax[2].set_title("ML estimates for model'snscales vs. epoch")
ax[2].set_xlabel("Epoch")
ax[2].set_ylabel("Scales")
plt.show()

确实，我们的训练循环达到了收敛，并且即使模型参数都以相同的初始条件开始，模型的参数也取决于其最佳估计值。以下代码绘制了三个类的三个双变量高斯分布

代码语言：txt复制

def get_meshgrid(x0_range, x1_range, n_points=100):
    x0 = np.linspace(x0_range[0], x0_range[1], n_points)
    x1 = np.linspace(x1_range[0], x1_range[1], n_points)
    return np.meshgrid(x0, x1)

x0_range = x_train[:, 0].min(), x_train[:, 0].max()
x1_range = x_train[:, 1].min(), x_train[:, 1].max()
X0, X1 = get_meshgrid(x0_range, x1_range, n_points=300)
X_v = np.expand_dims(np.array([X0.ravel(), X1.ravel()]).T, axis=1)
Z = class_conditionals.prob(X_v)
Z = np.array(Z).T.reshape(3, *X0.shape)

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(12, 6))
ax = fig.gca(projection='3d')
for k in [0,1,2]:
    ax.plot_surface(X0, X1, Z[k,:,:], alpha=.8)
ax.set_xlabel('Sepal length (cm)')
ax.set_ylabel('Sepal width (cm)')
ax.set_zlabel('Probability');
ax.elev = 35
ax.azim = 45

代码语言：txt复制

print("Class conditional means:")
print(class_conditionals.loc.numpy())
print("nClass conditional standard deviations:")
print(class_conditionals.stddev().numpy())
# Class conditional means:
# [[5.002439  3.42439  ]
# [5.89      2.78     ]
# [6.5179486 2.9307692]]

# Class conditional standard deviations:
# [[0.34675348 0.37855995]
# [0.5309426  0.31796226]
# [0.5808357  0.28025055]]

5. 其他

经典机器学习框架比如TF的输出是确定的值，这里面有些问题，因为世界上不确定是普遍，包含Aleatoric数据本身的不确定性（数据集标签中的测量误差或噪声）Epistemic uncerntainty模型的认知不确定。模型是需要不确定性。深度学习模型需要知道它们不知道什么。“To know what they don't know.” 概率编程或者概率模型是个解决方案，而要做概率编程 (Probabilistic Programming) 的话，TensorFlow Probability (TFP) 这个库是个不错的选择。

附录&参考

Probabilistic Deep Learning with TensorFlow 2, by 伦敦帝国学院

TensorFlow Probability

tensorflow 人工智能深度学习

0 人点赞