逻辑回归模型
模型的假设:数据服从伯努利分布。
y=sigma(f(boldsymbol{x}))=sigmaleft(boldsymbol{w}^{T} boldsymbol{x}right)=frac{1}{1 e^{-boldsymbol{w}^{T} boldsymbol{x}}}
损失函数
P(y | boldsymbol{x})=left{begin{array}{c}{p, y=1} \ {1-p, y=0}end{array}right.
等价于:
Pleft(y_{i} | boldsymbol{x}_{i}right)=p^{y_{i}}(1-p)^{1-y_{i}}
如果我们采集到了一组数据一共N个,总概率为:
P =prod_{n=1}^{N} p^{y_{n}}(1-p)^{1-y_{n}}
最大化这个概率即最小化负Log损失函数:
begin{array}{l}{L=sum_{n=1}^{N} ln left(p^{y_{n}}(1-p)^{1-y_{n}}right)} \ {=sum_{n=1}^{N}left(y_{n} ln (p) left(1-y_{n}right) ln (1-p)right)}end{array}
梯度计算
上面式子中的p的公式如下:
p=frac{1}{1 e^{-boldsymbol{w}^{T} boldsymbol{x}}}
1-p的公式:
1-p=frac{e^{-boldsymbol{w}^{T} boldsymbol{x}}}{1 e^{-boldsymbol{w}^{T} boldsymbol{x}}}
p的导数如下:
p^{prime}=p(1-p) boldsymbol{x}
1-p的导数如下:
(1-p)^{prime}=-p(1-p) boldsymbol{x}
所以损失函数的梯度如下:
begin{aligned} nabla F(boldsymbol{w}) &=nablaleft(sum_{n=1}^{N}left(y_{n} ln (p) left(1-y_{n}right) ln (1-p)right)right) \ &=sumleft(y_{n} ln ^{prime}(p) left(1-y_{n}right) ln ^{prime}(1-p)right) \ &=sumleft(left(y_{n} frac{1}{p} p^{prime}right) left(1-y_{n}right) frac{1}{1-p}(1-p)^{prime}right) \ &=sum_{N}left(y_{n}(1-p) boldsymbol{x}_{n}-left(1-y_{n}right) p boldsymbol{x}_{n}right) \ &=sum_{n=1}^{N}left(y_{n}-pright) boldsymbol{x}_{n} end{aligned}
逻辑回归的决策边界
逻辑回归的决策边界如下:
frac{1}{1 e^{-boldsymbol{w}^{T} boldsymbol{x}}}=0.5
简一下上面的曲线公式,得到:
e^{-boldsymbol{w}^{T} boldsymbol{x}}=1=e^{0}
如下图所示,所以决策边界是线性的。
-boldsymbol{w}^{T} boldsymbol{x}=0
代码
逻辑回归 L2范数正则化代码
代码语言:javascript复制class LogisticRegression():
""" A simple logistic regression model with L2 regularization (zero-mean
Gaussian priors on parameters). """
def __init__(self, x_train=None, y_train=None, x_test=None, y_test=None,
alpha=.1, synthetic=False):
# Set L2 regularization strength
self.alpha = alpha
# Set the data.
self.set_data(x_train, y_train, x_test, y_test)
# Initialize parameters to zero, for lack of a better choice.
self.betas = np.zeros(self.x_train.shape[1])
def negative_lik(self, betas):
return -1 * self.lik(betas)
def lik(self, betas):
""" Likelihood of the data under the current settings of parameters. """
# Data likelihood
l = 0
for i in range(self.n):
l = log(sigmoid(self.y_train[i] *
np.dot(betas, self.x_train[i,:])))
# Prior likelihood
for k in range(1, self.x_train.shape[1]):
l -= (self.alpha / 2.0) * self.betas[k]**2
return l
def train(self):
""" Define the gradient and hand it off to a scipy gradient-based
optimizer. """
# Define the derivative of the likelihood with respect to beta_k.
# Need to multiply by -1 because we will be minimizing.
dB_k = lambda B, k : (k > 0) * self.alpha * B[k] - np.sum([
self.y_train[i] * self.x_train[i, k] *
sigmoid(-self.y_train[i] * np.dot(B, self.x_train[i,:]))
for i in range(self.n)])
# The full gradient is just an array of componentwise derivatives
dB = lambda B : np.array([dB_k(B, k)
for k in range(self.x_train.shape[1])])
# Optimize
self.betas = fmin_bfgs(self.negative_lik, self.betas, fprime=dB)
def set_data(self, x_train, y_train, x_test, y_test):
""" Take data that's already been generated. """
self.x_train = x_train
self.y_train = y_train
self.x_test = x_test
self.y_test = y_test
self.n = y_train.shape[0]
def training_reconstruction(self):
p_y1 = np.zeros(self.n)
for i in range(self.n):
p_y1[i] = sigmoid(np.dot(self.betas, self.x_train[i,:]))
return p_y1
def test_predictions(self):
p_y1 = np.zeros(self.n)
for i in range(self.n):
p_y1[i] = sigmoid(np.dot(self.betas, self.x_test[i,:]))
return p_y1
def plot_training_reconstruction(self):
plot(np.arange(self.n), .5 .5 * self.y_train, 'bo')
plot(np.arange(self.n), self.training_reconstruction(), 'rx')
ylim([-.1, 1.1])
def plot_test_predictions(self):
plot(np.arange(self.n), .5 .5 * self.y_test, 'yo')
plot(np.arange(self.n), self.test_predictions(), 'rx')
ylim([-.1, 1.1])
if __name__ == "__main__":
from pylab import *
# Create 20 dimensional data set with 25 points -- this will be
# susceptible to overfitting.
data = SyntheticClassifierData(25, 20)
# Run for a variety of regularization strengths
alphas = [0, .001, .01, .1]
for j, a in enumerate(alphas):
# Create a new learner, but use the same data for each run
lr = LogisticRegression(x_train=data.X_train, y_train=data.Y_train,
x_test=data.X_test, y_test=data.Y_test, alpha=a)
print "Initial likelihood:"
print lr.lik(lr.betas)
# Train the model
lr.train()
# Display execution info
print "Final betas:"
print lr.betas
print "Final lik:"
print lr.lik(lr.betas)
# Plot the results
subplot(len(alphas), 2, 2*j 1)
lr.plot_training_reconstruction()
ylabel("Alpha=%s" % a)
if j == 0:
title("Training set reconstructions")
subplot(len(alphas), 2, 2*j 2)
lr.plot_test_predictions()
if j == 0:
title("Test set predictions")
show()
逻辑回归中为什么使用对数损失而不用平方损失
对于逻辑回归,这里所说的对数损失和极大似然是相同的。 不使用平方损失的原因是,在使用 Sigmoid 函数作为正样本的概率时,同时将平方损失作为损失函数,这时所构造出来的损失函数是非凸的,不容易求解,容易得到其局部最优解。 而如果使用极大似然,其目标函数就是对数似然函数,该损失函数是关于未知参数的高阶连续可导的凸函数,便于求其全局最优解。
Softmax
h_{theta}(x)=left[begin{array}{c}{P(y=1 | x ; theta)} \ {P(y=2 | x ; theta)} \ {vdots} \ {P(y=K | x ; theta)}end{array}right]=frac{1}{sum_{j=1}^{K} exp left(theta_{j}^{T} xright)}left[begin{array}{c}{exp left(theta_{1}^{T} xright)} \ {exp left(theta_{2}^{T} xright)} \ {vdots} \ {exp left(theta_{K}^{T} xright)}end{array}right]
代价函数:
J(theta)=-left[sum_{i=1}^{n} sum_{k=1}^{K} mathbf{1}left{y^{(i)}=kright} ln frac{exp left(theta_{k}^{T} x_{i}right)}{sum_{j=1}^{K} exp left(theta_{j}^{T} x_{i}right)}right]
其中,1${x}$是指示函数,x为真时取1否则取0。