一、算法要求
学生有两门考试成绩,预测学生的入学结果,即两个参数的拟合情况
- 题一 线性拟合,预测学生录取情况
- 题二 非线性拟合,预测学生录取情况,正则化逻辑回归 通过做这道题发现,选择哪种拟合方式,要先看输入数据的特点,可以总结机器学习解决工程问题的步骤:
- 绘图观察输入参数特点(如果是三维以内,超过三维无法成图)
- 设计拟合函数,即估计算法
- 得出损失函数cost,即
,逻辑回归用到sigmoid函数
- 求出梯度下降函数,即偏导,用来做梯度下降
- 开始数据训练,求出
向量
- 根据求出的
,绘图观察拟合情况,是否有过拟合或者欠拟合,根据观察结果调参
二、线性逻辑回归
准备数据
代码语言:javascript复制import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import matplotlib.pyplot as plt
# import tensorflow as tf
from sklearn.metrics import classification_report#这个包是评价报告
# 加载样本数据
data = pd.read_csv('ex2data1.txt', names=['exam1', 'exam2', 'admitted'])
data.head()#看前五行
data.describe()
# 绘制样本分布
sns.set(context="notebook", style="darkgrid",
palette=sns.color_palette("RdBu", 2)) # RdBu指一种风格,2指取两种颜色
sns.palplot(sns.color_palette("RdBu", n_colors=5)) # 绘制出颜色条
sns.lmplot(x='exam1', y='exam2', hue='admitted', data=data,
height=6,
fit_reg=False,
scatter_kws={"s": 50}
)
plt.show()#看下数据的样子
样本数据 | 样本特征 |
---|---|
样本数据 | 样本特征 |
sns.color_palett设为5 | 样本分布 |
---|---|
sns.color_palett设为5 | 样本分布 |
定义必要的函数,获取X 和 y
代码语言:javascript复制def get_X(df):#读取特征
# """
# use concat to add intersect feature to avoid side effect
# not efficient for big dataset though
# """
ones = pd.DataFrame({'ones': np.ones(len(df))})#ones是m行1列的dataframe
data = pd.concat([ones, df], axis=1) # 合并数据,根据列合并 axis表示按"行"还是"列合并"
return data.iloc[:, :-1].values # 这个操作返回 ndarray,不是矩阵
def get_y(df):#读取标签
# '''assume the last column is the target'''
return np.array(df.iloc[:, -1])#df.iloc[:, -1]是指df的最后一列
def normalize_feature(df):
# """Applies function along input axis(default 0) of DataFrame."""
# lumbda 默认column作为参数
return df.apply(lambda column: (column - column.mean()) / column.std())#特征缩放
X = get_X(data)
print(X.shape)
y = get_y(data)
print(y.shape)
# (100, 3) (100,)
逻辑回归基于sigmoid函数实现
g 代表一个常用的逻辑函数(logistic function)为S形函数(Sigmoid function),公式为:
合起来,我们得到逻辑回归模型的假设函数:
sigmoid的编程实现:
代码语言:javascript复制def sigmoid(z):
return 1 / (1 np.exp(-z))
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(np.arange(-10, 10, step=0.01),
sigmoid(np.arange(-10, 10, step=0.01)))
ax.set_ylim((-0.1,1.1))
ax.set_xlabel('z', fontsize=18)
ax.set_ylabel('g(z)', fontsize=18)
ax.set_title('sigmoid function', fontsize=18)
plt.show()
sigmoid function
定义cost functiion(代价函数)
- choose
as the cost function
开始处理theta
代码语言:javascript复制theta = theta=np.zeros(3) # X(m*n) so theta is n*1
def cost(theta, X, y):
''' cost fn is -l(theta) for you to minimize'''
return np.mean(-y * np.log(sigmoid(X @ theta)) - (1 - y) * np.log(1 - sigmoid(X @ theta)))
# X @ theta与X.dot(theta)等价
cost(theta, X, y)
# 得到一个初始的损失值:0.6931471805599453
定义gradient descent(梯度下降)
- 这是批量梯度下降(batch gradient descent)
- 转化为向量化计算:
def gradient(theta, X, y):
# '''just 1 batch gradient'''
return (1 / len(X)) * X.T @ (sigmoid(X @ theta) - y)
gradient(theta, X, y)
# 调一次,看看第一步的梯度下降
# array([ -0.1 , -12.00921659, -11.26284221])
拟合参数
- 这里使用
scipy.optimize.minimize
去寻找参数
import scipy.optimize as opt
res = opt.minimize(fun=cost, x0=theta, args=(X, y), method='Newton-CG', jac=gradient)
# fun-损失函数,x0-待拟合函数的参数,args-输入样本数据,method-梯度下降的处理方法,jac-训练方法,这里选择梯度下降
print(res)
fun: 0.20349770426553998
jac: array([-2.85342794e-06, -3.50853296e-05, -1.62061639e-04])
message: 'Optimization terminated successfully.'
nfev: 71
nhev: 0
nit: 27
njev: 178
status: 0
success: True
x: array([-25.16557602, 0.20626565, 0.20150593])
用训练集预测和验证
代码语言:javascript复制实际工程中,不应该用训练集来做预测和验证,交叉校验和册数数据的选择另有讲究,这里是练习题,不做那么多讲究了,达到学习目的即可。
def predict(x, theta):
prob = sigmoid(x @ theta)
# 这里不能直接判断 x @ theta > 0,因为有截距存在
return (prob >= 0.5).astype(int)
final_theta = res.x
y_pred = predict(X, final_theta)
print(classification_report(y, y_pred))
precision recall f1-score support
0 0.87 0.85 0.86 40
1 0.90 0.92 0.91 60
accuracy 0.89 100
macro avg 0.89 0.88 0.88 100
weighted avg 0.89 0.89 0.89 100
寻找决策边界
http://stats.stackexchange.com/questions/93569/why-is-logistic-regression-a-linear-classifier
(this is the line)
其实就是解方程
coef = -(res.x / res.x[2]) # find the equation
代码语言:javascript复制print(res.x) # this is final theta
[-25.16557602 0.20626565 0.20150593]
coef = -(res.x / res.x[2]) # find the equation,
print(coef)
# 根据theta,绘制一组数据,表达拟合的方程
x = np.arange(130, step=0.1)
y = coef[0] coef[1]*x
[124.8875223 -1.02362075 -1. ]
data.describe() # find the range of x and y
代码语言:javascript复制sns.set(context="notebook", style="ticks", font_scale=1.5)
sns.lmplot(x='exam1', y='exam2', hue='admitted', data=data,
height=6,
fit_reg=False,
scatter_kws={"s": 25}
)
plt.plot(x, y, 'grey')
plt.xlim(0, 130)
plt.ylim(0, 130)
plt.title('Decision Boundary')
plt.show()
决策边界
三、非线性逻辑回归
基本的逻辑同上,相同点不做赘述。
加载样本数据
代码语言:javascript复制df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted'])
df.head()
sns.set(context="notebook", style="ticks", font_scale=1.5)
sns.lmplot('test1', 'test2', hue='accepted', data=df,
size=6,
fit_reg=False,
scatter_kws={"s": 50}
)
plt.title('Regularized Logistic Regression')
plt.show()
观察到题二的数据是非线性的,需要通过更复杂的多项式来拟合
feature mapping(特征映射)
polynomial expansion
多项式拓展
实现代码逻辑:
代码语言:javascript复制for i in 0..i
for p in 0..i:
output x^(i-p) * y^p
注意,notebook不支持local image
把原先的两列数据拓展成n列
代码语言:javascript复制def feature_mapping(x, y, power, as_ndarray=False):
# """return mapped features as ndarray or dataframe"""
# data = {}
# # inclusive
# for i in np.arange(power 1):
# for p in np.arange(i 1):
# data["f{}{}".format(i - p, p)] = np.power(x, i - p) * np.power(y, p)
# {}表示字典,即map数据集合,后面两个for用到了脚本的方式进行条件操作
data = {"f{}{}".format(i - p, p): np.power(x, i - p) * np.power(y, p)
for i in np.arange(power 1)
for p in np.arange(i 1)
}
if as_ndarray:
return pd.DataFrame(data).values
else:
return pd.DataFrame(data)
x1 = np.array(df.test1)
x2 = np.array(df.test2)
data = feature_mapping(x1, x2, power=6)
print(data.shape)
data.head()
拓展后的数据集:
regularized cost(正则化代价函数)
代码语言:javascript复制theta = np.zeros(data.shape[1])
X = feature_mapping(x1, x2, power=6, as_ndarray=True)
print(X.shape)
y = get_y(df)
print(y.shape)
(118, 28)
(118,)
def regularized_cost(theta, X, y, l=1):
# '''you don't penalize theta_0'''
# theta_0表示截距,不做“惩罚”
theta_j1_to_n = theta[1:]
regularized_term = (l / (2 * len(X))) * np.power(theta_j1_to_n, 2).sum()
return cost(theta, X, y) regularized_term
#正则化代价函数
regularized_cost(theta, X, y, l=1)
0.6931471805599454
# this is the same as the not regularized cost because we init theta as zeros...
# 因为我们设置theta为0,所以这个正则化代价函数与代价函数的值相同
regularized gradient(正则化梯度)
代码语言:javascript复制def regularized_gradient(theta, X, y, l=1):
# '''still, leave theta_0 alone'''
theta_j1_to_n = theta[1:]
regularized_theta = (l / len(X)) * theta_j1_to_n
# by doing this, no offset is on theta_0
# theta_0填充为0,即正则化不影响theta_0
regularized_term = np.concatenate([np.array([0]), regularized_theta])
return gradient(theta, X, y) regularized_term
regularized_gradient(theta, X, y)
拟合参数
代码语言:javascript复制import scipy.optimize as opt
print('init cost = {}'.format(regularized_cost(theta, X, y)))
res = opt.minimize(fun=regularized_cost, x0=theta, args=(X, y), method='Newton-CG', jac=regularized_gradient)
res
init cost = 0.6931471805599454
fun: 0.5290027297128722
jac: array([ 4.64317436e-08, 1.04331373e-08, -3.61802419e-08, -3.44397841e-08,
2.46408233e-08, 1.12246256e-08, -5.13021764e-09, -1.11582358e-08,
1.23402171e-09, 2.62428040e-08, -3.94916642e-08, 5.21264511e-09,
-8.98645551e-09, 1.04022216e-08, 3.24793498e-08, -5.20025250e-09,
-5.87238749e-09, 7.46797548e-10, -4.52162093e-09, -1.47871366e-09,
1.80405423e-08, -1.62958045e-08, 7.26901438e-10, -5.53694209e-09,
3.57089897e-09, -3.35135784e-09, 5.51302048e-09, 2.59989451e-08])
message: 'Optimization terminated successfully.'
nfev: 7
nhev: 0
nit: 6
njev: 55
status: 0
success: True
x: array([ 1.27274054, 0.62527229, 1.18108684, -2.01996217, -0.91742229,
-1.43166588, 0.1240061 , -0.36553467, -0.35724013, -0.1751284 ,
-1.45815894, -0.050989 , -0.61555564, -0.27470555, -1.192815 ,
-0.24218818, -0.20600633, -0.04473079, -0.27778484, -0.29537856,
-0.45635706, -1.04320283, 0.02777156, -0.29243164, 0.01556705,
-0.3273799 , -0.14388646, -0.92465161])
预测
代码语言:javascript复制final_theta = res.x
y_pred = predict(X, final_theta)
print(classification_report(y, y_pred))
precision recall f1-score support
0 0.90 0.75 0.82 60
1 0.78 0.91 0.84 58
accuracy 0.83 118
macro avg 0.84 0.83 0.83 118
weighted avg 0.84 0.83 0.83 118
使用不同的
(即正则化的权重,这个是常数), 画出决策边界
我们找到所有满足
的x
- instead of solving polynomial equation, just create a coridate x,y grid that is dense enough, and find all those
that is close enough to 0, then plot them
代码语言:javascript复制def draw_boundary(power, l):
# """
# power: polynomial power for mapped feature
# l: lambda constant
# """
density = 1000
threshhold = 2 * 10**-3
final_theta = feature_mapped_logistic_regression(power, l)
x, y = find_decision_boundary(density, power, final_theta, threshhold)
df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted'])
sns.lmplot(x='test1', y='test2', hue='accepted', data=df, height=6, fit_reg=False, scatter_kws={"s": 100})
plt.scatter(x, y, c='red', s=10)
plt.title('Decision boundary')
plt.show()
def feature_mapped_logistic_regression(power, l):
# """for drawing purpose only.. not a well generealize logistic regression
# power: int
# raise x1, x2 to polynomial power
# l: int
# lambda constant for regularization term
# """
df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted'])
x1 = np.array(df.test1)
x2 = np.array(df.test2)
y = get_y(df)
X = feature_mapping(x1, x2, power, as_ndarray=True)
theta = np.zeros(X.shape[1])
res = opt.minimize(fun=regularized_cost,
x0=theta,
args=(X, y, l),
method='TNC',
jac=regularized_gradient)
final_theta = res.x
return final_theta
def find_decision_boundary(density, power, theta, threshhold):
t1 = np.linspace(-1, 1.5, density)
t2 = np.linspace(-1, 1.5, density)
cordinates = [(x, y) for x in t1 for y in t2]
# zip(a, b):打包;zip(*c):解压
x_cord, y_cord = zip(*cordinates)
mapped_cord = feature_mapping(x_cord, y_cord, power) # this is a dataframe
inner_product = mapped_cord.values @ theta
decision = mapped_cord[np.abs(inner_product) < threshhold]
return decision.f10, decision.f01
#寻找决策边界函数
多尝试几个
值
draw_boundary(power=6, l=1)
略欠拟合 | 过拟合 | 欠拟合 | 比较适中 |
---|---|---|---|
略欠拟合 | 过拟合 | 欠拟合 | 比较适中 |
注意 notebook 不支持 加载本地image资源
Local images cannot be inserted in jupyter notebook
参考:
https://github.com/fengdu78/Coursera-ML-AndrewNg-Notes matplotlib.pyplot style美化 matplotlib 美化大全 Seaborn 0.9 中文文档 逻辑回归LogisticRegression参数解析之penalty & C Coursera-ML-AndrewNg 作业