本文为 scikit-learn机器学习(第2版)学习笔记
逻辑回归常用于分类任务
1. 逻辑回归二分类
《统计学习方法》逻辑斯谛回归模型( Logistic Regression,LR)
定义:设 XXX 是连续随机变量, XXX 服从 logistic 分布是指 XXX 具有下列分布函数和密度函数:
F(x)=P(X≤x)=11 e−(x−μ)/γF(x) = P(X leq x) = frac{1}{1 e^{{-(x-mu)} / gamma}}F(x)=P(X≤x)=1 e−(x−μ)/γ1
f(x)=F′(x)=e−(x−μ)/γγ(1 e−(x−μ)/γ)2f(x)=F'(x)= frac {e^{{-(x-mu)} / gamma}}{gamma {(1 e^{{-(x-mu)}/gamma})}^2}f(x)=F′(x)=γ(1 e−(x−μ)/γ)2e−(x−μ)/γ
在逻辑回归中,当预测概率 >= 阈值,预测为正类,否则预测为负类
2. 垃圾邮件过滤
从信息中提取 TF-IDF 特征,并使用逻辑回归进行分类
代码语言:javascript复制import pandas as pd
data = pd.read_csv("SMSSpamCollection", delimiter='t',header=None)
data
代码语言:javascript复制data[data[0]=='ham'][0].count() # 4825 条正常信息
data[data[0]=='spam'][0].count() # 747 条垃圾信息
代码语言:javascript复制import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
X = data[1].values
y = data[0].values
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
y = lb.fit_transform(y)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, random_state=520)
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
pred = classifier.predict(X_test)
for i, pred_i in enumerate(pred[:5]):
print("预测为:%s, 信息为:%s,真实为:%s" %(pred_i,X_test_raw[i],y_test[i]))
代码语言:javascript复制预测为:0, 信息为:Aww that's the first time u said u missed me without asking if I missed u first. You DO love me! :),真实为:[0]
预测为:0, 信息为:Poor girl can't go one day lmao,真实为:[0]
预测为:0, 信息为:Also remember the beads don't come off. Ever.,真实为:[0]
预测为:0, 信息为:I see the letter B on my car,真实为:[0]
预测为:0, 信息为:My love ! How come it took you so long to leave for Zaher's? I got your words on ym and was happy to see them but was sad you had left. I miss you,真实为:[0]
2.1 性能指标
混淆矩阵
代码语言:javascript复制from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
confusion_matrix = confusion_matrix(y_test, pred)
plt.matshow(confusion_matrix)
plt.rcParams["font.sans-serif"]= 'SimHei' # 消除中文乱码
plt.title("混淆矩阵")
plt.ylabel('真实')
plt.xlabel('预测')
plt.colorbar()
2.2 准确率
代码语言:javascript复制scores = cross_val_score(classifier, X_train, y_train, cv=5)
print('Accuracies: %s' % scores)
print('Mean accuracy: %s' % np.mean(scores))
代码语言:javascript复制Accuracies: [0.94976077 0.95933014 0.96650718 0.95215311 0.95688623]
Mean accuracy: 0.9569274847434318
准确率不是一个很合适的性能指标,它不能区分预测错误,是正预测为负,还是负预测为正
2.3 精准率、召回率
可以参考 [Hands On ML] 3. 分类(MNIST手写数字预测)
单独只看精准率或者召回率是没有意义的
代码语言:javascript复制from sklearn.metrics import precision_score, recall_score, f1_score
precisions = precision_score(y_test, pred)
print('Precision: %s' % precisions)
recalls = recall_score(y_test, pred)
print('Recall: %s' % recalls)
代码语言:javascript复制Precision: 0.9852941176470589
预测为垃圾信息的基本上真的是垃圾信息
Recall: 0.6979166666666666
有30%的垃圾信息预测为了非垃圾信息
2.4 F1值
F1 值是以上精准率和召回率的均衡
代码语言:javascript复制f1s = f1_score(y_test, pred)
print('F1 score: %s' % f1s)
# F1 score: 0.8170731707317074
2.5 ROC、AUC
- 好的分类器AUC面积越接近1越好,随机分类器AUC面积为0.5
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
false_positive_rate, recall, thresholds = roc_curve(y_test, pred)
roc_auc_score = roc_auc_score(y_test, pred)
plt.title('受试者工作特性')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc_score)
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()
3. 网格搜索调参
代码语言:javascript复制import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression())
])
parameters = {
'vect__max_df': (0.25, 0.5, 0.75), # 模块name__参数name
'vect__stop_words': ('english', None),
'vect__max_features': (2500, 5000, None),
'vect__ngram_range': ((1, 1), (1, 2)),
'vect__use_idf': (True, False),
'clf__penalty': ('l1', 'l2'),
'clf__C': (0.01, 0.1, 1, 10),
}
if __name__ == "__main__":
df = pd.read_csv('./SMSSpamCollection', delimiter='t', header=None)
X = df[1].values
y = df[0].values
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y)
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)
grid_search.fit(X_train, y_train)
print('Best score: %0.3f' % grid_search.best_score_)
print('Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print('t%s: %r' % (param_name, best_parameters[param_name]))
predictions = grid_search.predict(X_test)
print('Accuracy: %s' % accuracy_score(y_test, predictions))
print('Precision: %s' % precision_score(y_test, predictions))
print('Recall: %s' % recall_score(y_test, predictions))
代码语言:javascript复制Best score: 0.985
Best parameters set:
clf__C: 10
clf__penalty: 'l2'
vect__max_df: 0.5
vect__max_features: 5000
vect__ngram_range: (1, 2)
vect__stop_words: None
vect__use_idf: True
Accuracy: 0.9791816223977028
Precision: 1.0
Recall: 0.8605769230769231
调整参数后,提高了召回率
4. 多类别分类
电影情绪评价预测
代码语言:javascript复制data = pd.read_csv("./chapter5_movie_train.csv",header=0,delimiter='t')
data
代码语言:javascript复制data['Sentiment'].describe()
代码语言:javascript复制count 156060.000000
mean 2.063578
std 0.893832
min 0.000000
25% 2.000000
50% 2.000000
75% 3.000000
max 4.000000
Name: Sentiment, dtype: float64
平均都是比较中立的情绪
代码语言:javascript复制data["Sentiment"].value_counts()/data["Sentiment"].count()
代码语言:javascript复制2 0.509945
3 0.210989
1 0.174760
4 0.058990
0 0.045316
Name: Sentiment, dtype: float64
50% 的例子都是中立的情绪
代码语言:javascript复制from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
df = pd.read_csv('./chapter5_movie_train.csv', header=0, delimiter='t')
X, y = df['Phrase'], df['Sentiment'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression())
])
parameters = {
'vect__max_df': (0.25, 0.5),
'vect__ngram_range': ((1, 1), (1, 2)),
'vect__use_idf': (True, False),
'clf__C': (0.1, 1, 10),
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy')
grid_search.fit(X_train, y_train)
print('Best score: %0.3f' % grid_search.best_score_)
print('Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print('t%s: %r' % (param_name, best_parameters[param_name]))
代码语言:javascript复制Best score: 0.619
Best parameters set:
clf__C: 10
vect__max_df: 0.25
vect__ngram_range: (1, 2)
vect__use_idf: False
- 性能指标
predictions = grid_search.predict(X_test)
print('Accuracy: %s' % accuracy_score(y_test, predictions))
print('Confusion Matrix:')
print(confusion_matrix(y_test, predictions))
print('Classification Report:')
print(classification_report(y_test, predictions))
代码语言:javascript复制Accuracy: 0.6292323465333846
Confusion Matrix:
[[ 1013 1742 682 106 11]
[ 794 5914 6275 637 49]
[ 196 3207 32397 3686 222]
[ 28 488 6513 8131 1299]
[ 1 59 548 2388 1644]]
Classification Report:
precision recall f1-score support
0 0.50 0.29 0.36 3554
1 0.52 0.43 0.47 13669
2 0.70 0.82 0.75 39708
3 0.54 0.49 0.52 16459
4 0.51 0.35 0.42 4640
accuracy 0.63 78030
macro avg 0.55 0.48 0.50 78030
weighted avg 0.61 0.63 0.62 78030
5. 多标签分类
- 一个实例可以被贴上多个 labels
问题转换:
- 实例的标签(假设为L1,L2),转换成(L1 and L2),以此类推,缺点,产生很多种类的标签,且模型只能训练数据中包含的类,很多可能无法覆盖到
- 对每个标签,训练一个二分类器(这个实例是L1吗,是L2吗?),缺点,忽略了标签之间的关系
5.1 多标签分类性能指标
- 汉明损失:不正确标签的平均比例,0最好
- 杰卡德相似系数:预测与真实标签的交集数量 / 并集数量,1最好
from sklearn.metrics import hamming_loss, jaccard_score
# help(jaccard_score)
print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]])))
print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]])))
print(hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]])))
print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[0.0, 1.0], [1.0, 1.0]]),average=None))
print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [1.0, 1.0]]),average=None))
print(jaccard_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.array([[1.0, 1.0], [0.0, 1.0]]),average=None))
代码语言:javascript复制0.0
0.25
0.5
[1. 1.]
[0.5 1. ]
[0. 1.]