多模型加权融合是一个常见的提升机器学习效果的方案。
但是各个模型的权重如何确定呢?
有些方案是使用线性回归或者逻辑回归模型进行学习,这种方案一般叫做stacking ensemble,但是这种方案一般是对可微的Loss进行优化的,无法直接对auc,acc等不可微的评价指标进行优化。
由于optuna是一个强大的不可微问题调优工具,我们可以使用它来寻找模型融合的权重,直接对auc,acc等不可微的评价指标进行优化,当给予足够的搜索次数时,其结果相比stacking ensemble通常更加有竞争力。
公众号后台回复关键词:optuna,获取B站视频讲解演示和notebook源代码。
代码范例如下:
代码语言:javascript复制import numpy as np
from copy import deepcopy
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score
# 一,准备数据
data,target = make_classification(n_samples=2000,n_features=20,
n_informative=12,n_redundant=4,n_repeated=0,n_classes=2,
n_clusters_per_class=4)
x_train, x_test, y_train, y_test = train_test_split(data, target)
代码语言:javascript复制# 二,训练3个基础模型
tree = DecisionTreeClassifier()
mlp = MLPClassifier()
svc = SVC(probability=True)
mlp.fit(x_train,y_train)
tree.fit(x_train,y_train)
svc.fit(x_train,y_train);
代码语言:javascript复制# 三,评估单模型效果
def get_test_auc(model):
probs = model.predict_proba(x_test)[:,1]
val_auc = roc_auc_score(y_test,probs)
return val_auc
print("mlp_score:",get_test_auc(mlp))
print("tree_score:",get_test_auc(tree))
print("svc_score:",get_test_auc(svc))
代码语言:javascript复制mlp_score: 0.9188172387295083
tree_score: 0.7185578893442623
svc_score: 0.923828125
三个模型中最好的是svc,测试集AUC达到了0.9238
代码语言:javascript复制# 四, stacking方案效果
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
stacking = StackingClassifier(
estimators=[('mlp',mlp),('tree',tree),('svc',svc)],
final_estimator=LogisticRegression())
stacking.fit(x_train,y_train)
print("stacking_score:",get_test_auc(stacking))
代码语言:javascript复制stacking_score: 0.9304879610655739
代码语言:javascript复制
可以看到,stacking模型融合方案相比于最好的svm模型在测试集的AUC提升了0.67个百分点,达到了0.9305
代码语言:javascript复制
代码语言:javascript复制# 五,获取CV预测结果
# 为了充分利用训练数据集,采用类似stacking的方式,用5折CV的方式获取各个模型在训练集的预测结果
def get_cv_preds(model,x_train,y_train):
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
cv_preds = np.zeros(len(y_train))
for idx, (train_idx, valid_idx) in enumerate(cv.split(x_train, y_train)):
xtrain_i, xvalid_i = x_train[train_idx], x_train[valid_idx]
ytrain_i, yvalid_i = y_train[train_idx], y_train[valid_idx]
model_idx = deepcopy(model)
model_idx.fit(xtrain_i,ytrain_i)
probs_valid_idx = model_idx.predict_proba(xvalid_i)[:,1]
cv_preds[valid_idx] = probs_valid_idx
return cv_preds
preds_cv = {name: get_cv_preds(eval(name),x_train,y_train)
for name in ['mlp','tree','svc']}
代码语言:javascript复制!rm optuna.db
代码语言:javascript复制# 六, optuna搜索融合权重
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)
def objective(trial):
weights = {name:trial.suggest_int(name, 1, 100) for name in ['mlp','tree','svc']}
probs = sum([weights[name]*preds_cv[name] for name in ['mlp','tree','svc']])/sum(
[weights[name] for name in ['mlp','tree','svc']])
cv_auc = roc_auc_score(y_train,probs)
trial.report(cv_auc, 0)
return cv_auc
storage_name = "sqlite:///optuna.db"
study = optuna.create_study(
direction="maximize",
study_name="optuna_ensemble", storage=storage_name,load_if_exists=False
)
study.optimize(objective, n_trials=900, timeout=600)
best_params = study.best_params
best_value = study.best_value
print("nnbest_value = " str(best_value))
print("best_params:")
print(best_params)
代码语言:javascript复制# 七, optuna权重融合效果
preds_test = {name:(eval(name)).predict_proba(x_test)[:,1] for name in ['mlp','tree','svc']}
def test_score(weights):
probs = sum([weights[name]*preds_test[name] for name in ['mlp','tree','svc']])/sum(
[weights[name] for name in ['mlp','tree','svc']])
test_auc = roc_auc_score(y_test,probs)
return test_auc
print('optuna_ensemble_score:',
test_score(best_params))
代码语言:javascript复制optuna_ensemble_score: 0.9320248463114754
nice,optuna多模型融合方案在测试集AUC相比svm提升了0.82个百分点,达到了0.9320,相比于stacking方案,还是非常有竞争力的。