将多个分类器的预测结果进行组合得到最终决策,来获得更好的分类及回归性能。单一分类器只适合于某种特定类型的数据,很难保证得到最佳分类模型,如果对不同算法的预测结果取平均,相比一个分类器,可能会获得更好的分类模型。bagging, boosting和随机森林是应用最广泛的三类集成学习算法。
- bagging:投票式算法,首先bootstrap产生不同的训练数据集,然后得到多个基础分类器,最后组合得到一个相对更优的模型。
- Boosting:与bagging类似,区别是boosting是顺序进行,后一轮分类器与之前分类器的结果有关,即在错分基础上学习,补偿学习。
- 随机森林:包含多个决策树的分类器,通过投票得到分类结果,对每一类特征向量产生一棵单独的分类决策树,从这些分类结果中选择多个投票数最高的决策树完成分类,或者选择一个平均值作为回归处理的输出。
8.2 使用bagging方法对数据分类
adabag包对bagging和boosting两种方法都提供了支持,前者是Breman bagging算法(首次提出了版本分类器理论)。
代码语言:javascript复制install.packages("adabag")
library(adabag)
# 发现
data(iris)
churnTrain <- iris
ind <- sample(2,nrow(churnTrain),replace = TRUE,
prob = c(0.7,0.3))
trainset <- churnTrain[ind==1,]
testset <- churnTrain[ind==2,]
set.seed(2)
churn.bagging <- bagging(churn~., data = trainset, mfinal = 10) #迭代次数为10
churn.bagging$importance
Petal.Length Petal.Width Sepal.Length Sepal.Width
75.53879 24.46121 0.00000 0.00000
churn.predbagging$confusion
Observed Class
Predicted Class setosa versicolor virginica
setosa 18 0 0
versicolor 0 19 0
virginica 0 2 15
churn.baggingcv$error
[1] 0.03703704
算法源于Bootstrap aggregation,具有稳定、准确、功能强大和易于实现的优点,常用于数据分类和回归处理。算法定义如下:给定大小为n的数据集,通过bootstrap抽样,得到m个新数据集Di,通过m个样本得到m个模型,然后获得最优模型。缺点是结果难以解释。扩展ipred包也可以实现同样功能,测试下来这个速度超快呢,上面那个半小时了还没动静,应该是没有交叉验证。
代码语言:javascript复制churn.bagging <- bagging(churn~., data = trainset, coob=TRUE)
churn.bagging
Bagging classification trees with 25 bootstrap replications
Call: bagging.data.frame(formula = churn ~ ., data = trainset, coob = TRUE)
Out-of-bag estimate of misclassification error: 0.0606
# 错分率
mean(predict(churn.bagging)!=trainset$churn)
[1] 0.06115418
# 预测
churn.predction <- predict(churn.bagging, newdata = testset, type = "class")
prediction.table <- table(churn.predction, testset$churn)
prediction.table
churn.predction yes no
yes 170 16
no 57 1274
8.3 使用bagging方法进行交叉验证
评估分类模型的鲁棒性
代码语言:javascript复制# cv
churn.baggingcv <- bagging.cv(churn~., v = 10, data = trainset,
mfinal = 10)
# Error in bagging.cv(churn ~ ., v = 10, data = trainset, mfinal = 10) :
v should be in [2, n] 原数据集的问题,这里用iris代替
churn.baggingcv$confusion
Observed Class
Predicted Class setosa versicolor virginica
setosa 18 0 0
versicolor 0 19 0
virginica 0 2 15
# 错分率
churn.predbagging$error
[1] 0.03703704
churn数据集报错,这里用iris简单数据集曲线报国了,好处是可以节省时间。
8.4 使用boosting 方法进行数据分类
adabag实现了AdaBoost和SANME两个算法。
代码语言:javascript复制# boosting
set.seed(2)
churn.boost <- boosting(Species~., data = trainset, mfinal = 3,
coeflearn = "Freund", boos = FALSE,
control=rpart.control(maxdepth=3))
churn.boost.pred <- predict.boosting(churn.boost, newdata = testset)
churn.boost.pred$confusion
churn.boost.pred$confusion
Observed Class
Predicted Class setosa versicolor virginica
setosa 18 0 0
versicolor 0 20 1
virginica 0 1 14
churn.boost.pred$error
[1] 0.03703704
boosting算法的思想是弱分类器(如单一决策树),逐步优化(改变权重),使之成为强分类器。bagging和boosting都采用了集成学习的思想,不同之处是bagging组合独立的模型,boostong迭代学习。mfinal是迭代次数,coeflearn是权重更新系数,观测值权重boos,rpart的控制方法(单一决策树)。扩展
代码语言:javascript复制install.packages(c("mboost","ada"))
library(mboost)
library(pROC)
library(caret)
install.packages("MLmetrics")
set.seed(2)
ctrl <- trainControl(method = "repeatedcv", repeats = 1,
classProbs = TRUE,
summaryFunction = twoClassSummary)
ada.train <- train(churn~.,data = trainset, method = "ada",
metric = "ROC", trControl = ctrl)
# 这里iris报错,切换回了churn数据集
nu maxdepth iter ROC Sens Spec ROCSD SensSD SpecSD
1 0.1 1 50 0.8600045 0.9090204 0.010719176 0.03719839 0.05786791 0.007708342
...
plot(ada.train)
ada.predict <- predict(ada.train, testset, "prob")
ada.predict.result <- ifelse(ada.predict[1]>0.5, "yes", "no")
table(testset$churn, ada.predict.result)
ada.predict.result
no yes
yes 71 143
no 1301 6
本章少有的图
8.5 使用boosting方法进行交叉验证
代码语言:javascript复制churn.boostingcv <- boosting.cv(Species~., v=10, data = trainset,
mfinal = 5, control = rpart.control(cp=0.01))
churn.boostingcv$confusion
Observed Class
Predicted Class setosa versicolor virginica
setosa 32 0 0
versicolor 0 26 3
virginica 0 3 32
churn.boostingcv$error
[1] 0.0625
8.6 使用gradient boosting方法对数据进行分类
也是将弱分类器组合在一起,然后在与损失函数的负梯度最大相关时得到新的基础分类器,既可以回归分析,也可以分类,对不同数据集的适应能力都很好。
代码语言:javascript复制# gradient boosting
install.packages("gbm")
library(gbm)
# 响应值为0~1,所以转换下
trainset$churn <- ifelse(trainset$churn =="yes", 1,0)
set.seed(2)
churn.gbm <- gbm(formula = churn ~ ., distribution = "bernoulli", data = trainset,
n.trees = 1000, interaction.depth = 7, shrinkage = 0.01,
cv.folds = 3) # shrinkage 步长减少参数,即学习速度;interaction.depth决策树最大深度
summary(churn.gbm)
var rel.inf
total_day_minutes total_day_minutes 29.8623601
total_eve_minutes total_eve_minutes 14.6407627
number_customer_service_calls number_customer_service_calls 12.5827527
total_intl_minutes total_intl_minutes 9.6529151
...
# 交叉验证,确定最佳迭代次数
churn.iter <- gbm.perf(churn.gbm, method = "cv")
# Bernoulli损失函数的对数奇点值
churn.predict <- predict(churn.gbm, testset, n.trees = churn.iter)
str(churn.predict)
num [1:1521] -3.56 -3.36 -2.99 -3.82 -3.52 ...
# ROC, 得到最大准确率的最佳临界值
# ROC
churn.roc <- roc(testset$churn, churn.predict)
plot(churn.roc)
# coords得到最佳临界值
coords(churn.roc, "best")
threshold specificity sensitivity
1 -0.7369319 0.8738318 0.9869931
# coords得到最佳临界值
coords(churn.roc, "best")
churn.predict.class <- ifelse(churn.predict >c(coords(churn.roc,"best")["threshold"]),
"yes","no")
table(testset$churn, churn.predict.class)
churn.predict.class
no yes
yes 27 187
no 1290 17
算法的思想如下:首先,计算每个划分的数据集残差的方差,并据此确定每个阶段的最优划分,被选中模型将前一阶段处理得到的方差作为学习目标重新建模,缩小。采用梯度下降,也就是沿着导数下降的方向进行变化,使剩余方差最小化。拓展
代码语言:javascript复制library(mboost)
# 仅支持数值,去除,转化非数值, 这里发现前面错误的来源应该是这个yes和no的转换,只有加个c()才行,不科学呀,不管了,达到目的即可
trainset$churn <- ifelse(trainset$churn ==c("yes"),1,0)
trainset$voice_mail_plan = NULL
trainset$international_plan = NULL
churn.mboost <- mboost(churn ~., data = trainset, control = boost_control(mstop = 10))
summary(churn.mboost)
Model-based Boosting
Call:
mboost(formula = churn ~ ., data = trainset, control = boost_control(mstop = 10))
Squared Error (Regression)
Loss function: (y - f)^2
Number of boosting iterations: mstop = 10
Step size: 0.1
Offset: 0.1417074
Number of baselearners: 14
Selection frequencies:
bbs(total_day_minutes) bbs(number_customer_service_calls)
0.6 0.4
par(mfrow=c(1,2))
plot(churn.mboost)
重要属性的局部贡献
计算分类器边缘
代码语言:javascript复制boost.margins <- margins(churn.boost, trainset)
boost.pred.margins <- margins(churn.boost.pred, testset)
plot(sort(boost.margins[[1]]),
(1:length(boost.margins[[1]]))/length(boost.margins[[1]]),
type = 'l', xlim = c(-1,1),
main = "Boosting:Magrin cumulative distribution graph",
xlab = "margin", ylab = "% observations", col= 'blue')
lines(sort(boost.pred.margins[[1]]),
(1:length(boost.pred.margins[[1]]))/length(boost.pred.margins[[1]]),
type = "l", col="green")
abline(v=0, col='red', lty=2)
boosting分类器的边缘累积分布图
代码语言:javascript复制# 与训练和测试集错误匹配负边缘的百分比
boosting.training.margin <- table(boost.margins[[1]]>0)
boosting.negative.training <- as.numeric(boosting.training.margin[1])/boosting.training.margin[2]
boosting.negative.training
TRUE
0.0212766
代码语言:javascript复制# 计算bagiing分类器的边缘
bagging.margins = margins(churn.bagging, trainset)
bagging.pred.margins <- margins(churn.predbagging,testset)
plot(sort(bagging.margins[[1]]),
(1:length(bagging.margins[[1]]))/length(bagging.margins[[1]]),
type = "l", xlim = c(-1,1),
main = "Bagging:Magrin cumulative distribution graph",
xlab = "margin", ylab = "% observations", col= 'blue')
lines(sort(bagging.pred.margins[[1]]),
(1:length(bagging.pred.margins[[1]]))/length(bagging.pred.margins[[1]]),
type = "l", col="green")
abline(v=0, col='red', lty=2)
# 同样计算百分比
bagging.training.margin <- table(bagging.margins[[1]]>0)
bagging.negative.training <- as.numeric(bagging.training.margin[1])/boosting.training.margin[2]
bagging.negative.training
边缘是分类器确定性的一种度量,是根据分类样本数及最大错分样本计算得来的。正确分类样本建立边缘,错误分类样本形成负边缘,如果边缘接近1,说明正确分类的样本可信度非常高。分类不确定的样本只有较小的边缘。margin函数能计算AdaBoost.M1、AdaBoost-SAMME以及baaging分类器的边缘,返回一个边缘向量,可以绘制边缘累积分布曲线展现边缘分布情况,如果每个观测都能正确划分,分布图会是边缘值为1的垂线。通常情况下,训练数据集的错分样例负边缘与测试数据集的错分负边缘差不多。
计算集成分类算法的误差演变
代码语言:javascript复制# 误差演变
boosting.evol.train <- errorevol(churn.boost, trainset)
boosting.evol.test <- errorevol(churn.boost, testset)
plot(boosting.evol.test$error, type = "l", ylim = c(0,1),
main = "Boosting error versus number of trees",
xlab = "Iteration", ylab = "Error", col='red',
lwd=2)
lines(boosting.evol.train$error, cex = .5, col='blue',lty=2,
lwd=2)
legend('topright', c('test','train'), col = c('red', 'blue'),
lty = 1:2, lwd=2)
adabag包中提供了errorevol函数以方便用户根据迭代次数估算集成分类算法的误差。
代码语言:javascript复制# bagging
# 误差演变
bagging.evol.train <- errorevol(churn.bagging, trainset)
bagging.evol.test <- errorevol(churn.bagging, testset)
plot(bagging.evol.test$error, type = "l", ylim = c(0,1),
main = "Bagging error versus number of trees",
xlab = "Iteration", ylab = "Error", col='red',
lwd=2)
lines(bagging.evol.train$error, cex = .5, col='blue',lty=2,
lwd=2)
legend('topright', c('test','train'), col = c('red', 'blue'),
lty = 1:2, lwd=2)
图展示了每次迭代后的分类误差的变化,可以调用predict.bagging和predict.boosting进行剪枝。
8.9 随机森林对数据分类
训练过程中产生多棵决策树,每棵会根据输入产生预测输出,采用投票机制选择类别众数作为预测结果。
代码语言:javascript复制# random Forest
install.packages("randomForest")
library(randomForest)
churn.rf <- randomForest(churn ~., data = trainset, importance=T) # 对预测器的重要性进行评估
churn.rf
Call:
randomForest(formula = churn ~ ., data = trainset, importance = T)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4
OOB estimate of error rate: 4.31%
Confusion matrix:
yes no class.error
yes 363 130 0.263691684
no 20 2966 0.006697924
# 分类预测
churn.prediction <- predict(churn.rf, testset)
table(churn.prediction, testset$churn)
churn.prediction yes no
yes 167 7
no 47 1300
plot(churn.rf)
importance(churn.rf)
yes no MeanDecreaseAccuracy
international_plan 93.0223581 72.5504101 95.5848053
voice_mail_plan 22.5321109 18.2760474 22.8558091
number_vmail_messages 23.0980210 17.5029154 22.6011108
total_day_minutes 33.4914749 33.8653396 43.0515228
varImpPlot(churn.rf)
margins.rf <- margin(churn.rf, trainset)
plot(margins.rf)
hist(margins.rf, main = "Margins of Random Forest for churn dataset")
boxplot(margins.rf~ trainset$churn, main = "Margins of Random Forest for churn dataset by class")
随机森林将多个弱学习机(决策树)组合得到一个强学习机,处理过程和bagging非常相似,首先boostrap采样,从中找到能提供最佳分割效果的预测属性。如果是回归,将取所有预测的平均值或者加权平均值作为最后输出,如果是分类,选择类别预测众数作为最终预测。算法包括两个参数,ntree决策树个数和mtry可用来寻找最佳特征的特征个数,bagging算法只使用前者,如果mtry=训练数据集的特征值,随机森林就等同于bagging了。最大优点是计算容易,高效,对缺失数据或不平衡数据容错度较高;主要缺点是不能预测超过训练集之外的数据,容易被噪声数据影响出现过度适应。拓展cforest包的cforest函数同样可以实现随机森林算法
代码语言:javascript复制# 拓展
install.packages("party")
library(party)
churn.cforest <- cforest(churn~., data = trainset,
controls = cforest_unbiased(ntree=1000,mtry=5))
churn.forest.prediction <- predict(churn.cforest, testset, OOB=TRUE, type = "response")
table(churn.forest.prediction, trainset$churn) # 这个地方很神奇,先打个问号
churn.forest.prediction yes no
yes 348 21
no 145 2965
8.10 估算不同分类器的预测误差
对多种分类算法采用errorest函数进行十折交叉验证,证明集成分类器是否比单一决策树分类效果更优。
代码语言:javascript复制# ipred erroest
library(ipred)
churn.bagging <- errorest(churn ~., data = trainset, model = bagging);churn.bagging
Call:
errorest.data.frame(formula = churn ~ ., data = trainset, model = bagging)
10-fold cross-validation estimator of misclassification error
Misclassification error: 0.052
library(ada)
churn.mboosting <- errorest(churn ~., data = trainset, model = ada);churn.mboosting
Call:
errorest.data.frame(formula = churn ~ ., data = trainset, model = ada)
10-fold cross-validation estimator of misclassification error
Misclassification error: 0.048
hurn.rf <- errorest(churn~., data = trainset, model = randomForest);churn.rf
Call:
errorest.data.frame(formula = churn ~ ., data = trainset, model = randomForest)
10-fold cross-validation estimator of misclassification error
Misclassification error: 0.0454
churn.tree <- errorest(churn~., data = trainset, model = rpart, predict=churn.predict);churn.tree
Call:
errorest.data.frame(formula = churn ~ ., data = trainset, model = rpart,
predict = churn.predict)
10-fold cross-validation estimator of misclassification error
Misclassification error: 0.0606
randomForest的错分率最低,性能最佳,单棵树的性能最差,集成学习优于单树。ada提供了boosting分类的方法。