第8章 集成学习 笔记

2022-03-04 11:23:35 浏览数 (1)

将多个分类器的预测结果进行组合得到最终决策,来获得更好的分类及回归性能。单一分类器只适合于某种特定类型的数据,很难保证得到最佳分类模型,如果对不同算法的预测结果取平均,相比一个分类器,可能会获得更好的分类模型。bagging, boosting和随机森林是应用最广泛的三类集成学习算法。

  • bagging:投票式算法,首先bootstrap产生不同的训练数据集,然后得到多个基础分类器,最后组合得到一个相对更优的模型。
  • Boosting:与bagging类似,区别是boosting是顺序进行,后一轮分类器与之前分类器的结果有关,即在错分基础上学习,补偿学习。
  • 随机森林:包含多个决策树的分类器,通过投票得到分类结果,对每一类特征向量产生一棵单独的分类决策树,从这些分类结果中选择多个投票数最高的决策树完成分类,或者选择一个平均值作为回归处理的输出。

8.2 使用bagging方法对数据分类

adabag包对bagging和boosting两种方法都提供了支持,前者是Breman bagging算法(首次提出了版本分类器理论)。

代码语言:javascript复制
install.packages("adabag")
library(adabag)
# 发现

data(iris)
churnTrain <- iris
ind <- sample(2,nrow(churnTrain),replace = TRUE,
              prob = c(0.7,0.3))
trainset <- churnTrain[ind==1,]
testset <- churnTrain[ind==2,]
set.seed(2)
churn.bagging <- bagging(churn~., data = trainset, mfinal = 10) #迭代次数为10
churn.bagging$importance
Petal.Length  Petal.Width Sepal.Length  Sepal.Width 
    75.53879     24.46121      0.00000      0.00000 
 churn.predbagging$confusion
               Observed Class
Predicted Class setosa versicolor virginica
     setosa         18          0         0
     versicolor      0         19         0
     virginica       0          2        15
churn.baggingcv$error
[1] 0.03703704

算法源于Bootstrap aggregation,具有稳定、准确、功能强大和易于实现的优点,常用于数据分类和回归处理。算法定义如下:给定大小为n的数据集,通过bootstrap抽样,得到m个新数据集Di,通过m个样本得到m个模型,然后获得最优模型。缺点是结果难以解释。扩展ipred包也可以实现同样功能,测试下来这个速度超快呢,上面那个半小时了还没动静,应该是没有交叉验证。

代码语言:javascript复制
churn.bagging <- bagging(churn~., data = trainset, coob=TRUE)
churn.bagging
Bagging classification trees with 25 bootstrap replications 

Call: bagging.data.frame(formula = churn ~ ., data = trainset, coob = TRUE)

Out-of-bag estimate of misclassification error:  0.0606 
# 错分率
mean(predict(churn.bagging)!=trainset$churn)
[1] 0.06115418
# 预测 
churn.predction <- predict(churn.bagging, newdata = testset, type = "class")
prediction.table <- table(churn.predction, testset$churn)
prediction.table
churn.predction  yes   no
            yes  170   16
            no    57 1274

8.3 使用bagging方法进行交叉验证

评估分类模型的鲁棒性

代码语言:javascript复制
# cv
churn.baggingcv <- bagging.cv(churn~., v = 10, data = trainset,
                              mfinal = 10)
# Error in bagging.cv(churn ~ ., v = 10, data = trainset, mfinal = 10) : 
   v should be in [2, n] 原数据集的问题,这里用iris代替
churn.baggingcv$confusion
               Observed Class
Predicted Class setosa versicolor virginica
     setosa         18          0         0
     versicolor      0         19         0
     virginica       0          2        15
# 错分率
churn.predbagging$error
[1] 0.03703704

churn数据集报错,这里用iris简单数据集曲线报国了,好处是可以节省时间。

8.4 使用boosting 方法进行数据分类

adabag实现了AdaBoost和SANME两个算法。

代码语言:javascript复制
# boosting
set.seed(2)
churn.boost <- boosting(Species~., data = trainset, mfinal = 3,
                        coeflearn = "Freund", boos = FALSE,
                        control=rpart.control(maxdepth=3))
churn.boost.pred <- predict.boosting(churn.boost, newdata = testset)
churn.boost.pred$confusion
churn.boost.pred$confusion
               Observed Class
Predicted Class setosa versicolor virginica
     setosa         18          0         0
     versicolor      0         20         1
     virginica       0          1        14
churn.boost.pred$error
[1] 0.03703704

boosting算法的思想是弱分类器(如单一决策树),逐步优化(改变权重),使之成为强分类器。bagging和boosting都采用了集成学习的思想,不同之处是bagging组合独立的模型,boostong迭代学习。mfinal是迭代次数,coeflearn是权重更新系数,观测值权重boos,rpart的控制方法(单一决策树)。扩展

代码语言:javascript复制
install.packages(c("mboost","ada"))
library(mboost)
library(pROC)
library(caret)
install.packages("MLmetrics")
set.seed(2)
ctrl <- trainControl(method = "repeatedcv", repeats = 1,
                     classProbs = TRUE, 
                     summaryFunction = twoClassSummary)
ada.train <- train(churn~.,data = trainset, method = "ada",
                   metric = "ROC", trControl = ctrl)
#  这里iris报错,切换回了churn数据集
  nu maxdepth iter       ROC      Sens        Spec      ROCSD     SensSD      SpecSD
1 0.1        1   50 0.8600045 0.9090204 0.010719176 0.03719839 0.05786791 0.007708342
...
plot(ada.train)
ada.predict <- predict(ada.train, testset, "prob")
ada.predict.result <- ifelse(ada.predict[1]>0.5, "yes", "no")
table(testset$churn, ada.predict.result)
     ada.predict.result
        no  yes
  yes   71  143
  no  1301    6

本章少有的图

8.5 使用boosting方法进行交叉验证

代码语言:javascript复制
churn.boostingcv <- boosting.cv(Species~., v=10, data = trainset,
            mfinal = 5, control = rpart.control(cp=0.01))
churn.boostingcv$confusion
               Observed Class
Predicted Class setosa versicolor virginica
     setosa         32          0         0
     versicolor      0         26         3
     virginica       0          3        32
churn.boostingcv$error
[1] 0.0625

8.6 使用gradient boosting方法对数据进行分类

也是将弱分类器组合在一起,然后在与损失函数的负梯度最大相关时得到新的基础分类器,既可以回归分析,也可以分类,对不同数据集的适应能力都很好。

代码语言:javascript复制
# gradient boosting
install.packages("gbm")
library(gbm)
# 响应值为0~1,所以转换下
trainset$churn <- ifelse(trainset$churn =="yes", 1,0)
set.seed(2)
churn.gbm <- gbm(formula = churn ~ ., distribution = "bernoulli", data = trainset,
                 n.trees = 1000, interaction.depth = 7, shrinkage = 0.01,
                 cv.folds = 3) # shrinkage 步长减少参数,即学习速度;interaction.depth决策树最大深度
summary(churn.gbm)
                                                       var    rel.inf
total_day_minutes                         total_day_minutes 29.8623601
total_eve_minutes                         total_eve_minutes 14.6407627
number_customer_service_calls number_customer_service_calls 12.5827527
total_intl_minutes                       total_intl_minutes  9.6529151
...
# 交叉验证,确定最佳迭代次数
churn.iter <- gbm.perf(churn.gbm, method = "cv")
# Bernoulli损失函数的对数奇点值 
churn.predict <- predict(churn.gbm, testset, n.trees = churn.iter)
str(churn.predict)
num [1:1521] -3.56 -3.36 -2.99 -3.82 -3.52 ...
# ROC, 得到最大准确率的最佳临界值
# ROC
churn.roc <- roc(testset$churn, churn.predict)
plot(churn.roc)
# coords得到最佳临界值
coords(churn.roc, "best")
   threshold specificity sensitivity
1 -0.7369319   0.8738318   0.9869931
# coords得到最佳临界值
coords(churn.roc, "best")
churn.predict.class <- ifelse(churn.predict >c(coords(churn.roc,"best")["threshold"]), 
                              "yes","no")
table(testset$churn, churn.predict.class)
     churn.predict.class
        no  yes
  yes   27  187
  no  1290   17

算法的思想如下:首先,计算每个划分的数据集残差的方差,并据此确定每个阶段的最优划分,被选中模型将前一阶段处理得到的方差作为学习目标重新建模,缩小。采用梯度下降,也就是沿着导数下降的方向进行变化,使剩余方差最小化。拓展

代码语言:javascript复制
library(mboost)
# 仅支持数值,去除,转化非数值, 这里发现前面错误的来源应该是这个yes和no的转换,只有加个c()才行,不科学呀,不管了,达到目的即可
trainset$churn <- ifelse(trainset$churn ==c("yes"),1,0)
trainset$voice_mail_plan = NULL
trainset$international_plan = NULL
churn.mboost <- mboost(churn ~., data = trainset, control = boost_control(mstop = 10))
summary(churn.mboost)
  Model-based Boosting

Call:
mboost(formula = churn ~ ., data = trainset, control = boost_control(mstop = 10))


  Squared Error (Regression) 

Loss function: (y - f)^2 
 

Number of boosting iterations: mstop = 10 
Step size:  0.1 
Offset:  0.1417074 
Number of baselearners:  14 

Selection frequencies:
            bbs(total_day_minutes) bbs(number_customer_service_calls) 
                               0.6                                0.4 
par(mfrow=c(1,2))
plot(churn.mboost)

重要属性的局部贡献

计算分类器边缘

代码语言:javascript复制
boost.margins <- margins(churn.boost, trainset)
boost.pred.margins <- margins(churn.boost.pred, testset)
plot(sort(boost.margins[[1]]), 
     (1:length(boost.margins[[1]]))/length(boost.margins[[1]]),
     type = 'l', xlim = c(-1,1), 
     main = "Boosting:Magrin cumulative distribution graph",
     xlab = "margin", ylab = "% observations", col= 'blue')
lines(sort(boost.pred.margins[[1]]), 
      (1:length(boost.pred.margins[[1]]))/length(boost.pred.margins[[1]]),
      type = "l", col="green")
abline(v=0, col='red', lty=2)

boosting分类器的边缘累积分布图

代码语言:javascript复制
# 与训练和测试集错误匹配负边缘的百分比
boosting.training.margin <- table(boost.margins[[1]]>0)
boosting.negative.training <- as.numeric(boosting.training.margin[1])/boosting.training.margin[2]
boosting.negative.training
     TRUE 
0.0212766 
代码语言:javascript复制
# 计算bagiing分类器的边缘
bagging.margins = margins(churn.bagging, trainset)
bagging.pred.margins <- margins(churn.predbagging,testset)
plot(sort(bagging.margins[[1]]),
     (1:length(bagging.margins[[1]]))/length(bagging.margins[[1]]),
     type = "l", xlim = c(-1,1), 
     main = "Bagging:Magrin cumulative distribution graph",
     xlab = "margin", ylab = "% observations", col= 'blue')
lines(sort(bagging.pred.margins[[1]]), 
      (1:length(bagging.pred.margins[[1]]))/length(bagging.pred.margins[[1]]),
      type = "l", col="green")
abline(v=0, col='red', lty=2)
# 同样计算百分比
bagging.training.margin <- table(bagging.margins[[1]]>0)
bagging.negative.training <- as.numeric(bagging.training.margin[1])/boosting.training.margin[2]
bagging.negative.training

边缘是分类器确定性的一种度量,是根据分类样本数及最大错分样本计算得来的。正确分类样本建立边缘,错误分类样本形成负边缘,如果边缘接近1,说明正确分类的样本可信度非常高。分类不确定的样本只有较小的边缘。margin函数能计算AdaBoost.M1、AdaBoost-SAMME以及baaging分类器的边缘,返回一个边缘向量,可以绘制边缘累积分布曲线展现边缘分布情况,如果每个观测都能正确划分,分布图会是边缘值为1的垂线。通常情况下,训练数据集的错分样例负边缘与测试数据集的错分负边缘差不多。

计算集成分类算法的误差演变

代码语言:javascript复制
# 误差演变
boosting.evol.train <- errorevol(churn.boost, trainset)
boosting.evol.test <- errorevol(churn.boost, testset)
plot(boosting.evol.test$error, type = "l", ylim = c(0,1),
     main = "Boosting error versus number of trees",
     xlab = "Iteration", ylab = "Error", col='red',
     lwd=2)
lines(boosting.evol.train$error, cex = .5, col='blue',lty=2,
      lwd=2)
legend('topright', c('test','train'), col = c('red', 'blue'),
       lty = 1:2, lwd=2)

adabag包中提供了errorevol函数以方便用户根据迭代次数估算集成分类算法的误差。

代码语言:javascript复制
# bagging
# 误差演变
bagging.evol.train <- errorevol(churn.bagging, trainset)
bagging.evol.test <- errorevol(churn.bagging, testset)
plot(bagging.evol.test$error, type = "l", ylim = c(0,1),
     main = "Bagging error versus number of trees",
     xlab = "Iteration", ylab = "Error", col='red',
     lwd=2)
lines(bagging.evol.train$error, cex = .5, col='blue',lty=2,
      lwd=2)
legend('topright', c('test','train'), col = c('red', 'blue'),
       lty = 1:2, lwd=2)

图展示了每次迭代后的分类误差的变化,可以调用predict.bagging和predict.boosting进行剪枝。

8.9 随机森林对数据分类

训练过程中产生多棵决策树,每棵会根据输入产生预测输出,采用投票机制选择类别众数作为预测结果。

代码语言:javascript复制
# random Forest
install.packages("randomForest")
library(randomForest)
churn.rf <- randomForest(churn ~., data = trainset, importance=T) # 对预测器的重要性进行评估
churn.rf
Call:
 randomForest(formula = churn ~ ., data = trainset, importance = T) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 4

        OOB estimate of  error rate: 4.31%
Confusion matrix:
    yes   no class.error
yes 363  130 0.263691684
no   20 2966 0.006697924
# 分类预测
churn.prediction <- predict(churn.rf, testset)
table(churn.prediction, testset$churn)
churn.prediction  yes   no
             yes  167    7
             no    47 1300
plot(churn.rf)
importance(churn.rf)
                                     yes         no MeanDecreaseAccuracy
international_plan            93.0223581 72.5504101           95.5848053
voice_mail_plan               22.5321109 18.2760474           22.8558091
number_vmail_messages         23.0980210 17.5029154           22.6011108
total_day_minutes             33.4914749 33.8653396           43.0515228
varImpPlot(churn.rf)
margins.rf <- margin(churn.rf, trainset)
plot(margins.rf)
hist(margins.rf, main = "Margins of Random Forest for churn dataset")
boxplot(margins.rf~ trainset$churn, main = "Margins of Random Forest for churn dataset by class")

随机森林将多个弱学习机(决策树)组合得到一个强学习机,处理过程和bagging非常相似,首先boostrap采样,从中找到能提供最佳分割效果的预测属性。如果是回归,将取所有预测的平均值或者加权平均值作为最后输出,如果是分类,选择类别预测众数作为最终预测。算法包括两个参数,ntree决策树个数和mtry可用来寻找最佳特征的特征个数,bagging算法只使用前者,如果mtry=训练数据集的特征值,随机森林就等同于bagging了。最大优点是计算容易,高效,对缺失数据或不平衡数据容错度较高;主要缺点是不能预测超过训练集之外的数据,容易被噪声数据影响出现过度适应。拓展cforest包的cforest函数同样可以实现随机森林算法

代码语言:javascript复制
# 拓展
install.packages("party")
library(party)
churn.cforest <- cforest(churn~., data = trainset, 
                         controls = cforest_unbiased(ntree=1000,mtry=5))
churn.forest.prediction <- predict(churn.cforest, testset, OOB=TRUE, type = "response")
table(churn.forest.prediction, trainset$churn) # 这个地方很神奇,先打个问号
                       
churn.forest.prediction  yes   no
                    yes  348   21
                    no   145 2965

8.10 估算不同分类器的预测误差

对多种分类算法采用errorest函数进行十折交叉验证,证明集成分类器是否比单一决策树分类效果更优。

代码语言:javascript复制
# ipred erroest
library(ipred)
churn.bagging <- errorest(churn ~., data = trainset, model = bagging);churn.bagging
Call:
errorest.data.frame(formula = churn ~ ., data = trainset, model = bagging)

  10-fold cross-validation estimator of misclassification error 

Misclassification error:  0.052 
library(ada)
churn.mboosting <- errorest(churn ~., data = trainset, model = ada);churn.mboosting
Call:
errorest.data.frame(formula = churn ~ ., data = trainset, model = ada)

  10-fold cross-validation estimator of misclassification error 

Misclassification error:  0.048 
hurn.rf <- errorest(churn~., data = trainset, model = randomForest);churn.rf
Call:
errorest.data.frame(formula = churn ~ ., data = trainset, model = randomForest)

  10-fold cross-validation estimator of misclassification error 

Misclassification error:  0.0454 
churn.tree <- errorest(churn~., data = trainset, model = rpart, predict=churn.predict);churn.tree
Call:
errorest.data.frame(formula = churn ~ ., data = trainset, model = rpart, 
    predict = churn.predict)

  10-fold cross-validation estimator of misclassification error 

Misclassification error:  0.0606 

randomForest的错分率最低,性能最佳,单棵树的性能最差,集成学习优于单树。ada提供了boosting分类的方法。

0 人点赞