决策树模型虽然使用了很多的heuristic(什么impurity), 但是天然具有处理类别数据的能力,而且往往效果还不错。随机森林以决策树为基础,将数据集随机采样,随机选取属性(这里指selection,直接选取属性),随机构建属性(这里指exaction,将特征进行组合),然后在简单的决策树上训练模型。训练出许多许多的树之后,将这些模型进行aggregation,就得到了随机森林模型。随机森林的分类效果特别好,大概是“团结起来力量大”的缘故吧。它还有一个特点,就是“随机”,整个模型的构建中处处充满了随机,当然最后的分类结果也是随机的,但效果很不错(浮动不会很大)。

一般的决策树

下面就说一说decision tree和random forest在R中的应用吧。这里我们构建的决策树模型是CART模型,首先需要一些函数包。

install.packages("rpart")
library(rpart)
install.packages("rpart.plot")
library(rpart.plot)

然后是一般的构建CART树,可视化,预测的过程:

#"class" indicates classification, minibucket para controls the size of tree
StevensTree = rpart(Reverse ~ ., data = Train, method="class", minbucket=25)
#plot the tree
prp(StevensTree)
#make predictions
PredictCART = predict(StevensTree, newdata = Test, type = "class")

决策树模型的ROC:

library(ROCR)

PredictROC = predict(StevensTree, newdata = Test)
PredictROC

#pay attention to the first para 
pred = prediction(PredictROC[,2], Test$Reverse)
perf = performance(pred, "tpr", "fpr")
plot(perf)

错误衡量与多元分类

这个最早在《模式识别》课程里面就提到过,我们预测结果错误的代价是不一样的!比如预测病人是否有癌症,病人明明患有癌症,预测结果却说没有,这会带来灾难性的后果!反之,虽然会给病人一些心理负担,但后果也不算特别严重。所以,我们在构建模型的时候也要把这个考虑进去。首先看这样一个penalty matrix:

Image Title

注意到这是一个多元分类的问题。决策树还有一个明显的优势,就是非常适合进行多元分类,不需要花费额外的开销。

R语言中,rpart函数还考虑了penalty matrix的因素:

#Note here we changed a data set
#The penalty matrix
PenaltyMatrix = matrix(c(0,1,2,3,4,2,0,1,2,3,4,2,0,1,2,6,4,2,0,1,8,6,4,2,0), byrow=TRUE, nrow=5) 
#CART model with loss(penalty) matrix, cp controls the complexity of model
ClaimsTree = rpart(bucket2009 ~ ., data=ClaimsTrain, method="class", cp=0.00005, parms=list(loss=PenaltyMatrix))
#Predictions
PredictTest = predict(ClaimsTree, newdata = ClaimsTest, type = "class")

#Calculate penalty error, lean to use as.matrix to transform table to real matrix
sum(as.matrix(table(ClaimsTest$bucket2009, PredictTest))*PenaltyMatrix)/nrow(ClaimsTest)

Cross validation 在R中的应用

CV应该已经很熟悉了,我们在决策树里面使用CV是为了选出最好的模型复杂度的参数(cp),这里直接上示例代码:

# Load libraries for cross-validation, maybe need to install them
library(caret)
library(e1071)

# Number of folds
tr.control = trainControl(method = "cv", number = 10)

# cp values, control complexity of model
cp.grid = expand.grid( .cp = (0:10)*0.001)

# Cross-validation
tr = train(MEDV ~ ., data = train, method = "rpart", trControl = tr.control, tuneGrid = cp.grid)

# Extract tree
best.tree = tr$finalModel

Random Forest

最后来说一说随机森林,除了其特定的函数包之外,其它内容跟CART, LogReg都差不多。参数里面可以指定决策树的个数,还可以控制每个决策树的大小。

# Install randomForest package
install.packages("randomForest")
library(randomForest)

# Convert outcome to factor
Train$Reverse = as.factor(Train$Reverse)
Test$Reverse = as.factor(Test$Reverse)

# Build the model
StevensForest = randomForest(Reverse ~ ., data = Train, ntree=200, nodesize=25 )

# Make predictions
PredictForest = predict(StevensForest, newdata = Test)
table(Test$Reverse, PredictForest)