We load post-election survey data from the 2004 British Election Survey. The data is available here.
# clear workspace
rm(list=ls())
library(foreign)
bes <- read.dta("http://philippbroniecki.github.io/ML2017.io/data/bes.dta")
bes <- na.omit(bes)
# clean in_school
table(bes$in_school)
##
## -0.405100243979883 -0.286622836951644 -0.0932005119161492
## 1 1 1
## -0.08278915151733 0 0.0403350016659423
## 1 4120 1
## 0.123419680101826 0.247478125358543 1
## 1 1 34
bes$in_school <- ifelse (bes$in_school != 1, 0, bes$in_school)
table(bes$in_school)
##
## 0 1
## 4127 34
# data manipulation
categcorical <- c("Turnout", "Vote2001", "Gender", "PartyID", "Telephone", "edu15",
"edu16", "edu17", "edu18", "edu19plus", "in_school", "in_uni")
# declare factor variables
bes[, categcorical] <- lapply(bes[, categcorical], factor)
We use trees to classifyrespondents into voters and non-voters.
library(tree)
## Warning: package 'tree' was built under R version 3.4.1
# build classification tree (- in formula language means except)
t1 <- tree( Turnout ~ . -CivicDutyScores, data = bes)
summary(t1)
##
## Classification tree:
## tree(formula = Turnout ~ . - CivicDutyScores, data = bes)
## Variables actually used in tree construction:
## [1] "CivicDutyIndex" "Vote2001" "polinfoindex"
## Number of terminal nodes: 6
## Residual mean deviance: 0.8434 = 3504 / 4155
## Misclassification error rate: 0.1769 = 736 / 4161
We can plot the tree using the standard plot function. On every split a condition is printed. The observations in the left branch are those for which the condition is true and the ones on the right are those for which the condition is false.
# plot tree object
plot(t1)
text(t1, pretty = 0)
# examin the tree object
t1
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 4161 4763.0 1 ( 0.25931 0.74069 )
## 2) CivicDutyIndex < 19.5 3066 2446.0 1 ( 0.13666 0.86334 )
## 4) Vote2001: 0 243 333.4 1 ( 0.44033 0.55967 ) *
## 5) Vote2001: 1 2823 1963.0 1 ( 0.11052 0.88948 )
## 10) CivicDutyIndex < 16.5 1748 950.8 1 ( 0.07723 0.92277 ) *
## 11) CivicDutyIndex > 16.5 1075 961.7 1 ( 0.16465 0.83535 ) *
## 3) CivicDutyIndex > 19.5 1095 1471.0 0 ( 0.60274 0.39726 )
## 6) Vote2001: 0 429 391.4 0 ( 0.82984 0.17016 ) *
## 7) Vote2001: 1 666 918.2 1 ( 0.45646 0.54354 )
## 14) polinfoindex < 5.5 356 483.4 0 ( 0.58427 0.41573 ) *
## 15) polinfoindex > 5.5 310 383.7 1 ( 0.30968 0.69032 ) *
Now we use the validation set approach for classification. We split our data and re-grow the tree on the training data.
# initialize random number generator
set.seed(2)
# training/test split
train <- sample(nrow(bes), size = as.integer(nrow(bes)*.66))
bes.test <- bes[ -train, ]
turnout.test <- ifelse( bes$Turnout[-train] == "1", yes = 1, no = 0)
# grow tree on training data
t2 <- tree( Turnout ~ . , data = bes, subset = train)
We predict outcomes using the predict()
function.
# predict outcomes
t2.pred <- predict(t2, newdata = bes.test, type = "class")
# confusion matrix
table(prediction = t2.pred, truth = turnout.test)
## truth
## prediction 0 1
## 0 193 73
## 1 168 981
# percent correctly classified
mean( t2.pred == turnout.test )
## [1] 0.829682
We correctly classify \(83\%\) of the observations. In classification models the Brier Score is often used as as measure of model quality. We estimate it as the average of the squared differences between predicted probabilities and true outcomes. It is, thus, similar to the MSE.
# Brier score
t2.pred <- predict(t2, newdata = bes.test, type = "vector")
head(t2.pred)
## 0 1
## 9 0.0403071 0.9596929
## 11 0.1722365 0.8277635
## 16 0.1722365 0.8277635
## 18 0.1722365 0.8277635
## 21 0.0403071 0.9596929
## 25 0.0403071 0.9596929
t2.pred <- t2.pred[,2]
# brier score
mean( (t2.pred - turnout.test)^2 )
## [1] 0.1292594
We turn to cost-complexity pruning to see if we can simplify the tree and thus decrease variance without increasing bias. We use k-fold cross-validation to determine the best size of the tree.
set.seed(3)
cv.t2 <- cv.tree(t2, FUN = prune.misclass)
# illustrate
par(mfrow = c(1, 2))
plot(cv.t2$size, cv.t2$dev, type = "b")
plot(cv.t2$k, cv.t2$dev, type = "b")
We can prune the tree to four terminal nodes.
# prune the tree (pick the smallest tree that does not substiantially increase error)
prune.t2 <- prune.misclass(t2, best = 4)
par(mfrow = c(1,1))
plot(prune.t2)
text(prune.t2, pretty = 0)
# predict outcomes
t2.pred <- predict(prune.t2, newdata = bes.test, type = "class")
# did we loose predictive power?
table(prediction = t2.pred, truth = turnout.test)
## truth
## prediction 0 1
## 0 193 73
## 1 168 981
mean( t2.pred == turnout.test )
## [1] 0.829682
# Brier score
t2.pred <- predict(t2, newdata = bes.test, type = "vector")[,2]
mean( (t2.pred - turnout.test)^2 )
## [1] 0.1292594
We still correctly classify \(83\%\) of the observations and the brier score remained the same.
In the previous plots, we saw that we should do worse if we prune back the tree to have less than 4 terminal nodes. We examine what happens if we overdo it.
# using "wrong" value for pruning (where the error rate does increase)
prune.t2 <- prune.misclass(t2, best = 2)
plot(prune.t2)
text(prune.t2, pretty = 0)
# predict outcomes based on a tree that is too small
t2.pred <- predict(prune.t2, newdata = bes.test, type = "class")
table(prediction = t2.pred, truth = turnout.test)
## truth
## prediction 0 1
## 0 252 185
## 1 109 869
# our predictive power decreased
mean( t2.pred == turnout.test )
## [1] 0.7922261
# brier score
t2.pred <- predict(prune.t2, newdata = bes.test, type = "vector")[,2]
mean( (t2.pred - turnout.test)^2 )
## [1] 0.1439949
We see that our test error increases.
We predict the continuous variable Income
. The plot of the regression tree is similar. However, in the terminal nodes the mean values of the dependent variable for that group are displayed rather than the class labels.
# grow a regression tree
set.seed(123)
reg.t1 <- tree(Income ~ ., data = bes, subset = train)
summary(reg.t1)
##
## Regression tree:
## tree(formula = Income ~ ., data = bes, subset = train)
## Variables actually used in tree construction:
## [1] "edu19plus" "Age" "Telephone"
## Number of terminal nodes: 6
## Residual mean deviance: 3.659 = 10030 / 2740
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6.8830 -1.1430 0.1174 0.0000 1.2990 9.3210
# plot regression tree
plot(reg.t1)
text(reg.t1, pretty = 0)
# examin the tree objext
t1
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 4161 4763.0 1 ( 0.25931 0.74069 )
## 2) CivicDutyIndex < 19.5 3066 2446.0 1 ( 0.13666 0.86334 )
## 4) Vote2001: 0 243 333.4 1 ( 0.44033 0.55967 ) *
## 5) Vote2001: 1 2823 1963.0 1 ( 0.11052 0.88948 )
## 10) CivicDutyIndex < 16.5 1748 950.8 1 ( 0.07723 0.92277 ) *
## 11) CivicDutyIndex > 16.5 1075 961.7 1 ( 0.16465 0.83535 ) *
## 3) CivicDutyIndex > 19.5 1095 1471.0 0 ( 0.60274 0.39726 )
## 6) Vote2001: 0 429 391.4 0 ( 0.82984 0.17016 ) *
## 7) Vote2001: 1 666 918.2 1 ( 0.45646 0.54354 )
## 14) polinfoindex < 5.5 356 483.4 0 ( 0.58427 0.41573 ) *
## 15) polinfoindex > 5.5 310 383.7 1 ( 0.30968 0.69032 ) *
We estimate test error of our tree.
# MSE
mean( (bes.test$Income - predict(reg.t1, newdata = bes.test))^2)
## [1] 4.083786
We apply pruning again to get a smaller more interpretable tree.
# cross-validation (to determine cutback size for pruning)
cv.reg.t1 <- cv.tree(reg.t1)
plot(cv.reg.t1)
plot(cv.reg.t1$size, cv.reg.t1$dev, type = "b")
This is time we will increase error by pruning the tree. We choose four as a smaller tree size that does not increase RSS by much.
# pruning
prune.reg.t1 <- prune.tree(reg.t1, best = 4)
plot(prune.reg.t1)
text(prune.reg.t1, pretty = 0)
We can predict the outcome based on our pruned back tree. We will predict four values because we have four terminal nodes. We can illustrate the groups and their variance and estimate the MSE of our prediction.
# predict outcomes
yhat <- predict(prune.reg.t1, newdata = bes.test)
plot(yhat, bes.test$Income)
abline(0, 1)
# MSE
mean((yhat - bes.test$Income)^2)
## [1] 4.200642
We increased our MSE by:
(4.200642 - 4.083786) / 4.083786
## [1] 0.02861462
\(3\%\).
We now apply bagging and random forests to improve our prediction. Bagging is the idea that the high variance of a single bushy tree can be reduced by bootstapping samples and averaging over trees that were grown on the samples.
Note: Bagging gets an estimate of the test error for free as it always leaves out some observations when a tree is fit. The reported out-of-bag MSE is thus an estimate of test error. We also estimate test error separately on a test set. This is one particular test set, so the test error may vary.
In our run below the OOB MSE may be a better estimate of test error. It is reported to be lower than our test error estimate.
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
bag1 <- randomForest(Income ~ . , mtry = 19, data = bes, subset = train, importance = TRUE)
bag1
##
## Call:
## randomForest(formula = Income ~ ., data = bes, mtry = 19, importance = TRUE, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 19
##
## Mean of squared residuals: 3.545845
## % Var explained: 38.59
# predict outcome, illustrate, MSE
yhat.bag <- predict(bag1, newdata = bes.test)
plot(yhat.bag, bes.test$Income)
abline(0, 1) # line of 1:1 perfect prediction
mean( (yhat.bag - bes.test$Income)^2 )
## [1] 3.836317
# reduction of error
(3.836317 - 4.200642) / 3.836317
## [1] -0.09496739
We reduce the error rate by \(9\%\) by using bagging.
We examine what happens when we reduce the number of trees we grow. The default is 500.
# dcrease the number of trees (defaults to 500)
bag2 <- randomForest(Income ~ ., mtry = 19, data = bes, subset = train, ntree = 25, importance = TRUE)
# predict outcome
yhat.bag <- predict(bag2, newdata = bes.test)
mean( (yhat.bag - bes.test$Income)^2 )
## [1] 4.001648
The result is that our rate increases substantially again.
We now apply random forest. The trick is to decorrelate the trees by randomly considering only a subset of variables at every split. We thereby reduce variance further. The number of variables argument mtry
is a tuning parameter.
## Random Forest: not trying all vars at each split decorrelates the trees
set.seed(123)
# we try to find the optimal tuning parameter for the number of variables to use at each split
oob.error <- NA
val.set.error <- NA
for ( idx in 1:10){
rf1 <- randomForest(Income ~ ., mtry = idx, data = bes, subset = train, importance = TRUE)
# record out of bag error
oob.error[idx] <- rf1$mse[length(rf1$mse)]
cat(paste("\n", "Use ", idx, " variables at each split", sep=""))
# record validation set error
val.set.error[idx] <- mean( (predict(rf1, newdata = bes.test) - bes.test$Income)^2 )
}
##
## Use 1 variables at each split
## Use 2 variables at each split
## Use 3 variables at each split
## Use 4 variables at each split
## Use 5 variables at each split
## Use 6 variables at each split
## Use 7 variables at each split
## Use 8 variables at each split
## Use 9 variables at each split
## Use 10 variables at each split
# check optimal values for mtry
matplot( 1:idx, cbind(oob.error, val.set.error), pch = 19, col = c("red", "blue"),
type = "b", ylab = "MSE", frame.plot = FALSE)
legend("topright", legend = c("OOB", "Val. Set"), pch = 19, col = c("red", "blue"),
bty = "n")
We use 4 as the optimal value for mtry
.
rf1 <- randomForest(Income ~ ., mtry = 4, data = bes, subset = train, importance = TRUE)
# predict outcomes
yhat.rf <- predict(rf1, newdata = bes.test)
mean( (yhat.rf - bes.test$Income)^2 )
## [1] 3.738068
# reduction in error
(3.738068 - 3.836317) / 3.836317
## [1] -0.02561024
We reduced the error rate by another \(3\%\) by decorrelating the trees. We can exmine variable importance as well. Variable reduction is obtained by as the average that a predictor reduces error at splits within a tree where it was used and averaged again over all trees. Similarly, node purity is based on the gini index of how heterogenous a group becomes due to a split.
# which varaibles help explain outcome
importance(rf1)
## %IncMSE IncNodePurity
## cs_id 2.06598543 1191.62963
## Turnout 10.63267089 187.08688
## Vote2001 4.74392742 134.49639
## Age 61.67167351 2726.49512
## Gender 7.75324398 270.15383
## PartyID -1.30764676 186.00643
## Influence 8.25220908 622.72510
## Attention 13.48430187 734.16131
## Telephone 41.55335004 420.28959
## LeftrightSelf 3.84799215 701.93854
## CivicDutyIndex 19.09482995 765.92377
## polinfoindex 12.62283324 761.74783
## edu15 22.02752039 856.04335
## edu16 11.78494180 286.72268
## edu17 8.09626481 109.68566
## edu18 8.90440198 102.36531
## edu19plus 53.12112919 2236.08056
## in_school 6.07771141 77.57576
## in_uni -0.02058015 51.77373
## CivicDutyScores 18.09776217 1145.64391
# importance plot
varImpPlot(rf1)
library(gbm)
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:boot':
##
## aml
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
## Loading required package: parallel
## Loaded gbm 2.1.3
set.seed(1234)
We run gradient boosting. The tuning parameters are the tree size. Tree size is directly related to the second tuning parameter: the learning rate. When the learning rate is smaller, we need more trees. The third tuning parameter interaction.depth determines how bushy the tree is. Common choices are 1, 2, 4, 8. When interaction depth is 1, each tree is a stump. If we increase to two we can get bivariate interactions with 2 splits and so. A final parameter that is related to the complexity of the tree could be minimum number of observations in the terminal node which defaults to 10.
# gradient boosting
gb1 <- gbm(Income ~ ., data = bes[train, ],
distribution = "gaussian",
n.trees = 5000,
interaction.depth = 4,
shrinkage = 0.001)
summary(gb1)
## var rel.inf
## edu19plus edu19plus 41.4726490
## Age Age 28.6747586
## Telephone Telephone 7.0882505
## polinfoindex polinfoindex 5.3053648
## Gender Gender 2.5211165
## edu15 edu15 2.4673348
## cs_id cs_id 2.1982736
## LeftrightSelf LeftrightSelf 1.7370627
## Turnout Turnout 1.4784065
## in_school in_school 1.2178698
## CivicDutyScores CivicDutyScores 1.1787687
## Attention Attention 1.1600579
## edu16 edu16 0.9406745
## Influence Influence 0.9211903
## CivicDutyIndex CivicDutyIndex 0.6809589
## Vote2001 Vote2001 0.3880845
## edu17 edu17 0.2379545
## edu18 edu18 0.1226292
## PartyID PartyID 0.1054473
## in_uni in_uni 0.1031473
# partial dependence plots
plot(gb1, i = "edu19plus")
plot(gb1, i = "Age")
# predict outcome
yhat.gb <- predict(gb1, newdata = bes.test, n.trees = 5000)
mean( (yhat.gb - bes.test$Income)^2 )
## [1] 3.720392
# reduction in error
(3.716085 - 3.738321) / 3.738321
## [1] -0.005948125
We reduce the error rate by roughly another half percent.
We applied random forests for varying mtry
. In addition vary the number of trees and create a plot that displays both test error and OOB error.
Looking at the Carseats data from the ISLR package we will seek to predict Sales
using regression trees and related approaches, treating the response as a quantitative variable.
importance()
function to determine which variables are most important.importance()
function to determine which variables are most important. Describe the effect of mm, the number of variables considered at each split, on the error rate obtained.This problem involves the OJ dataset which is part of the ISLR package.
Purchase
as the response and the other variables except for Buy
as predictors. Use the summary()
function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?We now use boosting to predict Salary
in the Hitters dataset, which is part of the ISLR package.
This question uses the Caravan
dataset, part of the ISRL package.
Purchase
as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?Apply boosting, bagging, and random forests to a data set of your choice. Be sure to fit the models on a training set and to evaluate their performance on a test set. How accurate are the results compared to simple methods like linear or logistic regression? Which of these approaches yields the best performance?