Classification Hangover

Directions

Ready to re-trace your steps?

Data for demo

Back to the spellbook

1. Libraries

Load the libraries

library(rpart)
library(rpart.plot)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice

2. Load data

Load the data

party <- read.csv("hangover_5_v2.csv", header = TRUE)
head(party, 10)
##    ID Night Special_Event Number_of_Drinks Spent Chow Hangover
## 1   1   Fri             1                2   703    1        0
## 2   2   Sat             0                8   287    0        1
## 3   3   Wed             0                3   346    1        0
## 4   4   Sat             0                1   312    0        1
## 5   5   Mon             1                5   919    0        1
## 6   6   Mon             0                5   926    1        0
## 7   7   Fri             1                3   193    1        0
## 8   8   Tue             0               10   710    1        1
## 9   9   Wed             1                5    47    0        0
## 10 10   Thu             1                8   280    1        0
str(party)
## 'data.frame':    2000 obs. of  7 variables:
##  $ ID              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Night           : chr  "Fri" "Sat" "Wed" "Sat" ...
##  $ Special_Event   : int  1 0 0 0 1 0 1 0 1 1 ...
##  $ Number_of_Drinks: int  2 8 3 1 5 5 3 10 5 8 ...
##  $ Spent           : int  703 287 346 312 919 926 193 710 47 280 ...
##  $ Chow            : int  1 0 1 0 0 1 1 1 0 1 ...
##  $ Hangover        : int  0 1 0 1 1 0 0 1 0 0 ...
names(party)
## [1] "ID"               "Night"            "Special_Event"    "Number_of_Drinks"
## [5] "Spent"            "Chow"             "Hangover"
nrow(party)
## [1] 2000

Remove unnecessary variables for this model

party <- party[ , -c(1)]
names(party)
## [1] "Night"            "Special_Event"    "Number_of_Drinks" "Spent"           
## [5] "Chow"             "Hangover"

Rename target variable.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
party$Hangover <- recode(party$Hangover,
                         "1" = "Yes",
                         "0" = "No")
head(party)
##   Night Special_Event Number_of_Drinks Spent Chow Hangover
## 1   Fri             1                2   703    1       No
## 2   Sat             0                8   287    0      Yes
## 3   Wed             0                3   346    1       No
## 4   Sat             0                1   312    0      Yes
## 5   Mon             1                5   919    0      Yes
## 6   Mon             0                5   926    1       No

Set Hangover as factor This might be necessary for the confusion matrix to work

party$Hangover <- as.factor(party$Hangover)

3. Training validation split

Set the seed using our favourite number :-)

set.seed(666)

Create the indices for the split This samples the row indices to split the data into training and validation

train_index <- sample(1:nrow(party), 0.6 * nrow(party))
valid_index <- setdiff(1:nrow(party), train_index)

Using the indices, create the training and validation sets This is similar in principle to splitting a data frame by row

train_df <- party[train_index, ]
valid_df <- party[valid_index, ]

It is a good habit to check after splitting

nrow(train_df)
## [1] 1200
nrow(valid_df)
## [1] 800
head(train_df)
##      Night Special_Event Number_of_Drinks Spent Chow Hangover
## 1598   Sun             0                2   196    0      Yes
## 638    Wed             0                5   754    1       No
## 608    Tue             0                3    71    1      Yes
## 907    Wed             1                1   685    0       No
## 1147   Sat             0                7   850    0      Yes
## 1564   Sun             1                9    21    0      Yes
head(valid_df)
##    Night Special_Event Number_of_Drinks Spent Chow Hangover
## 2    Sat             0                8   287    0      Yes
## 5    Mon             1                5   919    0      Yes
## 6    Mon             0                5   926    1       No
## 10   Thu             1                8   280    1       No
## 11   Tue             0               10   532    0      Yes
## 12   Wed             1               10    68    0      Yes
str(train_df)
## 'data.frame':    1200 obs. of  6 variables:
##  $ Night           : chr  "Sun" "Wed" "Tue" "Wed" ...
##  $ Special_Event   : int  0 0 0 1 0 1 0 0 0 0 ...
##  $ Number_of_Drinks: int  2 5 3 1 7 9 7 5 6 10 ...
##  $ Spent           : int  196 754 71 685 850 21 461 618 593 904 ...
##  $ Chow            : int  0 1 1 0 0 0 1 1 1 1 ...
##  $ Hangover        : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 2 2 2 ...
str(valid_df)
## 'data.frame':    800 obs. of  6 variables:
##  $ Night           : chr  "Sat" "Mon" "Mon" "Thu" ...
##  $ Special_Event   : int  0 1 0 1 0 1 1 1 1 1 ...
##  $ Number_of_Drinks: int  8 5 5 8 10 10 6 9 1 5 ...
##  $ Spent           : int  287 919 926 280 532 68 573 489 890 14 ...
##  $ Chow            : int  0 0 1 1 0 0 1 0 0 1 ...
##  $ Hangover        : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 2 1 2 1 1 ...

4. Classification tree

4.1 The tree

Build the tree for the training set This amounts to training the data

names(train_df)
## [1] "Night"            "Special_Event"    "Number_of_Drinks" "Spent"           
## [5] "Chow"             "Hangover"
class_tr <- rpart(Hangover ~ Night + Special_Event + Number_of_Drinks + 
                    Spent + Chow,
                  data = train_df, method = "class",
                  maxdepth = 30)

Plot the tree Try different settings to tweak the format

prp(class_tr, cex = 0.8, tweak = 1)

4.2 Alternate specifications

Like any model, we can adjust the specifications to build different trees. Some may be better than others. In this case, we are adjusting the minbucket and maxdepth parameters.

class_tr_v2 <- rpart(Hangover ~ Night + Special_Event + Number_of_Drinks + 
                    Spent + Chow,
                    data = train_df, method = "class", minbucket = 2,
                    maxdepth = 10)

prp(class_tr_v2, cex = 0.8, tweak = 1)

A nicer plot.

rpart.plot(class_tr_v2, type = 4)

class_tr_v3 <- rpart(Hangover ~ Night + Special_Event + Number_of_Drinks + 
                    Spent + Chow,
                    data = train_df, method = "class", minbucket = 3,
                    maxdepth = 3)

prp(class_tr_v3, cex = 0.8, tweak = 1)

A nicer plot.

rpart.plot(class_tr_v3, type = 4)

4.3 Complexity parameter

Use cp to prune the tree.

printcp(class_tr)
## 
## Classification tree:
## rpart(formula = Hangover ~ Night + Special_Event + Number_of_Drinks + 
##     Spent + Chow, data = train_df, method = "class", maxdepth = 30)
## 
## Variables actually used in tree construction:
## [1] Chow             Night            Number_of_Drinks Special_Event   
## 
## Root node error: 522/1200 = 0.435
## 
## n= 1200 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.181992      0   1.00000 1.00000 0.032899
## 2 0.053640      1   0.81801 0.81801 0.031772
## 3 0.032567      2   0.76437 0.76820 0.031303
## 4 0.026820      4   0.69923 0.76628 0.031283
## 5 0.021073      5   0.67241 0.73563 0.030956
## 6 0.010000      6   0.65134 0.74713 0.031082

Select tree number 5.

class_tr_cp <- rpart(Hangover ~ Night + Special_Event + Number_of_Drinks +
                       Spent + Chow,
                     data = train_df, method = "class",
                     maxdepth = 30, cp = 0.011407)


prp(class_tr_cp, cex = 0.8, tweak = 1)

Alternate plot.

rpart.plot(class_tr_cp, type = 4)

5. Model Evaluation

5.1 Confusion matrix

For all classification models, we have to generate a confusion matrix to see how good (or lousy) our model is.

For training set.

class_tr_train_predict <- predict(class_tr, train_df,
                                  type = "class")

t(t(head(class_tr_train_predict,10)))
##      [,1]
## 1598 Yes 
## 638  Yes 
## 608  No  
## 907  No  
## 1147 Yes 
## 1564 Yes 
## 654  Yes 
## 1925 Yes 
## 873  Yes 
## 652  Yes 
## Levels: No Yes
confusionMatrix(class_tr_train_predict, train_df$Hangover, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  281  99
##        Yes 241 579
##                                          
##                Accuracy : 0.7167         
##                  95% CI : (0.6902, 0.742)
##     No Information Rate : 0.565          
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.405          
##                                          
##  Mcnemar's Test P-Value : 2.06e-14       
##                                          
##             Sensitivity : 0.8540         
##             Specificity : 0.5383         
##          Pos Pred Value : 0.7061         
##          Neg Pred Value : 0.7395         
##              Prevalence : 0.5650         
##          Detection Rate : 0.4825         
##    Detection Prevalence : 0.6833         
##       Balanced Accuracy : 0.6961         
##                                          
##        'Positive' Class : Yes            
## 
con_mat_train_1 <- confusionMatrix(class_tr_train_predict, train_df$Hangover, 
                                 positive = "Yes")

sensitivity_train <- con_mat_train_1$byClass[1]

precision_train <- con_mat_train_1$byClass[3]
f1_train <- 2/((1/sensitivity_train) + (1/precision_train))

# Use this to avoid awkward naming. It's just the way it works.
# f1_train <- unname(f1_train)

paste("The F1 score for training is", f1_train)
## [1] "The F1 score for training is 0.773030707610147"

Then we use the model to predict the outcome for validation set.

This will tell us how good (or lousy) our predictions are.

class_tr_valid_predict <- predict(class_tr, valid_df,
                                  type = "class")

t(t(head(class_tr_valid_predict,10)))
##    [,1]
## 2  Yes 
## 5  Yes 
## 6  Yes 
## 10 Yes 
## 11 Yes 
## 12 Yes 
## 14 Yes 
## 15 Yes 
## 19 No  
## 22 Yes 
## Levels: No Yes
confusionMatrix(class_tr_valid_predict, valid_df$Hangover, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  158  77
##        Yes 171 394
##                                           
##                Accuracy : 0.69            
##                  95% CI : (0.6567, 0.7219)
##     No Information Rate : 0.5888          
##     P-Value [Acc > NIR] : 2.091e-09       
##                                           
##                   Kappa : 0.331           
##                                           
##  Mcnemar's Test P-Value : 3.516e-09       
##                                           
##             Sensitivity : 0.8365          
##             Specificity : 0.4802          
##          Pos Pred Value : 0.6973          
##          Neg Pred Value : 0.6723          
##              Prevalence : 0.5887          
##          Detection Rate : 0.4925          
##    Detection Prevalence : 0.7063          
##       Balanced Accuracy : 0.6584          
##                                           
##        'Positive' Class : Yes             
## 
con_mat_valid_1 <- confusionMatrix(class_tr_valid_predict, valid_df$Hangover, 
                                 positive = "Yes")

sensitivity_valid <- con_mat_valid_1$byClass[1]

precision_valid <- con_mat_valid_1$byClass[3]

f1_valid <- 2/((1/sensitivity_valid) + (1/precision_valid))

# Use this to avoid awkward naming. It's just the way it works.
# f1_valid <- unname(f1_valid)

paste("The F1 score for validation is", f1_valid)
## [1] "The F1 score for validation is 0.760617760617761"

Compute the probabilities.

class_tr_valid_predict_prob <- predict(class_tr, valid_df,
                                  type = "prob")

head(class_tr_valid_predict_prob)
##           No       Yes
## 2  0.2312634 0.7687366
## 5  0.4179104 0.5820896
## 6  0.3820225 0.6179775
## 10 0.2312634 0.7687366
## 11 0.2312634 0.7687366
## 12 0.2312634 0.7687366

For a different cutoff…

confusionMatrix(as.factor(ifelse(class_tr_valid_predict_prob[,2] > 0.75,
                                 "Yes", "No")),
                          valid_df$Hangover, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  249 228
##        Yes  80 243
##                                           
##                Accuracy : 0.615           
##                  95% CI : (0.5803, 0.6489)
##     No Information Rate : 0.5888          
##     P-Value [Acc > NIR] : 0.07002         
##                                           
##                   Kappa : 0.2554          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.5159          
##             Specificity : 0.7568          
##          Pos Pred Value : 0.7523          
##          Neg Pred Value : 0.5220          
##              Prevalence : 0.5887          
##          Detection Rate : 0.3038          
##    Detection Prevalence : 0.4037          
##       Balanced Accuracy : 0.6364          
##                                           
##        'Positive' Class : Yes             
## 

5.2 ROC

library(ROSE)
## Loaded ROSE 0.0-4
ROSE::roc.curve(valid_df$Hangover, class_tr_valid_predict)

## Area under the curve (AUC): 0.658

5.3 Cumulative plots

We can perform further evalution

Load the library.

library(modelplotr)
## Package modelplotr loaded! Happy model plotting!

Prepare scores for evaluation.

scores_and_ntiles <- prepare_scores_and_ntiles(datasets = 
                                                 list("valid_df"),
                                               dataset_labels = 
                                                 list("Validation data"),
                                               models = 
                                                 list("class_tr"),
                                               model_labels = 
                                                 list("CART"),
                                               target_column = "Hangover",
                                               ntiles = 10)
## Warning: `select_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `select()` instead.
## ℹ The deprecated feature was likely used in the modelplotr package.
##   Please report the issue at <https://github.com/jurrr/modelplotr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## ... scoring caret model "class_tr" on dataset "valid_df".
## Data preparation step 1 succeeded! Dataframe created.

Specify the select_targetclass argument to the preferred class. In this case, we select “Yes”

plot_input <- plotting_scope(prepared_input = scores_and_ntiles,
                             select_targetclass = "Yes")
## Warning: `group_by_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `group_by()` instead.
## ℹ See vignette('programming') for more help
## ℹ The deprecated feature was likely used in the modelplotr package.
##   Please report the issue at <https://github.com/jurrr/modelplotr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Data preparation step 2 succeeded! Dataframe created.
## "prepared_input" aggregated...
## Data preparation step 3 succeeded! Dataframe created.
## 
## No comparison specified, default values are used. 
## 
## Single evaluation line will be plotted: Target value "Yes" plotted for dataset "Validation data" and model "CART.
## "
## -> To compare models, specify: scope = "compare_models"
## -> To compare datasets, specify: scope = "compare_datasets"
## -> To compare target classes, specify: scope = "compare_targetclasses"
## -> To plot one line, do not specify scope or specify scope = "no_comparison".

Cumulative gains.

plot_cumgains(data = plot_input)

Cumulative lift.

plot_cumlift(data = plot_input)

Response plot.

plot_response(data = plot_input)

Cumulative response plot.

plot_cumresponse(data = plot_input)

6. Predict new record

The purpose of a classification model is to predict new records. We will predict whether a new customer will experience a hangover. Import the new record.

New record

new_customer <- read.csv("hangover_new_data_R_v2.csv", header = TRUE)
new_customer
##   Night Special_Event Number_of_Drinks Spent Chow
## 1   Fri             0                6   666    1

Hangover for the new record.

Based on our classification model.

hangover_class_tr <- predict(class_tr, newdata = new_customer)
hangover_class_tr
##          No       Yes
## 1 0.3820225 0.6179775

6.1 Predicting more than 1 new record

Sometimes we need to predict the outcomes of several new records.

New records

new_customer_2 <- read.csv("hangover_new_data_R_v3.csv", header = TRUE)
new_customer_2
##   Night Special_Event Number_of_Drinks Spent Chow
## 1   Sat             0                2   333    1
## 2   Mon             1                8   888    1
## 3   Wed             0                6   222    0

Hangover for the new records.

Based on our classification models.

hangover_class_tr_2 <- predict(class_tr, newdata = new_customer_2,
                               type = "class")
hangover_class_tr_2
##   1   2   3 
## Yes Yes  No 
## Levels: No Yes
hangover_class_tr_2 <- predict(class_tr, newdata = new_customer_2,
                               type = "prob")
hangover_class_tr_2
##          No       Yes
## 1 0.3425926 0.6574074
## 2 0.2312634 0.7687366
## 3 0.6213592 0.3786408

Only 1 visit did not end up with a hangover :-)

7. k fold cross validation with caret

Use caret to perform a 10-fold cross validation; repeated 3 times.

caret_control_k10 <- trainControl(method = "repeatedcv",
                              number = 10,
                              repeats = 3)

Use caret to train the decision tree using 10-fold cross validation repeated 3 times and use 15 values for tuning the cp parameter for rpart.

This returns the best model trained on all the training data!

class_tr_k10 <- train(Hangover ~ Night + Special_Event + Number_of_Drinks +
                        Spent + Chow,
                      data = train_df,
                      method = "rpart",
                      trControl = caret_control_k10,
                      tuneLength = 15)

Display the results of the cross validation run.

class_tr_k10
## CART 
## 
## 1200 samples
##    5 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 1080, 1079, 1080, 1080, 1080, 1080, ... 
## Resampling results across tuning parameters:
## 
##   cp            Accuracy   Kappa    
##   0.0009578544  0.6586425  0.2955548
##   0.0014367816  0.6641796  0.3065570
##   0.0019157088  0.6661287  0.3099902
##   0.0022988506  0.6791822  0.3344577
##   0.0025542784  0.6802980  0.3355712
##   0.0028735632  0.6850042  0.3436703
##   0.0038314176  0.6908307  0.3530247
##   0.0057471264  0.6913749  0.3506167
##   0.0076628352  0.6861015  0.3385386
##   0.0086206897  0.6841548  0.3367659
##   0.0090996169  0.6841548  0.3367659
##   0.0181992337  0.6674786  0.3185716
##   0.0268199234  0.6544597  0.2856925
##   0.0536398467  0.6422444  0.2744946
##   0.1819923372  0.5985852  0.1599264
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.005747126.

7.1 Confusion matrix

Training set.

class_tr_train_predict_k10 <- predict(class_tr_k10, train_df,
                                      type = "raw")


confusionMatrix(class_tr_train_predict_k10, train_df$Hangover, 
               positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  280  81
##        Yes 242 597
##                                           
##                Accuracy : 0.7308          
##                  95% CI : (0.7048, 0.7558)
##     No Information Rate : 0.565           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4323          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8805          
##             Specificity : 0.5364          
##          Pos Pred Value : 0.7116          
##          Neg Pred Value : 0.7756          
##              Prevalence : 0.5650          
##          Detection Rate : 0.4975          
##    Detection Prevalence : 0.6992          
##       Balanced Accuracy : 0.7085          
##                                           
##        'Positive' Class : Yes             
## 
con_mat_train_cv <- confusionMatrix(class_tr_train_predict_k10, train_df$Hangover, 
               positive = "Yes")

sensitivity_train <- con_mat_train_cv$byClass[1]

precision_train <- con_mat_train_cv$byClass[3]
f1_train <- 2/((1/sensitivity_train) + (1/precision_train))

# Use this to avoid awkward naming. It's just the way it works.
# f1_train <- unname(f1_train)

paste("The F1 score for training is", f1_train)
## [1] "The F1 score for training is 0.787079762689519"
class_tr_valid_predict_k10 <- predict(class_tr_k10, valid_df,
                                      type = "raw")

confusionMatrix(class_tr_valid_predict_k10, valid_df$Hangover, 
                positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  156  76
##        Yes 173 395
##                                           
##                Accuracy : 0.6888          
##                  95% CI : (0.6554, 0.7207)
##     No Information Rate : 0.5888          
##     P-Value [Acc > NIR] : 3.271e-09       
##                                           
##                   Kappa : 0.3274          
##                                           
##  Mcnemar's Test P-Value : 1.174e-09       
##                                           
##             Sensitivity : 0.8386          
##             Specificity : 0.4742          
##          Pos Pred Value : 0.6954          
##          Neg Pred Value : 0.6724          
##              Prevalence : 0.5887          
##          Detection Rate : 0.4938          
##    Detection Prevalence : 0.7100          
##       Balanced Accuracy : 0.6564          
##                                           
##        'Positive' Class : Yes             
## 
con_mat_valid_cv <- confusionMatrix(class_tr_valid_predict_k10, valid_df$Hangover, 
                positive = "Yes")

sensitivity_valid <- con_mat_valid_cv$byClass[1]

precision_valid <- con_mat_valid_cv$byClass[3]

f1_valid <- 2/((1/sensitivity_valid) + (1/precision_valid))

# Use this to avoid awkward naming. It's just the way it works.
# f1_valid <- unname(f1_valid)

paste("The F1 score for validation is", f1_valid)
## [1] "The F1 score for validation is 0.760346487006737"

7.2 ROC

ROSE::roc.curve(class_tr_valid_predict_k10, valid_df$Hangover)

## Area under the curve (AUC): 0.684

7.3 Commulative plots

Prepare scores.

library(modelplotr)

scores_and_ntiles_best <- prepare_scores_and_ntiles(datasets = 
                                                 list("valid_df"),
                                               dataset_labels = 
                                                 list("Validation data"),
                                               models = 
                                                 list("class_tr_k10"),
                                               model_labels = 
                                                 list("CART"),
                                               target_column = "Hangover",
                                               ntiles = 10)
## ... scoring caret model "class_tr_k10" on dataset "valid_df".
## Data preparation step 1 succeeded! Dataframe created.

Specify the select_targetclass argument to the preferred class.

plot_input_best <- plotting_scope(prepared_input = scores_and_ntiles_best,
                             select_targetclass = "Yes")
## Data preparation step 2 succeeded! Dataframe created.
## "prepared_input" aggregated...
## Data preparation step 3 succeeded! Dataframe created.
## 
## No comparison specified, default values are used. 
## 
## Single evaluation line will be plotted: Target value "Yes" plotted for dataset "Validation data" and model "CART.
## "
## -> To compare models, specify: scope = "compare_models"
## -> To compare datasets, specify: scope = "compare_datasets"
## -> To compare target classes, specify: scope = "compare_targetclasses"
## -> To plot one line, do not specify scope or specify scope = "no_comparison".

Cumulative gains for the cv model.

plot_cumgains(data = plot_input_best)

Cumulative lift for the cv model.

plot_cumlift(data = plot_input_best)

Response plot for the cv model.

plot_response(data = plot_input_best)

Cumulative response plot for the cv model.

plot_cumresponse(data = plot_input_best)

8. Predict new record with cv model

hangover_class_tr_k10 <- predict(class_tr_k10, newdata = new_customer)
hangover_class_tr_k10
## [1] Yes
## Levels: No Yes
hangover_class_tr_k10_2 <- predict(class_tr_k10, newdata = new_customer_2)
hangover_class_tr_k10_2
## [1] Yes Yes No 
## Levels: No Yes

Compute the probabilities.

hangover_class_tr_k10_prob <- predict(class_tr_k10, newdata = new_customer,
                                   type = "prob")
hangover_class_tr_k10_prob
##          No       Yes
## 1 0.3820225 0.6179775
hangover_class_tr_k10_prob_2 <- predict(class_tr_k10, newdata = new_customer_2,
                                   type = "prob")
hangover_class_tr_k10_prob_2
##          No       Yes
## 1 0.3425926 0.6574074
## 2 0.2312634 0.7687366
## 3 0.7333333 0.2666667

9. Improved trees

9.1 Random forest

library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
class_tr_rf <- randomForest(Hangover ~ Night + Special_Event +
                              Number_of_Drinks +
                              Spent + Chow, 
                            data = train_df, ntree = 200,
                            nodesize = 5, importance = TRUE)

The confusion matrix for the random forest.

class_tr_rf_pred_train <- predict(class_tr_rf, train_df)
confusionMatrix(class_tr_rf_pred_train, train_df$Hangover, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  416  68
##        Yes 106 610
##                                           
##                Accuracy : 0.855           
##                  95% CI : (0.8338, 0.8744)
##     No Information Rate : 0.565           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7025          
##                                           
##  Mcnemar's Test P-Value : 0.005032        
##                                           
##             Sensitivity : 0.8997          
##             Specificity : 0.7969          
##          Pos Pred Value : 0.8520          
##          Neg Pred Value : 0.8595          
##              Prevalence : 0.5650          
##          Detection Rate : 0.5083          
##    Detection Prevalence : 0.5967          
##       Balanced Accuracy : 0.8483          
##                                           
##        'Positive' Class : Yes             
## 
con_mat_train_rf <- confusionMatrix(class_tr_rf_pred_train, 
                                    train_df$Hangover, positive = "Yes")

sensitivity_valid <- con_mat_train_rf$byClass[1]

precision_valid <- con_mat_train_rf$byClass[3]

f1_valid <- 2/((1/sensitivity_valid) + (1/precision_valid))

# Use this to avoid awkward naming. It's just the way it works.
# f1_valid <- unname(f1_valid)

paste("The F1 score for validation is", f1_valid)
## [1] "The F1 score for validation is 0.875179340028694"
class_tr_rf_pred_valid <- predict(class_tr_rf, valid_df)
confusionMatrix(class_tr_rf_pred_valid, valid_df$Hangover, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  200 122
##        Yes 129 349
##                                           
##                Accuracy : 0.6862          
##                  95% CI : (0.6528, 0.7183)
##     No Information Rate : 0.5888          
##     P-Value [Acc > NIR] : 7.866e-09       
##                                           
##                   Kappa : 0.35            
##                                           
##  Mcnemar's Test P-Value : 0.7049          
##                                           
##             Sensitivity : 0.7410          
##             Specificity : 0.6079          
##          Pos Pred Value : 0.7301          
##          Neg Pred Value : 0.6211          
##              Prevalence : 0.5887          
##          Detection Rate : 0.4363          
##    Detection Prevalence : 0.5975          
##       Balanced Accuracy : 0.6744          
##                                           
##        'Positive' Class : Yes             
## 
con_mat_valid_rf <- confusionMatrix(class_tr_rf_pred_valid, 
                                    valid_df$Hangover, positive = "Yes")

sensitivity_valid <- con_mat_valid_rf$byClass[1]

precision_valid <- con_mat_valid_rf$byClass[3]

f1_valid <- 2/((1/sensitivity_valid) + (1/precision_valid))

# Use this to avoid awkward naming. It's just the way it works.
# f1_valid <- unname(f1_valid)

paste("The F1 score for validation is", f1_valid)
## [1] "The F1 score for validation is 0.735511064278188"

Variable importance in the random forest.

varImpPlot(class_tr_rf, type = 1)

varImpPlot(class_tr_rf, type = 2)

ROC for the random forest.

ROSE::roc.curve(valid_df$Hangover, class_tr_rf_pred_valid)

## Area under the curve (AUC): 0.674

Predict using the random forest.

class_tr_rf_pred_new <- predict(class_tr_rf, new_customer)
class_tr_rf_pred_new
##   1 
## Yes 
## Levels: No Yes
class_tr_rf_pred_new_2 <- predict(class_tr_rf, new_customer_2)
class_tr_rf_pred_new_2
##   1   2   3 
## Yes Yes  No 
## Levels: No Yes

The probabilities

class_tr_rf_pred_new_prob <- predict(class_tr_rf, new_customer, type = "prob")
class_tr_rf_pred_new_prob
##      No   Yes
## 1 0.145 0.855
## attr(,"class")
## [1] "matrix" "array"  "votes"
class_tr_rf_pred_new_prob_2 <- predict(class_tr_rf, new_customer_2, type = "prob")
class_tr_rf_pred_new_prob_2
##      No   Yes
## 1 0.235 0.765
## 2 0.210 0.790
## 3 0.560 0.440
## attr(,"class")
## [1] "matrix" "array"  "votes"

Optimal cutoff

class_tr_rf_train_predict_prob <- predict(class_tr_rf, train_df,
                                  type = "prob")
class_tr_rf_train_predict_prob_df <- as.data.frame(class_tr_rf_train_predict_prob)
head(class_tr_rf_train_predict_prob_df[2])
##        Yes
## 1598 0.605
## 638  0.395
## 608  0.460
## 907  0.100
## 1147 0.905
## 1564 0.850
class_tr_rf_valid_predict_prob <- predict(class_tr_rf, valid_df,
                                  type = "prob")
class_tr_rf_valid_predict_prob_df <- as.data.frame(class_tr_rf_valid_predict_prob)
head(class_tr_rf_valid_predict_prob_df[2])
##      Yes
## 2  0.975
## 5  0.830
## 6  0.450
## 10 0.640
## 11 0.990
## 12 0.770
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
roc_valid <- roc(valid_df$Hangover,
                 class_tr_rf_valid_predict_prob_df[,2],
                 levels = c("No", "Yes"),
                 direction = "<")

opt_cutoff_best <- coords(roc_valid, "best", 
                          ret = "threshold", 
                          best.method = "youden")[[1]]

paste0("Optimal cutoff for validation set (Youden's J): ", opt_cutoff_best)
## [1] "Optimal cutoff for validation set (Youden's J): 0.4625"

Confusion matrix with optimal cutoff.

confusionMatrix(as.factor(ifelse(class_tr_rf_train_predict_prob_df[,2] > 0.4625,
                                 "Yes", "No")),
                          train_df$Hangover, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  394  61
##        Yes 128 617
##                                           
##                Accuracy : 0.8425          
##                  95% CI : (0.8206, 0.8627)
##     No Information Rate : 0.565           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6748          
##                                           
##  Mcnemar's Test P-Value : 1.58e-06        
##                                           
##             Sensitivity : 0.9100          
##             Specificity : 0.7548          
##          Pos Pred Value : 0.8282          
##          Neg Pred Value : 0.8659          
##              Prevalence : 0.5650          
##          Detection Rate : 0.5142          
##    Detection Prevalence : 0.6208          
##       Balanced Accuracy : 0.8324          
##                                           
##        'Positive' Class : Yes             
## 
con_mat_train_rf_2 <- confusionMatrix(as.factor(ifelse(
  class_tr_rf_train_predict_prob_df[,2] > 0.4625,"Yes", "No")),
                          train_df$Hangover, positive = "Yes")

sensitivity_train <- con_mat_train_rf_2$byClass[1]

precision_train <- con_mat_train_rf_2$byClass[3]

f1_train <- 2/((1/sensitivity_train) + (1/precision_train))


paste("The F1 score for training is", f1_train)
## [1] "The F1 score for training is 0.86718200983837"
confusionMatrix(as.factor(ifelse(class_tr_rf_valid_predict_prob_df[,2] > 0.4625,
                                 "Yes", "No")),
                          valid_df$Hangover, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  190 105
##        Yes 139 366
##                                           
##                Accuracy : 0.695           
##                  95% CI : (0.6618, 0.7268)
##     No Information Rate : 0.5888          
##     P-Value [Acc > NIR] : 3.299e-10       
##                                           
##                   Kappa : 0.3602          
##                                           
##  Mcnemar's Test P-Value : 0.03463         
##                                           
##             Sensitivity : 0.7771          
##             Specificity : 0.5775          
##          Pos Pred Value : 0.7248          
##          Neg Pred Value : 0.6441          
##              Prevalence : 0.5887          
##          Detection Rate : 0.4575          
##    Detection Prevalence : 0.6312          
##       Balanced Accuracy : 0.6773          
##                                           
##        'Positive' Class : Yes             
## 
con_mat_valid_rf_2 <- confusionMatrix(as.factor(ifelse(
  class_tr_rf_valid_predict_prob_df[,2] > 0.4625,"Yes", "No")),
                          valid_df$Hangover, positive = "Yes")

sensitivity_valid <- con_mat_valid_rf_2$byClass[1]

precision_valid <- con_mat_valid_rf_2$byClass[3]

f1_valid <- 2/((1/sensitivity_valid) + (1/precision_valid))

# Use this to avoid awkward naming. It's just the way it works.
# f1_valid <- unname(f1_valid)

paste("The F1 score for validation is", f1_valid)
## [1] "The F1 score for validation is 0.75"

Different cutoff for prediction

class_tr_rf_pred_new_prob_2_df <- as.data.frame(class_tr_rf_pred_new_prob_2)
class_tr_rf_pred_new_prob_2_df
##      No   Yes
## 1 0.235 0.765
## 2 0.210 0.790
## 3 0.560 0.440
class_tr_rf_pred_new_prob_2_df$Prediction <- ifelse(class_tr_rf_pred_new_prob_2_df$Yes > 0.4625, "Yes", "No")
class_tr_rf_pred_new_prob_2_df
##      No   Yes Prediction
## 1 0.235 0.765        Yes
## 2 0.210 0.790        Yes
## 3 0.560 0.440         No

9.2 Boosted tree

library(adabag)
## Loading required package: foreach
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel
class_tr_boost <- boosting(Hangover ~ Night + Special_Event +
                             Number_of_Drinks +
                             Spent + Chow, 
                           data = train_df)

Prediction using the boosted tree.

class_tr_boost_pred_train <- predict(class_tr_boost, train_df)
class_tr_boost_pred_valid <- predict(class_tr_boost, valid_df)

Predicted classes.

head(class_tr_boost_pred_train$class)
## [1] "Yes" "No"  "No"  "No"  "Yes" "Yes"
head(class_tr_boost_pred_valid$class)
## [1] "Yes" "Yes" "Yes" "Yes" "Yes" "Yes"

Confusion matrix from the boosted tree.

class_tr_boost_pred_train$confusion
##                Observed Class
## Predicted Class  No Yes
##             No  387  77
##             Yes 135 601
class_tr_boost_pred_valid$confusion
##                Observed Class
## Predicted Class  No Yes
##             No  179 118
##             Yes 150 353

Check data types.

str(valid_df$Hangover)
##  Factor w/ 2 levels "No","Yes": 2 2 1 1 2 2 1 2 1 1 ...
str(class_tr_boost_pred_valid$class)
##  chr [1:800] "Yes" "Yes" "Yes" "Yes" "Yes" "Yes" "No" "Yes" "No" "Yes" "No" ...

Convert to factor. Confusion matrix with statistics.

class_tr_boost_pred_train$class <- as.factor(class_tr_boost_pred_train$class)

confusionMatrix(class_tr_boost_pred_train$class, train_df$Hangover, 
                positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  387  77
##        Yes 135 601
##                                           
##                Accuracy : 0.8233          
##                  95% CI : (0.8006, 0.8445)
##     No Information Rate : 0.565           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6359          
##                                           
##  Mcnemar's Test P-Value : 9.049e-05       
##                                           
##             Sensitivity : 0.8864          
##             Specificity : 0.7414          
##          Pos Pred Value : 0.8166          
##          Neg Pred Value : 0.8341          
##              Prevalence : 0.5650          
##          Detection Rate : 0.5008          
##    Detection Prevalence : 0.6133          
##       Balanced Accuracy : 0.8139          
##                                           
##        'Positive' Class : Yes             
## 
class_tr_boost_pred_valid$class <- as.factor(class_tr_boost_pred_valid$class)

confusionMatrix(class_tr_boost_pred_valid$class, valid_df$Hangover, 
                positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  179 118
##        Yes 150 353
##                                           
##                Accuracy : 0.665           
##                  95% CI : (0.6311, 0.6977)
##     No Information Rate : 0.5888          
##     P-Value [Acc > NIR] : 5.547e-06       
##                                           
##                   Kappa : 0.2979          
##                                           
##  Mcnemar's Test P-Value : 0.05827         
##                                           
##             Sensitivity : 0.7495          
##             Specificity : 0.5441          
##          Pos Pred Value : 0.7018          
##          Neg Pred Value : 0.6027          
##              Prevalence : 0.5887          
##          Detection Rate : 0.4412          
##    Detection Prevalence : 0.6288          
##       Balanced Accuracy : 0.6468          
##                                           
##        'Positive' Class : Yes             
## 

ROC for the boosted tree.

ROSE::roc.curve(valid_df$Hangover, class_tr_boost_pred_valid$class)

## Area under the curve (AUC): 0.647

Predict using the boosted tree.

class_tr_boost_pred_new <- predict(class_tr_boost, new_customer)
class_tr_boost_pred_new$prob
##           [,1]      [,2]
## [1,] 0.3732615 0.6267385
class_tr_boost_pred_new$class
## [1] "Yes"
class_tr_boost_pred_new_2 <- predict(class_tr_boost, new_customer_2)
class_tr_boost_pred_new_2$prob
##           [,1]      [,2]
## [1,] 0.3459454 0.6540546
## [2,] 0.3196193 0.6803807
## [3,] 0.5708293 0.4291707
class_tr_boost_pred_new_2$class
## [1] "Yes" "Yes" "No"

And someone is hungover :-)