Classification Hangover
1. Libraries
Load the libraries
## Loading required package: ggplot2
## Loading required package: lattice
2. Load data
Load the data
## ID Night Special_Event Number_of_Drinks Spent Chow Hangover
## 1 1 Fri 1 2 703 1 0
## 2 2 Sat 0 8 287 0 1
## 3 3 Wed 0 3 346 1 0
## 4 4 Sat 0 1 312 0 1
## 5 5 Mon 1 5 919 0 1
## 6 6 Mon 0 5 926 1 0
## 7 7 Fri 1 3 193 1 0
## 8 8 Tue 0 10 710 1 1
## 9 9 Wed 1 5 47 0 0
## 10 10 Thu 1 8 280 1 0
## 'data.frame': 2000 obs. of 7 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Night : chr "Fri" "Sat" "Wed" "Sat" ...
## $ Special_Event : int 1 0 0 0 1 0 1 0 1 1 ...
## $ Number_of_Drinks: int 2 8 3 1 5 5 3 10 5 8 ...
## $ Spent : int 703 287 346 312 919 926 193 710 47 280 ...
## $ Chow : int 1 0 1 0 0 1 1 1 0 1 ...
## $ Hangover : int 0 1 0 1 1 0 0 1 0 0 ...
## [1] "ID" "Night" "Special_Event" "Number_of_Drinks"
## [5] "Spent" "Chow" "Hangover"
## [1] 2000
Remove unnecessary variables for this model
## [1] "Night" "Special_Event" "Number_of_Drinks" "Spent"
## [5] "Chow" "Hangover"
Rename target variable.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Night Special_Event Number_of_Drinks Spent Chow Hangover
## 1 Fri 1 2 703 1 No
## 2 Sat 0 8 287 0 Yes
## 3 Wed 0 3 346 1 No
## 4 Sat 0 1 312 0 Yes
## 5 Mon 1 5 919 0 Yes
## 6 Mon 0 5 926 1 No
Set Hangover as factor This might be necessary for the confusion matrix to work
3. Training validation split
Set the seed using our favourite number :-)
Create the indices for the split This samples the row indices to split the data into training and validation
train_index <- sample(1:nrow(party), 0.6 * nrow(party))
valid_index <- setdiff(1:nrow(party), train_index)
Using the indices, create the training and validation sets This is similar in principle to splitting a data frame by row
It is a good habit to check after splitting
## [1] 1200
## [1] 800
## Night Special_Event Number_of_Drinks Spent Chow Hangover
## 1598 Sun 0 2 196 0 Yes
## 638 Wed 0 5 754 1 No
## 608 Tue 0 3 71 1 Yes
## 907 Wed 1 1 685 0 No
## 1147 Sat 0 7 850 0 Yes
## 1564 Sun 1 9 21 0 Yes
## Night Special_Event Number_of_Drinks Spent Chow Hangover
## 2 Sat 0 8 287 0 Yes
## 5 Mon 1 5 919 0 Yes
## 6 Mon 0 5 926 1 No
## 10 Thu 1 8 280 1 No
## 11 Tue 0 10 532 0 Yes
## 12 Wed 1 10 68 0 Yes
## 'data.frame': 1200 obs. of 6 variables:
## $ Night : chr "Sun" "Wed" "Tue" "Wed" ...
## $ Special_Event : int 0 0 0 1 0 1 0 0 0 0 ...
## $ Number_of_Drinks: int 2 5 3 1 7 9 7 5 6 10 ...
## $ Spent : int 196 754 71 685 850 21 461 618 593 904 ...
## $ Chow : int 0 1 1 0 0 0 1 1 1 1 ...
## $ Hangover : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 2 2 2 ...
## 'data.frame': 800 obs. of 6 variables:
## $ Night : chr "Sat" "Mon" "Mon" "Thu" ...
## $ Special_Event : int 0 1 0 1 0 1 1 1 1 1 ...
## $ Number_of_Drinks: int 8 5 5 8 10 10 6 9 1 5 ...
## $ Spent : int 287 919 926 280 532 68 573 489 890 14 ...
## $ Chow : int 0 0 1 1 0 0 1 0 0 1 ...
## $ Hangover : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 2 1 2 1 1 ...
4. Classification tree
4.1 The tree
Build the tree for the training set This amounts to training the data
## [1] "Night" "Special_Event" "Number_of_Drinks" "Spent"
## [5] "Chow" "Hangover"
class_tr <- rpart(Hangover ~ Night + Special_Event + Number_of_Drinks +
Spent + Chow,
data = train_df, method = "class",
maxdepth = 30)
Plot the tree Try different settings to tweak the format
4.2 Alternate specifications
Like any model, we can adjust the specifications to build different trees. Some may be better than others. In this case, we are adjusting the minbucket and maxdepth parameters.
class_tr_v2 <- rpart(Hangover ~ Night + Special_Event + Number_of_Drinks +
Spent + Chow,
data = train_df, method = "class", minbucket = 2,
maxdepth = 10)
prp(class_tr_v2, cex = 0.8, tweak = 1)
A nicer plot.
class_tr_v3 <- rpart(Hangover ~ Night + Special_Event + Number_of_Drinks +
Spent + Chow,
data = train_df, method = "class", minbucket = 3,
maxdepth = 3)
prp(class_tr_v3, cex = 0.8, tweak = 1)
A nicer plot.
4.3 Complexity parameter
Use cp to prune the tree.
##
## Classification tree:
## rpart(formula = Hangover ~ Night + Special_Event + Number_of_Drinks +
## Spent + Chow, data = train_df, method = "class", maxdepth = 30)
##
## Variables actually used in tree construction:
## [1] Chow Night Number_of_Drinks Special_Event
##
## Root node error: 522/1200 = 0.435
##
## n= 1200
##
## CP nsplit rel error xerror xstd
## 1 0.181992 0 1.00000 1.00000 0.032899
## 2 0.053640 1 0.81801 0.81801 0.031772
## 3 0.032567 2 0.76437 0.76820 0.031303
## 4 0.026820 4 0.69923 0.76628 0.031283
## 5 0.021073 5 0.67241 0.73563 0.030956
## 6 0.010000 6 0.65134 0.74713 0.031082
Select tree number 5.
class_tr_cp <- rpart(Hangover ~ Night + Special_Event + Number_of_Drinks +
Spent + Chow,
data = train_df, method = "class",
maxdepth = 30, cp = 0.011407)
prp(class_tr_cp, cex = 0.8, tweak = 1)
Alternate plot.
5. Model Evaluation
5.1 Confusion matrix
For all classification models, we have to generate a confusion matrix to see how good (or lousy) our model is.
For training set.
class_tr_train_predict <- predict(class_tr, train_df,
type = "class")
t(t(head(class_tr_train_predict,10)))
## [,1]
## 1598 Yes
## 638 Yes
## 608 No
## 907 No
## 1147 Yes
## 1564 Yes
## 654 Yes
## 1925 Yes
## 873 Yes
## 652 Yes
## Levels: No Yes
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 281 99
## Yes 241 579
##
## Accuracy : 0.7167
## 95% CI : (0.6902, 0.742)
## No Information Rate : 0.565
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.405
##
## Mcnemar's Test P-Value : 2.06e-14
##
## Sensitivity : 0.8540
## Specificity : 0.5383
## Pos Pred Value : 0.7061
## Neg Pred Value : 0.7395
## Prevalence : 0.5650
## Detection Rate : 0.4825
## Detection Prevalence : 0.6833
## Balanced Accuracy : 0.6961
##
## 'Positive' Class : Yes
##
con_mat_train_1 <- confusionMatrix(class_tr_train_predict, train_df$Hangover,
positive = "Yes")
sensitivity_train <- con_mat_train_1$byClass[1]
precision_train <- con_mat_train_1$byClass[3]
f1_train <- 2/((1/sensitivity_train) + (1/precision_train))
# Use this to avoid awkward naming. It's just the way it works.
# f1_train <- unname(f1_train)
paste("The F1 score for training is", f1_train)
## [1] "The F1 score for training is 0.773030707610147"
Then we use the model to predict the outcome for validation set.
This will tell us how good (or lousy) our predictions are.
class_tr_valid_predict <- predict(class_tr, valid_df,
type = "class")
t(t(head(class_tr_valid_predict,10)))
## [,1]
## 2 Yes
## 5 Yes
## 6 Yes
## 10 Yes
## 11 Yes
## 12 Yes
## 14 Yes
## 15 Yes
## 19 No
## 22 Yes
## Levels: No Yes
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 158 77
## Yes 171 394
##
## Accuracy : 0.69
## 95% CI : (0.6567, 0.7219)
## No Information Rate : 0.5888
## P-Value [Acc > NIR] : 2.091e-09
##
## Kappa : 0.331
##
## Mcnemar's Test P-Value : 3.516e-09
##
## Sensitivity : 0.8365
## Specificity : 0.4802
## Pos Pred Value : 0.6973
## Neg Pred Value : 0.6723
## Prevalence : 0.5887
## Detection Rate : 0.4925
## Detection Prevalence : 0.7063
## Balanced Accuracy : 0.6584
##
## 'Positive' Class : Yes
##
con_mat_valid_1 <- confusionMatrix(class_tr_valid_predict, valid_df$Hangover,
positive = "Yes")
sensitivity_valid <- con_mat_valid_1$byClass[1]
precision_valid <- con_mat_valid_1$byClass[3]
f1_valid <- 2/((1/sensitivity_valid) + (1/precision_valid))
# Use this to avoid awkward naming. It's just the way it works.
# f1_valid <- unname(f1_valid)
paste("The F1 score for validation is", f1_valid)
## [1] "The F1 score for validation is 0.760617760617761"
Compute the probabilities.
class_tr_valid_predict_prob <- predict(class_tr, valid_df,
type = "prob")
head(class_tr_valid_predict_prob)
## No Yes
## 2 0.2312634 0.7687366
## 5 0.4179104 0.5820896
## 6 0.3820225 0.6179775
## 10 0.2312634 0.7687366
## 11 0.2312634 0.7687366
## 12 0.2312634 0.7687366
For a different cutoff…
confusionMatrix(as.factor(ifelse(class_tr_valid_predict_prob[,2] > 0.75,
"Yes", "No")),
valid_df$Hangover, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 249 228
## Yes 80 243
##
## Accuracy : 0.615
## 95% CI : (0.5803, 0.6489)
## No Information Rate : 0.5888
## P-Value [Acc > NIR] : 0.07002
##
## Kappa : 0.2554
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.5159
## Specificity : 0.7568
## Pos Pred Value : 0.7523
## Neg Pred Value : 0.5220
## Prevalence : 0.5887
## Detection Rate : 0.3038
## Detection Prevalence : 0.4037
## Balanced Accuracy : 0.6364
##
## 'Positive' Class : Yes
##
5.2 ROC
## Loaded ROSE 0.0-4
## Area under the curve (AUC): 0.658
5.3 Cumulative plots
We can perform further evalution
Load the library.
## Package modelplotr loaded! Happy model plotting!
Prepare scores for evaluation.
scores_and_ntiles <- prepare_scores_and_ntiles(datasets =
list("valid_df"),
dataset_labels =
list("Validation data"),
models =
list("class_tr"),
model_labels =
list("CART"),
target_column = "Hangover",
ntiles = 10)
## Warning: `select_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `select()` instead.
## ℹ The deprecated feature was likely used in the modelplotr package.
## Please report the issue at <https://github.com/jurrr/modelplotr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## ... scoring caret model "class_tr" on dataset "valid_df".
## Data preparation step 1 succeeded! Dataframe created.
Specify the select_targetclass argument to the preferred class. In this case, we select “Yes”
## Warning: `group_by_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `group_by()` instead.
## ℹ See vignette('programming') for more help
## ℹ The deprecated feature was likely used in the modelplotr package.
## Please report the issue at <https://github.com/jurrr/modelplotr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Data preparation step 2 succeeded! Dataframe created.
## "prepared_input" aggregated...
## Data preparation step 3 succeeded! Dataframe created.
##
## No comparison specified, default values are used.
##
## Single evaluation line will be plotted: Target value "Yes" plotted for dataset "Validation data" and model "CART.
## "
## -> To compare models, specify: scope = "compare_models"
## -> To compare datasets, specify: scope = "compare_datasets"
## -> To compare target classes, specify: scope = "compare_targetclasses"
## -> To plot one line, do not specify scope or specify scope = "no_comparison".
Cumulative gains.
Cumulative lift.
Response plot.
Cumulative response plot.
6. Predict new record
The purpose of a classification model is to predict new records. We will predict whether a new customer will experience a hangover. Import the new record.
## Night Special_Event Number_of_Drinks Spent Chow
## 1 Fri 0 6 666 1
Hangover for the new record.
Based on our classification model.
## No Yes
## 1 0.3820225 0.6179775
6.1 Predicting more than 1 new record
Sometimes we need to predict the outcomes of several new records.
## Night Special_Event Number_of_Drinks Spent Chow
## 1 Sat 0 2 333 1
## 2 Mon 1 8 888 1
## 3 Wed 0 6 222 0
Hangover for the new records.
Based on our classification models.
hangover_class_tr_2 <- predict(class_tr, newdata = new_customer_2,
type = "class")
hangover_class_tr_2
## 1 2 3
## Yes Yes No
## Levels: No Yes
hangover_class_tr_2 <- predict(class_tr, newdata = new_customer_2,
type = "prob")
hangover_class_tr_2
## No Yes
## 1 0.3425926 0.6574074
## 2 0.2312634 0.7687366
## 3 0.6213592 0.3786408
Only 1 visit did not end up with a hangover :-)
7. k fold cross validation with caret
Use caret to perform a 10-fold cross validation; repeated 3 times.
Use caret to train the decision tree using 10-fold cross validation repeated 3 times and use 15 values for tuning the cp parameter for rpart.
This returns the best model trained on all the training data!
class_tr_k10 <- train(Hangover ~ Night + Special_Event + Number_of_Drinks +
Spent + Chow,
data = train_df,
method = "rpart",
trControl = caret_control_k10,
tuneLength = 15)
Display the results of the cross validation run.
## CART
##
## 1200 samples
## 5 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 1080, 1079, 1080, 1080, 1080, 1080, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.0009578544 0.6586425 0.2955548
## 0.0014367816 0.6641796 0.3065570
## 0.0019157088 0.6661287 0.3099902
## 0.0022988506 0.6791822 0.3344577
## 0.0025542784 0.6802980 0.3355712
## 0.0028735632 0.6850042 0.3436703
## 0.0038314176 0.6908307 0.3530247
## 0.0057471264 0.6913749 0.3506167
## 0.0076628352 0.6861015 0.3385386
## 0.0086206897 0.6841548 0.3367659
## 0.0090996169 0.6841548 0.3367659
## 0.0181992337 0.6674786 0.3185716
## 0.0268199234 0.6544597 0.2856925
## 0.0536398467 0.6422444 0.2744946
## 0.1819923372 0.5985852 0.1599264
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.005747126.
7.1 Confusion matrix
Training set.
class_tr_train_predict_k10 <- predict(class_tr_k10, train_df,
type = "raw")
confusionMatrix(class_tr_train_predict_k10, train_df$Hangover,
positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 280 81
## Yes 242 597
##
## Accuracy : 0.7308
## 95% CI : (0.7048, 0.7558)
## No Information Rate : 0.565
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4323
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8805
## Specificity : 0.5364
## Pos Pred Value : 0.7116
## Neg Pred Value : 0.7756
## Prevalence : 0.5650
## Detection Rate : 0.4975
## Detection Prevalence : 0.6992
## Balanced Accuracy : 0.7085
##
## 'Positive' Class : Yes
##
con_mat_train_cv <- confusionMatrix(class_tr_train_predict_k10, train_df$Hangover,
positive = "Yes")
sensitivity_train <- con_mat_train_cv$byClass[1]
precision_train <- con_mat_train_cv$byClass[3]
f1_train <- 2/((1/sensitivity_train) + (1/precision_train))
# Use this to avoid awkward naming. It's just the way it works.
# f1_train <- unname(f1_train)
paste("The F1 score for training is", f1_train)
## [1] "The F1 score for training is 0.787079762689519"
class_tr_valid_predict_k10 <- predict(class_tr_k10, valid_df,
type = "raw")
confusionMatrix(class_tr_valid_predict_k10, valid_df$Hangover,
positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 156 76
## Yes 173 395
##
## Accuracy : 0.6888
## 95% CI : (0.6554, 0.7207)
## No Information Rate : 0.5888
## P-Value [Acc > NIR] : 3.271e-09
##
## Kappa : 0.3274
##
## Mcnemar's Test P-Value : 1.174e-09
##
## Sensitivity : 0.8386
## Specificity : 0.4742
## Pos Pred Value : 0.6954
## Neg Pred Value : 0.6724
## Prevalence : 0.5887
## Detection Rate : 0.4938
## Detection Prevalence : 0.7100
## Balanced Accuracy : 0.6564
##
## 'Positive' Class : Yes
##
con_mat_valid_cv <- confusionMatrix(class_tr_valid_predict_k10, valid_df$Hangover,
positive = "Yes")
sensitivity_valid <- con_mat_valid_cv$byClass[1]
precision_valid <- con_mat_valid_cv$byClass[3]
f1_valid <- 2/((1/sensitivity_valid) + (1/precision_valid))
# Use this to avoid awkward naming. It's just the way it works.
# f1_valid <- unname(f1_valid)
paste("The F1 score for validation is", f1_valid)
## [1] "The F1 score for validation is 0.760346487006737"
7.2 ROC
## Area under the curve (AUC): 0.684
7.3 Commulative plots
Prepare scores.
library(modelplotr)
scores_and_ntiles_best <- prepare_scores_and_ntiles(datasets =
list("valid_df"),
dataset_labels =
list("Validation data"),
models =
list("class_tr_k10"),
model_labels =
list("CART"),
target_column = "Hangover",
ntiles = 10)
## ... scoring caret model "class_tr_k10" on dataset "valid_df".
## Data preparation step 1 succeeded! Dataframe created.
Specify the select_targetclass argument to the preferred class.
plot_input_best <- plotting_scope(prepared_input = scores_and_ntiles_best,
select_targetclass = "Yes")
## Data preparation step 2 succeeded! Dataframe created.
## "prepared_input" aggregated...
## Data preparation step 3 succeeded! Dataframe created.
##
## No comparison specified, default values are used.
##
## Single evaluation line will be plotted: Target value "Yes" plotted for dataset "Validation data" and model "CART.
## "
## -> To compare models, specify: scope = "compare_models"
## -> To compare datasets, specify: scope = "compare_datasets"
## -> To compare target classes, specify: scope = "compare_targetclasses"
## -> To plot one line, do not specify scope or specify scope = "no_comparison".
Cumulative gains for the cv model.
Cumulative lift for the cv model.
Response plot for the cv model.
Cumulative response plot for the cv model.
8. Predict new record with cv model
## [1] Yes
## Levels: No Yes
## [1] Yes Yes No
## Levels: No Yes
Compute the probabilities.
hangover_class_tr_k10_prob <- predict(class_tr_k10, newdata = new_customer,
type = "prob")
hangover_class_tr_k10_prob
## No Yes
## 1 0.3820225 0.6179775
hangover_class_tr_k10_prob_2 <- predict(class_tr_k10, newdata = new_customer_2,
type = "prob")
hangover_class_tr_k10_prob_2
## No Yes
## 1 0.3425926 0.6574074
## 2 0.2312634 0.7687366
## 3 0.7333333 0.2666667
9. Improved trees
9.1 Random forest
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
class_tr_rf <- randomForest(Hangover ~ Night + Special_Event +
Number_of_Drinks +
Spent + Chow,
data = train_df, ntree = 200,
nodesize = 5, importance = TRUE)
The confusion matrix for the random forest.
class_tr_rf_pred_train <- predict(class_tr_rf, train_df)
confusionMatrix(class_tr_rf_pred_train, train_df$Hangover, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 416 68
## Yes 106 610
##
## Accuracy : 0.855
## 95% CI : (0.8338, 0.8744)
## No Information Rate : 0.565
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7025
##
## Mcnemar's Test P-Value : 0.005032
##
## Sensitivity : 0.8997
## Specificity : 0.7969
## Pos Pred Value : 0.8520
## Neg Pred Value : 0.8595
## Prevalence : 0.5650
## Detection Rate : 0.5083
## Detection Prevalence : 0.5967
## Balanced Accuracy : 0.8483
##
## 'Positive' Class : Yes
##
con_mat_train_rf <- confusionMatrix(class_tr_rf_pred_train,
train_df$Hangover, positive = "Yes")
sensitivity_valid <- con_mat_train_rf$byClass[1]
precision_valid <- con_mat_train_rf$byClass[3]
f1_valid <- 2/((1/sensitivity_valid) + (1/precision_valid))
# Use this to avoid awkward naming. It's just the way it works.
# f1_valid <- unname(f1_valid)
paste("The F1 score for validation is", f1_valid)
## [1] "The F1 score for validation is 0.875179340028694"
class_tr_rf_pred_valid <- predict(class_tr_rf, valid_df)
confusionMatrix(class_tr_rf_pred_valid, valid_df$Hangover, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 200 122
## Yes 129 349
##
## Accuracy : 0.6862
## 95% CI : (0.6528, 0.7183)
## No Information Rate : 0.5888
## P-Value [Acc > NIR] : 7.866e-09
##
## Kappa : 0.35
##
## Mcnemar's Test P-Value : 0.7049
##
## Sensitivity : 0.7410
## Specificity : 0.6079
## Pos Pred Value : 0.7301
## Neg Pred Value : 0.6211
## Prevalence : 0.5887
## Detection Rate : 0.4363
## Detection Prevalence : 0.5975
## Balanced Accuracy : 0.6744
##
## 'Positive' Class : Yes
##
con_mat_valid_rf <- confusionMatrix(class_tr_rf_pred_valid,
valid_df$Hangover, positive = "Yes")
sensitivity_valid <- con_mat_valid_rf$byClass[1]
precision_valid <- con_mat_valid_rf$byClass[3]
f1_valid <- 2/((1/sensitivity_valid) + (1/precision_valid))
# Use this to avoid awkward naming. It's just the way it works.
# f1_valid <- unname(f1_valid)
paste("The F1 score for validation is", f1_valid)
## [1] "The F1 score for validation is 0.735511064278188"
Variable importance in the random forest.
ROC for the random forest.
## Area under the curve (AUC): 0.674
Predict using the random forest.
## 1
## Yes
## Levels: No Yes
## 1 2 3
## Yes Yes No
## Levels: No Yes
The probabilities
class_tr_rf_pred_new_prob <- predict(class_tr_rf, new_customer, type = "prob")
class_tr_rf_pred_new_prob
## No Yes
## 1 0.145 0.855
## attr(,"class")
## [1] "matrix" "array" "votes"
class_tr_rf_pred_new_prob_2 <- predict(class_tr_rf, new_customer_2, type = "prob")
class_tr_rf_pred_new_prob_2
## No Yes
## 1 0.235 0.765
## 2 0.210 0.790
## 3 0.560 0.440
## attr(,"class")
## [1] "matrix" "array" "votes"
Optimal cutoff
class_tr_rf_train_predict_prob <- predict(class_tr_rf, train_df,
type = "prob")
class_tr_rf_train_predict_prob_df <- as.data.frame(class_tr_rf_train_predict_prob)
head(class_tr_rf_train_predict_prob_df[2])
## Yes
## 1598 0.605
## 638 0.395
## 608 0.460
## 907 0.100
## 1147 0.905
## 1564 0.850
class_tr_rf_valid_predict_prob <- predict(class_tr_rf, valid_df,
type = "prob")
class_tr_rf_valid_predict_prob_df <- as.data.frame(class_tr_rf_valid_predict_prob)
head(class_tr_rf_valid_predict_prob_df[2])
## Yes
## 2 0.975
## 5 0.830
## 6 0.450
## 10 0.640
## 11 0.990
## 12 0.770
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_valid <- roc(valid_df$Hangover,
class_tr_rf_valid_predict_prob_df[,2],
levels = c("No", "Yes"),
direction = "<")
opt_cutoff_best <- coords(roc_valid, "best",
ret = "threshold",
best.method = "youden")[[1]]
paste0("Optimal cutoff for validation set (Youden's J): ", opt_cutoff_best)
## [1] "Optimal cutoff for validation set (Youden's J): 0.4625"
Confusion matrix with optimal cutoff.
confusionMatrix(as.factor(ifelse(class_tr_rf_train_predict_prob_df[,2] > 0.4625,
"Yes", "No")),
train_df$Hangover, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 394 61
## Yes 128 617
##
## Accuracy : 0.8425
## 95% CI : (0.8206, 0.8627)
## No Information Rate : 0.565
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6748
##
## Mcnemar's Test P-Value : 1.58e-06
##
## Sensitivity : 0.9100
## Specificity : 0.7548
## Pos Pred Value : 0.8282
## Neg Pred Value : 0.8659
## Prevalence : 0.5650
## Detection Rate : 0.5142
## Detection Prevalence : 0.6208
## Balanced Accuracy : 0.8324
##
## 'Positive' Class : Yes
##
con_mat_train_rf_2 <- confusionMatrix(as.factor(ifelse(
class_tr_rf_train_predict_prob_df[,2] > 0.4625,"Yes", "No")),
train_df$Hangover, positive = "Yes")
sensitivity_train <- con_mat_train_rf_2$byClass[1]
precision_train <- con_mat_train_rf_2$byClass[3]
f1_train <- 2/((1/sensitivity_train) + (1/precision_train))
paste("The F1 score for training is", f1_train)
## [1] "The F1 score for training is 0.86718200983837"
confusionMatrix(as.factor(ifelse(class_tr_rf_valid_predict_prob_df[,2] > 0.4625,
"Yes", "No")),
valid_df$Hangover, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 190 105
## Yes 139 366
##
## Accuracy : 0.695
## 95% CI : (0.6618, 0.7268)
## No Information Rate : 0.5888
## P-Value [Acc > NIR] : 3.299e-10
##
## Kappa : 0.3602
##
## Mcnemar's Test P-Value : 0.03463
##
## Sensitivity : 0.7771
## Specificity : 0.5775
## Pos Pred Value : 0.7248
## Neg Pred Value : 0.6441
## Prevalence : 0.5887
## Detection Rate : 0.4575
## Detection Prevalence : 0.6312
## Balanced Accuracy : 0.6773
##
## 'Positive' Class : Yes
##
con_mat_valid_rf_2 <- confusionMatrix(as.factor(ifelse(
class_tr_rf_valid_predict_prob_df[,2] > 0.4625,"Yes", "No")),
valid_df$Hangover, positive = "Yes")
sensitivity_valid <- con_mat_valid_rf_2$byClass[1]
precision_valid <- con_mat_valid_rf_2$byClass[3]
f1_valid <- 2/((1/sensitivity_valid) + (1/precision_valid))
# Use this to avoid awkward naming. It's just the way it works.
# f1_valid <- unname(f1_valid)
paste("The F1 score for validation is", f1_valid)
## [1] "The F1 score for validation is 0.75"
Different cutoff for prediction
class_tr_rf_pred_new_prob_2_df <- as.data.frame(class_tr_rf_pred_new_prob_2)
class_tr_rf_pred_new_prob_2_df
## No Yes
## 1 0.235 0.765
## 2 0.210 0.790
## 3 0.560 0.440
class_tr_rf_pred_new_prob_2_df$Prediction <- ifelse(class_tr_rf_pred_new_prob_2_df$Yes > 0.4625, "Yes", "No")
class_tr_rf_pred_new_prob_2_df
## No Yes Prediction
## 1 0.235 0.765 Yes
## 2 0.210 0.790 Yes
## 3 0.560 0.440 No
9.2 Boosted tree
## Loading required package: foreach
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel
class_tr_boost <- boosting(Hangover ~ Night + Special_Event +
Number_of_Drinks +
Spent + Chow,
data = train_df)
Prediction using the boosted tree.
Predicted classes.
## [1] "Yes" "No" "No" "No" "Yes" "Yes"
## [1] "Yes" "Yes" "Yes" "Yes" "Yes" "Yes"
Confusion matrix from the boosted tree.
## Observed Class
## Predicted Class No Yes
## No 387 77
## Yes 135 601
## Observed Class
## Predicted Class No Yes
## No 179 118
## Yes 150 353
Check data types.
## Factor w/ 2 levels "No","Yes": 2 2 1 1 2 2 1 2 1 1 ...
## chr [1:800] "Yes" "Yes" "Yes" "Yes" "Yes" "Yes" "No" "Yes" "No" "Yes" "No" ...
Convert to factor. Confusion matrix with statistics.
class_tr_boost_pred_train$class <- as.factor(class_tr_boost_pred_train$class)
confusionMatrix(class_tr_boost_pred_train$class, train_df$Hangover,
positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 387 77
## Yes 135 601
##
## Accuracy : 0.8233
## 95% CI : (0.8006, 0.8445)
## No Information Rate : 0.565
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6359
##
## Mcnemar's Test P-Value : 9.049e-05
##
## Sensitivity : 0.8864
## Specificity : 0.7414
## Pos Pred Value : 0.8166
## Neg Pred Value : 0.8341
## Prevalence : 0.5650
## Detection Rate : 0.5008
## Detection Prevalence : 0.6133
## Balanced Accuracy : 0.8139
##
## 'Positive' Class : Yes
##
class_tr_boost_pred_valid$class <- as.factor(class_tr_boost_pred_valid$class)
confusionMatrix(class_tr_boost_pred_valid$class, valid_df$Hangover,
positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 179 118
## Yes 150 353
##
## Accuracy : 0.665
## 95% CI : (0.6311, 0.6977)
## No Information Rate : 0.5888
## P-Value [Acc > NIR] : 5.547e-06
##
## Kappa : 0.2979
##
## Mcnemar's Test P-Value : 0.05827
##
## Sensitivity : 0.7495
## Specificity : 0.5441
## Pos Pred Value : 0.7018
## Neg Pred Value : 0.6027
## Prevalence : 0.5887
## Detection Rate : 0.4412
## Detection Prevalence : 0.6288
## Balanced Accuracy : 0.6468
##
## 'Positive' Class : Yes
##
ROC for the boosted tree.
## Area under the curve (AUC): 0.647
Predict using the boosted tree.
## [,1] [,2]
## [1,] 0.3732615 0.6267385
## [1] "Yes"
## [,1] [,2]
## [1,] 0.3459454 0.6540546
## [2,] 0.3196193 0.6803807
## [3,] 0.5708293 0.4291707
## [1] "Yes" "Yes" "No"
And someone is hungover :-)