Directions

Every rose has its thorn

Just like every night has its dawn

And every cowboy sings a sad, sad song

Every rose has its thorn

Data for demo

Back to the spellbook

1. Load data

pokemon <- read.csv("pokemon_2.csv", header = TRUE)

Explore the pokemon (i.e. data) :-)

Notice that the train variable (i.e. whether to train the pokemon or not) is highly unbalanced.

head(pokemon)
##                     abilities against_bug against_dark against_dragon
## 1 ['Overgrow', 'Chlorophyll']        1.00            1              1
## 2 ['Overgrow', 'Chlorophyll']        1.00            1              1
## 3 ['Overgrow', 'Chlorophyll']        1.00            1              1
## 4    ['Blaze', 'Solar Power']        0.50            1              1
## 5    ['Blaze', 'Solar Power']        0.50            1              1
## 6    ['Blaze', 'Solar Power']        0.25            1              1
##   against_electric against_fairy against_fight against_fire against_flying
## 1              0.5           0.5           0.5          2.0              2
## 2              0.5           0.5           0.5          2.0              2
## 3              0.5           0.5           0.5          2.0              2
## 4              1.0           0.5           1.0          0.5              1
## 5              1.0           0.5           1.0          0.5              1
## 6              2.0           0.5           0.5          0.5              1
##   against_ghost against_grass against_ground against_ice against_normal
## 1             1          0.25              1         2.0              1
## 2             1          0.25              1         2.0              1
## 3             1          0.25              1         2.0              1
## 4             1          0.50              2         0.5              1
## 5             1          0.50              2         0.5              1
## 6             1          0.25              0         1.0              1
##   against_poison against_psychic against_rock against_steel against_water
## 1              1               2            1           1.0           0.5
## 2              1               2            1           1.0           0.5
## 3              1               2            1           1.0           0.5
## 4              1               1            2           0.5           2.0
## 5              1               1            2           0.5           2.0
## 6              1               1            4           0.5           2.0
##   attack base_egg_steps base_happiness base_total capture_rate   classfication
## 1     49           5120             70        318           45   Seed Pokí©mon
## 2     62           5120             70        405           45   Seed Pokí©mon
## 3    100           5120             70        625           45   Seed Pokí©mon
## 4     52           5120             70        309           45 Lizard Pokí©mon
## 5     64           5120             70        405           45  Flame Pokí©mon
## 6    104           5120             70        634           45  Flame Pokí©mon
##   defence experience_growth height_m hp              japanese_name       name
## 1      49           1059860      0.7 45 FushigidaneÜ\200´Ü‰‡Ü‰¬Ü\200óÜ\200\215  Bulbasaur
## 2      63           1059860      1.0 60  FushigisouÜ\200´Ü‰‡Ü‰¬Ü‰_܉_    Ivysaur
## 3     123           1059860      2.0 80 FushigibanaÜ\200´Ü‰‡Ü‰¬Ü\200\220Ü\200_   Venusaur
## 4      43           1059860      0.6 39       HitokageÜ\200ÍÜ\200šÜ‰‚܉_ Charmander
## 5      58           1059860      1.1 58        LizardoÜ\200È܉_Ü\200_Ü\200Š Charmeleon
## 6      78           1059860      1.7 78    LizardonÜ\200È܉_Ü\200_Ü\200ŠÜ\200_  Charizard
##   percentage_male pokedex_number sp_attack sp_defence speed type1  type2
## 1            88.1              1        65         65    45 grass poison
## 2            88.1              2        80         80    60 grass poison
## 3            88.1              3       122        120    80 grass poison
## 4            88.1              4        60         50    65  fire       
## 5            88.1              5        80         65    80  fire       
## 6            88.1              6       159        115   100  fire flying
##   weight_kg generation is_legendary train
## 1       6.9          1            0     1
## 2      13.0          1            0     1
## 3     100.0          1            0     0
## 4       8.5          1            0     0
## 5      19.0          1            0     0
## 6      90.5          1            0     0
str(pokemon)
## 'data.frame':    801 obs. of  42 variables:
##  $ abilities        : chr  "['Overgrow', 'Chlorophyll']" "['Overgrow', 'Chlorophyll']" "['Overgrow', 'Chlorophyll']" "['Blaze', 'Solar Power']" ...
##  $ against_bug      : num  1 1 1 0.5 0.5 0.25 1 1 1 1 ...
##  $ against_dark     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_dragon   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_electric : num  0.5 0.5 0.5 1 1 2 2 2 2 1 ...
##  $ against_fairy    : num  0.5 0.5 0.5 0.5 0.5 0.5 1 1 1 1 ...
##  $ against_fight    : num  0.5 0.5 0.5 1 1 0.5 1 1 1 0.5 ...
##  $ against_fire     : num  2 2 2 0.5 0.5 0.5 0.5 0.5 0.5 2 ...
##  $ against_flying   : num  2 2 2 1 1 1 1 1 1 2 ...
##  $ against_ghost    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_grass    : num  0.25 0.25 0.25 0.5 0.5 0.25 2 2 2 0.5 ...
##  $ against_ground   : num  1 1 1 2 2 0 1 1 1 0.5 ...
##  $ against_ice      : num  2 2 2 0.5 0.5 1 0.5 0.5 0.5 1 ...
##  $ against_normal   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_poison   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ against_psychic  : num  2 2 2 1 1 1 1 1 1 1 ...
##  $ against_rock     : num  1 1 1 2 2 4 1 1 1 2 ...
##  $ against_steel    : num  1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 1 ...
##  $ against_water    : num  0.5 0.5 0.5 2 2 2 0.5 0.5 0.5 1 ...
##  $ attack           : int  49 62 100 52 64 104 48 63 103 30 ...
##  $ base_egg_steps   : int  5120 5120 5120 5120 5120 5120 5120 5120 5120 3840 ...
##  $ base_happiness   : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ base_total       : int  318 405 625 309 405 634 314 405 630 195 ...
##  $ capture_rate     : chr  "45" "45" "45" "45" ...
##  $ classfication    : chr  "Seed Pokí©mon" "Seed Pokí©mon" "Seed Pokí©mon" "Lizard Pokí©mon" ...
##  $ defence          : int  49 63 123 43 58 78 65 80 120 35 ...
##  $ experience_growth: int  1059860 1059860 1059860 1059860 1059860 1059860 1059860 1059860 1059860 1000000 ...
##  $ height_m         : num  0.7 1 2 0.6 1.1 1.7 0.5 1 1.6 0.3 ...
##  $ hp               : int  45 60 80 39 58 78 44 59 79 45 ...
##  $ japanese_name    : chr  "FushigidaneÜ\200´Ü‰‡Ü‰¬Ü\200óÜ\200\215" "FushigisouÜ\200´Ü‰‡Ü‰¬Ü‰_܉_" "FushigibanaÜ\200´Ü‰‡Ü‰¬Ü\200\220Ü\200_" "HitokageÜ\200ÍÜ\200šÜ‰‚܉_" ...
##  $ name             : chr  "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
##  $ percentage_male  : num  88.1 88.1 88.1 88.1 88.1 88.1 88.1 88.1 88.1 50 ...
##  $ pokedex_number   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sp_attack        : int  65 80 122 60 80 159 50 65 135 20 ...
##  $ sp_defence       : int  65 80 120 50 65 115 64 80 115 20 ...
##  $ speed            : int  45 60 80 65 80 100 43 58 78 45 ...
##  $ type1            : chr  "grass" "grass" "grass" "fire" ...
##  $ type2            : chr  "poison" "poison" "poison" "" ...
##  $ weight_kg        : num  6.9 13 100 8.5 19 90.5 9 22.5 85.5 2.9 ...
##  $ generation       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ is_legendary     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ train            : int  1 1 0 0 0 0 0 0 0 0 ...
summary(pokemon)
##   abilities          against_bug      against_dark   against_dragon  
##  Length:801         Min.   :0.2500   Min.   :0.250   Min.   :0.0000  
##  Class :character   1st Qu.:0.5000   1st Qu.:1.000   1st Qu.:1.0000  
##  Mode  :character   Median :1.0000   Median :1.000   Median :1.0000  
##                     Mean   :0.9963   Mean   :1.057   Mean   :0.9688  
##                     3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:1.0000  
##                     Max.   :4.0000   Max.   :4.000   Max.   :2.0000  
##                                                                      
##  against_electric against_fairy   against_fight    against_fire  
##  Min.   :0.000    Min.   :0.250   Min.   :0.000   Min.   :0.250  
##  1st Qu.:0.500    1st Qu.:1.000   1st Qu.:0.500   1st Qu.:0.500  
##  Median :1.000    Median :1.000   Median :1.000   Median :1.000  
##  Mean   :1.074    Mean   :1.069   Mean   :1.066   Mean   :1.135  
##  3rd Qu.:1.000    3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:2.000  
##  Max.   :4.000    Max.   :4.000   Max.   :4.000   Max.   :4.000  
##                                                                  
##  against_flying  against_ghost   against_grass   against_ground 
##  Min.   :0.250   Min.   :0.000   Min.   :0.250   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.500   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :1.193   Mean   :0.985   Mean   :1.034   Mean   :1.098  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.000  
##                                                                 
##   against_ice    against_normal  against_poison   against_psychic
##  Min.   :0.250   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.500   1st Qu.:1.000   1st Qu.:0.5000   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :1.0000   Median :1.000  
##  Mean   :1.208   Mean   :0.887   Mean   :0.9753   Mean   :1.005  
##  3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.000  
##  Max.   :4.000   Max.   :1.000   Max.   :4.0000   Max.   :4.000  
##                                                                  
##   against_rock  against_steel    against_water       attack      
##  Min.   :0.25   Min.   :0.2500   Min.   :0.250   Min.   :  5.00  
##  1st Qu.:1.00   1st Qu.:0.5000   1st Qu.:0.500   1st Qu.: 55.00  
##  Median :1.00   Median :1.0000   Median :1.000   Median : 75.00  
##  Mean   :1.25   Mean   :0.9835   Mean   :1.058   Mean   : 77.86  
##  3rd Qu.:2.00   3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:100.00  
##  Max.   :4.00   Max.   :4.0000   Max.   :4.000   Max.   :185.00  
##                                                                  
##  base_egg_steps  base_happiness     base_total    capture_rate      
##  Min.   : 1280   Min.   :  0.00   Min.   :180.0   Length:801        
##  1st Qu.: 5120   1st Qu.: 70.00   1st Qu.:320.0   Class :character  
##  Median : 5120   Median : 70.00   Median :435.0   Mode  :character  
##  Mean   : 7191   Mean   : 65.36   Mean   :428.4                     
##  3rd Qu.: 6400   3rd Qu.: 70.00   3rd Qu.:505.0                     
##  Max.   :30720   Max.   :140.00   Max.   :780.0                     
##                                                                     
##  classfication         defence       experience_growth    height_m     
##  Length:801         Min.   :  5.00   Min.   : 600000   Min.   : 0.100  
##  Class :character   1st Qu.: 50.00   1st Qu.:1000000   1st Qu.: 0.600  
##  Mode  :character   Median : 70.00   Median :1000000   Median : 1.000  
##                     Mean   : 73.01   Mean   :1054996   Mean   : 1.164  
##                     3rd Qu.: 90.00   3rd Qu.:1059860   3rd Qu.: 1.500  
##                     Max.   :230.00   Max.   :1640000   Max.   :14.500  
##                                                        NA's   :20      
##        hp         japanese_name          name           percentage_male 
##  Min.   :  1.00   Length:801         Length:801         Min.   :  0.00  
##  1st Qu.: 50.00   Class :character   Class :character   1st Qu.: 50.00  
##  Median : 65.00   Mode  :character   Mode  :character   Median : 50.00  
##  Mean   : 68.96                                         Mean   : 55.16  
##  3rd Qu.: 80.00                                         3rd Qu.: 50.00  
##  Max.   :255.00                                         Max.   :100.00  
##                                                         NA's   :98      
##  pokedex_number   sp_attack        sp_defence         speed       
##  Min.   :  1    Min.   : 10.00   Min.   : 20.00   Min.   :  5.00  
##  1st Qu.:201    1st Qu.: 45.00   1st Qu.: 50.00   1st Qu.: 45.00  
##  Median :401    Median : 65.00   Median : 66.00   Median : 65.00  
##  Mean   :401    Mean   : 71.31   Mean   : 70.91   Mean   : 66.33  
##  3rd Qu.:601    3rd Qu.: 91.00   3rd Qu.: 90.00   3rd Qu.: 85.00  
##  Max.   :801    Max.   :194.00   Max.   :230.00   Max.   :180.00  
##                                                                   
##     type1              type2             weight_kg        generation  
##  Length:801         Length:801         Min.   :  0.10   Min.   :1.00  
##  Class :character   Class :character   1st Qu.:  9.00   1st Qu.:2.00  
##  Mode  :character   Mode  :character   Median : 27.30   Median :4.00  
##                                        Mean   : 61.38   Mean   :3.69  
##                                        3rd Qu.: 64.80   3rd Qu.:5.00  
##                                        Max.   :999.90   Max.   :7.00  
##                                        NA's   :20                     
##   is_legendary         train        
##  Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000  
##  Mean   :0.08739   Mean   :0.09988  
##  3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000  
## 
nrow(pokemon)
## [1] 801
table(pokemon$train)
## 
##   0   1 
## 721  80

1.1 Filter for only selected variables

Check the data.

names(pokemon)
##  [1] "abilities"         "against_bug"       "against_dark"     
##  [4] "against_dragon"    "against_electric"  "against_fairy"    
##  [7] "against_fight"     "against_fire"      "against_flying"   
## [10] "against_ghost"     "against_grass"     "against_ground"   
## [13] "against_ice"       "against_normal"    "against_poison"   
## [16] "against_psychic"   "against_rock"      "against_steel"    
## [19] "against_water"     "attack"            "base_egg_steps"   
## [22] "base_happiness"    "base_total"        "capture_rate"     
## [25] "classfication"     "defence"           "experience_growth"
## [28] "height_m"          "hp"                "japanese_name"    
## [31] "name"              "percentage_male"   "pokedex_number"   
## [34] "sp_attack"         "sp_defence"        "speed"            
## [37] "type1"             "type2"             "weight_kg"        
## [40] "generation"        "is_legendary"      "train"
pokemon_df <- pokemon[, c(34:37, 42)]
head(pokemon_df)
##   sp_attack sp_defence speed type1 train
## 1        65         65    45 grass     1
## 2        80         80    60 grass     1
## 3       122        120    80 grass     0
## 4        60         50    65  fire     0
## 5        80         65    80  fire     0
## 6       159        115   100  fire     0
nrow(pokemon_df)
## [1] 801
names(pokemon_df)
## [1] "sp_attack"  "sp_defence" "speed"      "type1"      "train"
table(pokemon_df$train)
## 
##   0   1 
## 721  80

Convert to numeric and factor as required by the package.

str(pokemon_df)
## 'data.frame':    801 obs. of  5 variables:
##  $ sp_attack : int  65 80 122 60 80 159 50 65 135 20 ...
##  $ sp_defence: int  65 80 120 50 65 115 64 80 115 20 ...
##  $ speed     : int  45 60 80 65 80 100 43 58 78 45 ...
##  $ type1     : chr  "grass" "grass" "grass" "fire" ...
##  $ train     : int  1 1 0 0 0 0 0 0 0 0 ...
pokemon_df[, c(1:3)] <- lapply(pokemon_df[, c(1:3)], as.numeric)

pokemon_df$train <- as.factor(pokemon_df$train)
str(pokemon_df)
## 'data.frame':    801 obs. of  5 variables:
##  $ sp_attack : num  65 80 122 60 80 159 50 65 135 20 ...
##  $ sp_defence: num  65 80 120 50 65 115 64 80 115 20 ...
##  $ speed     : num  45 60 80 65 80 100 43 58 78 45 ...
##  $ type1     : chr  "grass" "grass" "grass" "fire" ...
##  $ train     : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...

2. Training validation split

Our favourite seed :-) Or use another.

set.seed(666)

Training-validation split.

Create the indices for the split This samples the row indices to split the data into training and validation.

train_index <- sample(1:nrow(pokemon_df), 0.6 * nrow(pokemon_df))
valid_index <- setdiff(1:nrow(pokemon_df), train_index)

Using the indices, create the training and validation sets This is similar in principle to splitting a data frame by row.

train_df <- pokemon_df[train_index, ]
valid_df <- pokemon_df[valid_index, ]

It is a good habit to check after splitting.

The outcome variable in the training set is still highly unbalanced.

nrow(train_df)
## [1] 480
nrow(valid_df)
## [1] 321
table(train_df$train)
## 
##   0   1 
## 437  43
names(train_df)
## [1] "sp_attack"  "sp_defence" "speed"      "type1"      "train"

3. Weighted sampling

Now, balance the training set.

This can be done using the ROSE package.

library(ROSE)
## Warning: package 'ROSE' was built under R version 4.0.5
## Loaded ROSE 0.0-4

Create the weighted training df.

We’ll use train as the outcome variable and our model is based on predictors sp_attack, sp_defence, speed, and type1.

Now it is more balanced based on the outcome variable.

In the classification models, use this balanced training set instead of the original.

library(ROSE)
names(train_df)
## [1] "sp_attack"  "sp_defence" "speed"      "type1"      "train"
str(train_df)
## 'data.frame':    480 obs. of  5 variables:
##  $ sp_attack : num  55 90 95 55 40 90 74 97 35 85 ...
##  $ sp_defence: num  65 72 60 80 60 70 75 81 45 95 ...
##  $ speed     : num  45 108 55 105 42 73 64 68 90 60 ...
##  $ type1     : chr  "psychic" "steel" "ghost" "bug" ...
##  $ train     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
train_df$type1 <- as.factor(train_df$type1)
train_df_balanced <- ROSE(train ~ sp_attack + sp_defence + speed + type1, 
                          data = train_df, seed = 666)$data

table(train_df_balanced$train)
## 
##   0   1 
## 231 249

To see how the modelling is done, check the more advanced spell.