Every rose has its thorn
Just like every night has its dawn
And every cowboy sings a sad, sad song
Every rose has its thorn
pokemon <- read.csv("pokemon_2.csv", header = TRUE)
Explore the pokemon (i.e. data) :-)
Notice that the train variable (i.e. whether to train the pokemon or not) is highly unbalanced.
head(pokemon)
## abilities against_bug against_dark against_dragon
## 1 ['Overgrow', 'Chlorophyll'] 1.00 1 1
## 2 ['Overgrow', 'Chlorophyll'] 1.00 1 1
## 3 ['Overgrow', 'Chlorophyll'] 1.00 1 1
## 4 ['Blaze', 'Solar Power'] 0.50 1 1
## 5 ['Blaze', 'Solar Power'] 0.50 1 1
## 6 ['Blaze', 'Solar Power'] 0.25 1 1
## against_electric against_fairy against_fight against_fire against_flying
## 1 0.5 0.5 0.5 2.0 2
## 2 0.5 0.5 0.5 2.0 2
## 3 0.5 0.5 0.5 2.0 2
## 4 1.0 0.5 1.0 0.5 1
## 5 1.0 0.5 1.0 0.5 1
## 6 2.0 0.5 0.5 0.5 1
## against_ghost against_grass against_ground against_ice against_normal
## 1 1 0.25 1 2.0 1
## 2 1 0.25 1 2.0 1
## 3 1 0.25 1 2.0 1
## 4 1 0.50 2 0.5 1
## 5 1 0.50 2 0.5 1
## 6 1 0.25 0 1.0 1
## against_poison against_psychic against_rock against_steel against_water
## 1 1 2 1 1.0 0.5
## 2 1 2 1 1.0 0.5
## 3 1 2 1 1.0 0.5
## 4 1 1 2 0.5 2.0
## 5 1 1 2 0.5 2.0
## 6 1 1 4 0.5 2.0
## attack base_egg_steps base_happiness base_total capture_rate classfication
## 1 49 5120 70 318 45 Seed Pokí©mon
## 2 62 5120 70 405 45 Seed Pokí©mon
## 3 100 5120 70 625 45 Seed Pokí©mon
## 4 52 5120 70 309 45 Lizard Pokí©mon
## 5 64 5120 70 405 45 Flame Pokí©mon
## 6 104 5120 70 634 45 Flame Pokí©mon
## defence experience_growth height_m hp japanese_name name
## 1 49 1059860 0.7 45 FushigidaneÜ\200´Ü‰‡Ü‰¬Ü\200óÜ\200\215 Bulbasaur
## 2 63 1059860 1.0 60 FushigisouÜ\200´Ü‰‡Ü‰¬Ü‰_܉_ Ivysaur
## 3 123 1059860 2.0 80 FushigibanaÜ\200´Ü‰‡Ü‰¬Ü\200\220Ü\200_ Venusaur
## 4 43 1059860 0.6 39 HitokageÜ\200ÍÜ\200šÜ‰‚܉_ Charmander
## 5 58 1059860 1.1 58 LizardoÜ\200È܉_Ü\200_Ü\200Š Charmeleon
## 6 78 1059860 1.7 78 LizardonÜ\200È܉_Ü\200_Ü\200ŠÜ\200_ Charizard
## percentage_male pokedex_number sp_attack sp_defence speed type1 type2
## 1 88.1 1 65 65 45 grass poison
## 2 88.1 2 80 80 60 grass poison
## 3 88.1 3 122 120 80 grass poison
## 4 88.1 4 60 50 65 fire
## 5 88.1 5 80 65 80 fire
## 6 88.1 6 159 115 100 fire flying
## weight_kg generation is_legendary train
## 1 6.9 1 0 1
## 2 13.0 1 0 1
## 3 100.0 1 0 0
## 4 8.5 1 0 0
## 5 19.0 1 0 0
## 6 90.5 1 0 0
str(pokemon)
## 'data.frame': 801 obs. of 42 variables:
## $ abilities : chr "['Overgrow', 'Chlorophyll']" "['Overgrow', 'Chlorophyll']" "['Overgrow', 'Chlorophyll']" "['Blaze', 'Solar Power']" ...
## $ against_bug : num 1 1 1 0.5 0.5 0.25 1 1 1 1 ...
## $ against_dark : num 1 1 1 1 1 1 1 1 1 1 ...
## $ against_dragon : num 1 1 1 1 1 1 1 1 1 1 ...
## $ against_electric : num 0.5 0.5 0.5 1 1 2 2 2 2 1 ...
## $ against_fairy : num 0.5 0.5 0.5 0.5 0.5 0.5 1 1 1 1 ...
## $ against_fight : num 0.5 0.5 0.5 1 1 0.5 1 1 1 0.5 ...
## $ against_fire : num 2 2 2 0.5 0.5 0.5 0.5 0.5 0.5 2 ...
## $ against_flying : num 2 2 2 1 1 1 1 1 1 2 ...
## $ against_ghost : num 1 1 1 1 1 1 1 1 1 1 ...
## $ against_grass : num 0.25 0.25 0.25 0.5 0.5 0.25 2 2 2 0.5 ...
## $ against_ground : num 1 1 1 2 2 0 1 1 1 0.5 ...
## $ against_ice : num 2 2 2 0.5 0.5 1 0.5 0.5 0.5 1 ...
## $ against_normal : num 1 1 1 1 1 1 1 1 1 1 ...
## $ against_poison : num 1 1 1 1 1 1 1 1 1 1 ...
## $ against_psychic : num 2 2 2 1 1 1 1 1 1 1 ...
## $ against_rock : num 1 1 1 2 2 4 1 1 1 2 ...
## $ against_steel : num 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 1 ...
## $ against_water : num 0.5 0.5 0.5 2 2 2 0.5 0.5 0.5 1 ...
## $ attack : int 49 62 100 52 64 104 48 63 103 30 ...
## $ base_egg_steps : int 5120 5120 5120 5120 5120 5120 5120 5120 5120 3840 ...
## $ base_happiness : int 70 70 70 70 70 70 70 70 70 70 ...
## $ base_total : int 318 405 625 309 405 634 314 405 630 195 ...
## $ capture_rate : chr "45" "45" "45" "45" ...
## $ classfication : chr "Seed Pokí©mon" "Seed Pokí©mon" "Seed Pokí©mon" "Lizard Pokí©mon" ...
## $ defence : int 49 63 123 43 58 78 65 80 120 35 ...
## $ experience_growth: int 1059860 1059860 1059860 1059860 1059860 1059860 1059860 1059860 1059860 1000000 ...
## $ height_m : num 0.7 1 2 0.6 1.1 1.7 0.5 1 1.6 0.3 ...
## $ hp : int 45 60 80 39 58 78 44 59 79 45 ...
## $ japanese_name : chr "FushigidaneÜ\200´Ü‰‡Ü‰¬Ü\200óÜ\200\215" "FushigisouÜ\200´Ü‰‡Ü‰¬Ü‰_܉_" "FushigibanaÜ\200´Ü‰‡Ü‰¬Ü\200\220Ü\200_" "HitokageÜ\200ÍÜ\200šÜ‰‚܉_" ...
## $ name : chr "Bulbasaur" "Ivysaur" "Venusaur" "Charmander" ...
## $ percentage_male : num 88.1 88.1 88.1 88.1 88.1 88.1 88.1 88.1 88.1 50 ...
## $ pokedex_number : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sp_attack : int 65 80 122 60 80 159 50 65 135 20 ...
## $ sp_defence : int 65 80 120 50 65 115 64 80 115 20 ...
## $ speed : int 45 60 80 65 80 100 43 58 78 45 ...
## $ type1 : chr "grass" "grass" "grass" "fire" ...
## $ type2 : chr "poison" "poison" "poison" "" ...
## $ weight_kg : num 6.9 13 100 8.5 19 90.5 9 22.5 85.5 2.9 ...
## $ generation : int 1 1 1 1 1 1 1 1 1 1 ...
## $ is_legendary : int 0 0 0 0 0 0 0 0 0 0 ...
## $ train : int 1 1 0 0 0 0 0 0 0 0 ...
summary(pokemon)
## abilities against_bug against_dark against_dragon
## Length:801 Min. :0.2500 Min. :0.250 Min. :0.0000
## Class :character 1st Qu.:0.5000 1st Qu.:1.000 1st Qu.:1.0000
## Mode :character Median :1.0000 Median :1.000 Median :1.0000
## Mean :0.9963 Mean :1.057 Mean :0.9688
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :4.0000 Max. :4.000 Max. :2.0000
##
## against_electric against_fairy against_fight against_fire
## Min. :0.000 Min. :0.250 Min. :0.000 Min. :0.250
## 1st Qu.:0.500 1st Qu.:1.000 1st Qu.:0.500 1st Qu.:0.500
## Median :1.000 Median :1.000 Median :1.000 Median :1.000
## Mean :1.074 Mean :1.069 Mean :1.066 Mean :1.135
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:2.000
## Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000
##
## against_flying against_ghost against_grass against_ground
## Min. :0.250 Min. :0.000 Min. :0.250 Min. :0.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.500 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000 Median :1.000
## Mean :1.193 Mean :0.985 Mean :1.034 Mean :1.098
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000
##
## against_ice against_normal against_poison against_psychic
## Min. :0.250 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.500 1st Qu.:1.000 1st Qu.:0.5000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.0000 Median :1.000
## Mean :1.208 Mean :0.887 Mean :0.9753 Mean :1.005
## 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.000
## Max. :4.000 Max. :1.000 Max. :4.0000 Max. :4.000
##
## against_rock against_steel against_water attack
## Min. :0.25 Min. :0.2500 Min. :0.250 Min. : 5.00
## 1st Qu.:1.00 1st Qu.:0.5000 1st Qu.:0.500 1st Qu.: 55.00
## Median :1.00 Median :1.0000 Median :1.000 Median : 75.00
## Mean :1.25 Mean :0.9835 Mean :1.058 Mean : 77.86
## 3rd Qu.:2.00 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:100.00
## Max. :4.00 Max. :4.0000 Max. :4.000 Max. :185.00
##
## base_egg_steps base_happiness base_total capture_rate
## Min. : 1280 Min. : 0.00 Min. :180.0 Length:801
## 1st Qu.: 5120 1st Qu.: 70.00 1st Qu.:320.0 Class :character
## Median : 5120 Median : 70.00 Median :435.0 Mode :character
## Mean : 7191 Mean : 65.36 Mean :428.4
## 3rd Qu.: 6400 3rd Qu.: 70.00 3rd Qu.:505.0
## Max. :30720 Max. :140.00 Max. :780.0
##
## classfication defence experience_growth height_m
## Length:801 Min. : 5.00 Min. : 600000 Min. : 0.100
## Class :character 1st Qu.: 50.00 1st Qu.:1000000 1st Qu.: 0.600
## Mode :character Median : 70.00 Median :1000000 Median : 1.000
## Mean : 73.01 Mean :1054996 Mean : 1.164
## 3rd Qu.: 90.00 3rd Qu.:1059860 3rd Qu.: 1.500
## Max. :230.00 Max. :1640000 Max. :14.500
## NA's :20
## hp japanese_name name percentage_male
## Min. : 1.00 Length:801 Length:801 Min. : 0.00
## 1st Qu.: 50.00 Class :character Class :character 1st Qu.: 50.00
## Median : 65.00 Mode :character Mode :character Median : 50.00
## Mean : 68.96 Mean : 55.16
## 3rd Qu.: 80.00 3rd Qu.: 50.00
## Max. :255.00 Max. :100.00
## NA's :98
## pokedex_number sp_attack sp_defence speed
## Min. : 1 Min. : 10.00 Min. : 20.00 Min. : 5.00
## 1st Qu.:201 1st Qu.: 45.00 1st Qu.: 50.00 1st Qu.: 45.00
## Median :401 Median : 65.00 Median : 66.00 Median : 65.00
## Mean :401 Mean : 71.31 Mean : 70.91 Mean : 66.33
## 3rd Qu.:601 3rd Qu.: 91.00 3rd Qu.: 90.00 3rd Qu.: 85.00
## Max. :801 Max. :194.00 Max. :230.00 Max. :180.00
##
## type1 type2 weight_kg generation
## Length:801 Length:801 Min. : 0.10 Min. :1.00
## Class :character Class :character 1st Qu.: 9.00 1st Qu.:2.00
## Mode :character Mode :character Median : 27.30 Median :4.00
## Mean : 61.38 Mean :3.69
## 3rd Qu.: 64.80 3rd Qu.:5.00
## Max. :999.90 Max. :7.00
## NA's :20
## is_legendary train
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000
## Mean :0.08739 Mean :0.09988
## 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000
##
nrow(pokemon)
## [1] 801
table(pokemon$train)
##
## 0 1
## 721 80
Check the data.
names(pokemon)
## [1] "abilities" "against_bug" "against_dark"
## [4] "against_dragon" "against_electric" "against_fairy"
## [7] "against_fight" "against_fire" "against_flying"
## [10] "against_ghost" "against_grass" "against_ground"
## [13] "against_ice" "against_normal" "against_poison"
## [16] "against_psychic" "against_rock" "against_steel"
## [19] "against_water" "attack" "base_egg_steps"
## [22] "base_happiness" "base_total" "capture_rate"
## [25] "classfication" "defence" "experience_growth"
## [28] "height_m" "hp" "japanese_name"
## [31] "name" "percentage_male" "pokedex_number"
## [34] "sp_attack" "sp_defence" "speed"
## [37] "type1" "type2" "weight_kg"
## [40] "generation" "is_legendary" "train"
pokemon_df <- pokemon[, c(34:37, 42)]
head(pokemon_df)
## sp_attack sp_defence speed type1 train
## 1 65 65 45 grass 1
## 2 80 80 60 grass 1
## 3 122 120 80 grass 0
## 4 60 50 65 fire 0
## 5 80 65 80 fire 0
## 6 159 115 100 fire 0
nrow(pokemon_df)
## [1] 801
names(pokemon_df)
## [1] "sp_attack" "sp_defence" "speed" "type1" "train"
table(pokemon_df$train)
##
## 0 1
## 721 80
Convert to numeric and factor as required by the package.
str(pokemon_df)
## 'data.frame': 801 obs. of 5 variables:
## $ sp_attack : int 65 80 122 60 80 159 50 65 135 20 ...
## $ sp_defence: int 65 80 120 50 65 115 64 80 115 20 ...
## $ speed : int 45 60 80 65 80 100 43 58 78 45 ...
## $ type1 : chr "grass" "grass" "grass" "fire" ...
## $ train : int 1 1 0 0 0 0 0 0 0 0 ...
pokemon_df[, c(1:3)] <- lapply(pokemon_df[, c(1:3)], as.numeric)
pokemon_df$train <- as.factor(pokemon_df$train)
str(pokemon_df)
## 'data.frame': 801 obs. of 5 variables:
## $ sp_attack : num 65 80 122 60 80 159 50 65 135 20 ...
## $ sp_defence: num 65 80 120 50 65 115 64 80 115 20 ...
## $ speed : num 45 60 80 65 80 100 43 58 78 45 ...
## $ type1 : chr "grass" "grass" "grass" "fire" ...
## $ train : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...
Our favourite seed :-) Or use another.
set.seed(666)
Training-validation split.
Create the indices for the split This samples the row indices to split the data into training and validation.
train_index <- sample(1:nrow(pokemon_df), 0.6 * nrow(pokemon_df))
valid_index <- setdiff(1:nrow(pokemon_df), train_index)
Using the indices, create the training and validation sets This is similar in principle to splitting a data frame by row.
train_df <- pokemon_df[train_index, ]
valid_df <- pokemon_df[valid_index, ]
It is a good habit to check after splitting.
The outcome variable in the training set is still highly unbalanced.
nrow(train_df)
## [1] 480
nrow(valid_df)
## [1] 321
table(train_df$train)
##
## 0 1
## 437 43
names(train_df)
## [1] "sp_attack" "sp_defence" "speed" "type1" "train"
Now, balance the training set.
This can be done using the ROSE package.
library(ROSE)
## Warning: package 'ROSE' was built under R version 4.0.5
## Loaded ROSE 0.0-4
Create the weighted training df.
We’ll use train as the outcome variable and our model is based on predictors sp_attack, sp_defence, speed, and type1.
Now it is more balanced based on the outcome variable.
In the classification models, use this balanced training set instead of the original.
library(ROSE)
names(train_df)
## [1] "sp_attack" "sp_defence" "speed" "type1" "train"
str(train_df)
## 'data.frame': 480 obs. of 5 variables:
## $ sp_attack : num 55 90 95 55 40 90 74 97 35 85 ...
## $ sp_defence: num 65 72 60 80 60 70 75 81 45 95 ...
## $ speed : num 45 108 55 105 42 73 64 68 90 60 ...
## $ type1 : chr "psychic" "steel" "ghost" "bug" ...
## $ train : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
train_df$type1 <- as.factor(train_df$type1)
train_df_balanced <- ROSE(train ~ sp_attack + sp_defence + speed + type1,
data = train_df, seed = 666)$data
table(train_df_balanced$train)
##
## 0 1
## 231 249
To see how the modelling is done, check the more advanced spell.