podracer <- read.csv("podracer.csv", header = TRUE)
head(podracer, 10)
## Id Model Price Age_08_04 Mfg_Month
## 1 1 Podracer 2.0 D4D HATCHB TERRA 2/3-Doors 13500 23 10
## 2 2 Podracer 2.0 D4D HATCHB TERRA 2/3-Doors 13750 23 10
## 3 3 Podracer 2.0 D4D HATCHB TERRA 2/3-Doors 13950 24 9
## 4 4 Podracer 2.0 D4D HATCHB TERRA 2/3-Doors 14950 26 7
## 5 5 Podracer 2.0 D4D HATCHB SOL 2/3-Doors 13750 30 3
## 6 6 Podracer 2.0 D4D HATCHB SOL 2/3-Doors 12950 32 1
## 7 7 Podracer 2.0 D4D 90 3DR TERRA 2/3-Doors 16900 27 6
## 8 8 Podracer 2.0 D4D 90 3DR TERRA 2/3-Doors 18600 30 3
## 9 9 Podracer 1800 T SPORT VVT I 2/3-Doors 21500 27 6
## 10 10 Podracer 1.9 D HATCHB TERRA 2/3-Doors 12950 23 10
## Mfg_Year KM Fuel_Type HP Met_Color Color Automatic CC Doors Cylinders
## 1 2002 46986 Diesel 90 1 Blue 0 2000 3 4
## 2 2002 72937 Diesel 90 1 Silver 0 2000 3 4
## 3 2002 41711 Diesel 90 1 Blue 0 2000 3 4
## 4 2002 48000 Diesel 90 0 Black 0 2000 3 4
## 5 2002 38500 Diesel 90 0 Black 0 2000 3 4
## 6 2002 61000 Diesel 90 0 White 0 2000 3 4
## 7 2002 94612 Diesel 90 1 Grey 0 2000 3 4
## 8 2002 75889 Diesel 90 1 Grey 0 2000 3 4
## 9 2002 19700 Petrol 192 0 Red 0 1800 3 4
## 10 2002 71138 Diesel 69 0 Blue 0 1900 3 4
## Gears Quarterly_Tax Weight Mfr_Guarantee BOVAG_Guarantee Guarantee_Period
## 1 5 210 1165 0 1 3
## 2 5 210 1165 0 1 3
## 3 5 210 1165 1 1 3
## 4 5 210 1165 1 1 3
## 5 5 210 1170 1 1 3
## 6 5 210 1170 0 1 3
## 7 5 210 1245 0 1 3
## 8 5 210 1245 1 1 3
## 9 5 100 1185 0 1 3
## 10 5 185 1105 0 1 3
## ABS Airbag_1 Airbag_2 Airco Automatic_airco Boardcomputer CD_Player
## 1 1 1 1 0 0 1 0
## 2 1 1 1 1 0 1 1
## 3 1 1 1 0 0 1 0
## 4 1 1 1 0 0 1 0
## 5 1 1 1 1 0 1 0
## 6 1 1 1 1 0 1 0
## 7 1 1 1 1 0 1 0
## 8 1 1 1 1 0 1 1
## 9 1 1 0 1 0 0 0
## 10 1 1 1 1 0 1 0
## Central_Lock Powered_Windows Power_Steering Radio Mistlamps Sport_Model
## 1 1 1 1 0 0 0
## 2 1 0 1 0 0 0
## 3 0 0 1 0 0 0
## 4 0 0 1 0 0 0
## 5 1 1 1 0 1 0
## 6 1 1 1 0 1 0
## 7 1 1 1 0 0 1
## 8 1 1 1 0 0 0
## 9 1 1 1 1 0 0
## 10 0 0 1 0 0 0
## Backseat_Divider Metallic_Rim Radio_cassette Parking_Assistant Tow_Bar
## 1 1 0 0 0 0
## 2 1 0 0 0 0
## 3 1 0 0 0 0
## 4 1 0 0 0 0
## 5 1 0 0 0 0
## 6 1 0 0 0 0
## 7 1 0 0 0 0
## 8 1 0 0 0 0
## 9 0 1 1 0 0
## 10 1 0 0 0 0
nrow(podracer)
## [1] 1436
names(podracer)
## [1] "Id" "Model" "Price"
## [4] "Age_08_04" "Mfg_Month" "Mfg_Year"
## [7] "KM" "Fuel_Type" "HP"
## [10] "Met_Color" "Color" "Automatic"
## [13] "CC" "Doors" "Cylinders"
## [16] "Gears" "Quarterly_Tax" "Weight"
## [19] "Mfr_Guarantee" "BOVAG_Guarantee" "Guarantee_Period"
## [22] "ABS" "Airbag_1" "Airbag_2"
## [25] "Airco" "Automatic_airco" "Boardcomputer"
## [28] "CD_Player" "Central_Lock" "Powered_Windows"
## [31] "Power_Steering" "Radio" "Mistlamps"
## [34] "Sport_Model" "Backseat_Divider" "Metallic_Rim"
## [37] "Radio_cassette" "Parking_Assistant" "Tow_Bar"
str(podracer)
## 'data.frame': 1436 obs. of 39 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Model : Factor w/ 319 levels "Podracer","Podracer ! 1.6-16v vvt-i sol airco sedan 4/5-Doors",..: 276 276 276 276 275 275 269 269 257 246 ...
## $ Price : int 13500 13750 13950 14950 13750 12950 16900 18600 21500 12950 ...
## $ Age_08_04 : int 23 23 24 26 30 32 27 30 27 23 ...
## $ Mfg_Month : int 10 10 9 7 3 1 6 3 6 10 ...
## $ Mfg_Year : int 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 ...
## $ KM : int 46986 72937 41711 48000 38500 61000 94612 75889 19700 71138 ...
## $ Fuel_Type : Factor w/ 3 levels "CNG","Diesel",..: 2 2 2 2 2 2 2 2 3 2 ...
## $ HP : int 90 90 90 90 90 90 90 90 192 69 ...
## $ Met_Color : int 1 1 1 0 0 0 1 1 0 0 ...
## $ Color : Factor w/ 10 levels "Beige","Black",..: 3 7 3 2 2 9 5 5 6 3 ...
## $ Automatic : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CC : int 2000 2000 2000 2000 2000 2000 2000 2000 1800 1900 ...
## $ Doors : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Cylinders : int 4 4 4 4 4 4 4 4 4 4 ...
## $ Gears : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Quarterly_Tax : int 210 210 210 210 210 210 210 210 100 185 ...
## $ Weight : int 1165 1165 1165 1165 1170 1170 1245 1245 1185 1105 ...
## $ Mfr_Guarantee : int 0 0 1 1 1 0 0 1 0 0 ...
## $ BOVAG_Guarantee : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Guarantee_Period : int 3 3 3 3 3 3 3 3 3 3 ...
## $ ABS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Airbag_1 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Airbag_2 : int 1 1 1 1 1 1 1 1 0 1 ...
## $ Airco : int 0 1 0 0 1 1 1 1 1 1 ...
## $ Automatic_airco : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Boardcomputer : int 1 1 1 1 1 1 1 1 0 1 ...
## $ CD_Player : int 0 1 0 0 0 0 0 1 0 0 ...
## $ Central_Lock : int 1 1 0 0 1 1 1 1 1 0 ...
## $ Powered_Windows : int 1 0 0 0 1 1 1 1 1 0 ...
## $ Power_Steering : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Radio : int 0 0 0 0 0 0 0 0 1 0 ...
## $ Mistlamps : int 0 0 0 0 1 1 0 0 0 0 ...
## $ Sport_Model : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Backseat_Divider : int 1 1 1 1 1 1 1 1 0 1 ...
## $ Metallic_Rim : int 0 0 0 0 0 0 0 0 1 0 ...
## $ Radio_cassette : int 0 0 0 0 0 0 0 0 1 0 ...
## $ Parking_Assistant: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Tow_Bar : int 0 0 0 0 0 0 0 0 0 0 ...
podracer2 <- podracer[,c(4, 7:9, 12, 14, 17, 19, 21, 25:26, 28, 30, 34, 39, 3)]
head(podracer2)
## Age_08_04 KM Fuel_Type HP Automatic Doors Quarterly_Tax Mfr_Guarantee
## 1 23 46986 Diesel 90 0 3 210 0
## 2 23 72937 Diesel 90 0 3 210 0
## 3 24 41711 Diesel 90 0 3 210 1
## 4 26 48000 Diesel 90 0 3 210 1
## 5 30 38500 Diesel 90 0 3 210 1
## 6 32 61000 Diesel 90 0 3 210 0
## Guarantee_Period Airco Automatic_airco CD_Player Powered_Windows Sport_Model
## 1 3 0 0 0 1 0
## 2 3 1 0 1 0 0
## 3 3 0 0 0 0 0
## 4 3 0 0 0 0 0
## 5 3 1 0 0 1 0
## 6 3 1 0 0 1 0
## Tow_Bar Price
## 1 0 13500
## 2 0 13750
## 3 0 13950
## 4 0 14950
## 5 0 13750
## 6 0 12950
nrow(podracer2)
## [1] 1436
names(podracer2)
## [1] "Age_08_04" "KM" "Fuel_Type" "HP"
## [5] "Automatic" "Doors" "Quarterly_Tax" "Mfr_Guarantee"
## [9] "Guarantee_Period" "Airco" "Automatic_airco" "CD_Player"
## [13] "Powered_Windows" "Sport_Model" "Tow_Bar" "Price"
str(podracer2)
## 'data.frame': 1436 obs. of 16 variables:
## $ Age_08_04 : int 23 23 24 26 30 32 27 30 27 23 ...
## $ KM : int 46986 72937 41711 48000 38500 61000 94612 75889 19700 71138 ...
## $ Fuel_Type : Factor w/ 3 levels "CNG","Diesel",..: 2 2 2 2 2 2 2 2 3 2 ...
## $ HP : int 90 90 90 90 90 90 90 90 192 69 ...
## $ Automatic : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Doors : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Quarterly_Tax : int 210 210 210 210 210 210 210 210 100 185 ...
## $ Mfr_Guarantee : int 0 0 1 1 1 0 0 1 0 0 ...
## $ Guarantee_Period: int 3 3 3 3 3 3 3 3 3 3 ...
## $ Airco : int 0 1 0 0 1 1 1 1 1 1 ...
## $ Automatic_airco : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CD_Player : int 0 1 0 0 0 0 0 1 0 0 ...
## $ Powered_Windows : int 1 0 0 0 1 1 1 1 1 0 ...
## $ Sport_Model : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Tow_Bar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Price : int 13500 13750 13950 14950 13750 12950 16900 18600 21500 12950 ...
table(podracer2$Fuel_Type)
##
## CNG Diesel Petrol
## 17 155 1264
podracer2$Fuel_Type_CNG <- ifelse(podracer2$Fuel_Type == "CNG", 1 , 0)
podracer2$Fuel_Type_Diesel <- ifelse(podracer2$Fuel_Type == "Diesel",
1, 0)
names(podracer2)
## [1] "Age_08_04" "KM" "Fuel_Type" "HP"
## [5] "Automatic" "Doors" "Quarterly_Tax" "Mfr_Guarantee"
## [9] "Guarantee_Period" "Airco" "Automatic_airco" "CD_Player"
## [13] "Powered_Windows" "Sport_Model" "Tow_Bar" "Price"
## [17] "Fuel_Type_CNG" "Fuel_Type_Diesel"
podracer3 <- podracer2[, -c(3)]
names(podracer3)
## [1] "Age_08_04" "KM" "HP" "Automatic"
## [5] "Doors" "Quarterly_Tax" "Mfr_Guarantee" "Guarantee_Period"
## [9] "Airco" "Automatic_airco" "CD_Player" "Powered_Windows"
## [13] "Sport_Model" "Tow_Bar" "Price" "Fuel_Type_CNG"
## [17] "Fuel_Type_Diesel"
Using our favourite seed :-)
set.seed(666)
train_index <- sample(1:nrow(podracer3), 0.6 * nrow(podracer3))
valid_index <- setdiff(1:nrow(podracer3), train_index)
train_df <- podracer3[train_index, ]
valid_df <- podracer3[valid_index, ]
nrow(train_df)
## [1] 861
nrow(valid_df)
## [1] 575
Normalise the data using training set
summary(podracer3[1:15])
## Age_08_04 KM HP Automatic
## Min. : 1.00 Min. : 1 Min. : 69.0 Min. :0.00000
## 1st Qu.:44.00 1st Qu.: 43000 1st Qu.: 90.0 1st Qu.:0.00000
## Median :61.00 Median : 63390 Median :110.0 Median :0.00000
## Mean :55.95 Mean : 68533 Mean :101.5 Mean :0.05571
## 3rd Qu.:70.00 3rd Qu.: 87021 3rd Qu.:110.0 3rd Qu.:0.00000
## Max. :80.00 Max. :243000 Max. :192.0 Max. :1.00000
## Doors Quarterly_Tax Mfr_Guarantee Guarantee_Period
## Min. :2.000 Min. : 19.00 Min. :0.0000 Min. : 3.000
## 1st Qu.:3.000 1st Qu.: 69.00 1st Qu.:0.0000 1st Qu.: 3.000
## Median :4.000 Median : 85.00 Median :0.0000 Median : 3.000
## Mean :4.033 Mean : 87.12 Mean :0.4095 Mean : 3.815
## 3rd Qu.:5.000 3rd Qu.: 85.00 3rd Qu.:1.0000 3rd Qu.: 3.000
## Max. :5.000 Max. :283.00 Max. :1.0000 Max. :36.000
## Airco Automatic_airco CD_Player Powered_Windows
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.000
## Median :1.0000 Median :0.00000 Median :0.0000 Median :1.000
## Mean :0.5084 Mean :0.05641 Mean :0.2187 Mean :0.562
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:1.000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.000
## Sport_Model Tow_Bar Price
## Min. :0.0000 Min. :0.0000 Min. : 4350
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 8450
## Median :0.0000 Median :0.0000 Median : 9900
## Mean :0.3001 Mean :0.2779 Mean :10731
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:11950
## Max. :1.0000 Max. :1.0000 Max. :32500
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
Create normalised values from training set.
train_df_norm <- preProcess(train_df[1:15], method = c("range"))
train_df_norm
## Created from 861 samples and 15 variables
##
## Pre-processing:
## - ignored (0)
## - re-scaling to [0, 1] (15)
Transform training set using normalised values.
train_df_transform <- predict(train_df_norm, train_df[1:15])
summary(train_df_transform)
## Age_08_04 KM HP Automatic
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.5263 1st Qu.:0.1850 1st Qu.:0.1707 1st Qu.:0.0000
## Median :0.7500 Median :0.2642 Median :0.3333 Median :0.0000
## Mean :0.6911 Mean :0.2885 Mean :0.2643 Mean :0.0511
## 3rd Qu.:0.8684 3rd Qu.:0.3668 3rd Qu.:0.3333 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Doors Quarterly_Tax Mfr_Guarantee Guarantee_Period
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.3333 1st Qu.:0.1894 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.6667 Median :0.2500 Median :0.0000 Median :0.00000
## Mean :0.6756 Mean :0.2623 Mean :0.3972 Mean :0.02678
## 3rd Qu.:1.0000 3rd Qu.:0.2500 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## Airco Automatic_airco CD_Player Powered_Windows
## Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.000 Median :1.0000
## Mean :0.4925 Mean :0.04181 Mean :0.216 Mean :0.5563
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00000 Max. :1.000 Max. :1.0000
## Sport_Model Tow_Bar Price
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.1441
## Median :0.0000 Median :0.0000 Median :0.1922
## Mean :0.2904 Mean :0.2962 Mean :0.2201
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.2667
## Max. :1.0000 Max. :1.0000 Max. :1.0000
names(train_df_transform)
## [1] "Age_08_04" "KM" "HP" "Automatic"
## [5] "Doors" "Quarterly_Tax" "Mfr_Guarantee" "Guarantee_Period"
## [9] "Airco" "Automatic_airco" "CD_Player" "Powered_Windows"
## [13] "Sport_Model" "Tow_Bar" "Price"
Transform validation set using normalised values.
valid_df_transform <- predict(train_df_norm, valid_df[1:15])
summary(valid_df_transform)
## Age_08_04 KM HP Automatic
## Min. :-0.03947 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.: 0.51316 1st Qu.:0.1630 1st Qu.:0.1382 1st Qu.:0.00000
## Median : 0.73684 Median :0.2549 Median :0.3333 Median :0.00000
## Mean : 0.67217 Mean :0.2723 Mean :0.2642 Mean :0.06261
## 3rd Qu.: 0.86842 3rd Qu.:0.3459 3rd Qu.:0.3333 3rd Qu.:0.00000
## Max. : 1.00000 Max. :0.8961 Max. :1.0000 Max. :1.00000
## Doors Quarterly_Tax Mfr_Guarantee Guarantee_Period
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.3333 1st Qu.:0.1894 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.6667 Median :0.2500 Median :0.0000 Median :0.00000
## Mean :0.6812 Mean :0.2517 Mean :0.4278 Mean :0.02161
## 3rd Qu.:1.0000 3rd Qu.:0.2500 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :0.8144 Max. :1.0000 Max. :1.00000
## Airco Automatic_airco CD_Player Powered_Windows
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :0.00000 Median :0.0000 Median :1.0000
## Mean :0.5322 Mean :0.07826 Mean :0.2226 Mean :0.5704
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## Sport_Model Tow_Bar Price
## Min. :0.0000 Min. :0.0000 Min. :-0.001779
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.144128
## Median :0.0000 Median :0.0000 Median : 0.197509
## Mean :0.3148 Mean :0.2504 Mean : 0.233138
## 3rd Qu.:1.0000 3rd Qu.:0.5000 3rd Qu.: 0.268683
## Max. :1.0000 Max. :1.0000 Max. : 0.732740
names(valid_df_transform)
## [1] "Age_08_04" "KM" "HP" "Automatic"
## [5] "Doors" "Quarterly_Tax" "Mfr_Guarantee" "Guarantee_Period"
## [9] "Airco" "Automatic_airco" "CD_Player" "Powered_Windows"
## [13] "Sport_Model" "Tow_Bar" "Price"
Create full training and validation sets with dummy variables.
train_df_2 <- cbind(train_df_transform, train_df[16:17])
names(train_df_2)
## [1] "Age_08_04" "KM" "HP" "Automatic"
## [5] "Doors" "Quarterly_Tax" "Mfr_Guarantee" "Guarantee_Period"
## [9] "Airco" "Automatic_airco" "CD_Player" "Powered_Windows"
## [13] "Sport_Model" "Tow_Bar" "Price" "Fuel_Type_CNG"
## [17] "Fuel_Type_Diesel"
str(train_df_2)
## 'data.frame': 861 obs. of 17 variables:
## $ Age_08_04 : num 0.724 0.763 0.842 0.921 0.789 ...
## $ KM : num 0.501 0.755 0.242 0.419 0.473 ...
## $ HP : num 0.1382 0.0244 0.3333 0.3333 0.3333 ...
## $ Automatic : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Doors : num 1 1 0.333 0.333 1 ...
## $ Quarterly_Tax : num 0.189 0.629 0.25 0.189 0.25 ...
## $ Mfr_Guarantee : num 0 0 1 0 1 1 0 0 0 0 ...
## $ Guarantee_Period: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Airco : num 0 1 1 1 1 0 1 0 1 0 ...
## $ Automatic_airco : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CD_Player : num 0 0 0 0 0 0 1 0 1 0 ...
## $ Powered_Windows : num 0 1 1 0 1 1 1 0 1 1 ...
## $ Sport_Model : num 1 0 0 1 0 0 0 0 1 0 ...
## $ Tow_Bar : num 0 0 1 0 0 1 1 0 0 0 ...
## $ Price : num 0.128 0.11 0.19 0.089 0.198 ...
## $ Fuel_Type_CNG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fuel_Type_Diesel: num 0 1 0 0 0 0 0 1 0 0 ...
valid_df_2 <- cbind(valid_df_transform, valid_df[16:17])
names(valid_df_2)
## [1] "Age_08_04" "KM" "HP" "Automatic"
## [5] "Doors" "Quarterly_Tax" "Mfr_Guarantee" "Guarantee_Period"
## [9] "Airco" "Automatic_airco" "CD_Player" "Powered_Windows"
## [13] "Sport_Model" "Tow_Bar" "Price" "Fuel_Type_CNG"
## [17] "Fuel_Type_Diesel"
str(valid_df_2)
## 'data.frame': 575 obs. of 17 variables:
## $ Age_08_04 : num 0.25 0.263 0.368 0.25 0.276 ...
## $ KM : num 0.3 0.172 0.251 0.293 0.129 ...
## $ HP : num 0.171 0.171 0.171 0 1 ...
## $ Automatic : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Doors : num 0.333 0.333 0.333 0.333 0.333 ...
## $ Quarterly_Tax : num 0.723 0.723 0.723 0.629 0.307 ...
## $ Mfr_Guarantee : num 0 1 0 0 1 1 1 0 0 0 ...
## $ Guarantee_Period: num 0 0 0 0 0.273 ...
## $ Airco : num 1 0 1 1 1 1 1 1 1 1 ...
## $ Automatic_airco : num 0 0 0 0 1 1 1 1 1 1 ...
## $ CD_Player : num 1 0 0 0 1 0 1 1 1 0 ...
## $ Powered_Windows : num 0 0 1 0 1 1 1 1 1 1 ...
## $ Sport_Model : num 0 0 0 0 0 1 1 0 0 1 ...
## $ Tow_Bar : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Price : num 0.333 0.34 0.304 0.304 0.589 ...
## $ Fuel_Type_CNG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fuel_Type_Diesel: num 1 1 1 1 0 0 0 0 0 0 ...
library(neuralnet)
names(train_df_2)
## [1] "Age_08_04" "KM" "HP" "Automatic"
## [5] "Doors" "Quarterly_Tax" "Mfr_Guarantee" "Guarantee_Period"
## [9] "Airco" "Automatic_airco" "CD_Player" "Powered_Windows"
## [13] "Sport_Model" "Tow_Bar" "Price" "Fuel_Type_CNG"
## [17] "Fuel_Type_Diesel"
train_df_2_nn1 <- neuralnet(Price ~ ., data = train_df_2,
linear.output = T, hidden = 2)
train_df_2_nn1$weights
## [[1]]
## [[1]][[1]]
## [,1] [,2]
## [1,] 1.15901800 -1.08563353
## [2,] 1.82816650 -2.02230384
## [3,] 0.86336022 -1.21240231
## [4,] -0.58612529 0.94654369
## [5,] -0.03594813 0.17996578
## [6,] -0.06898549 0.05446269
## [7,] -0.91984206 1.88943957
## [8,] -0.26167277 -0.12012124
## [9,] -1.57051151 -0.43253884
## [10,] -0.35223374 -0.13300483
## [11,] -0.49500291 0.08573496
## [12,] -0.10767580 -0.02481818
## [13,] -0.69788941 -0.39766470
## [14,] -0.13204115 -0.08298837
## [15,] 0.01340312 -0.02912053
## [16,] 0.31749721 -0.79525038
## [17,] 0.57706746 -0.19400431
##
## [[1]][[2]]
## [,1]
## [1,] 0.4727631
## [2,] -0.4302306
## [3,] 0.9083219
plot(train_df_2_nn1, rep = "best")
names(train_df_2)
## [1] "Age_08_04" "KM" "HP" "Automatic"
## [5] "Doors" "Quarterly_Tax" "Mfr_Guarantee" "Guarantee_Period"
## [9] "Airco" "Automatic_airco" "CD_Player" "Powered_Windows"
## [13] "Sport_Model" "Tow_Bar" "Price" "Fuel_Type_CNG"
## [17] "Fuel_Type_Diesel"
train_pred <- compute(train_df_2_nn1,
train_df_2[, c("Age_08_04", "KM", "HP", "Automatic",
"Doors", "Quarterly_Tax", "Mfr_Guarantee",
"Guarantee_Period", "Airco", "Automatic_airco",
"CD_Player", "Powered_Windows", "Sport_Model",
"Tow_Bar", "Fuel_Type_CNG", "Fuel_Type_Diesel")])
Check predicted values (normalised scale).
head(train_pred$net.result, 10)
## [,1]
## 638 0.13388562
## 608 0.13360861
## 907 0.19005452
## 1147 0.12464834
## 654 0.18233738
## 873 0.16353465
## 652 0.15771844
## 1074 0.07737048
## 131 0.42074060
## 1125 0.11793014
library(forecast)
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Registered S3 methods overwritten by 'forecast':
## method from
## fitted.fracdiff fracdiff
## residuals.fracdiff fracdiff
accuracy(unlist(train_pred), train_df_2$Price)
## ME RMSE MAE MPE MAPE
## Test set -0.7799406 0.7896206 0.7799406 -Inf Inf
Predicted values on training set in original scale.
Scale back to original.
train_pred_original <- (train_pred$net.result * (max(podracer3$Price) -
min(podracer3$Price))) + min(podracer3$Price)
First 10 predicted values for Price (i.e. original scale).
head(train_pred_original, 10)
## [,1]
## 638 8118.880
## 608 8111.082
## 907 9700.035
## 1147 7858.851
## 654 9482.797
## 873 8953.500
## 652 8789.774
## 1074 6527.979
## 131 16193.848
## 1125 7669.733
valid_pred <- compute(train_df_2_nn1,
valid_df_2[, c("Age_08_04", "KM", "HP", "Automatic",
"Doors", "Quarterly_Tax", "Mfr_Guarantee",
"Guarantee_Period", "Airco", "Automatic_airco",
"CD_Player", "Powered_Windows", "Sport_Model",
"Tow_Bar", "Fuel_Type_CNG", "Fuel_Type_Diesel")])
Check predicted values (normalised scale) on validation set.
head(valid_pred$net.result, 10)
## [,1]
## 2 0.4312241
## 3 0.4568309
## 6 0.3696080
## 10 0.3540338
## 11 0.6421790
## 12 0.6291678
## 14 0.5969761
## 17 0.5918448
## 19 0.3928083
## 22 0.4684290
accuracy(unlist(valid_pred), valid_df_2$Price)
## ME RMSE MAE MPE MAPE
## Test set -0.7668622 0.778975 0.7668622 -470.1392 665.9653
Predicted values on vaidation set in original scale.
Scale back to original.
valid_pred_original <- (valid_pred$net.result * (max(podracer3$Price) -
min(podracer3$Price))) + min(podracer3$Price)
First 10 predicted values for Price (i.e. original scale).
The RMSE is 0.787 (validation set).
head(valid_pred_original, 10)
## [,1]
## 2 16488.96
## 3 17209.79
## 6 14754.47
## 10 14316.05
## 11 22427.34
## 12 22061.07
## 14 21154.88
## 17 21010.43
## 19 15407.55
## 22 17536.28
train_df_2_nn2 <- neuralnet(Price ~ ., data = train_df_2,
linear.output = T, hidden = 5)
train_df_2_nn2$weights
## [[1]]
## [[1]][[1]]
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.18957167 -0.89466610 1.0824754 2.42050649 -0.33127514
## [2,] 1.96054540 1.37433577 0.8948274 4.40041547 -3.10673667
## [3,] 1.13302729 1.82584835 0.4578356 0.42179024 -0.24636756
## [4,] 0.48787670 0.08808572 0.4508432 -2.68437920 1.35055091
## [5,] -0.39559804 -0.19464965 0.2720596 -0.66374101 -0.88375286
## [6,] 0.23118535 0.61259820 0.9671399 -0.14550667 0.42122786
## [7,] 0.71214886 -0.08396825 2.2444979 -1.98723018 1.29039226
## [8,] -1.42525933 1.32430778 0.8110131 0.55885189 0.42205369
## [9,] 0.07348997 -2.29756202 -0.2703548 -1.31054609 -1.73576189
## [10,] -0.24503662 0.18832419 0.5185287 0.05152996 -0.01311173
## [11,] -1.03896939 -0.58202995 -0.7346107 0.07856487 1.03093049
## [12,] -0.26664599 -0.49121719 -0.3720243 0.02349889 -0.34906382
## [13,] 0.12697065 -0.54179088 -0.8226099 0.43912863 0.63586422
## [14,] 0.08105145 -0.15883869 -0.2611583 -0.48826068 0.02000547
## [15,] 0.53209299 0.58647185 0.1358807 -0.63806182 0.89027493
## [16,] -1.43004532 33.76601654 1.2256482 0.66878117 -0.64258411
## [17,] -0.48939593 -0.28235172 -1.0951958 -0.53796605 -0.69528845
##
## [[1]][[2]]
## [,1]
## [1,] 0.4068726
## [2,] -0.1984329
## [3,] -0.3438218
## [4,] 0.7441377
## [5,] -0.5608212
## [6,] 0.1876774
plot(train_df_2_nn2, rep = "best")
Predictions on training set (normalised scale), neural network 2.
train_pred_2 <- compute(train_df_2_nn2,
train_df_2[, c("Age_08_04", "KM", "HP", "Automatic",
"Doors", "Quarterly_Tax", "Mfr_Guarantee",
"Guarantee_Period", "Airco", "Automatic_airco",
"CD_Player", "Powered_Windows", "Sport_Model",
"Tow_Bar", "Fuel_Type_CNG", "Fuel_Type_Diesel")])
Check predicted values (normalised scale) neural network 2.
head(train_pred_2$net.result, 10)
## [,1]
## 638 0.13243667
## 608 0.13448536
## 907 0.19276265
## 1147 0.12628239
## 654 0.18202365
## 873 0.17223813
## 652 0.15082087
## 1074 0.03960327
## 131 0.43177621
## 1125 0.11421734
Accuracy for training set, neural network 2.
accuracy(unlist(train_pred_2), train_df_2$Price)
## ME RMSE MAE MPE MAPE
## Test set -0.7799406 0.7896206 0.7799406 -Inf Inf
Predicted values on training set in original scale, neural network 2.
Scale back to original.
train_pred_original_2 <- (train_pred_2$net.result * (max(podracer3$Price) -
min(podracer3$Price))) + min(podracer3$Price)
First 10 predicted values for Price (i.e. original scale).
head(train_pred_original_2, 10)
## [,1]
## 638 8078.092
## 608 8135.763
## 907 9776.268
## 1147 7904.849
## 654 9473.966
## 873 9198.503
## 652 8595.608
## 1074 5464.832
## 131 16504.500
## 1125 7565.218
Check predictions on validation set, neural network 2.
valid_pred_2 <- compute(train_df_2_nn2,
valid_df_2[, c("Age_08_04", "KM", "HP", "Automatic",
"Doors", "Quarterly_Tax", "Mfr_Guarantee",
"Guarantee_Period", "Airco", "Automatic_airco",
"CD_Player", "Powered_Windows", "Sport_Model",
"Tow_Bar", "Fuel_Type_CNG", "Fuel_Type_Diesel")])
Check predicted values (normalised scale) on validation set, neural network 2.
head(valid_pred_2$net.result, 10)
## [,1]
## 2 0.4632419
## 3 0.4150965
## 6 0.3725922
## 10 0.3717272
## 11 0.6679051
## 12 0.6343647
## 14 0.5968551
## 17 0.5702664
## 19 0.3181462
## 22 0.4930135
Accuracy for validation set.
accuracy(unlist(valid_pred_2), valid_df_2$Price)
## ME RMSE MAE MPE MAPE
## Test set -0.7668622 0.778975 0.7668622 -470.1392 665.9653
Predicted values on vaidation set in original scale, neural network 2.
Scale back to original, neural network 2.
valid_pred_original_2 <- (valid_pred_2$net.result * (max(podracer3$Price) -
min(podracer3$Price))) + min(podracer3$Price)
First 10 predicted values for Price (i.e. original scale). ]n
head(valid_pred_original_2, 10)
## [,1]
## 2 17390.26
## 3 16034.97
## 6 14838.47
## 10 14814.12
## 11 23151.53
## 12 22207.36
## 14 21151.47
## 17 20403.00
## 19 13305.82
## 22 18228.33
train_df_2_nn3 <- neuralnet(Price ~ ., data = train_df_2,
linear.output = T, hidden = c(5, 5))
train_df_2_nn3$weights
## [[1]]
## [[1]][[1]]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 3.0312157 0.729781413 1.73296058 0.526279836 1.36754542
## [2,] 2.2746104 -0.673971374 -1.28546167 1.865098445 1.92287126
## [3,] 1.1811063 0.346708205 -0.42711216 0.230995172 0.81082054
## [4,] 1.4600034 -0.666639146 -0.11534620 -0.488379522 -1.46673567
## [5,] 0.4963332 -0.148531969 -0.30864712 0.913119881 -0.43280048
## [6,] 0.6037612 0.922629396 0.41989183 -0.817610361 -0.36119700
## [7,] -2.1785460 0.295010338 0.34242470 0.619580337 -1.12396894
## [8,] -0.3401346 -2.693842026 -0.96621894 -0.526085172 0.04697226
## [9,] -0.4016648 -0.702544216 1.09377040 -4.047313727 -0.23776733
## [10,] -0.5794138 0.001617147 -0.14640835 1.259149845 0.07297571
## [11,] 0.4966499 0.276719928 0.61804940 1.384689420 -0.19907089
## [12,] -0.1784131 0.897779012 0.38114986 0.472464309 0.16125459
## [13,] 0.0844264 -0.044855992 0.11172067 -0.363590483 -0.08471410
## [14,] -0.4507591 0.832312013 0.55384361 -1.399723115 0.08354863
## [15,] -0.3923960 -0.120520575 -0.20651297 0.701537274 0.08108022
## [16,] -0.8827099 -0.287324053 0.65404356 -9.826423125 2.64001952
## [17,] 0.3280078 0.093677822 -0.07680684 0.002000548 0.29415752
##
## [[1]][[2]]
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.2529114 -1.0387237 -0.8954089 1.9735991 2.0651737
## [2,] -1.8123231 2.0692834 1.8043833 -0.3872393 0.3150387
## [3,] 1.3408346 -1.4238964 -1.0127844 -1.6857784 -0.2976973
## [4,] -0.5366908 -0.5939110 -0.8086246 0.3699647 -1.5691488
## [5,] 0.2024926 -0.1763935 -0.3596893 0.3285640 0.9199554
## [6,] 0.5325236 1.6950461 -0.2003131 -0.8837031 0.1232894
##
## [[1]][[3]]
## [,1]
## [1,] 0.7841711
## [2,] 1.1555838
## [3,] -1.1197686
## [4,] -0.5145833
## [5,] 2.4615883
## [6,] -1.5177305
plot(train_df_2_nn3, rep = "best")
Predictions on training set (normalised scale), neural network 3.
train_pred_3 <- compute(train_df_2_nn3,
train_df_2[, c("Age_08_04", "KM", "HP", "Automatic",
"Doors", "Quarterly_Tax", "Mfr_Guarantee",
"Guarantee_Period", "Airco", "Automatic_airco",
"CD_Player", "Powered_Windows", "Sport_Model",
"Tow_Bar", "Fuel_Type_CNG", "Fuel_Type_Diesel")])
Check predicted values (normalised scale) neural network 3.
head(train_pred_3$net.result, 10)
## [,1]
## 638 0.13055802
## 608 0.13309814
## 907 0.18083207
## 1147 0.12413856
## 654 0.18842596
## 873 0.16762104
## 652 0.13408202
## 1074 0.05540572
## 131 0.41769986
## 1125 0.12202005
Accuracy for training set, neural network 3.
accuracy(unlist(train_pred_3), train_df_2$Price)
## ME RMSE MAE MPE MAPE
## Test set -0.7799406 0.7896206 0.7799406 -Inf Inf
Predicted values on training set in original scale, neural network 3.
Scale back to original.
train_pred_original_3 <- (train_pred_3$net.result * (max(podracer3$Price) -
min(podracer3$Price))) + min(podracer3$Price)
First 10 predicted values for Price (i.e. original scale).
head(train_pred_original_3, 10)
## [,1]
## 638 8025.208
## 608 8096.713
## 907 9440.423
## 1147 7844.500
## 654 9654.191
## 873 9068.532
## 652 8124.409
## 1074 5909.671
## 131 16108.251
## 1125 7784.864
Check predictions on validation set, neural network 3.
valid_pred_3 <- compute(train_df_2_nn3,
valid_df_2[, c("Age_08_04", "KM", "HP", "Automatic",
"Doors", "Quarterly_Tax", "Mfr_Guarantee",
"Guarantee_Period", "Airco", "Automatic_airco",
"CD_Player", "Powered_Windows", "Sport_Model",
"Tow_Bar", "Price", "Fuel_Type_CNG", "Fuel_Type_Diesel")])
Check predicted values (normalised scale) on validation set, neural network 3.
head(valid_pred_3$net.result, 10)
## [,1]
## 2 0.4250526
## 3 0.3800759
## 6 0.3717043
## 10 0.3667990
## 11 0.6164454
## 12 0.6113172
## 14 0.5826810
## 17 0.5977831
## 19 0.3853424
## 22 0.4978584
Accuracy for validation set.
accuracy(unlist(valid_pred_3), valid_df_2$Price)
## ME RMSE MAE MPE MAPE
## Test set -0.7668622 0.778975 0.7668622 -470.1392 665.9653
Predicted values on vaidation set in original scale, neural network 3.
Scale back to original.
valid_pred_original_3 <- (valid_pred_3$net.result * (max(podracer3$Price) -
min(podracer3$Price))) + min(podracer3$Price)
First 10 predicted values for Price (i.e. original scale), neural network 3.
head(valid_pred_original_3, 10)
## [,1]
## 2 16315.23
## 3 15049.14
## 6 14813.47
## 10 14675.39
## 11 21702.94
## 12 21558.58
## 14 20752.47
## 17 21177.59
## 19 15197.39
## 22 18364.71
The accuracies (RMS) did not change when the number of layers and nodes increased. A complex model does not necessarily yeld better results. In this case, minimising the number of nodes can make it easier to interpret.