Directions

Neural networks to predict the price of podracers.

Data for demo

Back to the spellbook

1. Load Data

podracer <- read.csv("podracer.csv", header = TRUE)

head(podracer, 10)
##    Id                                   Model Price Age_08_04 Mfg_Month
## 1   1 Podracer 2.0 D4D HATCHB TERRA 2/3-Doors 13500        23        10
## 2   2 Podracer 2.0 D4D HATCHB TERRA 2/3-Doors 13750        23        10
## 3   3 Podracer 2.0 D4D HATCHB TERRA 2/3-Doors 13950        24         9
## 4   4 Podracer 2.0 D4D HATCHB TERRA 2/3-Doors 14950        26         7
## 5   5   Podracer 2.0 D4D HATCHB SOL 2/3-Doors 13750        30         3
## 6   6   Podracer 2.0 D4D HATCHB SOL 2/3-Doors 12950        32         1
## 7   7 Podracer 2.0 D4D 90 3DR TERRA 2/3-Doors 16900        27         6
## 8   8 Podracer 2.0 D4D 90 3DR TERRA 2/3-Doors 18600        30         3
## 9   9   Podracer 1800 T SPORT VVT I 2/3-Doors 21500        27         6
## 10 10   Podracer 1.9 D HATCHB TERRA 2/3-Doors 12950        23        10
##    Mfg_Year    KM Fuel_Type  HP Met_Color  Color Automatic   CC Doors Cylinders
## 1      2002 46986    Diesel  90         1   Blue         0 2000     3         4
## 2      2002 72937    Diesel  90         1 Silver         0 2000     3         4
## 3      2002 41711    Diesel  90         1   Blue         0 2000     3         4
## 4      2002 48000    Diesel  90         0  Black         0 2000     3         4
## 5      2002 38500    Diesel  90         0  Black         0 2000     3         4
## 6      2002 61000    Diesel  90         0  White         0 2000     3         4
## 7      2002 94612    Diesel  90         1   Grey         0 2000     3         4
## 8      2002 75889    Diesel  90         1   Grey         0 2000     3         4
## 9      2002 19700    Petrol 192         0    Red         0 1800     3         4
## 10     2002 71138    Diesel  69         0   Blue         0 1900     3         4
##    Gears Quarterly_Tax Weight Mfr_Guarantee BOVAG_Guarantee Guarantee_Period
## 1      5           210   1165             0               1                3
## 2      5           210   1165             0               1                3
## 3      5           210   1165             1               1                3
## 4      5           210   1165             1               1                3
## 5      5           210   1170             1               1                3
## 6      5           210   1170             0               1                3
## 7      5           210   1245             0               1                3
## 8      5           210   1245             1               1                3
## 9      5           100   1185             0               1                3
## 10     5           185   1105             0               1                3
##    ABS Airbag_1 Airbag_2 Airco Automatic_airco Boardcomputer CD_Player
## 1    1        1        1     0               0             1         0
## 2    1        1        1     1               0             1         1
## 3    1        1        1     0               0             1         0
## 4    1        1        1     0               0             1         0
## 5    1        1        1     1               0             1         0
## 6    1        1        1     1               0             1         0
## 7    1        1        1     1               0             1         0
## 8    1        1        1     1               0             1         1
## 9    1        1        0     1               0             0         0
## 10   1        1        1     1               0             1         0
##    Central_Lock Powered_Windows Power_Steering Radio Mistlamps Sport_Model
## 1             1               1              1     0         0           0
## 2             1               0              1     0         0           0
## 3             0               0              1     0         0           0
## 4             0               0              1     0         0           0
## 5             1               1              1     0         1           0
## 6             1               1              1     0         1           0
## 7             1               1              1     0         0           1
## 8             1               1              1     0         0           0
## 9             1               1              1     1         0           0
## 10            0               0              1     0         0           0
##    Backseat_Divider Metallic_Rim Radio_cassette Parking_Assistant Tow_Bar
## 1                 1            0              0                 0       0
## 2                 1            0              0                 0       0
## 3                 1            0              0                 0       0
## 4                 1            0              0                 0       0
## 5                 1            0              0                 0       0
## 6                 1            0              0                 0       0
## 7                 1            0              0                 0       0
## 8                 1            0              0                 0       0
## 9                 0            1              1                 0       0
## 10                1            0              0                 0       0
nrow(podracer)
## [1] 1436
names(podracer)
##  [1] "Id"                "Model"             "Price"            
##  [4] "Age_08_04"         "Mfg_Month"         "Mfg_Year"         
##  [7] "KM"                "Fuel_Type"         "HP"               
## [10] "Met_Color"         "Color"             "Automatic"        
## [13] "CC"                "Doors"             "Cylinders"        
## [16] "Gears"             "Quarterly_Tax"     "Weight"           
## [19] "Mfr_Guarantee"     "BOVAG_Guarantee"   "Guarantee_Period" 
## [22] "ABS"               "Airbag_1"          "Airbag_2"         
## [25] "Airco"             "Automatic_airco"   "Boardcomputer"    
## [28] "CD_Player"         "Central_Lock"      "Powered_Windows"  
## [31] "Power_Steering"    "Radio"             "Mistlamps"        
## [34] "Sport_Model"       "Backseat_Divider"  "Metallic_Rim"     
## [37] "Radio_cassette"    "Parking_Assistant" "Tow_Bar"
str(podracer)
## 'data.frame':    1436 obs. of  39 variables:
##  $ Id               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Model            : Factor w/ 319 levels "Podracer","Podracer ! 1.6-16v vvt-i sol airco sedan 4/5-Doors",..: 276 276 276 276 275 275 269 269 257 246 ...
##  $ Price            : int  13500 13750 13950 14950 13750 12950 16900 18600 21500 12950 ...
##  $ Age_08_04        : int  23 23 24 26 30 32 27 30 27 23 ...
##  $ Mfg_Month        : int  10 10 9 7 3 1 6 3 6 10 ...
##  $ Mfg_Year         : int  2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 ...
##  $ KM               : int  46986 72937 41711 48000 38500 61000 94612 75889 19700 71138 ...
##  $ Fuel_Type        : Factor w/ 3 levels "CNG","Diesel",..: 2 2 2 2 2 2 2 2 3 2 ...
##  $ HP               : int  90 90 90 90 90 90 90 90 192 69 ...
##  $ Met_Color        : int  1 1 1 0 0 0 1 1 0 0 ...
##  $ Color            : Factor w/ 10 levels "Beige","Black",..: 3 7 3 2 2 9 5 5 6 3 ...
##  $ Automatic        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CC               : int  2000 2000 2000 2000 2000 2000 2000 2000 1800 1900 ...
##  $ Doors            : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Cylinders        : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ Gears            : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Quarterly_Tax    : int  210 210 210 210 210 210 210 210 100 185 ...
##  $ Weight           : int  1165 1165 1165 1165 1170 1170 1245 1245 1185 1105 ...
##  $ Mfr_Guarantee    : int  0 0 1 1 1 0 0 1 0 0 ...
##  $ BOVAG_Guarantee  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Guarantee_Period : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ ABS              : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Airbag_1         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Airbag_2         : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ Airco            : int  0 1 0 0 1 1 1 1 1 1 ...
##  $ Automatic_airco  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Boardcomputer    : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ CD_Player        : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ Central_Lock     : int  1 1 0 0 1 1 1 1 1 0 ...
##  $ Powered_Windows  : int  1 0 0 0 1 1 1 1 1 0 ...
##  $ Power_Steering   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Radio            : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ Mistlamps        : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ Sport_Model      : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Backseat_Divider : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ Metallic_Rim     : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ Radio_cassette   : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ Parking_Assistant: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Tow_Bar          : int  0 0 0 0 0 0 0 0 0 0 ...

2. PreProcessing

2.1 Filter for required variables

podracer2 <- podracer[,c(4, 7:9, 12, 14, 17, 19, 21, 25:26, 28, 30, 34, 39, 3)]
head(podracer2)
##   Age_08_04    KM Fuel_Type HP Automatic Doors Quarterly_Tax Mfr_Guarantee
## 1        23 46986    Diesel 90         0     3           210             0
## 2        23 72937    Diesel 90         0     3           210             0
## 3        24 41711    Diesel 90         0     3           210             1
## 4        26 48000    Diesel 90         0     3           210             1
## 5        30 38500    Diesel 90         0     3           210             1
## 6        32 61000    Diesel 90         0     3           210             0
##   Guarantee_Period Airco Automatic_airco CD_Player Powered_Windows Sport_Model
## 1                3     0               0         0               1           0
## 2                3     1               0         1               0           0
## 3                3     0               0         0               0           0
## 4                3     0               0         0               0           0
## 5                3     1               0         0               1           0
## 6                3     1               0         0               1           0
##   Tow_Bar Price
## 1       0 13500
## 2       0 13750
## 3       0 13950
## 4       0 14950
## 5       0 13750
## 6       0 12950
nrow(podracer2)
## [1] 1436
names(podracer2)
##  [1] "Age_08_04"        "KM"               "Fuel_Type"        "HP"              
##  [5] "Automatic"        "Doors"            "Quarterly_Tax"    "Mfr_Guarantee"   
##  [9] "Guarantee_Period" "Airco"            "Automatic_airco"  "CD_Player"       
## [13] "Powered_Windows"  "Sport_Model"      "Tow_Bar"          "Price"
str(podracer2)
## 'data.frame':    1436 obs. of  16 variables:
##  $ Age_08_04       : int  23 23 24 26 30 32 27 30 27 23 ...
##  $ KM              : int  46986 72937 41711 48000 38500 61000 94612 75889 19700 71138 ...
##  $ Fuel_Type       : Factor w/ 3 levels "CNG","Diesel",..: 2 2 2 2 2 2 2 2 3 2 ...
##  $ HP              : int  90 90 90 90 90 90 90 90 192 69 ...
##  $ Automatic       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Doors           : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Quarterly_Tax   : int  210 210 210 210 210 210 210 210 100 185 ...
##  $ Mfr_Guarantee   : int  0 0 1 1 1 0 0 1 0 0 ...
##  $ Guarantee_Period: int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Airco           : int  0 1 0 0 1 1 1 1 1 1 ...
##  $ Automatic_airco : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CD_Player       : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ Powered_Windows : int  1 0 0 0 1 1 1 1 1 0 ...
##  $ Sport_Model     : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Tow_Bar         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Price           : int  13500 13750 13950 14950 13750 12950 16900 18600 21500 12950 ...

2.2 Create dummy variables

table(podracer2$Fuel_Type)
## 
##    CNG Diesel Petrol 
##     17    155   1264
podracer2$Fuel_Type_CNG <- ifelse(podracer2$Fuel_Type == "CNG", 1 , 0)
podracer2$Fuel_Type_Diesel <- ifelse(podracer2$Fuel_Type == "Diesel",
                                   1, 0)

names(podracer2)
##  [1] "Age_08_04"        "KM"               "Fuel_Type"        "HP"              
##  [5] "Automatic"        "Doors"            "Quarterly_Tax"    "Mfr_Guarantee"   
##  [9] "Guarantee_Period" "Airco"            "Automatic_airco"  "CD_Player"       
## [13] "Powered_Windows"  "Sport_Model"      "Tow_Bar"          "Price"           
## [17] "Fuel_Type_CNG"    "Fuel_Type_Diesel"
podracer3 <- podracer2[, -c(3)]
names(podracer3)
##  [1] "Age_08_04"        "KM"               "HP"               "Automatic"       
##  [5] "Doors"            "Quarterly_Tax"    "Mfr_Guarantee"    "Guarantee_Period"
##  [9] "Airco"            "Automatic_airco"  "CD_Player"        "Powered_Windows" 
## [13] "Sport_Model"      "Tow_Bar"          "Price"            "Fuel_Type_CNG"   
## [17] "Fuel_Type_Diesel"

3. Training Validation Split

Using our favourite seed :-)

set.seed(666)

train_index <- sample(1:nrow(podracer3), 0.6 * nrow(podracer3))
valid_index <- setdiff(1:nrow(podracer3), train_index)

train_df <- podracer3[train_index, ]
valid_df <- podracer3[valid_index, ]
nrow(train_df)
## [1] 861
nrow(valid_df)
## [1] 575

4. Normalise

Normalise the data using training set

summary(podracer3[1:15])
##    Age_08_04           KM               HP          Automatic      
##  Min.   : 1.00   Min.   :     1   Min.   : 69.0   Min.   :0.00000  
##  1st Qu.:44.00   1st Qu.: 43000   1st Qu.: 90.0   1st Qu.:0.00000  
##  Median :61.00   Median : 63390   Median :110.0   Median :0.00000  
##  Mean   :55.95   Mean   : 68533   Mean   :101.5   Mean   :0.05571  
##  3rd Qu.:70.00   3rd Qu.: 87021   3rd Qu.:110.0   3rd Qu.:0.00000  
##  Max.   :80.00   Max.   :243000   Max.   :192.0   Max.   :1.00000  
##      Doors       Quarterly_Tax    Mfr_Guarantee    Guarantee_Period
##  Min.   :2.000   Min.   : 19.00   Min.   :0.0000   Min.   : 3.000  
##  1st Qu.:3.000   1st Qu.: 69.00   1st Qu.:0.0000   1st Qu.: 3.000  
##  Median :4.000   Median : 85.00   Median :0.0000   Median : 3.000  
##  Mean   :4.033   Mean   : 87.12   Mean   :0.4095   Mean   : 3.815  
##  3rd Qu.:5.000   3rd Qu.: 85.00   3rd Qu.:1.0000   3rd Qu.: 3.000  
##  Max.   :5.000   Max.   :283.00   Max.   :1.0000   Max.   :36.000  
##      Airco        Automatic_airco     CD_Player      Powered_Windows
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :1.0000   Median :0.00000   Median :0.0000   Median :1.000  
##  Mean   :0.5084   Mean   :0.05641   Mean   :0.2187   Mean   :0.562  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:1.000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.000  
##   Sport_Model        Tow_Bar           Price      
##  Min.   :0.0000   Min.   :0.0000   Min.   : 4350  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 8450  
##  Median :0.0000   Median :0.0000   Median : 9900  
##  Mean   :0.3001   Mean   :0.2779   Mean   :10731  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:11950  
##  Max.   :1.0000   Max.   :1.0000   Max.   :32500
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2

Create normalised values from training set.

train_df_norm <- preProcess(train_df[1:15], method = c("range"))
train_df_norm
## Created from 861 samples and 15 variables
## 
## Pre-processing:
##   - ignored (0)
##   - re-scaling to [0, 1] (15)

Transform training set using normalised values.

train_df_transform <- predict(train_df_norm, train_df[1:15])
summary(train_df_transform)
##    Age_08_04            KM               HP           Automatic     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.5263   1st Qu.:0.1850   1st Qu.:0.1707   1st Qu.:0.0000  
##  Median :0.7500   Median :0.2642   Median :0.3333   Median :0.0000  
##  Mean   :0.6911   Mean   :0.2885   Mean   :0.2643   Mean   :0.0511  
##  3rd Qu.:0.8684   3rd Qu.:0.3668   3rd Qu.:0.3333   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      Doors        Quarterly_Tax    Mfr_Guarantee    Guarantee_Period 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.3333   1st Qu.:0.1894   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.6667   Median :0.2500   Median :0.0000   Median :0.00000  
##  Mean   :0.6756   Mean   :0.2623   Mean   :0.3972   Mean   :0.02678  
##  3rd Qu.:1.0000   3rd Qu.:0.2500   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##      Airco        Automatic_airco     CD_Player     Powered_Windows 
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :0.000   Median :1.0000  
##  Mean   :0.4925   Mean   :0.04181   Mean   :0.216   Mean   :0.5563  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.000   Max.   :1.0000  
##   Sport_Model        Tow_Bar           Price       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.1441  
##  Median :0.0000   Median :0.0000   Median :0.1922  
##  Mean   :0.2904   Mean   :0.2962   Mean   :0.2201  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.2667  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000
names(train_df_transform)
##  [1] "Age_08_04"        "KM"               "HP"               "Automatic"       
##  [5] "Doors"            "Quarterly_Tax"    "Mfr_Guarantee"    "Guarantee_Period"
##  [9] "Airco"            "Automatic_airco"  "CD_Player"        "Powered_Windows" 
## [13] "Sport_Model"      "Tow_Bar"          "Price"

Transform validation set using normalised values.

valid_df_transform <- predict(train_df_norm, valid_df[1:15])
summary(valid_df_transform)
##    Age_08_04              KM               HP           Automatic      
##  Min.   :-0.03947   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.: 0.51316   1st Qu.:0.1630   1st Qu.:0.1382   1st Qu.:0.00000  
##  Median : 0.73684   Median :0.2549   Median :0.3333   Median :0.00000  
##  Mean   : 0.67217   Mean   :0.2723   Mean   :0.2642   Mean   :0.06261  
##  3rd Qu.: 0.86842   3rd Qu.:0.3459   3rd Qu.:0.3333   3rd Qu.:0.00000  
##  Max.   : 1.00000   Max.   :0.8961   Max.   :1.0000   Max.   :1.00000  
##      Doors        Quarterly_Tax    Mfr_Guarantee    Guarantee_Period 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.3333   1st Qu.:0.1894   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.6667   Median :0.2500   Median :0.0000   Median :0.00000  
##  Mean   :0.6812   Mean   :0.2517   Mean   :0.4278   Mean   :0.02161  
##  3rd Qu.:1.0000   3rd Qu.:0.2500   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :0.8144   Max.   :1.0000   Max.   :1.00000  
##      Airco        Automatic_airco     CD_Player      Powered_Windows 
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :0.00000   Median :0.0000   Median :1.0000  
##  Mean   :0.5322   Mean   :0.07826   Mean   :0.2226   Mean   :0.5704  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##   Sport_Model        Tow_Bar           Price          
##  Min.   :0.0000   Min.   :0.0000   Min.   :-0.001779  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 0.144128  
##  Median :0.0000   Median :0.0000   Median : 0.197509  
##  Mean   :0.3148   Mean   :0.2504   Mean   : 0.233138  
##  3rd Qu.:1.0000   3rd Qu.:0.5000   3rd Qu.: 0.268683  
##  Max.   :1.0000   Max.   :1.0000   Max.   : 0.732740
names(valid_df_transform)
##  [1] "Age_08_04"        "KM"               "HP"               "Automatic"       
##  [5] "Doors"            "Quarterly_Tax"    "Mfr_Guarantee"    "Guarantee_Period"
##  [9] "Airco"            "Automatic_airco"  "CD_Player"        "Powered_Windows" 
## [13] "Sport_Model"      "Tow_Bar"          "Price"

Create full training and validation sets with dummy variables.

train_df_2 <- cbind(train_df_transform, train_df[16:17])
names(train_df_2)
##  [1] "Age_08_04"        "KM"               "HP"               "Automatic"       
##  [5] "Doors"            "Quarterly_Tax"    "Mfr_Guarantee"    "Guarantee_Period"
##  [9] "Airco"            "Automatic_airco"  "CD_Player"        "Powered_Windows" 
## [13] "Sport_Model"      "Tow_Bar"          "Price"            "Fuel_Type_CNG"   
## [17] "Fuel_Type_Diesel"
str(train_df_2)
## 'data.frame':    861 obs. of  17 variables:
##  $ Age_08_04       : num  0.724 0.763 0.842 0.921 0.789 ...
##  $ KM              : num  0.501 0.755 0.242 0.419 0.473 ...
##  $ HP              : num  0.1382 0.0244 0.3333 0.3333 0.3333 ...
##  $ Automatic       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Doors           : num  1 1 0.333 0.333 1 ...
##  $ Quarterly_Tax   : num  0.189 0.629 0.25 0.189 0.25 ...
##  $ Mfr_Guarantee   : num  0 0 1 0 1 1 0 0 0 0 ...
##  $ Guarantee_Period: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Airco           : num  0 1 1 1 1 0 1 0 1 0 ...
##  $ Automatic_airco : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CD_Player       : num  0 0 0 0 0 0 1 0 1 0 ...
##  $ Powered_Windows : num  0 1 1 0 1 1 1 0 1 1 ...
##  $ Sport_Model     : num  1 0 0 1 0 0 0 0 1 0 ...
##  $ Tow_Bar         : num  0 0 1 0 0 1 1 0 0 0 ...
##  $ Price           : num  0.128 0.11 0.19 0.089 0.198 ...
##  $ Fuel_Type_CNG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fuel_Type_Diesel: num  0 1 0 0 0 0 0 1 0 0 ...
valid_df_2 <- cbind(valid_df_transform, valid_df[16:17])
names(valid_df_2)
##  [1] "Age_08_04"        "KM"               "HP"               "Automatic"       
##  [5] "Doors"            "Quarterly_Tax"    "Mfr_Guarantee"    "Guarantee_Period"
##  [9] "Airco"            "Automatic_airco"  "CD_Player"        "Powered_Windows" 
## [13] "Sport_Model"      "Tow_Bar"          "Price"            "Fuel_Type_CNG"   
## [17] "Fuel_Type_Diesel"
str(valid_df_2)
## 'data.frame':    575 obs. of  17 variables:
##  $ Age_08_04       : num  0.25 0.263 0.368 0.25 0.276 ...
##  $ KM              : num  0.3 0.172 0.251 0.293 0.129 ...
##  $ HP              : num  0.171 0.171 0.171 0 1 ...
##  $ Automatic       : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Doors           : num  0.333 0.333 0.333 0.333 0.333 ...
##  $ Quarterly_Tax   : num  0.723 0.723 0.723 0.629 0.307 ...
##  $ Mfr_Guarantee   : num  0 1 0 0 1 1 1 0 0 0 ...
##  $ Guarantee_Period: num  0 0 0 0 0.273 ...
##  $ Airco           : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ Automatic_airco : num  0 0 0 0 1 1 1 1 1 1 ...
##  $ CD_Player       : num  1 0 0 0 1 0 1 1 1 0 ...
##  $ Powered_Windows : num  0 0 1 0 1 1 1 1 1 1 ...
##  $ Sport_Model     : num  0 0 0 0 0 1 1 0 0 1 ...
##  $ Tow_Bar         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Price           : num  0.333 0.34 0.304 0.304 0.589 ...
##  $ Fuel_Type_CNG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fuel_Type_Diesel: num  1 1 1 1 0 0 0 0 0 0 ...

5. Fit a Neural Network

library(neuralnet)
names(train_df_2)
##  [1] "Age_08_04"        "KM"               "HP"               "Automatic"       
##  [5] "Doors"            "Quarterly_Tax"    "Mfr_Guarantee"    "Guarantee_Period"
##  [9] "Airco"            "Automatic_airco"  "CD_Player"        "Powered_Windows" 
## [13] "Sport_Model"      "Tow_Bar"          "Price"            "Fuel_Type_CNG"   
## [17] "Fuel_Type_Diesel"
train_df_2_nn1 <- neuralnet(Price ~ ., data = train_df_2, 
                         linear.output = T, hidden = 2)

train_df_2_nn1$weights
## [[1]]
## [[1]][[1]]
##              [,1]        [,2]
##  [1,]  1.15901800 -1.08563353
##  [2,]  1.82816650 -2.02230384
##  [3,]  0.86336022 -1.21240231
##  [4,] -0.58612529  0.94654369
##  [5,] -0.03594813  0.17996578
##  [6,] -0.06898549  0.05446269
##  [7,] -0.91984206  1.88943957
##  [8,] -0.26167277 -0.12012124
##  [9,] -1.57051151 -0.43253884
## [10,] -0.35223374 -0.13300483
## [11,] -0.49500291  0.08573496
## [12,] -0.10767580 -0.02481818
## [13,] -0.69788941 -0.39766470
## [14,] -0.13204115 -0.08298837
## [15,]  0.01340312 -0.02912053
## [16,]  0.31749721 -0.79525038
## [17,]  0.57706746 -0.19400431
## 
## [[1]][[2]]
##            [,1]
## [1,]  0.4727631
## [2,] -0.4302306
## [3,]  0.9083219
plot(train_df_2_nn1, rep = "best")

names(train_df_2)
##  [1] "Age_08_04"        "KM"               "HP"               "Automatic"       
##  [5] "Doors"            "Quarterly_Tax"    "Mfr_Guarantee"    "Guarantee_Period"
##  [9] "Airco"            "Automatic_airco"  "CD_Player"        "Powered_Windows" 
## [13] "Sport_Model"      "Tow_Bar"          "Price"            "Fuel_Type_CNG"   
## [17] "Fuel_Type_Diesel"

5.1 Predictions on training set (normalised scale)

train_pred <- compute(train_df_2_nn1, 
                      train_df_2[, c("Age_08_04", "KM", "HP", "Automatic", 
                                     "Doors", "Quarterly_Tax", "Mfr_Guarantee",
                                     "Guarantee_Period", "Airco", "Automatic_airco",
                                     "CD_Player", "Powered_Windows", "Sport_Model", 
                                     "Tow_Bar", "Fuel_Type_CNG", "Fuel_Type_Diesel")])

Check predicted values (normalised scale).

head(train_pred$net.result, 10)
##            [,1]
## 638  0.13388562
## 608  0.13360861
## 907  0.19005452
## 1147 0.12464834
## 654  0.18233738
## 873  0.16353465
## 652  0.15771844
## 1074 0.07737048
## 131  0.42074060
## 1125 0.11793014

5.2 Accuracy for training set

library(forecast)
## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Registered S3 methods overwritten by 'forecast':
##   method             from    
##   fitted.fracdiff    fracdiff
##   residuals.fracdiff fracdiff
accuracy(unlist(train_pred), train_df_2$Price)
##                  ME      RMSE       MAE  MPE MAPE
## Test set -0.7799406 0.7896206 0.7799406 -Inf  Inf

Predicted values on training set in original scale.

Scale back to original.

train_pred_original <- (train_pred$net.result * (max(podracer3$Price) - 
                                                   min(podracer3$Price))) + min(podracer3$Price) 

First 10 predicted values for Price (i.e. original scale).

head(train_pred_original, 10)
##           [,1]
## 638   8118.880
## 608   8111.082
## 907   9700.035
## 1147  7858.851
## 654   9482.797
## 873   8953.500
## 652   8789.774
## 1074  6527.979
## 131  16193.848
## 1125  7669.733

5.3 Predictions on validation set

valid_pred <- compute(train_df_2_nn1, 
                      valid_df_2[, c("Age_08_04", "KM", "HP", "Automatic", 
                                     "Doors", "Quarterly_Tax", "Mfr_Guarantee",
                                     "Guarantee_Period", "Airco", "Automatic_airco",
                                     "CD_Player", "Powered_Windows", "Sport_Model", 
                                     "Tow_Bar", "Fuel_Type_CNG", "Fuel_Type_Diesel")])

Check predicted values (normalised scale) on validation set.

head(valid_pred$net.result, 10)
##         [,1]
## 2  0.4312241
## 3  0.4568309
## 6  0.3696080
## 10 0.3540338
## 11 0.6421790
## 12 0.6291678
## 14 0.5969761
## 17 0.5918448
## 19 0.3928083
## 22 0.4684290

5.4 Accuracy for validation set

accuracy(unlist(valid_pred), valid_df_2$Price)
##                  ME     RMSE       MAE       MPE     MAPE
## Test set -0.7668622 0.778975 0.7668622 -470.1392 665.9653

Predicted values on vaidation set in original scale.

Scale back to original.

valid_pred_original <- (valid_pred$net.result * (max(podracer3$Price) - 
                                                   min(podracer3$Price))) + min(podracer3$Price) 

First 10 predicted values for Price (i.e. original scale).

The RMSE is 0.787 (validation set).

head(valid_pred_original, 10)
##        [,1]
## 2  16488.96
## 3  17209.79
## 6  14754.47
## 10 14316.05
## 11 22427.34
## 12 22061.07
## 14 21154.88
## 17 21010.43
## 19 15407.55
## 22 17536.28

6. Different specifications

6.1 Neural network with 1 layer, 5 nodes

train_df_2_nn2 <- neuralnet(Price ~ ., data = train_df_2, 
                            linear.output = T, hidden = 5)

train_df_2_nn2$weights
## [[1]]
## [[1]][[1]]
##              [,1]        [,2]       [,3]        [,4]        [,5]
##  [1,] -0.18957167 -0.89466610  1.0824754  2.42050649 -0.33127514
##  [2,]  1.96054540  1.37433577  0.8948274  4.40041547 -3.10673667
##  [3,]  1.13302729  1.82584835  0.4578356  0.42179024 -0.24636756
##  [4,]  0.48787670  0.08808572  0.4508432 -2.68437920  1.35055091
##  [5,] -0.39559804 -0.19464965  0.2720596 -0.66374101 -0.88375286
##  [6,]  0.23118535  0.61259820  0.9671399 -0.14550667  0.42122786
##  [7,]  0.71214886 -0.08396825  2.2444979 -1.98723018  1.29039226
##  [8,] -1.42525933  1.32430778  0.8110131  0.55885189  0.42205369
##  [9,]  0.07348997 -2.29756202 -0.2703548 -1.31054609 -1.73576189
## [10,] -0.24503662  0.18832419  0.5185287  0.05152996 -0.01311173
## [11,] -1.03896939 -0.58202995 -0.7346107  0.07856487  1.03093049
## [12,] -0.26664599 -0.49121719 -0.3720243  0.02349889 -0.34906382
## [13,]  0.12697065 -0.54179088 -0.8226099  0.43912863  0.63586422
## [14,]  0.08105145 -0.15883869 -0.2611583 -0.48826068  0.02000547
## [15,]  0.53209299  0.58647185  0.1358807 -0.63806182  0.89027493
## [16,] -1.43004532 33.76601654  1.2256482  0.66878117 -0.64258411
## [17,] -0.48939593 -0.28235172 -1.0951958 -0.53796605 -0.69528845
## 
## [[1]][[2]]
##            [,1]
## [1,]  0.4068726
## [2,] -0.1984329
## [3,] -0.3438218
## [4,]  0.7441377
## [5,] -0.5608212
## [6,]  0.1876774
plot(train_df_2_nn2, rep = "best")

Predictions on training set (normalised scale), neural network 2.

train_pred_2 <- compute(train_df_2_nn2, 
                      train_df_2[, c("Age_08_04", "KM", "HP", "Automatic", 
                                     "Doors", "Quarterly_Tax", "Mfr_Guarantee",
                                     "Guarantee_Period", "Airco", "Automatic_airco",
                                     "CD_Player", "Powered_Windows", "Sport_Model", 
                                     "Tow_Bar", "Fuel_Type_CNG", "Fuel_Type_Diesel")])

Check predicted values (normalised scale) neural network 2.

head(train_pred_2$net.result, 10)
##            [,1]
## 638  0.13243667
## 608  0.13448536
## 907  0.19276265
## 1147 0.12628239
## 654  0.18202365
## 873  0.17223813
## 652  0.15082087
## 1074 0.03960327
## 131  0.43177621
## 1125 0.11421734

Accuracy for training set, neural network 2.

accuracy(unlist(train_pred_2), train_df_2$Price)
##                  ME      RMSE       MAE  MPE MAPE
## Test set -0.7799406 0.7896206 0.7799406 -Inf  Inf

Predicted values on training set in original scale, neural network 2.

Scale back to original.

train_pred_original_2 <- (train_pred_2$net.result * (max(podracer3$Price) - 
                                                   min(podracer3$Price))) + min(podracer3$Price) 

First 10 predicted values for Price (i.e. original scale).

head(train_pred_original_2, 10)
##           [,1]
## 638   8078.092
## 608   8135.763
## 907   9776.268
## 1147  7904.849
## 654   9473.966
## 873   9198.503
## 652   8595.608
## 1074  5464.832
## 131  16504.500
## 1125  7565.218

Check predictions on validation set, neural network 2.

valid_pred_2 <- compute(train_df_2_nn2, 
                      valid_df_2[, c("Age_08_04", "KM", "HP", "Automatic", 
                                     "Doors", "Quarterly_Tax", "Mfr_Guarantee",
                                     "Guarantee_Period", "Airco", "Automatic_airco",
                                     "CD_Player", "Powered_Windows", "Sport_Model", 
                                     "Tow_Bar", "Fuel_Type_CNG", "Fuel_Type_Diesel")])

Check predicted values (normalised scale) on validation set, neural network 2.

head(valid_pred_2$net.result, 10)
##         [,1]
## 2  0.4632419
## 3  0.4150965
## 6  0.3725922
## 10 0.3717272
## 11 0.6679051
## 12 0.6343647
## 14 0.5968551
## 17 0.5702664
## 19 0.3181462
## 22 0.4930135

Accuracy for validation set.

accuracy(unlist(valid_pred_2), valid_df_2$Price)
##                  ME     RMSE       MAE       MPE     MAPE
## Test set -0.7668622 0.778975 0.7668622 -470.1392 665.9653

Predicted values on vaidation set in original scale, neural network 2.

Scale back to original, neural network 2.

valid_pred_original_2 <- (valid_pred_2$net.result * (max(podracer3$Price) - 
                                                   min(podracer3$Price))) + min(podracer3$Price) 

First 10 predicted values for Price (i.e. original scale). ]n

head(valid_pred_original_2, 10)
##        [,1]
## 2  17390.26
## 3  16034.97
## 6  14838.47
## 10 14814.12
## 11 23151.53
## 12 22207.36
## 14 21151.47
## 17 20403.00
## 19 13305.82
## 22 18228.33

6.2 Neural network with 2 layers, 5 nodes each

train_df_2_nn3 <- neuralnet(Price ~ ., data = train_df_2, 
                            linear.output = T, hidden = c(5, 5))

train_df_2_nn3$weights
## [[1]]
## [[1]][[1]]
##             [,1]         [,2]        [,3]         [,4]        [,5]
##  [1,]  3.0312157  0.729781413  1.73296058  0.526279836  1.36754542
##  [2,]  2.2746104 -0.673971374 -1.28546167  1.865098445  1.92287126
##  [3,]  1.1811063  0.346708205 -0.42711216  0.230995172  0.81082054
##  [4,]  1.4600034 -0.666639146 -0.11534620 -0.488379522 -1.46673567
##  [5,]  0.4963332 -0.148531969 -0.30864712  0.913119881 -0.43280048
##  [6,]  0.6037612  0.922629396  0.41989183 -0.817610361 -0.36119700
##  [7,] -2.1785460  0.295010338  0.34242470  0.619580337 -1.12396894
##  [8,] -0.3401346 -2.693842026 -0.96621894 -0.526085172  0.04697226
##  [9,] -0.4016648 -0.702544216  1.09377040 -4.047313727 -0.23776733
## [10,] -0.5794138  0.001617147 -0.14640835  1.259149845  0.07297571
## [11,]  0.4966499  0.276719928  0.61804940  1.384689420 -0.19907089
## [12,] -0.1784131  0.897779012  0.38114986  0.472464309  0.16125459
## [13,]  0.0844264 -0.044855992  0.11172067 -0.363590483 -0.08471410
## [14,] -0.4507591  0.832312013  0.55384361 -1.399723115  0.08354863
## [15,] -0.3923960 -0.120520575 -0.20651297  0.701537274  0.08108022
## [16,] -0.8827099 -0.287324053  0.65404356 -9.826423125  2.64001952
## [17,]  0.3280078  0.093677822 -0.07680684  0.002000548  0.29415752
## 
## [[1]][[2]]
##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] -0.2529114 -1.0387237 -0.8954089  1.9735991  2.0651737
## [2,] -1.8123231  2.0692834  1.8043833 -0.3872393  0.3150387
## [3,]  1.3408346 -1.4238964 -1.0127844 -1.6857784 -0.2976973
## [4,] -0.5366908 -0.5939110 -0.8086246  0.3699647 -1.5691488
## [5,]  0.2024926 -0.1763935 -0.3596893  0.3285640  0.9199554
## [6,]  0.5325236  1.6950461 -0.2003131 -0.8837031  0.1232894
## 
## [[1]][[3]]
##            [,1]
## [1,]  0.7841711
## [2,]  1.1555838
## [3,] -1.1197686
## [4,] -0.5145833
## [5,]  2.4615883
## [6,] -1.5177305
plot(train_df_2_nn3, rep = "best")

Predictions on training set (normalised scale), neural network 3.

train_pred_3 <- compute(train_df_2_nn3, 
                        train_df_2[, c("Age_08_04", "KM", "HP", "Automatic", 
                                       "Doors", "Quarterly_Tax", "Mfr_Guarantee",
                                       "Guarantee_Period", "Airco", "Automatic_airco",
                                       "CD_Player", "Powered_Windows", "Sport_Model", 
                                       "Tow_Bar", "Fuel_Type_CNG", "Fuel_Type_Diesel")])

Check predicted values (normalised scale) neural network 3.

head(train_pred_3$net.result, 10)
##            [,1]
## 638  0.13055802
## 608  0.13309814
## 907  0.18083207
## 1147 0.12413856
## 654  0.18842596
## 873  0.16762104
## 652  0.13408202
## 1074 0.05540572
## 131  0.41769986
## 1125 0.12202005

Accuracy for training set, neural network 3.

accuracy(unlist(train_pred_3), train_df_2$Price)
##                  ME      RMSE       MAE  MPE MAPE
## Test set -0.7799406 0.7896206 0.7799406 -Inf  Inf

Predicted values on training set in original scale, neural network 3.

Scale back to original.

train_pred_original_3 <- (train_pred_3$net.result * (max(podracer3$Price) - 
                                                       min(podracer3$Price))) + min(podracer3$Price) 

First 10 predicted values for Price (i.e. original scale).

head(train_pred_original_3, 10)
##           [,1]
## 638   8025.208
## 608   8096.713
## 907   9440.423
## 1147  7844.500
## 654   9654.191
## 873   9068.532
## 652   8124.409
## 1074  5909.671
## 131  16108.251
## 1125  7784.864

Check predictions on validation set, neural network 3.

valid_pred_3 <- compute(train_df_2_nn3, 
                        valid_df_2[, c("Age_08_04", "KM", "HP", "Automatic", 
                                       "Doors", "Quarterly_Tax", "Mfr_Guarantee",
                                       "Guarantee_Period", "Airco", "Automatic_airco",
                                       "CD_Player", "Powered_Windows", "Sport_Model", 
                                       "Tow_Bar", "Price", "Fuel_Type_CNG", "Fuel_Type_Diesel")])

Check predicted values (normalised scale) on validation set, neural network 3.

head(valid_pred_3$net.result, 10)
##         [,1]
## 2  0.4250526
## 3  0.3800759
## 6  0.3717043
## 10 0.3667990
## 11 0.6164454
## 12 0.6113172
## 14 0.5826810
## 17 0.5977831
## 19 0.3853424
## 22 0.4978584

Accuracy for validation set.

accuracy(unlist(valid_pred_3), valid_df_2$Price)
##                  ME     RMSE       MAE       MPE     MAPE
## Test set -0.7668622 0.778975 0.7668622 -470.1392 665.9653

Predicted values on vaidation set in original scale, neural network 3.

Scale back to original.

valid_pred_original_3 <- (valid_pred_3$net.result * (max(podracer3$Price) - 
                                                   min(podracer3$Price))) + min(podracer3$Price) 

First 10 predicted values for Price (i.e. original scale), neural network 3.

head(valid_pred_original_3, 10)
##        [,1]
## 2  16315.23
## 3  15049.14
## 6  14813.47
## 10 14675.39
## 11 21702.94
## 12 21558.58
## 14 20752.47
## 17 21177.59
## 19 15197.39
## 22 18364.71

The accuracies (RMS) did not change when the number of layers and nodes increased. A complex model does not necessarily yeld better results. In this case, minimising the number of nodes can make it easier to interpret.