Directions

Neural networks to predict the price of podracers.

1. Load Data

podracer <- read.csv("podracer.csv", header = TRUE)

head(podracer, 10)

##    Id                                   Model Price Age_08_04 Mfg_Month
## 1   1 Podracer 2.0 D4D HATCHB TERRA 2/3-Doors 13500        23        10
## 2   2 Podracer 2.0 D4D HATCHB TERRA 2/3-Doors 13750        23        10
## 3   3 Podracer 2.0 D4D HATCHB TERRA 2/3-Doors 13950        24         9
## 4   4 Podracer 2.0 D4D HATCHB TERRA 2/3-Doors 14950        26         7
## 5   5   Podracer 2.0 D4D HATCHB SOL 2/3-Doors 13750        30         3
## 6   6   Podracer 2.0 D4D HATCHB SOL 2/3-Doors 12950        32         1
## 7   7 Podracer 2.0 D4D 90 3DR TERRA 2/3-Doors 16900        27         6
## 8   8 Podracer 2.0 D4D 90 3DR TERRA 2/3-Doors 18600        30         3
## 9   9   Podracer 1800 T SPORT VVT I 2/3-Doors 21500        27         6
## 10 10   Podracer 1.9 D HATCHB TERRA 2/3-Doors 12950        23        10
##    Mfg_Year    KM Fuel_Type  HP Met_Color  Color Automatic   CC Doors Cylinders
## 1      2002 46986    Diesel  90         1   Blue         0 2000     3         4
## 2      2002 72937    Diesel  90         1 Silver         0 2000     3         4
## 3      2002 41711    Diesel  90         1   Blue         0 2000     3         4
## 4      2002 48000    Diesel  90         0  Black         0 2000     3         4
## 5      2002 38500    Diesel  90         0  Black         0 2000     3         4
## 6      2002 61000    Diesel  90         0  White         0 2000     3         4
## 7      2002 94612    Diesel  90         1   Grey         0 2000     3         4
## 8      2002 75889    Diesel  90         1   Grey         0 2000     3         4
## 9      2002 19700    Petrol 192         0    Red         0 1800     3         4
## 10     2002 71138    Diesel  69         0   Blue         0 1900     3         4
##    Gears Quarterly_Tax Weight Mfr_Guarantee BOVAG_Guarantee Guarantee_Period
## 1      5           210   1165             0               1                3
## 2      5           210   1165             0               1                3
## 3      5           210   1165             1               1                3
## 4      5           210   1165             1               1                3
## 5      5           210   1170             1               1                3
## 6      5           210   1170             0               1                3
## 7      5           210   1245             0               1                3
## 8      5           210   1245             1               1                3
## 9      5           100   1185             0               1                3
## 10     5           185   1105             0               1                3
##    ABS Airbag_1 Airbag_2 Airco Automatic_airco Boardcomputer CD_Player
## 1    1        1        1     0               0             1         0
## 2    1        1        1     1               0             1         1
## 3    1        1        1     0               0             1         0
## 4    1        1        1     0               0             1         0
## 5    1        1        1     1               0             1         0
## 6    1        1        1     1               0             1         0
## 7    1        1        1     1               0             1         0
## 8    1        1        1     1               0             1         1
## 9    1        1        0     1               0             0         0
## 10   1        1        1     1               0             1         0
##    Central_Lock Powered_Windows Power_Steering Radio Mistlamps Sport_Model
## 1             1               1              1     0         0           0
## 2             1               0              1     0         0           0
## 3             0               0              1     0         0           0
## 4             0               0              1     0         0           0
## 5             1               1              1     0         1           0
## 6             1               1              1     0         1           0
## 7             1               1              1     0         0           1
## 8             1               1              1     0         0           0
## 9             1               1              1     1         0           0
## 10            0               0              1     0         0           0
##    Backseat_Divider Metallic_Rim Radio_cassette Parking_Assistant Tow_Bar
## 1                 1            0              0                 0       0
## 2                 1            0              0                 0       0
## 3                 1            0              0                 0       0
## 4                 1            0              0                 0       0
## 5                 1            0              0                 0       0
## 6                 1            0              0                 0       0
## 7                 1            0              0                 0       0
## 8                 1            0              0                 0       0
## 9                 0            1              1                 0       0
## 10                1            0              0                 0       0

nrow(podracer)

## [1] 1436

names(podracer)

##  [1] "Id"                "Model"             "Price"            
##  [4] "Age_08_04"         "Mfg_Month"         "Mfg_Year"         
##  [7] "KM"                "Fuel_Type"         "HP"               
## [10] "Met_Color"         "Color"             "Automatic"        
## [13] "CC"                "Doors"             "Cylinders"        
## [16] "Gears"             "Quarterly_Tax"     "Weight"           
## [19] "Mfr_Guarantee"     "BOVAG_Guarantee"   "Guarantee_Period" 
## [22] "ABS"               "Airbag_1"          "Airbag_2"         
## [25] "Airco"             "Automatic_airco"   "Boardcomputer"    
## [28] "CD_Player"         "Central_Lock"      "Powered_Windows"  
## [31] "Power_Steering"    "Radio"             "Mistlamps"        
## [34] "Sport_Model"       "Backseat_Divider"  "Metallic_Rim"     
## [37] "Radio_cassette"    "Parking_Assistant" "Tow_Bar"

str(podracer)

## 'data.frame':    1436 obs. of  39 variables:
##  $ Id               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Model            : Factor w/ 319 levels "Podracer","Podracer ! 1.6-16v vvt-i sol airco sedan 4/5-Doors",..: 276 276 276 276 275 275 269 269 257 246 ...
##  $ Price            : int  13500 13750 13950 14950 13750 12950 16900 18600 21500 12950 ...
##  $ Age_08_04        : int  23 23 24 26 30 32 27 30 27 23 ...
##  $ Mfg_Month        : int  10 10 9 7 3 1 6 3 6 10 ...
##  $ Mfg_Year         : int  2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 ...
##  $ KM               : int  46986 72937 41711 48000 38500 61000 94612 75889 19700 71138 ...
##  $ Fuel_Type        : Factor w/ 3 levels "CNG","Diesel",..: 2 2 2 2 2 2 2 2 3 2 ...
##  $ HP               : int  90 90 90 90 90 90 90 90 192 69 ...
##  $ Met_Color        : int  1 1 1 0 0 0 1 1 0 0 ...
##  $ Color            : Factor w/ 10 levels "Beige","Black",..: 3 7 3 2 2 9 5 5 6 3 ...
##  $ Automatic        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CC               : int  2000 2000 2000 2000 2000 2000 2000 2000 1800 1900 ...
##  $ Doors            : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Cylinders        : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ Gears            : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Quarterly_Tax    : int  210 210 210 210 210 210 210 210 100 185 ...
##  $ Weight           : int  1165 1165 1165 1165 1170 1170 1245 1245 1185 1105 ...
##  $ Mfr_Guarantee    : int  0 0 1 1 1 0 0 1 0 0 ...
##  $ BOVAG_Guarantee  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Guarantee_Period : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ ABS              : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Airbag_1         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Airbag_2         : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ Airco            : int  0 1 0 0 1 1 1 1 1 1 ...
##  $ Automatic_airco  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Boardcomputer    : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ CD_Player        : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ Central_Lock     : int  1 1 0 0 1 1 1 1 1 0 ...
##  $ Powered_Windows  : int  1 0 0 0 1 1 1 1 1 0 ...
##  $ Power_Steering   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Radio            : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ Mistlamps        : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ Sport_Model      : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Backseat_Divider : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ Metallic_Rim     : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ Radio_cassette   : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ Parking_Assistant: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Tow_Bar          : int  0 0 0 0 0 0 0 0 0 0 ...

2. PreProcessing

2.1 Filter for required variables

podracer2 <- podracer[,c(4, 7:9, 12, 14, 17, 19, 21, 25:26, 28, 30, 34, 39, 3)]
head(podracer2)

##   Age_08_04    KM Fuel_Type HP Automatic Doors Quarterly_Tax Mfr_Guarantee
## 1        23 46986    Diesel 90         0     3           210             0
## 2        23 72937    Diesel 90         0     3           210             0
## 3        24 41711    Diesel 90         0     3           210             1
## 4        26 48000    Diesel 90         0     3           210             1
## 5        30 38500    Diesel 90         0     3           210             1
## 6        32 61000    Diesel 90         0     3           210             0
##   Guarantee_Period Airco Automatic_airco CD_Player Powered_Windows Sport_Model
## 1                3     0               0         0               1           0
## 2                3     1               0         1               0           0
## 3                3     0               0         0               0           0
## 4                3     0               0         0               0           0
## 5                3     1               0         0               1           0
## 6                3     1               0         0               1           0
##   Tow_Bar Price
## 1       0 13500
## 2       0 13750
## 3       0 13950
## 4       0 14950
## 5       0 13750
## 6       0 12950

nrow(podracer2)

## [1] 1436

names(podracer2)

##  [1] "Age_08_04"        "KM"               "Fuel_Type"        "HP"              
##  [5] "Automatic"        "Doors"            "Quarterly_Tax"    "Mfr_Guarantee"   
##  [9] "Guarantee_Period" "Airco"            "Automatic_airco"  "CD_Player"       
## [13] "Powered_Windows"  "Sport_Model"      "Tow_Bar"          "Price"

str(podracer2)

## 'data.frame':    1436 obs. of  16 variables:
##  $ Age_08_04       : int  23 23 24 26 30 32 27 30 27 23 ...
##  $ KM              : int  46986 72937 41711 48000 38500 61000 94612 75889 19700 71138 ...
##  $ Fuel_Type       : Factor w/ 3 levels "CNG","Diesel",..: 2 2 2 2 2 2 2 2 3 2 ...
##  $ HP              : int  90 90 90 90 90 90 90 90 192 69 ...
##  $ Automatic       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Doors           : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Quarterly_Tax   : int  210 210 210 210 210 210 210 210 100 185 ...
##  $ Mfr_Guarantee   : int  0 0 1 1 1 0 0 1 0 0 ...
##  $ Guarantee_Period: int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Airco           : int  0 1 0 0 1 1 1 1 1 1 ...
##  $ Automatic_airco : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CD_Player       : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ Powered_Windows : int  1 0 0 0 1 1 1 1 1 0 ...
##  $ Sport_Model     : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Tow_Bar         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Price           : int  13500 13750 13950 14950 13750 12950 16900 18600 21500 12950 ...

2.2 Create dummy variables

table(podracer2$Fuel_Type)

## 
##    CNG Diesel Petrol 
##     17    155   1264

podracer2$Fuel_Type_CNG <- ifelse(podracer2$Fuel_Type == "CNG", 1 , 0)
podracer2$Fuel_Type_Diesel <- ifelse(podracer2$Fuel_Type == "Diesel",
                                   1, 0)

names(podracer2)

##  [1] "Age_08_04"        "KM"               "Fuel_Type"        "HP"              
##  [5] "Automatic"        "Doors"            "Quarterly_Tax"    "Mfr_Guarantee"   
##  [9] "Guarantee_Period" "Airco"            "Automatic_airco"  "CD_Player"       
## [13] "Powered_Windows"  "Sport_Model"      "Tow_Bar"          "Price"           
## [17] "Fuel_Type_CNG"    "Fuel_Type_Diesel"

podracer3 <- podracer2[, -c(3)]
names(podracer3)

##  [1] "Age_08_04"        "KM"               "HP"               "Automatic"       
##  [5] "Doors"            "Quarterly_Tax"    "Mfr_Guarantee"    "Guarantee_Period"
##  [9] "Airco"            "Automatic_airco"  "CD_Player"        "Powered_Windows" 
## [13] "Sport_Model"      "Tow_Bar"          "Price"            "Fuel_Type_CNG"   
## [17] "Fuel_Type_Diesel"

3. Training Validation Split

Using our favourite seed :-)

set.seed(666)

train_index <- sample(1:nrow(podracer3), 0.6 * nrow(podracer3))
valid_index <- setdiff(1:nrow(podracer3), train_index)

train_df <- podracer3[train_index, ]
valid_df <- podracer3[valid_index, ]
nrow(train_df)

## [1] 861

nrow(valid_df)

## [1] 575

4. Normalise

Normalise the data using training set

summary(podracer3[1:15])

##    Age_08_04           KM               HP          Automatic      
##  Min.   : 1.00   Min.   :     1   Min.   : 69.0   Min.   :0.00000  
##  1st Qu.:44.00   1st Qu.: 43000   1st Qu.: 90.0   1st Qu.:0.00000  
##  Median :61.00   Median : 63390   Median :110.0   Median :0.00000  
##  Mean   :55.95   Mean   : 68533   Mean   :101.5   Mean   :0.05571  
##  3rd Qu.:70.00   3rd Qu.: 87021   3rd Qu.:110.0   3rd Qu.:0.00000  
##  Max.   :80.00   Max.   :243000   Max.   :192.0   Max.   :1.00000  
##      Doors       Quarterly_Tax    Mfr_Guarantee    Guarantee_Period
##  Min.   :2.000   Min.   : 19.00   Min.   :0.0000   Min.   : 3.000  
##  1st Qu.:3.000   1st Qu.: 69.00   1st Qu.:0.0000   1st Qu.: 3.000  
##  Median :4.000   Median : 85.00   Median :0.0000   Median : 3.000  
##  Mean   :4.033   Mean   : 87.12   Mean   :0.4095   Mean   : 3.815  
##  3rd Qu.:5.000   3rd Qu.: 85.00   3rd Qu.:1.0000   3rd Qu.: 3.000  
##  Max.   :5.000   Max.   :283.00   Max.   :1.0000   Max.   :36.000  
##      Airco        Automatic_airco     CD_Player      Powered_Windows
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :1.0000   Median :0.00000   Median :0.0000   Median :1.000  
##  Mean   :0.5084   Mean   :0.05641   Mean   :0.2187   Mean   :0.562  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:1.000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.000  
##   Sport_Model        Tow_Bar           Price      
##  Min.   :0.0000   Min.   :0.0000   Min.   : 4350  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 8450  
##  Median :0.0000   Median :0.0000   Median : 9900  
##  Mean   :0.3001   Mean   :0.2779   Mean   :10731  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:11950  
##  Max.   :1.0000   Max.   :1.0000   Max.   :32500

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

Create normalised values from training set.

train_df_norm <- preProcess(train_df[1:15], method = c("range"))
train_df_norm

## Created from 861 samples and 15 variables
## 
## Pre-processing:
##   - ignored (0)
##   - re-scaling to [0, 1] (15)

Transform training set using normalised values.

train_df_transform <- predict(train_df_norm, train_df[1:15])
summary(train_df_transform)

##    Age_08_04            KM               HP           Automatic     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.5263   1st Qu.:0.1850   1st Qu.:0.1707   1st Qu.:0.0000  
##  Median :0.7500   Median :0.2642   Median :0.3333   Median :0.0000  
##  Mean   :0.6911   Mean   :0.2885   Mean   :0.2643   Mean   :0.0511  
##  3rd Qu.:0.8684   3rd Qu.:0.3668   3rd Qu.:0.3333   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      Doors        Quarterly_Tax    Mfr_Guarantee    Guarantee_Period 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.3333   1st Qu.:0.1894   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.6667   Median :0.2500   Median :0.0000   Median :0.00000  
##  Mean   :0.6756   Mean   :0.2623   Mean   :0.3972   Mean   :0.02678  
##  3rd Qu.:1.0000   3rd Qu.:0.2500   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##      Airco        Automatic_airco     CD_Player     Powered_Windows 
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :0.000   Median :1.0000  
##  Mean   :0.4925   Mean   :0.04181   Mean   :0.216   Mean   :0.5563  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.000   Max.   :1.0000  
##   Sport_Model        Tow_Bar           Price       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.1441  
##  Median :0.0000   Median :0.0000   Median :0.1922  
##  Mean   :0.2904   Mean   :0.2962   Mean   :0.2201  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.2667  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

names(train_df_transform)

##  [1] "Age_08_04"        "KM"               "HP"               "Automatic"       
##  [5] "Doors"            "Quarterly_Tax"    "Mfr_Guarantee"    "Guarantee_Period"
##  [9] "Airco"            "Automatic_airco"  "CD_Player"        "Powered_Windows" 
## [13] "Sport_Model"      "Tow_Bar"          "Price"

Transform validation set using normalised values.

valid_df_transform <- predict(train_df_norm, valid_df[1:15])
summary(valid_df_transform)

##    Age_08_04              KM               HP           Automatic      
##  Min.   :-0.03947   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.: 0.51316   1st Qu.:0.1630   1st Qu.:0.1382   1st Qu.:0.00000  
##  Median : 0.73684   Median :0.2549   Median :0.3333   Median :0.00000  
##  Mean   : 0.67217   Mean   :0.2723   Mean   :0.2642   Mean   :0.06261  
##  3rd Qu.: 0.86842   3rd Qu.:0.3459   3rd Qu.:0.3333   3rd Qu.:0.00000  
##  Max.   : 1.00000   Max.   :0.8961   Max.   :1.0000   Max.   :1.00000  
##      Doors        Quarterly_Tax    Mfr_Guarantee    Guarantee_Period 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.3333   1st Qu.:0.1894   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.6667   Median :0.2500   Median :0.0000   Median :0.00000  
##  Mean   :0.6812   Mean   :0.2517   Mean   :0.4278   Mean   :0.02161  
##  3rd Qu.:1.0000   3rd Qu.:0.2500   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :0.8144   Max.   :1.0000   Max.   :1.00000  
##      Airco        Automatic_airco     CD_Player      Powered_Windows 
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :0.00000   Median :0.0000   Median :1.0000  
##  Mean   :0.5322   Mean   :0.07826   Mean   :0.2226   Mean   :0.5704  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##   Sport_Model        Tow_Bar           Price          
##  Min.   :0.0000   Min.   :0.0000   Min.   :-0.001779  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 0.144128  
##  Median :0.0000   Median :0.0000   Median : 0.197509  
##  Mean   :0.3148   Mean   :0.2504   Mean   : 0.233138  
##  3rd Qu.:1.0000   3rd Qu.:0.5000   3rd Qu.: 0.268683  
##  Max.   :1.0000   Max.   :1.0000   Max.   : 0.732740

names(valid_df_transform)

##  [1] "Age_08_04"        "KM"               "HP"               "Automatic"       
##  [5] "Doors"            "Quarterly_Tax"    "Mfr_Guarantee"    "Guarantee_Period"
##  [9] "Airco"            "Automatic_airco"  "CD_Player"        "Powered_Windows" 
## [13] "Sport_Model"      "Tow_Bar"          "Price"

Create full training and validation sets with dummy variables.

train_df_2 <- cbind(train_df_transform, train_df[16:17])
names(train_df_2)

##  [1] "Age_08_04"        "KM"               "HP"               "Automatic"       
##  [5] "Doors"            "Quarterly_Tax"    "Mfr_Guarantee"    "Guarantee_Period"
##  [9] "Airco"            "Automatic_airco"  "CD_Player"        "Powered_Windows" 
## [13] "Sport_Model"      "Tow_Bar"          "Price"            "Fuel_Type_CNG"   
## [17] "Fuel_Type_Diesel"

str(train_df_2)

## 'data.frame':    861 obs. of  17 variables:
##  $ Age_08_04       : num  0.724 0.763 0.842 0.921 0.789 ...
##  $ KM              : num  0.501 0.755 0.242 0.419 0.473 ...
##  $ HP              : num  0.1382 0.0244 0.3333 0.3333 0.3333 ...
##  $ Automatic       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Doors           : num  1 1 0.333 0.333 1 ...
##  $ Quarterly_Tax   : num  0.189 0.629 0.25 0.189 0.25 ...
##  $ Mfr_Guarantee   : num  0 0 1 0 1 1 0 0 0 0 ...
##  $ Guarantee_Period: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Airco           : num  0 1 1 1 1 0 1 0 1 0 ...
##  $ Automatic_airco : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CD_Player       : num  0 0 0 0 0 0 1 0 1 0 ...
##  $ Powered_Windows : num  0 1 1 0 1 1 1 0 1 1 ...
##  $ Sport_Model     : num  1 0 0 1 0 0 0 0 1 0 ...
##  $ Tow_Bar         : num  0 0 1 0 0 1 1 0 0 0 ...
##  $ Price           : num  0.128 0.11 0.19 0.089 0.198 ...
##  $ Fuel_Type_CNG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fuel_Type_Diesel: num  0 1 0 0 0 0 0 1 0 0 ...

valid_df_2 <- cbind(valid_df_transform, valid_df[16:17])
names(valid_df_2)

##  [1] "Age_08_04"        "KM"               "HP"               "Automatic"       
##  [5] "Doors"            "Quarterly_Tax"    "Mfr_Guarantee"    "Guarantee_Period"
##  [9] "Airco"            "Automatic_airco"  "CD_Player"        "Powered_Windows" 
## [13] "Sport_Model"      "Tow_Bar"          "Price"            "Fuel_Type_CNG"   
## [17] "Fuel_Type_Diesel"

str(valid_df_2)

## 'data.frame':    575 obs. of  17 variables:
##  $ Age_08_04       : num  0.25 0.263 0.368 0.25 0.276 ...
##  $ KM              : num  0.3 0.172 0.251 0.293 0.129 ...
##  $ HP              : num  0.171 0.171 0.171 0 1 ...
##  $ Automatic       : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Doors           : num  0.333 0.333 0.333 0.333 0.333 ...
##  $ Quarterly_Tax   : num  0.723 0.723 0.723 0.629 0.307 ...
##  $ Mfr_Guarantee   : num  0 1 0 0 1 1 1 0 0 0 ...
##  $ Guarantee_Period: num  0 0 0 0 0.273 ...
##  $ Airco           : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ Automatic_airco : num  0 0 0 0 1 1 1 1 1 1 ...
##  $ CD_Player       : num  1 0 0 0 1 0 1 1 1 0 ...
##  $ Powered_Windows : num  0 0 1 0 1 1 1 1 1 1 ...
##  $ Sport_Model     : num  0 0 0 0 0 1 1 0 0 1 ...
##  $ Tow_Bar         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Price           : num  0.333 0.34 0.304 0.304 0.589 ...
##  $ Fuel_Type_CNG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fuel_Type_Diesel: num  1 1 1 1 0 0 0 0 0 0 ...

5. Fit a Neural Network

library(neuralnet)
names(train_df_2)

##  [1] "Age_08_04"        "KM"               "HP"               "Automatic"       
##  [5] "Doors"            "Quarterly_Tax"    "Mfr_Guarantee"    "Guarantee_Period"
##  [9] "Airco"            "Automatic_airco"  "CD_Player"        "Powered_Windows" 
## [13] "Sport_Model"      "Tow_Bar"          "Price"            "Fuel_Type_CNG"   
## [17] "Fuel_Type_Diesel"

train_df_2_nn1 <- neuralnet(Price ~ ., data = train_df_2, 
                         linear.output = T, hidden = 2)

train_df_2_nn1$weights

## [[1]]
## [[1]][[1]]
##              [,1]        [,2]
##  [1,]  1.15901800 -1.08563353
##  [2,]  1.82816650 -2.02230384
##  [3,]  0.86336022 -1.21240231
##  [4,] -0.58612529  0.94654369
##  [5,] -0.03594813  0.17996578
##  [6,] -0.06898549  0.05446269
##  [7,] -0.91984206  1.88943957
##  [8,] -0.26167277 -0.12012124
##  [9,] -1.57051151 -0.43253884
## [10,] -0.35223374 -0.13300483
## [11,] -0.49500291  0.08573496
## [12,] -0.10767580 -0.02481818
## [13,] -0.69788941 -0.39766470
## [14,] -0.13204115 -0.08298837
## [15,]  0.01340312 -0.02912053
## [16,]  0.31749721 -0.79525038
## [17,]  0.57706746 -0.19400431
## 
## [[1]][[2]]
##            [,1]
## [1,]  0.4727631
## [2,] -0.4302306
## [3,]  0.9083219

plot(train_df_2_nn1, rep = "best")

names(train_df_2)

##  [1] "Age_08_04"        "KM"               "HP"               "Automatic"       
##  [5] "Doors"            "Quarterly_Tax"    "Mfr_Guarantee"    "Guarantee_Period"
##  [9] "Airco"            "Automatic_airco"  "CD_Player"        "Powered_Windows" 
## [13] "Sport_Model"      "Tow_Bar"          "Price"            "Fuel_Type_CNG"   
## [17] "Fuel_Type_Diesel"

5.1 Predictions on training set (normalised scale)

train_pred <- compute(train_df_2_nn1, 
                      train_df_2[, c("Age_08_04", "KM", "HP", "Automatic", 
                                     "Doors", "Quarterly_Tax", "Mfr_Guarantee",
                                     "Guarantee_Period", "Airco", "Automatic_airco",
                                     "CD_Player", "Powered_Windows", "Sport_Model", 
                                     "Tow_Bar", "Fuel_Type_CNG", "Fuel_Type_Diesel")])

Check predicted values (normalised scale).

head(train_pred$net.result, 10)

##            [,1]
## 638  0.13388562
## 608  0.13360861
## 907  0.19005452
## 1147 0.12464834
## 654  0.18233738
## 873  0.16353465
## 652  0.15771844
## 1074 0.07737048
## 131  0.42074060
## 1125 0.11793014

5.2 Accuracy for training set

library(forecast)

## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## Registered S3 methods overwritten by 'forecast':
##   method             from    
##   fitted.fracdiff    fracdiff
##   residuals.fracdiff fracdiff

accuracy(unlist(train_pred), train_df_2$Price)

##                  ME      RMSE       MAE  MPE MAPE
## Test set -0.7799406 0.7896206 0.7799406 -Inf  Inf

Predicted values on training set in original scale.

Scale back to original.

train_pred_original <- (train_pred$net.result * (max(podracer3$Price) - 
                                                   min(podracer3$Price))) + min(podracer3$Price)

First 10 predicted values for Price (i.e. original scale).

head(train_pred_original, 10)

##           [,1]
## 638   8118.880
## 608   8111.082
## 907   9700.035
## 1147  7858.851
## 654   9482.797
## 873   8953.500
## 652   8789.774
## 1074  6527.979
## 131  16193.848
## 1125  7669.733

5.3 Predictions on validation set

valid_pred <- compute(train_df_2_nn1, 
                      valid_df_2[, c("Age_08_04", "KM", "HP", "Automatic", 
                                     "Doors", "Quarterly_Tax", "Mfr_Guarantee",
                                     "Guarantee_Period", "Airco", "Automatic_airco",
                                     "CD_Player", "Powered_Windows", "Sport_Model", 
                                     "Tow_Bar", "Fuel_Type_CNG", "Fuel_Type_Diesel")])

Check predicted values (normalised scale) on validation set.

head(valid_pred$net.result, 10)

##         [,1]
## 2  0.4312241
## 3  0.4568309
## 6  0.3696080
## 10 0.3540338
## 11 0.6421790
## 12 0.6291678
## 14 0.5969761
## 17 0.5918448
## 19 0.3928083
## 22 0.4684290

5.4 Accuracy for validation set

accuracy(unlist(valid_pred), valid_df_2$Price)

##                  ME     RMSE       MAE       MPE     MAPE
## Test set -0.7668622 0.778975 0.7668622 -470.1392 665.9653

Predicted values on vaidation set in original scale.

Scale back to original.

valid_pred_original <- (valid_pred$net.result * (max(podracer3$Price) - 
                                                   min(podracer3$Price))) + min(podracer3$Price)

First 10 predicted values for Price (i.e. original scale).

The RMSE is 0.787 (validation set).

head(valid_pred_original, 10)

##        [,1]
## 2  16488.96
## 3  17209.79
## 6  14754.47
## 10 14316.05
## 11 22427.34
## 12 22061.07
## 14 21154.88
## 17 21010.43
## 19 15407.55
## 22 17536.28

6. Different specifications

6.1 Neural network with 1 layer, 5 nodes

train_df_2_nn2 <- neuralnet(Price ~ ., data = train_df_2, 
                            linear.output = T, hidden = 5)

train_df_2_nn2$weights

## [[1]]
## [[1]][[1]]
##              [,1]        [,2]       [,3]        [,4]        [,5]
##  [1,] -0.18957167 -0.89466610  1.0824754  2.42050649 -0.33127514
##  [2,]  1.96054540  1.37433577  0.8948274  4.40041547 -3.10673667
##  [3,]  1.13302729  1.82584835  0.4578356  0.42179024 -0.24636756
##  [4,]  0.48787670  0.08808572  0.4508432 -2.68437920  1.35055091
##  [5,] -0.39559804 -0.19464965  0.2720596 -0.66374101 -0.88375286
##  [6,]  0.23118535  0.61259820  0.9671399 -0.14550667  0.42122786
##  [7,]  0.71214886 -0.08396825  2.2444979 -1.98723018  1.29039226
##  [8,] -1.42525933  1.32430778  0.8110131  0.55885189  0.42205369
##  [9,]  0.07348997 -2.29756202 -0.2703548 -1.31054609 -1.73576189
## [10,] -0.24503662  0.18832419  0.5185287  0.05152996 -0.01311173
## [11,] -1.03896939 -0.58202995 -0.7346107  0.07856487  1.03093049
## [12,] -0.26664599 -0.49121719 -0.3720243  0.02349889 -0.34906382
## [13,]  0.12697065 -0.54179088 -0.8226099  0.43912863  0.63586422
## [14,]  0.08105145 -0.15883869 -0.2611583 -0.48826068  0.02000547
## [15,]  0.53209299  0.58647185  0.1358807 -0.63806182  0.89027493
## [16,] -1.43004532 33.76601654  1.2256482  0.66878117 -0.64258411
## [17,] -0.48939593 -0.28235172 -1.0951958 -0.53796605 -0.69528845
## 
## [[1]][[2]]
##            [,1]
## [1,]  0.4068726
## [2,] -0.1984329
## [3,] -0.3438218
## [4,]  0.7441377
## [5,] -0.5608212
## [6,]  0.1876774

plot(train_df_2_nn2, rep = "best")

Predictions on training set (normalised scale), neural network 2.

train_pred_2 <- compute(train_df_2_nn2, 
                      train_df_2[, c("Age_08_04", "KM", "HP", "Automatic", 
                                     "Doors", "Quarterly_Tax", "Mfr_Guarantee",
                                     "Guarantee_Period", "Airco", "Automatic_airco",
                                     "CD_Player", "Powered_Windows", "Sport_Model", 
                                     "Tow_Bar", "Fuel_Type_CNG", "Fuel_Type_Diesel")])

Check predicted values (normalised scale) neural network 2.

head(train_pred_2$net.result, 10)

##            [,1]
## 638  0.13243667
## 608  0.13448536
## 907  0.19276265
## 1147 0.12628239
## 654  0.18202365
## 873  0.17223813
## 652  0.15082087
## 1074 0.03960327
## 131  0.43177621
## 1125 0.11421734

Accuracy for training set, neural network 2.

accuracy(unlist(train_pred_2), train_df_2$Price)

##                  ME      RMSE       MAE  MPE MAPE
## Test set -0.7799406 0.7896206 0.7799406 -Inf  Inf

Predicted values on training set in original scale, neural network 2.

Scale back to original.

train_pred_original_2 <- (train_pred_2$net.result * (max(podracer3$Price) - 
                                                   min(podracer3$Price))) + min(podracer3$Price)

First 10 predicted values for Price (i.e. original scale).

head(train_pred_original_2, 10)

##           [,1]
## 638   8078.092
## 608   8135.763
## 907   9776.268
## 1147  7904.849
## 654   9473.966
## 873   9198.503
## 652   8595.608
## 1074  5464.832
## 131  16504.500
## 1125  7565.218

Check predictions on validation set, neural network 2.

valid_pred_2 <- compute(train_df_2_nn2, 
                      valid_df_2[, c("Age_08_04", "KM", "HP", "Automatic", 
                                     "Doors", "Quarterly_Tax", "Mfr_Guarantee",
                                     "Guarantee_Period", "Airco", "Automatic_airco",
                                     "CD_Player", "Powered_Windows", "Sport_Model", 
                                     "Tow_Bar", "Fuel_Type_CNG", "Fuel_Type_Diesel")])

Check predicted values (normalised scale) on validation set, neural network 2.

head(valid_pred_2$net.result, 10)

##         [,1]
## 2  0.4632419
## 3  0.4150965
## 6  0.3725922
## 10 0.3717272
## 11 0.6679051
## 12 0.6343647
## 14 0.5968551
## 17 0.5702664
## 19 0.3181462
## 22 0.4930135

Accuracy for validation set.

accuracy(unlist(valid_pred_2), valid_df_2$Price)

##                  ME     RMSE       MAE       MPE     MAPE
## Test set -0.7668622 0.778975 0.7668622 -470.1392 665.9653

Predicted values on vaidation set in original scale, neural network 2.

Scale back to original, neural network 2.

valid_pred_original_2 <- (valid_pred_2$net.result * (max(podracer3$Price) - 
                                                   min(podracer3$Price))) + min(podracer3$Price)

First 10 predicted values for Price (i.e. original scale). ]n

head(valid_pred_original_2, 10)

##        [,1]
## 2  17390.26
## 3  16034.97
## 6  14838.47
## 10 14814.12
## 11 23151.53
## 12 22207.36
## 14 21151.47
## 17 20403.00
## 19 13305.82
## 22 18228.33

6.2 Neural network with 2 layers, 5 nodes each

train_df_2_nn3 <- neuralnet(Price ~ ., data = train_df_2, 
                            linear.output = T, hidden = c(5, 5))

train_df_2_nn3$weights

## [[1]]
## [[1]][[1]]
##             [,1]         [,2]        [,3]         [,4]        [,5]
##  [1,]  3.0312157  0.729781413  1.73296058  0.526279836  1.36754542
##  [2,]  2.2746104 -0.673971374 -1.28546167  1.865098445  1.92287126
##  [3,]  1.1811063  0.346708205 -0.42711216  0.230995172  0.81082054
##  [4,]  1.4600034 -0.666639146 -0.11534620 -0.488379522 -1.46673567
##  [5,]  0.4963332 -0.148531969 -0.30864712  0.913119881 -0.43280048
##  [6,]  0.6037612  0.922629396  0.41989183 -0.817610361 -0.36119700
##  [7,] -2.1785460  0.295010338  0.34242470  0.619580337 -1.12396894
##  [8,] -0.3401346 -2.693842026 -0.96621894 -0.526085172  0.04697226
##  [9,] -0.4016648 -0.702544216  1.09377040 -4.047313727 -0.23776733
## [10,] -0.5794138  0.001617147 -0.14640835  1.259149845  0.07297571
## [11,]  0.4966499  0.276719928  0.61804940  1.384689420 -0.19907089
## [12,] -0.1784131  0.897779012  0.38114986  0.472464309  0.16125459
## [13,]  0.0844264 -0.044855992  0.11172067 -0.363590483 -0.08471410
## [14,] -0.4507591  0.832312013  0.55384361 -1.399723115  0.08354863
## [15,] -0.3923960 -0.120520575 -0.20651297  0.701537274  0.08108022
## [16,] -0.8827099 -0.287324053  0.65404356 -9.826423125  2.64001952
## [17,]  0.3280078  0.093677822 -0.07680684  0.002000548  0.29415752
## 
## [[1]][[2]]
##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] -0.2529114 -1.0387237 -0.8954089  1.9735991  2.0651737
## [2,] -1.8123231  2.0692834  1.8043833 -0.3872393  0.3150387
## [3,]  1.3408346 -1.4238964 -1.0127844 -1.6857784 -0.2976973
## [4,] -0.5366908 -0.5939110 -0.8086246  0.3699647 -1.5691488
## [5,]  0.2024926 -0.1763935 -0.3596893  0.3285640  0.9199554
## [6,]  0.5325236  1.6950461 -0.2003131 -0.8837031  0.1232894
## 
## [[1]][[3]]
##            [,1]
## [1,]  0.7841711
## [2,]  1.1555838
## [3,] -1.1197686
## [4,] -0.5145833
## [5,]  2.4615883
## [6,] -1.5177305

plot(train_df_2_nn3, rep = "best")

Predictions on training set (normalised scale), neural network 3.

train_pred_3 <- compute(train_df_2_nn3, 
                        train_df_2[, c("Age_08_04", "KM", "HP", "Automatic", 
                                       "Doors", "Quarterly_Tax", "Mfr_Guarantee",
                                       "Guarantee_Period", "Airco", "Automatic_airco",
                                       "CD_Player", "Powered_Windows", "Sport_Model", 
                                       "Tow_Bar", "Fuel_Type_CNG", "Fuel_Type_Diesel")])

Check predicted values (normalised scale) neural network 3.

head(train_pred_3$net.result, 10)

##            [,1]
## 638  0.13055802
## 608  0.13309814
## 907  0.18083207
## 1147 0.12413856
## 654  0.18842596
## 873  0.16762104
## 652  0.13408202
## 1074 0.05540572
## 131  0.41769986
## 1125 0.12202005

Accuracy for training set, neural network 3.

accuracy(unlist(train_pred_3), train_df_2$Price)

##                  ME      RMSE       MAE  MPE MAPE
## Test set -0.7799406 0.7896206 0.7799406 -Inf  Inf

Predicted values on training set in original scale, neural network 3.

Scale back to original.

train_pred_original_3 <- (train_pred_3$net.result * (max(podracer3$Price) - 
                                                       min(podracer3$Price))) + min(podracer3$Price)

First 10 predicted values for Price (i.e. original scale).

head(train_pred_original_3, 10)

##           [,1]
## 638   8025.208
## 608   8096.713
## 907   9440.423
## 1147  7844.500
## 654   9654.191
## 873   9068.532
## 652   8124.409
## 1074  5909.671
## 131  16108.251
## 1125  7784.864

Check predictions on validation set, neural network 3.

valid_pred_3 <- compute(train_df_2_nn3, 
                        valid_df_2[, c("Age_08_04", "KM", "HP", "Automatic", 
                                       "Doors", "Quarterly_Tax", "Mfr_Guarantee",
                                       "Guarantee_Period", "Airco", "Automatic_airco",
                                       "CD_Player", "Powered_Windows", "Sport_Model", 
                                       "Tow_Bar", "Price", "Fuel_Type_CNG", "Fuel_Type_Diesel")])

Check predicted values (normalised scale) on validation set, neural network 3.

head(valid_pred_3$net.result, 10)

##         [,1]
## 2  0.4250526
## 3  0.3800759
## 6  0.3717043
## 10 0.3667990
## 11 0.6164454
## 12 0.6113172
## 14 0.5826810
## 17 0.5977831
## 19 0.3853424
## 22 0.4978584

Accuracy for validation set.

accuracy(unlist(valid_pred_3), valid_df_2$Price)

##                  ME     RMSE       MAE       MPE     MAPE
## Test set -0.7668622 0.778975 0.7668622 -470.1392 665.9653

Predicted values on vaidation set in original scale, neural network 3.

Scale back to original.

valid_pred_original_3 <- (valid_pred_3$net.result * (max(podracer3$Price) - 
                                                   min(podracer3$Price))) + min(podracer3$Price)

First 10 predicted values for Price (i.e. original scale), neural network 3.

head(valid_pred_original_3, 10)

##        [,1]
## 2  16315.23
## 3  15049.14
## 6  14813.47
## 10 14675.39
## 11 21702.94
## 12 21558.58
## 14 20752.47
## 17 21177.59
## 19 15197.39
## 22 18364.71

The accuracies (RMS) did not change when the number of layers and nodes increased. A complex model does not necessarily yeld better results. In this case, minimising the number of nodes can make it easier to interpret.

Podracing

master yeoda

a long, long time ago in a galaxy far, far away

Directions

1. Load Data

2. PreProcessing

2.1 Filter for required variables

2.2 Create dummy variables

3. Training Validation Split

4. Normalise

5. Fit a Neural Network

5.1 Predictions on training set (normalised scale)

5.2 Accuracy for training set

5.3 Predictions on validation set

5.4 Accuracy for validation set

6. Different specifications

6.1 Neural network with 1 layer, 5 nodes

6.2 Neural network with 2 layers, 5 nodes each