Directions

Forces of the northern lights! Comparing means using ANOVA.

1. ANOVA

1.1 Load data

insurance <- read.csv("insurance_survey.csv", header = TRUE)
head(insurance)

##   ï..Age Gender       Education Marital.Status Years.Employed Satisfaction.
## 1     36      F    Some college       Divorced              4             4
## 2     55      F    Some college       Divorced              2             1
## 3     61      M Graduate degree        Widowed             26             3
## 4     65      F    Some college        Married              9             4
## 5     53      F Graduate degree        Married              6             4
## 6     50      F Graduate degree        Married             10             5
##   Premium_Deductible
## 1                  N
## 2                  N
## 3                  N
## 4                  N
## 5                  N
## 6                  N

1.2 Rename fields

Rename the fields so they’re easier to read.

names(insurance)[1:7] <- c("Age", "Gender", "Education", "Marital_Status",
                           "Years_Employed", "Satisfaction", "Premium_Deductible")
head(insurance)

##   Age Gender       Education Marital_Status Years_Employed Satisfaction
## 1  36      F    Some college       Divorced              4            4
## 2  55      F    Some college       Divorced              2            1
## 3  61      M Graduate degree        Widowed             26            3
## 4  65      F    Some college        Married              9            4
## 5  53      F Graduate degree        Married              6            4
## 6  50      F Graduate degree        Married             10            5
##   Premium_Deductible
## 1                  N
## 2                  N
## 3                  N
## 4                  N
## 5                  N
## 6                  N

1.3. ANOVA

H0: The mean satisfaction by education are equal H1: The means satisfaction by education are not equal

anova <- aov(Satisfaction ~ Education, data = insurance)

summary(anova)

##             Df Sum Sq Mean Sq F value Pr(>F)  
## Education    2  7.879   3.939   3.925 0.0356 *
## Residuals   21 21.079   1.004                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Use post hoc test to determine the mean differences

TukeyHSD(anova)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Satisfaction ~ Education, data = insurance)
## 
## $Education
##                                        diff        lwr         upr     p adj
## Graduate degree-College graduate  1.0555556 -0.1715336  2.28264475 0.1003252
## Some college-College graduate    -0.3015873 -1.5742334  0.97105876 0.8230559
## Some college-Graduate degree     -1.3571429 -2.6641246 -0.05016107 0.0409193

1.4 Assumptions

Test for normality.

shapiro.test(insurance$Satisfaction)

## 
##  Shapiro-Wilk normality test
## 
## data:  insurance$Satisfaction
## W = 0.88598, p-value = 0.01099

1.5 Two-Way ANOVA

When there is more than 1 variable to categorise the data, we can use a two-way ANOVA.

head(insurance)

##   Age Gender       Education Marital_Status Years_Employed Satisfaction
## 1  36      F    Some college       Divorced              4            4
## 2  55      F    Some college       Divorced              2            1
## 3  61      M Graduate degree        Widowed             26            3
## 4  65      F    Some college        Married              9            4
## 5  53      F Graduate degree        Married              6            4
## 6  50      F Graduate degree        Married             10            5
##   Premium_Deductible
## 1                  N
## 2                  N
## 3                  N
## 4                  N
## 5                  N
## 6                  N

anova_2 <- aov(Satisfaction ~ Education + Marital_Status, 
               data = insurance)

In this case, the addition of marital status improved the model because of the lower residual sum of squares.However, marital status is not significant.

summary(anova)

##             Df Sum Sq Mean Sq F value Pr(>F)  
## Education    2  7.879   3.939   3.925 0.0356 *
## Residuals   21 21.079   1.004                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(anova_2)

##                Df Sum Sq Mean Sq F value Pr(>F)  
## Education       2  7.879   3.939   3.858 0.0403 *
## Marital_Status  3  2.697   0.899   0.880 0.4697  
## Residuals      18 18.382   1.021                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Use a post hoc test to determine the mean differences.

TukeyHSD(anova_2)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Satisfaction ~ Education + Marital_Status, data = insurance)
## 
## $Education
##                                        diff        lwr         upr     p adj
## Graduate degree-College graduate  1.0555556 -0.1976587  2.30876980 0.1078072
## Some college-College graduate    -0.3015873 -1.6013283  0.99815372 0.8260097
## Some college-Graduate degree     -1.3571429 -2.6919506 -0.02233509 0.0459262
## 
## $Marital_Status
##                        diff       lwr      upr     p adj
## Married-Divorced  0.1409897 -1.312049 1.594029 0.9925368
## Single-Divorced   0.5634921 -2.565223 3.692207 0.9558787
## Widowed-Divorced -1.4365079 -4.565223 1.692207 0.5760233
## Single-Married    0.4225023 -2.516414 3.361418 0.9766473
## Widowed-Married  -1.5774977 -4.516414 1.361418 0.4483875
## Widowed-Single   -2.0000000 -6.039154 2.039154 0.5156365

1.6 ANOVA with interaction variables

anova_3 <- aov(Satisfaction ~ Education * Marital_Status, 
               data = insurance)


summary(anova_3)

##                          Df Sum Sq Mean Sq F value Pr(>F)  
## Education                 2  7.879   3.939   3.660 0.0477 *
## Marital_Status            3  2.697   0.899   0.835 0.4929  
## Education:Marital_Status  1  0.084   0.084   0.078 0.7830  
## Residuals                17 18.298   1.076                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

1.7 Best fit model

The model with the lowest AIC (Akaike information criterion) gives the best model.

library(AICcmodavg)

models <- list(anova, anova_2, anova_3)
models_names <- c("one_way", "two_way", "interaction")

aictab(models, modnames = models_names)

## 
## Model selection based on AICc:
## 
##             K  AICc Delta_AICc AICcWt Cum.Wt     LL
## one_way     4 75.10       0.00   0.98   0.98 -32.50
## two_way     7 82.71       7.61   0.02   1.00 -30.85
## interaction 8 87.20      12.10   0.00   1.00 -30.80

2 Kruskal-Wallis

For small samples, a Kruskal-Wallis can be used to determine mean differences.

kw <- kruskal.test(Satisfaction ~ Education, data = insurance)
kw

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Satisfaction by Education
## Kruskal-Wallis chi-squared = 6.543, df = 2, p-value = 0.03795

Use a post hoc test to determine the mean differences. Or just do this to get the Kruskal-Wallis statistic in 1 step.

library(dunn.test)
dunn.test(insurance$Satisfaction, insurance$Education, kw = TRUE,
          table = TRUE, alpha = 0.05)

##   Kruskal-Wallis rank sum test
## 
## data: x and group
## Kruskal-Wallis chi-squared = 6.543, df = 2, p-value = 0.04
## 
## 
##                            Comparison of x by group                            
##                                 (No adjustment)                                
## Col Mean-|
## Row Mean |   College    Graduate
## ---------+----------------------
## Graduate |  -2.150138
##          |    0.0158*
##          |
## Some col |   0.259146   2.271044
##          |     0.3978    0.0116*
## 
## alpha = 0.05
## Reject Ho if p <= alpha/2

Statistical Inference: ANOVA

master yeoda

a long, long time ago in a galaxy far, far away