Forces of the northern lights! Comparing means using ANOVA.
insurance <- read.csv("insurance_survey.csv", header = TRUE)
head(insurance)
## ï..Age Gender Education Marital.Status Years.Employed Satisfaction.
## 1 36 F Some college Divorced 4 4
## 2 55 F Some college Divorced 2 1
## 3 61 M Graduate degree Widowed 26 3
## 4 65 F Some college Married 9 4
## 5 53 F Graduate degree Married 6 4
## 6 50 F Graduate degree Married 10 5
## Premium_Deductible
## 1 N
## 2 N
## 3 N
## 4 N
## 5 N
## 6 N
Rename the fields so they’re easier to read.
names(insurance)[1:7] <- c("Age", "Gender", "Education", "Marital_Status",
"Years_Employed", "Satisfaction", "Premium_Deductible")
head(insurance)
## Age Gender Education Marital_Status Years_Employed Satisfaction
## 1 36 F Some college Divorced 4 4
## 2 55 F Some college Divorced 2 1
## 3 61 M Graduate degree Widowed 26 3
## 4 65 F Some college Married 9 4
## 5 53 F Graduate degree Married 6 4
## 6 50 F Graduate degree Married 10 5
## Premium_Deductible
## 1 N
## 2 N
## 3 N
## 4 N
## 5 N
## 6 N
H0: The mean satisfaction by education are equal H1: The means satisfaction by education are not equal
anova <- aov(Satisfaction ~ Education, data = insurance)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Education 2 7.879 3.939 3.925 0.0356 *
## Residuals 21 21.079 1.004
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Use post hoc test to determine the mean differences
TukeyHSD(anova)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Satisfaction ~ Education, data = insurance)
##
## $Education
## diff lwr upr p adj
## Graduate degree-College graduate 1.0555556 -0.1715336 2.28264475 0.1003252
## Some college-College graduate -0.3015873 -1.5742334 0.97105876 0.8230559
## Some college-Graduate degree -1.3571429 -2.6641246 -0.05016107 0.0409193
Test for normality.
shapiro.test(insurance$Satisfaction)
##
## Shapiro-Wilk normality test
##
## data: insurance$Satisfaction
## W = 0.88598, p-value = 0.01099
When there is more than 1 variable to categorise the data, we can use a two-way ANOVA.
head(insurance)
## Age Gender Education Marital_Status Years_Employed Satisfaction
## 1 36 F Some college Divorced 4 4
## 2 55 F Some college Divorced 2 1
## 3 61 M Graduate degree Widowed 26 3
## 4 65 F Some college Married 9 4
## 5 53 F Graduate degree Married 6 4
## 6 50 F Graduate degree Married 10 5
## Premium_Deductible
## 1 N
## 2 N
## 3 N
## 4 N
## 5 N
## 6 N
anova_2 <- aov(Satisfaction ~ Education + Marital_Status,
data = insurance)
In this case, the addition of marital status improved the model because of the lower residual sum of squares.However, marital status is not significant.
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Education 2 7.879 3.939 3.925 0.0356 *
## Residuals 21 21.079 1.004
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(anova_2)
## Df Sum Sq Mean Sq F value Pr(>F)
## Education 2 7.879 3.939 3.858 0.0403 *
## Marital_Status 3 2.697 0.899 0.880 0.4697
## Residuals 18 18.382 1.021
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Use a post hoc test to determine the mean differences.
TukeyHSD(anova_2)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Satisfaction ~ Education + Marital_Status, data = insurance)
##
## $Education
## diff lwr upr p adj
## Graduate degree-College graduate 1.0555556 -0.1976587 2.30876980 0.1078072
## Some college-College graduate -0.3015873 -1.6013283 0.99815372 0.8260097
## Some college-Graduate degree -1.3571429 -2.6919506 -0.02233509 0.0459262
##
## $Marital_Status
## diff lwr upr p adj
## Married-Divorced 0.1409897 -1.312049 1.594029 0.9925368
## Single-Divorced 0.5634921 -2.565223 3.692207 0.9558787
## Widowed-Divorced -1.4365079 -4.565223 1.692207 0.5760233
## Single-Married 0.4225023 -2.516414 3.361418 0.9766473
## Widowed-Married -1.5774977 -4.516414 1.361418 0.4483875
## Widowed-Single -2.0000000 -6.039154 2.039154 0.5156365
anova_3 <- aov(Satisfaction ~ Education * Marital_Status,
data = insurance)
summary(anova_3)
## Df Sum Sq Mean Sq F value Pr(>F)
## Education 2 7.879 3.939 3.660 0.0477 *
## Marital_Status 3 2.697 0.899 0.835 0.4929
## Education:Marital_Status 1 0.084 0.084 0.078 0.7830
## Residuals 17 18.298 1.076
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model with the lowest AIC (Akaike information criterion) gives the best model.
library(AICcmodavg)
models <- list(anova, anova_2, anova_3)
models_names <- c("one_way", "two_way", "interaction")
aictab(models, modnames = models_names)
##
## Model selection based on AICc:
##
## K AICc Delta_AICc AICcWt Cum.Wt LL
## one_way 4 75.10 0.00 0.98 0.98 -32.50
## two_way 7 82.71 7.61 0.02 1.00 -30.85
## interaction 8 87.20 12.10 0.00 1.00 -30.80
For small samples, a Kruskal-Wallis can be used to determine mean differences.
kw <- kruskal.test(Satisfaction ~ Education, data = insurance)
kw
##
## Kruskal-Wallis rank sum test
##
## data: Satisfaction by Education
## Kruskal-Wallis chi-squared = 6.543, df = 2, p-value = 0.03795
Use a post hoc test to determine the mean differences. Or just do this to get the Kruskal-Wallis statistic in 1 step.
library(dunn.test)
dunn.test(insurance$Satisfaction, insurance$Education, kw = TRUE,
table = TRUE, alpha = 0.05)
## Kruskal-Wallis rank sum test
##
## data: x and group
## Kruskal-Wallis chi-squared = 6.543, df = 2, p-value = 0.04
##
##
## Comparison of x by group
## (No adjustment)
## Col Mean-|
## Row Mean | College Graduate
## ---------+----------------------
## Graduate | -2.150138
## | 0.0158*
## |
## Some col | 0.259146 2.271044
## | 0.3978 0.0116*
##
## alpha = 0.05
## Reject Ho if p <= alpha/2