Download this R markdown as: R, Rmd.
If you use your personal computer an additional package
mffSM
is needed. The package is not available from CRAN but
can be downloaded from this website. Download the package and use the following
command for the installation:
install.packages("mffSM_1.1.zip", repos = NULL)
# or without necessity to download first
install.packages("https://www2.karlin.mff.cuni.cz/~komarek/vyuka/2021_22/nmsa407/R/mffSM_1.2.tar.gz", repos = NULL)
Libraries can be accessed the following way. When in doubt, use
help()
or ?
to show the documentation:
library("mffSM")
library("colorspace")
help(package="mffSM")
This lab session we will work with the data set on car vehicles that were on the market in the U.S. in 2004:
data(Cars2004nh, package = "mffSM")
Quick overview and summary of the data:
dim(Cars2004nh)
## [1] 425 20
str(Cars2004nh)
## 'data.frame': 425 obs. of 20 variables:
## $ vname : chr "Chevrolet.Aveo.4dr" "Chevrolet.Aveo.LS.4dr.hatch" "Chevrolet.Cavalier.2dr" "Chevrolet.Cavalier.4dr" ...
## $ type : int 1 1 1 1 1 1 1 1 1 1 ...
## $ drive : int 1 1 1 1 1 1 1 1 1 1 ...
## $ price.retail: int 11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
## $ price.dealer: int 10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
## $ price : num 11328 12194 14154 14347 15871 ...
## $ cons.city : num 8.4 8.4 9 9 9 8.1 8.1 9 8.7 9 ...
## $ cons.highway: num 6.9 6.9 6.4 6.4 6.4 6.5 6.5 7.1 6.5 7.1 ...
## $ consumption : num 7.65 7.65 7.7 7.7 7.7 7.3 7.3 8.05 7.6 8.05 ...
## $ engine.size : num 1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
## $ ncylinder : int 4 4 4 4 4 4 4 4 4 4 ...
## $ horsepower : int 103 103 140 140 140 132 132 130 110 130 ...
## $ weight : int 1075 1065 1187 1214 1187 1171 1191 1185 1182 1182 ...
## $ iweight : num 0.00093 0.000939 0.000842 0.000824 0.000842 ...
## $ lweight : num 6.98 6.97 7.08 7.1 7.08 ...
## $ wheel.base : int 249 249 264 264 264 267 267 262 262 262 ...
## $ length : int 424 389 465 465 465 442 442 427 427 427 ...
## $ width : int 168 168 175 173 175 170 170 170 170 170 ...
## $ ftype : Factor w/ 6 levels "personal","wagon",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ fdrive : Factor w/ 3 levels "front","rear",..: 1 1 1 1 1 1 1 1 1 1 ...
head(Cars2004nh)
## vname type drive price.retail price.dealer price cons.city cons.highway consumption
## 1 Chevrolet.Aveo.4dr 1 1 11690 10965 11327.5 8.4 6.9 7.65
## 2 Chevrolet.Aveo.LS.4dr.hatch 1 1 12585 11802 12193.5 8.4 6.9 7.65
## 3 Chevrolet.Cavalier.2dr 1 1 14610 13697 14153.5 9.0 6.4 7.70
## 4 Chevrolet.Cavalier.4dr 1 1 14810 13884 14347.0 9.0 6.4 7.70
## 5 Chevrolet.Cavalier.LS.2dr 1 1 16385 15357 15871.0 9.0 6.4 7.70
## 6 Dodge.Neon.SE.4dr 1 1 13670 12849 13259.5 8.1 6.5 7.30
## engine.size ncylinder horsepower weight iweight lweight wheel.base length width ftype fdrive
## 1 1.6 4 103 1075 0.0009302326 6.980076 249 424 168 personal front
## 2 1.6 4 103 1065 0.0009389671 6.970730 249 389 168 personal front
## 3 2.2 4 140 1187 0.0008424600 7.079184 264 465 175 personal front
## 4 2.2 4 140 1214 0.0008237232 7.101676 264 465 173 personal front
## 5 2.2 4 140 1187 0.0008424600 7.079184 264 465 175 personal front
## 6 2.0 4 132 1171 0.0008539710 7.065613 267 442 170 personal front
summary(Cars2004nh)
## vname type drive price.retail price.dealer price
## Length:425 Min. :1.000 Min. :1.000 Min. : 10280 Min. : 9875 Min. : 10078
## Class :character 1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 20370 1st Qu.: 18973 1st Qu.: 19601
## Mode :character Median :1.000 Median :1.000 Median : 27905 Median : 25672 Median : 26656
## Mean :2.219 Mean :1.692 Mean : 32866 Mean : 30096 Mean : 31481
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.: 39235 3rd Qu.: 35777 3rd Qu.: 37514
## Max. :6.000 Max. :3.000 Max. :192465 Max. :173560 Max. :183013
##
## cons.city cons.highway consumption engine.size ncylinder horsepower
## Min. : 6.20 Min. : 5.100 Min. : 5.65 Min. :1.300 Min. :-1.000 Min. :100.0
## 1st Qu.:11.20 1st Qu.: 8.100 1st Qu.: 9.65 1st Qu.:2.400 1st Qu.: 4.000 1st Qu.:165.0
## Median :12.40 Median : 9.000 Median :10.70 Median :3.000 Median : 6.000 Median :210.0
## Mean :12.36 Mean : 9.142 Mean :10.75 Mean :3.208 Mean : 5.791 Mean :216.8
## 3rd Qu.:13.80 3rd Qu.: 9.800 3rd Qu.:11.65 3rd Qu.:3.900 3rd Qu.: 6.000 3rd Qu.:255.0
## Max. :23.50 Max. :19.600 Max. :21.55 Max. :8.300 Max. :12.000 Max. :500.0
## NA's :14 NA's :14 NA's :14
## weight iweight lweight wheel.base length width
## Min. : 923 Min. :0.0003067 Min. :6.828 Min. :226.0 Min. :363.0 Min. :163.0
## 1st Qu.:1412 1st Qu.:0.0005542 1st Qu.:7.253 1st Qu.:262.0 1st Qu.:450.0 1st Qu.:175.0
## Median :1577 Median :0.0006341 Median :7.363 Median :272.0 Median :472.0 Median :180.0
## Mean :1626 Mean :0.0006412 Mean :7.373 Mean :274.9 Mean :470.6 Mean :181.1
## 3rd Qu.:1804 3rd Qu.:0.0007082 3rd Qu.:7.498 3rd Qu.:284.0 3rd Qu.:490.0 3rd Qu.:185.0
## Max. :3261 Max. :0.0010834 Max. :8.090 Max. :366.0 Max. :577.0 Max. :206.0
## NA's :2 NA's :2 NA's :2 NA's :2 NA's :26 NA's :28
## ftype fdrive
## personal:242 front:223
## wagon : 30 rear :110
## SUV : 60 4x4 : 92
## pickup : 24
## sport : 49
## minivan : 20
##
Exclude cars where the consumption is missing:
Cars2004nh <- Cars2004nh[!is.na(Cars2004nh$consumption),]
Create a working subsample:
set.seed(20250218)
idx <- c(sample((1:nrow(Cars2004nh))[Cars2004nh$ftype=="personal"], size=100),
sample((1:nrow(Cars2004nh))[Cars2004nh$ftype!="personal"], size=120))
Cars2004nh <- Cars2004nh[idx,]
Next, we derive some additional variables. First, binary factor
fdrive2
, which distinguishes whether the drive is
front
or other
(rear
or
4x4
):
Cars2004nh <- transform(Cars2004nh,
fdrive2 = factor(1*(fdrive == "front") +
2*(fdrive != "front"),
labels = c("front", "other")))
with(Cars2004nh, table(fdrive, fdrive2))
## fdrive2
## fdrive front other
## front 106 0
## rear 0 64
## 4x4 0 50
Next, we derive binary indicator cons10
of having a
consumption <= 10
l/100 km and then create a factor with
No
(0) and Yes
(1) labels:
Cars2004nh <- transform(Cars2004nh,
cons10 = 1 * (consumption <= 10), # 0 or 1
fcons10 = factor(1 * (consumption <= 10),
labels = c("No", "Yes")))
summary(Cars2004nh)
## vname type drive price.retail price.dealer price
## Length:220 Min. :1.000 Min. :1.000 Min. : 10280 Min. : 9875 Min. : 10078
## Class :character 1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 20334 1st Qu.: 18839 1st Qu.: 19572
## Mode :character Median :2.000 Median :2.000 Median : 27270 Median : 25142 Median : 26147
## Mean :2.618 Mean :1.745 Mean : 32819 Mean : 30012 Mean : 31416
## 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.: 40151 3rd Qu.: 35905 3rd Qu.: 38015
## Max. :6.000 Max. :3.000 Max. :126670 Max. :117854 Max. :122262
##
## cons.city cons.highway consumption engine.size ncylinder horsepower
## Min. : 6.70 Min. : 5.50 Min. : 6.10 Min. :1.300 Min. :-1.000 Min. :103.0
## 1st Qu.:11.20 1st Qu.: 8.10 1st Qu.: 9.65 1st Qu.:2.400 1st Qu.: 4.000 1st Qu.:165.0
## Median :12.40 Median : 9.00 Median :10.70 Median :3.200 Median : 6.000 Median :210.0
## Mean :12.48 Mean : 9.28 Mean :10.88 Mean :3.205 Mean : 5.764 Mean :214.7
## 3rd Qu.:13.80 3rd Qu.: 9.80 3rd Qu.:11.80 3rd Qu.:3.900 3rd Qu.: 6.000 3rd Qu.:250.0
## Max. :19.60 Max. :16.80 Max. :17.45 Max. :6.000 Max. :12.000 Max. :493.0
##
## weight iweight lweight wheel.base length width
## Min. : 923 Min. :0.0003595 Min. :6.828 Min. :226.0 Min. :363.0 Min. :165.0
## 1st Qu.:1384 1st Qu.:0.0005524 1st Qu.:7.233 1st Qu.:259.0 1st Qu.:443.5 1st Qu.:175.0
## Median :1584 Median :0.0006313 Median :7.368 Median :272.0 Median :466.0 Median :180.0
## Mean :1625 Mean :0.0006428 Mean :7.371 Mean :274.5 Mean :466.7 Mean :181.1
## 3rd Qu.:1810 3rd Qu.:0.0007224 3rd Qu.:7.501 3rd Qu.:287.0 3rd Qu.:493.0 3rd Qu.:185.0
## Max. :2782 Max. :0.0010834 Max. :7.931 Max. :366.0 Max. :556.0 Max. :203.0
## NA's :16 NA's :18
## ftype fdrive fdrive2 cons10 fcons10
## personal:100 front:106 front:106 Min. :0.00 No :143
## wagon : 17 rear : 64 other:114 1st Qu.:0.00 Yes: 77
## SUV : 36 4x4 : 50 Median :0.00
## pickup : 16 Mean :0.35
## sport : 36 3rd Qu.:1.00
## minivan : 15 Max. :1.00
##
In this section, we review the methods for two-sample problems, i.e. exploring the relationship between continuous variable and binary variable. In particular, we are going to explore the following question:
Does the consumption depend on the drive while distinguishing only two categories: front and other?
We start with an exploration phase. The most straightforward
visualization tool for this problem is boxplot, the default
plot for numeric ~ factor
:
plot(consumption ~ fdrive2, data = Cars2004nh)
With some effort you can make this plot more compelling:
COL <- c("yellowgreen", "goldenrod")
names(COL) <- levels(Cars2004nh$fdrive2)
par(mfrow = c(1,1), mar = c(4,4,1.5,0.5))
plot(consumption ~ fdrive2, data = Cars2004nh, col = COL,
xlab = "Drive of the vehicle", ylab = "Consumption [l/100 km]",
main = "Consumption vs. Front drive",
cex.main = 1.0,
xaxt = "n")
axis(1, at = 1:nlevels(Cars2004nh$fdrive2), labels = c("Front drive", "Other"))
Boxplots of consumption for two categories of drive.
Think about other ways of displaying the two different distributions. Here are the sample characteristics:
with(Cars2004nh, tapply(consumption, fdrive2, summary))
## $front
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.100 8.450 9.875 9.863 10.700 15.600
##
## $other
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.30 10.70 11.45 11.82 12.45 17.45
with(Cars2004nh, tapply(consumption, fdrive2, sd, na.rm = TRUE))
## front other
## 1.848568 1.960938
X = subset(Cars2004nh, fdrive2 == "front")$consumption
Y = subset(Cars2004nh, fdrive2 == "other")$consumption
For each test refresh the following:
Standard two-sample t-test (assuming equal variances and normality)
t.test(consumption ~ fdrive2, data = Cars2004nh, var.equal = TRUE)
##
## Two Sample t-test
##
## data: consumption by fdrive2
## t = -7.62, df = 218, p-value = 7.68e-13
## alternative hypothesis: true difference in means between group front and group other is not equal to 0
## 95 percent confidence interval:
## -2.468658 -1.454050
## sample estimates:
## mean in group front mean in group other
## 9.863208 11.824561
mean(X) - mean(Y)
## [1] -1.961354
Welch two-sample t-test (not assuming equal variances)
t.test(consumption ~ fdrive2, data = Cars2004nh)
##
## Welch Two Sample t-test
##
## data: consumption by fdrive2
## t = -7.6364, df = 217.96, p-value = 6.949e-13
## alternative hypothesis: true difference in means between group front and group other is not equal to 0
## 95 percent confidence interval:
## -2.467568 -1.455140
## sample estimates:
## mean in group front mean in group other
## 9.863208 11.824561
Wilcoxon two-sample rank test
wilcox.test(consumption ~ fdrive2, data = Cars2004nh, conf.int = TRUE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: consumption by fdrive2
## W = 2548, p-value = 1.273e-13
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -2.300047 -1.350055
## sample estimates:
## difference in location
## -1.799991
median(X) - median(Y)
## [1] -1.575
XYdiff = outer(X, Y, FUN="-")
median(XYdiff)
## [1] -1.8
Kolmogorov-Smirnov test
ks.test(consumption ~ fdrive2, data = Cars2004nh)
##
## Asymptotic two-sample Kolmogorov-Smirnov test
##
## data: consumption by fdrive2
## D = 0.52185, p-value = 2.035e-13
## alternative hypothesis: two-sided
par(mfrow = c(1,1), mar = c(4,4,1.5,0.5))
plot(ecdf(X), main="ECDF")
plot(ecdf(Y), col=2, add=TRUE)
What if there are more than two groups?
Does the consumption depend on the drive (distinguishing front/rear/4x4)?
Let’s start with boxplots:
plot(consumption ~ fdrive, data = Cars2004nh)
with(Cars2004nh, tapply(consumption, fdrive, summary))
## $front
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.100 8.450 9.875 9.863 10.700 15.600
##
## $rear
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.30 10.55 11.25 11.25 11.80 15.25
##
## $`4x4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.95 10.70 11.60 12.56 14.79 17.45
with(Cars2004nh, tapply(consumption, fdrive, sd, na.rm = TRUE))
## front rear 4x4
## 1.848568 1.360218 2.344725
For each test refresh the following:
Standard ANOVA
summary(aov(consumption ~ fdrive, data = Cars2004nh))
## Df Sum Sq Mean Sq F value Pr(>F)
## fdrive 2 259.9 129.93 37.86 7.88e-15 ***
## Residuals 217 744.8 3.43
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA without assuming equal variances
oneway.test(consumption ~ fdrive, data = Cars2004nh)
##
## One-way analysis of means (not assuming equal variances)
##
## data: consumption and fdrive
## F = 30.988, num df = 2.00, denom df = 113.11, p-value = 1.856e-11
Kruskal-Wallis Rank Sum Test
kruskal.test(consumption ~ fdrive, data = Cars2004nh)
##
## Kruskal-Wallis rank sum test
##
## data: consumption by fdrive
## Kruskal-Wallis chi-squared = 58.147, df = 2, p-value = 2.363e-13
Alternatively: the same test statistics all in the standard ANOVA above
summary(m0 <- lm(consumption ~ fdrive, data=Cars2004nh))
##
## Call:
## lm(formula = consumption ~ fdrive, data = Cars2004nh)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9477 -1.3130 -0.0632 0.9118 5.7368
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.8632 0.1799 54.814 < 2e-16 ***
## fdriverear 1.3844 0.2933 4.721 4.21e-06 ***
## fdrive4x4 2.6998 0.3178 8.494 3.18e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.853 on 217 degrees of freedom
## Multiple R-squared: 0.2587, Adjusted R-squared: 0.2518
## F-statistic: 37.86 on 2 and 217 DF, p-value: 7.876e-15
anova(m0)
## Analysis of Variance Table
##
## Response: consumption
## Df Sum Sq Mean Sq F value Pr(>F)
## fdrive 2 259.87 129.933 37.858 7.876e-15 ***
## Residuals 217 744.76 3.432
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
oneway.test(consumption ~ fdrive, data = Cars2004nh, var.equal=TRUE)
##
## One-way analysis of means
##
## data: consumption and fdrive
## F = 37.858, num df = 2, denom df = 217, p-value = 7.876e-15
Follow these steps to compute the sum of squares and degrees of freedom needed for ANOVA with equal variances:
X = subset(Cars2004nh, fdrive == "front")$consumption
Y = subset(Cars2004nh, fdrive == "rear")$consumption
Z = subset(Cars2004nh, fdrive == "4x4")$consumption
nX = length(X)
nY = length(Y)
nZ = length(Z)
mX = mean(X)
mY = mean(Y)
mZ = mean(Z)
m = mean(c(X,Y,Z))
nX*(m - mX)^2+nY*(m-mY)^2+nZ*(m-mZ)^2
## [1] 259.8652
sum((X - mX)^2)+sum((Y-mY)^2)+sum((Z-mZ)^2)
## [1] 744.7577
sum((c(X,Y,Z)-m)^2)
## [1] 1004.623
nX+nY+nZ
## [1] 220
Now we review the basic methods for describing the relationship between two continuous variables.
Why are boxplots not reasonable choice? Use scatterplot instead,
default for plot
between numeric
and
numeric
variable.
par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(consumption ~ length, data = Cars2004nh, col = "red3", bg = "orange", pch = 23)
Histograms to inspect the shape of marginal distributions:
### Histograms
par(mfrow = c(1, 2), mar = c(4,4,0.5,0.5))
hist(Cars2004nh$consumption, prob = TRUE, main = "", col = "seagreen",
xlab = "Consumption [l/100 km]")
hist(Cars2004nh$length, prob = TRUE, main = "", col = "seagreen",
xlab = "Length [cm]")
with(Cars2004nh, cor(consumption, length))
## [1] NA
with(Cars2004nh, cor(consumption, length, use = "complete.obs"))
## [1] 0.4379756
with(Cars2004nh, cor(consumption, length, method = "spearman", use = "complete.obs"))
## [1] 0.4440134
For each test refresh the following:
Test based on Pearson’s correlation coefficient
with(Cars2004nh, cor.test(consumption, length))
##
## Pearson's product-moment correlation
##
## data: consumption and length
## t = 6.9242, df = 202, p-value = 5.713e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3198482 0.5426954
## sample estimates:
## cor
## 0.4379756
Test based on Spearman’s correlation coefficient
with(Cars2004nh, cor.test(consumption, length, method = "spearman"))
##
## Spearman's rank correlation rho
##
## data: consumption and length
## S = 786671, p-value = 2.892e-11
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4440134
How would you assess the relationship between the dependent variable and independent covariates?
Here are some graphical tools to examine the (non)linearity of the relationship:
par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(consumption ~ length, data = Cars2004nh, col = "red3", bg = "orange", pch = 23)
grid <- quantile(Cars2004nh$length, seq(0, 1, length = 20), na.rm = T)
xGrid <- NULL
yMean <- NULL
for(i in 2:length(grid)){
tempData <- Cars2004nh$consumption[Cars2004nh$length >= grid[i - 1] & Cars2004nh$length < grid[i]]
lines(rep(mean(tempData, na.rm = T), 2) ~ c(grid[i - 1], grid[i]), col = "red")
points((grid[i - 1] + grid[i])/2, mean(tempData, na.rm = T),
pch = 23, bg = "red", cex = 1.2)
xGrid <- c(xGrid, (grid[i - 1] + grid[i])/2)
yMean <- c(yMean, mean(tempData, na.rm = T))
lines(rep(grid[i], 2), c(min(Cars2004nh$consumption), max(Cars2004nh$consumption)), lty = 3)
}
### which interpolates as:
lines(yMean ~ xGrid, col = "darkred", lwd = 3)
Less groups without for
cycle:
grid <- quantile(Cars2004nh$length, seq(0, 1, length = 11), na.rm = T)
Cars2004nh$flength <- cut(Cars2004nh$length, breaks = grid, right = FALSE)
xGrid <- (grid[-1] + grid[-length(grid)]) / 2
yMean <- tapply(Cars2004nh$consumption, Cars2004nh$flength, mean)
par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(consumption ~ length, data = Cars2004nh, type = "n")
segments(x0 = grid, y0 = min(Cars2004nh$consumption), y1 = max(Cars2004nh$consumption), lty = 3)
points(consumption ~ length, data = Cars2004nh, col = "red3", bg = "orange", pch = 23)
segments(x0 = grid[-length(grid)], x1 = grid[-1], y0 = yMean, col = "red")
lines(x = xGrid, y = yMean, col = "darkred", lwd = 3)
points(x = xGrid, y = yMean, pch = 23, bg = "red", cex = 1.2)
Category for every value of length
that appears in the
data:
par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(consumption ~ length, data = Cars2004nh, col = "red3", bg = "orange", pch = 23)
grid <- as.numeric(names(table(Cars2004nh$length)))
xGrid <- NULL
yMean <- NULL
for (i in 2:(length(grid) - 1)){
tempData <- Cars2004nh$consumption[Cars2004nh$length >= grid[i - 1] & Cars2004nh$length <= grid[i + 1]]
points(grid[i], mean(tempData, na.rm = T), pch = 23, bg = "red", cex = 1.2)
xGrid <- c(xGrid, grid[i])
yMean <- c(yMean, mean(tempData, na.rm = T))
}
### which now interpolates as:
lines(yMean ~ xGrid, col = "darkred", lwd = 3)
Local trend by
lowess
curve:
par(mfrow = c(1,3), mar = c(4,4,1.5,0.5))
for(f in c(1/10, 1/5, 1/2)){
plot(consumption ~ length, data = Cars2004nh, col = "red3", bg = "orange",
pch = 23, main = paste0("f = ", f))
tempData <- Cars2004nh[!is.na(Cars2004nh$length),]
lines(lowess(tempData$consumption ~ tempData$length, f = f), col = "red3", lwd = 3)
}
How a straight line should be fitted into the data?
We use them to capture the relationship between two categorical variables.
Does the fact that the car is economical (in the U.S. sense,
consumption <= 10
) depend on the drive (distinguishing
only front
and other
)?
The default plot for factor ~ factor
is the spineplot
that compares the proportions of the first factor within each category
of the second factor separately:
par(mfrow = c(1,1), mar = c(4,4,1,2.5))
plot(fcons10 ~ fdrive2, data = Cars2004nh, col = diverge_hcl(2))
Contingency table:
(TAB2 <- with(Cars2004nh, table(fdrive2, fcons10)))
## fcons10
## fdrive2 No Yes
## front 46 60
## other 97 17
Proportion table:
round(prop.table(TAB2, margin = 1), 2)
## fcons10
## fdrive2 No Yes
## front 0.43 0.57
## other 0.85 0.15
For each test refresh the following:
Chi-square test using continuity correction to be more conservative
chisq.test(TAB2)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: TAB2
## X-squared = 40.154, df = 1, p-value = 2.347e-10
Chi-square test without continuity correction
(chi = chisq.test(TAB2, correct = FALSE))
##
## Pearson's Chi-squared test
##
## data: TAB2
## X-squared = 41.966, df = 1, p-value = 9.286e-11
How to obtain the test results manually:
# Expected observed counts under the null hypothesis
print(chi$expected)
## fcons10
## fdrive2 No Yes
## front 68.9 37.1
## other 74.1 39.9
n = sum(TAB2) # number of observations
R = rowSums(TAB2)/n # observed proportion in rows
C = colSums(TAB2)/n # observed proportion in columns
R%*%t(C)*n # expected counts under independence
## No Yes
## [1,] 68.9 37.1
## [2,] 74.1 39.9
(chi$observed - chi$expected)/sqrt(chi$expected) # residuals
## fcons10
## fdrive2 No Yes
## front -2.758836 3.759660
## other 2.660274 -3.625342
chi$residuals
## fcons10
## fdrive2 No Yes
## front -2.758836 3.759660
## other 2.660274 -3.625342
(ts<-sum(chi$res^2)) # test statistic
## [1] 41.96638
pchisq(ts, df=1, lower.tail=FALSE) # p-value
## [1] 9.285619e-11
Test based on difference between proportions. Notice the equivalency
to chisq.test
. What can you learn from the output?
(x <- TAB2[, "Yes"])
## front other
## 60 17
(n <- apply(TAB2, 1, sum))
## front other
## 106 114
prop.test(x, n)
##
## 2-sample test for equality of proportions with continuity correction
##
## data: x out of n
## X-squared = 40.154, df = 1, p-value = 2.347e-10
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.2930180 0.5408118
## sample estimates:
## prop 1 prop 2
## 0.5660377 0.1491228
prop.test(x, n, correct = FALSE)
##
## 2-sample test for equality of proportions without continuity correction
##
## data: x out of n
## X-squared = 41.966, df = 1, p-value = 9.286e-11
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.3021210 0.5317089
## sample estimates:
## prop 1 prop 2
## 0.5660377 0.1491228
Reconstruction of the test statistic:
p2 = x/n # proportion estimates in groups
p = sum(x) / sum(n) # common proportion estimate
(ts <- (diff(p2) / sqrt(p*(1-p)*sum(1/n)))^2) # test statistic
## other
## 41.96638
pchisq(ts, df=1, lower.tail=FALSE)
## other
## 9.285619e-11
What is tested here?
prop.test(x, n, alternative = "greater")
##
## 2-sample test for equality of proportions with continuity correction
##
## data: x out of n
## X-squared = 40.154, df = 1, p-value = 1.174e-10
## alternative hypothesis: greater
## 95 percent confidence interval:
## 0.3114739 1.0000000
## sample estimates:
## prop 1 prop 2
## 0.5660377 0.1491228
Two sample problem
More sample problem
Contingency tables
fcons10
,
fdrive
) and (fdrive
,ftype
).For practical purposes one sided alternative can be tested instead. Which alternative makes more sense? What exactly do we test here? Reconstruct the test statistics and the corresponding p-values.
t.test(consumption ~ fdrive2, data = Cars2004nh, alternative = "less", mu = -2)
##
## Welch Two Sample t-test
##
## data: consumption by fdrive2
## t = 0.15047, df = 217.96, p-value = 0.5597
## alternative hypothesis: true difference in means between group front and group other is less than -2
## 95 percent confidence interval:
## -Inf -1.537082
## sample estimates:
## mean in group front mean in group other
## 9.863208 11.824561
t.test(consumption ~ fdrive2, data = Cars2004nh, alternative = "less", mu = -2,
var.equal = TRUE)
##
## Two Sample t-test
##
## data: consumption by fdrive2
## t = 0.15014, df = 218, p-value = 0.5596
## alternative hypothesis: true difference in means between group front and group other is less than -2
## 95 percent confidence interval:
## -Inf -1.536167
## sample estimates:
## mean in group front mean in group other
## 9.863208 11.824561
wilcox.test(consumption ~ fdrive2, data = Cars2004nh, alternative = "less",
mu = -2, conf.int = TRUE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: consumption by fdrive2
## W = 6410, p-value = 0.7827
## alternative hypothesis: true location shift is less than -2
## 95 percent confidence interval:
## -Inf -1.449992
## sample estimates:
## difference in location
## -1.799991
Does the fact that the car is economical (in the U.S. sense,
consumption<=10
) depend on the drive (while
distinguishing front
, rear
or
4x4
)?
chisq.test(TAB3)
##
## Pearson's Chi-squared test
##
## data: TAB3
## X-squared = 42.298, df = 2, p-value = 6.532e-10
print(chisq.test(TAB3)$expected)
## fcons10
## fdrive No Yes
## front 68.9 37.1
## rear 41.6 22.4
## 4x4 32.5 17.5
print(chisq.test(TAB3)$residuals)
## fcons10
## fdrive No Yes
## front -2.758836 3.759660
## rear 1.767495 -2.408690
## 4x4 2.017233 -2.749026
print(chisq.test(TAB3)$stdres)
## fcons10
## fdrive No Yes
## front -6.478146 6.478146
## rear 3.547915 -3.547915
## 4x4 3.878904 -3.878904
Does the drive (front/rear/4x4) depend on the type of the car (personal/wagon/SUV/pickup/sport/minivan)?
chisq.test(TAB4)
##
## Pearson's Chi-squared test
##
## data: TAB4
## X-squared = 119.77, df = 10, p-value < 2.2e-16
print(chisq.test(TAB4)$expected)
## fdrive
## ftype front rear 4x4
## personal 48.181818 29.090909 22.727273
## wagon 8.190909 4.945455 3.863636
## SUV 17.345455 10.472727 8.181818
## pickup 7.709091 4.654545 3.636364
## sport 17.345455 10.472727 8.181818
## minivan 7.227273 4.363636 3.409091