Download this R markdown as: R, Rmd.
If you use your personal computer an additional package
mffSM
is needed. The package is not available from CRAN but
can be downloaded from this website. Download the package and use the following
command for the installation:
install.packages("mffSM_1.1.zip", repos = NULL)
# or without necessity to download first
install.packages("https://www2.karlin.mff.cuni.cz/~komarek/vyuka/2021_22/nmsa407/R/mffSM_1.2.tar.gz", repos = NULL)
Libraries can be accessed the following way. When in doubt, use
help()
or ?
to show the documentation:
library("mffSM")
library("colorspace")
help(package="mffSM")
This lab session we will work with the data set on car vehicles that were on the market in the U.S. in 2004:
data(Cars2004nh, package = "mffSM")
Quick overview and summary of the data:
dim(Cars2004nh)
## [1] 425 20
str(Cars2004nh)
## 'data.frame': 425 obs. of 20 variables:
## $ vname : chr "Chevrolet.Aveo.4dr" "Chevrolet.Aveo.LS.4dr.hatch" "Chevrolet.Cavalier.2dr" "Chevrolet.Cavalier.4dr" ...
## $ type : int 1 1 1 1 1 1 1 1 1 1 ...
## $ drive : int 1 1 1 1 1 1 1 1 1 1 ...
## $ price.retail: int 11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
## $ price.dealer: int 10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
## $ price : num 11328 12194 14154 14347 15871 ...
## $ cons.city : num 8.4 8.4 9 9 9 8.1 8.1 9 8.7 9 ...
## $ cons.highway: num 6.9 6.9 6.4 6.4 6.4 6.5 6.5 7.1 6.5 7.1 ...
## $ consumption : num 7.65 7.65 7.7 7.7 7.7 7.3 7.3 8.05 7.6 8.05 ...
## $ engine.size : num 1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
## $ ncylinder : int 4 4 4 4 4 4 4 4 4 4 ...
## $ horsepower : int 103 103 140 140 140 132 132 130 110 130 ...
## $ weight : int 1075 1065 1187 1214 1187 1171 1191 1185 1182 1182 ...
## $ iweight : num 0.00093 0.000939 0.000842 0.000824 0.000842 ...
## $ lweight : num 6.98 6.97 7.08 7.1 7.08 ...
## $ wheel.base : int 249 249 264 264 264 267 267 262 262 262 ...
## $ length : int 424 389 465 465 465 442 442 427 427 427 ...
## $ width : int 168 168 175 173 175 170 170 170 170 170 ...
## $ ftype : Factor w/ 6 levels "personal","wagon",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ fdrive : Factor w/ 3 levels "front","rear",..: 1 1 1 1 1 1 1 1 1 1 ...
head(Cars2004nh)
## vname type drive price.retail price.dealer price cons.city cons.highway consumption
## 1 Chevrolet.Aveo.4dr 1 1 11690 10965 11328 8.4 6.9 7.65
## 2 Chevrolet.Aveo.LS.4dr.hatch 1 1 12585 11802 12194 8.4 6.9 7.65
## 3 Chevrolet.Cavalier.2dr 1 1 14610 13697 14154 9.0 6.4 7.70
## 4 Chevrolet.Cavalier.4dr 1 1 14810 13884 14347 9.0 6.4 7.70
## 5 Chevrolet.Cavalier.LS.2dr 1 1 16385 15357 15871 9.0 6.4 7.70
## 6 Dodge.Neon.SE.4dr 1 1 13670 12849 13260 8.1 6.5 7.30
## engine.size ncylinder horsepower weight iweight lweight wheel.base length width ftype fdrive
## 1 1.6 4 103 1075 0.0009302 6.980 249 424 168 personal front
## 2 1.6 4 103 1065 0.0009390 6.971 249 389 168 personal front
## 3 2.2 4 140 1187 0.0008425 7.079 264 465 175 personal front
## 4 2.2 4 140 1214 0.0008237 7.102 264 465 173 personal front
## 5 2.2 4 140 1187 0.0008425 7.079 264 465 175 personal front
## 6 2.0 4 132 1171 0.0008540 7.066 267 442 170 personal front
summary(Cars2004nh)
## vname type drive price.retail price.dealer price
## Length:425 Min. :1.00 Min. :1.00 Min. : 10280 Min. : 9875 Min. : 10078
## Class :character 1st Qu.:1.00 1st Qu.:1.00 1st Qu.: 20370 1st Qu.: 18973 1st Qu.: 19600
## Mode :character Median :1.00 Median :1.00 Median : 27905 Median : 25672 Median : 26656
## Mean :2.22 Mean :1.69 Mean : 32866 Mean : 30096 Mean : 31481
## 3rd Qu.:3.00 3rd Qu.:2.00 3rd Qu.: 39235 3rd Qu.: 35777 3rd Qu.: 37514
## Max. :6.00 Max. :3.00 Max. :192465 Max. :173560 Max. :183012
##
## cons.city cons.highway consumption engine.size ncylinder horsepower weight
## Min. : 6.2 Min. : 5.10 Min. : 5.65 Min. :1.30 Min. :-1.00 Min. :100 Min. : 923
## 1st Qu.:11.2 1st Qu.: 8.10 1st Qu.: 9.65 1st Qu.:2.40 1st Qu.: 4.00 1st Qu.:165 1st Qu.:1412
## Median :12.4 Median : 9.00 Median :10.70 Median :3.00 Median : 6.00 Median :210 Median :1577
## Mean :12.4 Mean : 9.14 Mean :10.75 Mean :3.21 Mean : 5.79 Mean :217 Mean :1626
## 3rd Qu.:13.8 3rd Qu.: 9.80 3rd Qu.:11.65 3rd Qu.:3.90 3rd Qu.: 6.00 3rd Qu.:255 3rd Qu.:1804
## Max. :23.5 Max. :19.60 Max. :21.55 Max. :8.30 Max. :12.00 Max. :500 Max. :3261
## NA's :14 NA's :14 NA's :14 NA's :2
## iweight lweight wheel.base length width ftype fdrive
## Min. :0.0003 Min. :6.83 Min. :226 Min. :363 Min. :163 personal:242 front:223
## 1st Qu.:0.0006 1st Qu.:7.25 1st Qu.:262 1st Qu.:450 1st Qu.:175 wagon : 30 rear :110
## Median :0.0006 Median :7.36 Median :272 Median :472 Median :180 SUV : 60 4x4 : 92
## Mean :0.0006 Mean :7.37 Mean :275 Mean :471 Mean :181 pickup : 24
## 3rd Qu.:0.0007 3rd Qu.:7.50 3rd Qu.:284 3rd Qu.:490 3rd Qu.:185 sport : 49
## Max. :0.0011 Max. :8.09 Max. :366 Max. :577 Max. :206 minivan : 20
## NA's :2 NA's :2 NA's :2 NA's :26 NA's :28
Exclude cars where the consumption is missing:
Cars2004nh <- Cars2004nh[!is.na(Cars2004nh$consumption),]
Create a working subsample:
set.seed(20250218)
idx <- c(sample((1:nrow(Cars2004nh))[Cars2004nh$ftype=="personal"], size=100),
sample((1:nrow(Cars2004nh))[Cars2004nh$ftype!="personal"], size=120))
Cars2004nh <- Cars2004nh[idx,]
Next, we derive some additional variables. First, binary factor
fdrive2
, which distinguishes whether the drive is
front
or other
(rear
or
4x4
):
Cars2004nh <- transform(Cars2004nh,
fdrive2 = factor(1*(fdrive == "front") +
2*(fdrive != "front"),
labels = c("front", "other")))
with(Cars2004nh, table(fdrive, fdrive2))
## fdrive2
## fdrive front other
## front 106 0
## rear 0 64
## 4x4 0 50
Next, we derive binary indicator cons10
of having a
consumption <= 10
l/100 km and then create a factor with
No
(0) and Yes
(1) labels:
Cars2004nh <- transform(Cars2004nh,
cons10 = 1 * (consumption <= 10), # 0 or 1
fcons10 = factor(1 * (consumption <= 10),
labels = c("No", "Yes")))
summary(Cars2004nh)
## vname type drive price.retail price.dealer price
## Length:220 Min. :1.00 Min. :1.00 Min. : 10280 Min. : 9875 Min. : 10078
## Class :character 1st Qu.:1.00 1st Qu.:1.00 1st Qu.: 20334 1st Qu.: 18839 1st Qu.: 19572
## Mode :character Median :2.00 Median :2.00 Median : 27270 Median : 25142 Median : 26147
## Mean :2.62 Mean :1.75 Mean : 32819 Mean : 30012 Mean : 31416
## 3rd Qu.:4.00 3rd Qu.:2.00 3rd Qu.: 40151 3rd Qu.: 35904 3rd Qu.: 38015
## Max. :6.00 Max. :3.00 Max. :126670 Max. :117854 Max. :122262
##
## cons.city cons.highway consumption engine.size ncylinder horsepower weight
## Min. : 6.7 Min. : 5.50 Min. : 6.10 Min. :1.30 Min. :-1.00 Min. :103 Min. : 923
## 1st Qu.:11.2 1st Qu.: 8.10 1st Qu.: 9.65 1st Qu.:2.40 1st Qu.: 4.00 1st Qu.:165 1st Qu.:1384
## Median :12.4 Median : 9.00 Median :10.70 Median :3.20 Median : 6.00 Median :210 Median :1584
## Mean :12.5 Mean : 9.28 Mean :10.88 Mean :3.21 Mean : 5.76 Mean :215 Mean :1625
## 3rd Qu.:13.8 3rd Qu.: 9.80 3rd Qu.:11.80 3rd Qu.:3.90 3rd Qu.: 6.00 3rd Qu.:250 3rd Qu.:1810
## Max. :19.6 Max. :16.80 Max. :17.45 Max. :6.00 Max. :12.00 Max. :493 Max. :2782
##
## iweight lweight wheel.base length width ftype fdrive
## Min. :0.000360 Min. :6.83 Min. :226 Min. :363 Min. :165 personal:100 front:106
## 1st Qu.:0.000552 1st Qu.:7.23 1st Qu.:259 1st Qu.:444 1st Qu.:175 wagon : 17 rear : 64
## Median :0.000631 Median :7.37 Median :272 Median :466 Median :180 SUV : 36 4x4 : 50
## Mean :0.000643 Mean :7.37 Mean :274 Mean :467 Mean :181 pickup : 16
## 3rd Qu.:0.000722 3rd Qu.:7.50 3rd Qu.:287 3rd Qu.:493 3rd Qu.:185 sport : 36
## Max. :0.001083 Max. :7.93 Max. :366 Max. :556 Max. :203 minivan : 15
## NA's :16 NA's :18
## fdrive2 cons10 fcons10
## front:106 Min. :0.00 No :143
## other:114 1st Qu.:0.00 Yes: 77
## Median :0.00
## Mean :0.35
## 3rd Qu.:1.00
## Max. :1.00
##
In this section, we review the methods for two-sample problems, i.e. exploring the relationship between continuous variable and binary variable. In particular, we are going to explore the following question:
Does the consumption depend on the drive while distinguishing only two categories: front and other?
We start with an exploration phase. The most straightforward
visualization tool for this problem is boxplot, the default
plot for numeric ~ factor
:
plot(consumption ~ fdrive2, data = Cars2004nh)
With some effort you can make this plot more compelling:
COL <- c("yellowgreen", "goldenrod")
names(COL) <- levels(Cars2004nh$fdrive2)
par(mfrow = c(1,1), mar = c(4,4,1.5,0.5))
plot(consumption ~ fdrive2, data = Cars2004nh, col = COL,
xlab = "Drive of the vehicle", ylab = "Consumption [l/100 km]",
main = "Consumption vs. Front drive",
cex.main = 1.0,
xaxt = "n")
axis(1, at = 1:nlevels(Cars2004nh$fdrive2), labels = c("Front drive", "Other"))
Boxplots of consumption for two categories of drive.
Think about other ways of displaying the two different distributions. Here are the sample characteristics:
with(Cars2004nh, tapply(consumption, fdrive2, summary))
## $front
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.10 8.45 9.88 9.86 10.70 15.60
##
## $other
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.3 10.7 11.4 11.8 12.4 17.4
with(Cars2004nh, tapply(consumption, fdrive2, sd, na.rm = TRUE))
## front other
## 1.849 1.961
X = subset(Cars2004nh, fdrive2 == "front")$consumption
Y = subset(Cars2004nh, fdrive2 == "other")$consumption
For each test refresh the following:
Standard two-sample t-test (assuming equal variances and normality)
t.test(consumption ~ fdrive2, data = Cars2004nh, var.equal = TRUE)
##
## Two Sample t-test
##
## data: consumption by fdrive2
## t = -7.6, df = 218, p-value = 8e-13
## alternative hypothesis: true difference in means between group front and group other is not equal to 0
## 95 percent confidence interval:
## -2.469 -1.454
## sample estimates:
## mean in group front mean in group other
## 9.863 11.825
mean(X) - mean(Y)
## [1] -1.961
Welch two-sample t-test (not assuming equal variances)
t.test(consumption ~ fdrive2, data = Cars2004nh)
##
## Welch Two Sample t-test
##
## data: consumption by fdrive2
## t = -7.6, df = 218, p-value = 7e-13
## alternative hypothesis: true difference in means between group front and group other is not equal to 0
## 95 percent confidence interval:
## -2.468 -1.455
## sample estimates:
## mean in group front mean in group other
## 9.863 11.825
Wilcoxon two-sample rank test
wilcox.test(consumption ~ fdrive2, data = Cars2004nh, conf.int = TRUE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: consumption by fdrive2
## W = 2548, p-value = 1e-13
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -2.30 -1.35
## sample estimates:
## difference in location
## -1.8
median(X) - median(Y)
## [1] -1.575
XYdiff = outer(X, Y, FUN="-")
median(XYdiff)
## [1] -1.8
Kolmogorov-Smirnov test
ks.test(consumption ~ fdrive2, data = Cars2004nh)
##
## Asymptotic two-sample Kolmogorov-Smirnov test
##
## data: consumption by fdrive2
## D = 0.52, p-value = 2e-13
## alternative hypothesis: two-sided
par(mfrow = c(1,1), mar = c(4,4,1.5,0.5))
plot(ecdf(X), main="ECDF")
plot(ecdf(Y), col=2, add=TRUE)
What if there are more than two groups?
Does the consumption depend on the drive (distinguishing front/rear/4x4)?
Let’s start with boxplots:
plot(consumption ~ fdrive, data = Cars2004nh)
with(Cars2004nh, tapply(consumption, fdrive, summary))
## $front
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.10 8.45 9.88 9.86 10.70 15.60
##
## $rear
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.3 10.6 11.2 11.2 11.8 15.2
##
## $`4x4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.95 10.70 11.60 12.56 14.79 17.45
with(Cars2004nh, tapply(consumption, fdrive, sd, na.rm = TRUE))
## front rear 4x4
## 1.849 1.360 2.345
For each test refresh the following:
Standard ANOVA
summary(aov(consumption ~ fdrive, data = Cars2004nh))
## Df Sum Sq Mean Sq F value Pr(>F)
## fdrive 2 260 129.9 37.9 7.9e-15 ***
## Residuals 217 745 3.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA without assuming equal variances
oneway.test(consumption ~ fdrive, data = Cars2004nh)
##
## One-way analysis of means (not assuming equal variances)
##
## data: consumption and fdrive
## F = 31, num df = 2, denom df = 113, p-value = 2e-11
Kruskal-Wallis Rank Sum Test
kruskal.test(consumption ~ fdrive, data = Cars2004nh)
##
## Kruskal-Wallis rank sum test
##
## data: consumption by fdrive
## Kruskal-Wallis chi-squared = 58, df = 2, p-value = 2e-13
Alternatively: the same test statistics all in the standard ANOVA above
summary(m0 <- lm(consumption ~ fdrive, data=Cars2004nh))
##
## Call:
## lm(formula = consumption ~ fdrive, data = Cars2004nh)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.948 -1.313 -0.063 0.912 5.737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.863 0.180 54.81 < 2e-16 ***
## fdriverear 1.384 0.293 4.72 4.2e-06 ***
## fdrive4x4 2.700 0.318 8.49 3.2e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.85 on 217 degrees of freedom
## Multiple R-squared: 0.259, Adjusted R-squared: 0.252
## F-statistic: 37.9 on 2 and 217 DF, p-value: 7.88e-15
anova(m0)
## Analysis of Variance Table
##
## Response: consumption
## Df Sum Sq Mean Sq F value Pr(>F)
## fdrive 2 260 129.9 37.9 7.9e-15 ***
## Residuals 217 745 3.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
oneway.test(consumption ~ fdrive, data = Cars2004nh, var.equal=TRUE)
##
## One-way analysis of means
##
## data: consumption and fdrive
## F = 38, num df = 2, denom df = 217, p-value = 8e-15
Follow these steps to compute the sum of squares and degrees of freedom needed for ANOVA with equal variances:
X = subset(Cars2004nh, fdrive == "front")$consumption
Y = subset(Cars2004nh, fdrive == "rear")$consumption
Z = subset(Cars2004nh, fdrive == "4x4")$consumption
nX = length(X)
nY = length(Y)
nZ = length(Z)
mX = mean(X)
mY = mean(Y)
mZ = mean(Z)
m = mean(c(X,Y,Z))
nX*(m - mX)^2+nY*(m-mY)^2+nZ*(m-mZ)^2
## [1] 259.9
sum((X - mX)^2)+sum((Y-mY)^2)+sum((Z-mZ)^2)
## [1] 744.8
sum((c(X,Y,Z)-m)^2)
## [1] 1005
nX+nY+nZ
## [1] 220
Now we review the basic methods for describing the relationship between two continuous variables.
Why are boxplots not reasonable choice? Use scatterplot instead,
default for plot
between numeric
and
numeric
variable.
par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(consumption ~ length, data = Cars2004nh, col = "red3", bg = "orange", pch = 23)
Histograms to inspect the shape of marginal distributions:
### Histograms
par(mfrow = c(1, 2), mar = c(4,4,0.5,0.5))
hist(Cars2004nh$consumption, prob = TRUE, main = "", col = "seagreen",
xlab = "Consumption [l/100 km]")
hist(Cars2004nh$length, prob = TRUE, main = "", col = "seagreen",
xlab = "Length [cm]")
with(Cars2004nh, cor(consumption, length))
## [1] NA
with(Cars2004nh, cor(consumption, length, use = "complete.obs"))
## [1] 0.438
with(Cars2004nh, cor(consumption, length, method = "spearman", use = "complete.obs"))
## [1] 0.444
For each test refresh the following:
Test based on Pearson’s correlation coefficient
with(Cars2004nh, cor.test(consumption, length))
##
## Pearson's product-moment correlation
##
## data: consumption and length
## t = 6.9, df = 202, p-value = 6e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3198 0.5427
## sample estimates:
## cor
## 0.438
Test based on Spearman’s correlation coefficient
with(Cars2004nh, cor.test(consumption, length, method = "spearman"))
##
## Spearman's rank correlation rho
##
## data: consumption and length
## S = 786671, p-value = 3e-11
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.444
How would you assess the relationship between the dependent variable and independent covariates?
Here are some graphical tools to examine the (non)linearity of the relationship:
par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(consumption ~ length, data = Cars2004nh, col = "red3", bg = "orange", pch = 23)
grid <- quantile(Cars2004nh$length, seq(0, 1, length = 20), na.rm = T)
xGrid <- NULL
yMean <- NULL
for(i in 2:length(grid)){
tempData <- Cars2004nh$consumption[Cars2004nh$length >= grid[i - 1] & Cars2004nh$length < grid[i]]
lines(rep(mean(tempData, na.rm = T), 2) ~ c(grid[i - 1], grid[i]), col = "red")
points((grid[i - 1] + grid[i])/2, mean(tempData, na.rm = T),
pch = 23, bg = "red", cex = 1.2)
xGrid <- c(xGrid, (grid[i - 1] + grid[i])/2)
yMean <- c(yMean, mean(tempData, na.rm = T))
lines(rep(grid[i], 2), c(min(Cars2004nh$consumption), max(Cars2004nh$consumption)), lty = 3)
}
### which interpolates as:
lines(yMean ~ xGrid, col = "darkred", lwd = 3)
Less groups without for
cycle:
grid <- quantile(Cars2004nh$length, seq(0, 1, length = 11), na.rm = T)
Cars2004nh$flength <- cut(Cars2004nh$length, breaks = grid, right = FALSE)
xGrid <- (grid[-1] + grid[-length(grid)]) / 2
yMean <- tapply(Cars2004nh$consumption, Cars2004nh$flength, mean)
par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(consumption ~ length, data = Cars2004nh, type = "n")
segments(x0 = grid, y0 = min(Cars2004nh$consumption), y1 = max(Cars2004nh$consumption), lty = 3)
points(consumption ~ length, data = Cars2004nh, col = "red3", bg = "orange", pch = 23)
segments(x0 = grid[-length(grid)], x1 = grid[-1], y0 = yMean, col = "red")
lines(x = xGrid, y = yMean, col = "darkred", lwd = 3)
points(x = xGrid, y = yMean, pch = 23, bg = "red", cex = 1.2)
Category for every value of length
that appears in the
data:
par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(consumption ~ length, data = Cars2004nh, col = "red3", bg = "orange", pch = 23)
grid <- as.numeric(names(table(Cars2004nh$length)))
xGrid <- NULL
yMean <- NULL
for (i in 2:(length(grid) - 1)){
tempData <- Cars2004nh$consumption[Cars2004nh$length >= grid[i - 1] & Cars2004nh$length <= grid[i + 1]]
points(grid[i], mean(tempData, na.rm = T), pch = 23, bg = "red", cex = 1.2)
xGrid <- c(xGrid, grid[i])
yMean <- c(yMean, mean(tempData, na.rm = T))
}
### which now interpolates as:
lines(yMean ~ xGrid, col = "darkred", lwd = 3)
Local trend by
lowess
curve:
par(mfrow = c(1,3), mar = c(4,4,1.5,0.5))
for(f in c(1/10, 1/5, 1/2)){
plot(consumption ~ length, data = Cars2004nh, col = "red3", bg = "orange",
pch = 23, main = paste0("f = ", f))
tempData <- Cars2004nh[!is.na(Cars2004nh$length),]
lines(lowess(tempData$consumption ~ tempData$length, f = f), col = "red3", lwd = 3)
}
How a straight line should be fitted into the data?
We use them to capture the relationship between two categorical variables.
Does the fact that the car is economical (in the U.S. sense,
consumption <= 10
) depend on the drive (distinguishing
only front
and other
)?
The default plot for factor ~ factor
is the spineplot
that compares the proportions of the first factor within each category
of the second factor separately:
par(mfrow = c(1,1), mar = c(4,4,1,2.5))
plot(fcons10 ~ fdrive2, data = Cars2004nh, col = diverge_hcl(2))
Contingency table:
(TAB2 <- with(Cars2004nh, table(fdrive2, fcons10)))
## fcons10
## fdrive2 No Yes
## front 46 60
## other 97 17
Proportion table:
round(prop.table(TAB2, margin = 1), 2)
## fcons10
## fdrive2 No Yes
## front 0.43 0.57
## other 0.85 0.15
For each test refresh the following:
Chi-square test using continuity correction to be more conservative
chisq.test(TAB2)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: TAB2
## X-squared = 40, df = 1, p-value = 2e-10
Chi-square test without continuity correction
(chi = chisq.test(TAB2, correct = FALSE))
##
## Pearson's Chi-squared test
##
## data: TAB2
## X-squared = 42, df = 1, p-value = 9e-11
How to obtain the test results manually:
# Expected observed counts under the null hypothesis
print(chi$expected)
## fcons10
## fdrive2 No Yes
## front 68.9 37.1
## other 74.1 39.9
n = sum(TAB2) # number of observations
R = rowSums(TAB2)/n # observed proportion in rows
C = colSums(TAB2)/n # observed proportion in columns
R%*%t(C)*n # expected counts under independence
## No Yes
## [1,] 68.9 37.1
## [2,] 74.1 39.9
(chi$observed - chi$expected)/sqrt(chi$expected) # residuals
## fcons10
## fdrive2 No Yes
## front -2.759 3.760
## other 2.660 -3.625
chi$residuals
## fcons10
## fdrive2 No Yes
## front -2.759 3.760
## other 2.660 -3.625
(ts<-sum(chi$res^2)) # test statistic
## [1] 41.97
pchisq(ts, df=1, lower.tail=FALSE) # p-value
## [1] 9.286e-11
Test based on difference between proportions. Notice the equivalency
to chisq.test
. What can you learn from the output?
(x <- TAB2[, "Yes"])
## front other
## 60 17
(n <- apply(TAB2, 1, sum))
## front other
## 106 114
prop.test(x, n)
##
## 2-sample test for equality of proportions with continuity correction
##
## data: x out of n
## X-squared = 40, df = 1, p-value = 2e-10
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.2930 0.5408
## sample estimates:
## prop 1 prop 2
## 0.5660 0.1491
prop.test(x, n, correct = FALSE)
##
## 2-sample test for equality of proportions without continuity correction
##
## data: x out of n
## X-squared = 42, df = 1, p-value = 9e-11
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.3021 0.5317
## sample estimates:
## prop 1 prop 2
## 0.5660 0.1491
Reconstruction of the test statistic:
p2 = x/n # proportion estimates in groups
p = sum(x) / sum(n) # common proportion estimate
(ts <- (diff(p2) / sqrt(p*(1-p)*sum(1/n)))^2) # test statistic
## other
## 41.97
pchisq(ts, df=1, lower.tail=FALSE)
## other
## 9.286e-11
What is tested here?
prop.test(x, n, alternative = "greater")
##
## 2-sample test for equality of proportions with continuity correction
##
## data: x out of n
## X-squared = 40, df = 1, p-value = 1e-10
## alternative hypothesis: greater
## 95 percent confidence interval:
## 0.3115 1.0000
## sample estimates:
## prop 1 prop 2
## 0.5660 0.1491
Two sample problem
More sample problem
Contingency tables
fcons10
,
fdrive
) and (fdrive
,ftype
).For practical purposes one sided alternative can be tested instead. Which alternative makes more sense? What exactly do we test here? Reconstruct the test statistics and the corresponding p-values.
t.test(consumption ~ fdrive2, data = Cars2004nh, alternative = "less", mu = -2)
##
## Welch Two Sample t-test
##
## data: consumption by fdrive2
## t = 0.15, df = 218, p-value = 0.6
## alternative hypothesis: true difference in means between group front and group other is less than -2
## 95 percent confidence interval:
## -Inf -1.537
## sample estimates:
## mean in group front mean in group other
## 9.863 11.825
t.test(consumption ~ fdrive2, data = Cars2004nh, alternative = "less", mu = -2,
var.equal = TRUE)
##
## Two Sample t-test
##
## data: consumption by fdrive2
## t = 0.15, df = 218, p-value = 0.6
## alternative hypothesis: true difference in means between group front and group other is less than -2
## 95 percent confidence interval:
## -Inf -1.536
## sample estimates:
## mean in group front mean in group other
## 9.863 11.825
wilcox.test(consumption ~ fdrive2, data = Cars2004nh, alternative = "less",
mu = -2, conf.int = TRUE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: consumption by fdrive2
## W = 6410, p-value = 0.8
## alternative hypothesis: true location shift is less than -2
## 95 percent confidence interval:
## -Inf -1.45
## sample estimates:
## difference in location
## -1.8
Does the fact that the car is economical (in the U.S. sense,
consumption<=10
) depend on the drive (while
distinguishing front
, rear
or
4x4
)?
chisq.test(TAB3)
##
## Pearson's Chi-squared test
##
## data: TAB3
## X-squared = 42, df = 2, p-value = 7e-10
print(chisq.test(TAB3)$expected)
## fcons10
## fdrive No Yes
## front 68.9 37.1
## rear 41.6 22.4
## 4x4 32.5 17.5
print(chisq.test(TAB3)$residuals)
## fcons10
## fdrive No Yes
## front -2.759 3.760
## rear 1.767 -2.409
## 4x4 2.017 -2.749
print(chisq.test(TAB3)$stdres)
## fcons10
## fdrive No Yes
## front -6.478 6.478
## rear 3.548 -3.548
## 4x4 3.879 -3.879
Does the drive (front/rear/4x4) depend on the type of the car (personal/wagon/SUV/pickup/sport/minivan)?
chisq.test(TAB4)
##
## Pearson's Chi-squared test
##
## data: TAB4
## X-squared = 120, df = 10, p-value <2e-16
print(chisq.test(TAB4)$expected)
## fdrive
## ftype front rear 4x4
## personal 48.182 29.091 22.727
## wagon 8.191 4.945 3.864
## SUV 17.345 10.473 8.182
## pickup 7.709 4.655 3.636
## sport 17.345 10.473 8.182
## minivan 7.227 4.364 3.409