Download this R markdown as: R, Rmd.

Loading the data and libraries

If you use your personal computer an additional package mffSM is needed. The package is not available from CRAN but can be downloaded from this website. Download the package and use the following command for the installation:

install.packages("mffSM_1.1.zip", repos = NULL)
# or without necessity to download first 
install.packages("https://www2.karlin.mff.cuni.cz/~komarek/vyuka/2021_22/nmsa407/R/mffSM_1.2.tar.gz", repos = NULL)

Libraries can be accessed the following way. When in doubt, use help() or ? to show the documentation:

library("mffSM")
library("colorspace")
help(package="mffSM")

This lab session we will work with the data set on car vehicles that were on the market in the U.S. in 2004:

data(Cars2004nh, package = "mffSM")

Quick overview and summary of the data:

dim(Cars2004nh)
## [1] 425  20
str(Cars2004nh)
## 'data.frame':    425 obs. of  20 variables:
##  $ vname       : chr  "Chevrolet.Aveo.4dr" "Chevrolet.Aveo.LS.4dr.hatch" "Chevrolet.Cavalier.2dr" "Chevrolet.Cavalier.4dr" ...
##  $ type        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ drive       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ price.retail: int  11690 12585 14610 14810 16385 13670 15040 13270 13730 15460 ...
##  $ price.dealer: int  10965 11802 13697 13884 15357 12849 14086 12482 12906 14496 ...
##  $ price       : num  11328 12194 14154 14347 15871 ...
##  $ cons.city   : num  8.4 8.4 9 9 9 8.1 8.1 9 8.7 9 ...
##  $ cons.highway: num  6.9 6.9 6.4 6.4 6.4 6.5 6.5 7.1 6.5 7.1 ...
##  $ consumption : num  7.65 7.65 7.7 7.7 7.7 7.3 7.3 8.05 7.6 8.05 ...
##  $ engine.size : num  1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
##  $ ncylinder   : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ horsepower  : int  103 103 140 140 140 132 132 130 110 130 ...
##  $ weight      : int  1075 1065 1187 1214 1187 1171 1191 1185 1182 1182 ...
##  $ iweight     : num  0.00093 0.000939 0.000842 0.000824 0.000842 ...
##  $ lweight     : num  6.98 6.97 7.08 7.1 7.08 ...
##  $ wheel.base  : int  249 249 264 264 264 267 267 262 262 262 ...
##  $ length      : int  424 389 465 465 465 442 442 427 427 427 ...
##  $ width       : int  168 168 175 173 175 170 170 170 170 170 ...
##  $ ftype       : Factor w/ 6 levels "personal","wagon",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ fdrive      : Factor w/ 3 levels "front","rear",..: 1 1 1 1 1 1 1 1 1 1 ...
head(Cars2004nh)
##                         vname type drive price.retail price.dealer   price cons.city cons.highway consumption
## 1          Chevrolet.Aveo.4dr    1     1        11690        10965 11327.5       8.4          6.9        7.65
## 2 Chevrolet.Aveo.LS.4dr.hatch    1     1        12585        11802 12193.5       8.4          6.9        7.65
## 3      Chevrolet.Cavalier.2dr    1     1        14610        13697 14153.5       9.0          6.4        7.70
## 4      Chevrolet.Cavalier.4dr    1     1        14810        13884 14347.0       9.0          6.4        7.70
## 5   Chevrolet.Cavalier.LS.2dr    1     1        16385        15357 15871.0       9.0          6.4        7.70
## 6           Dodge.Neon.SE.4dr    1     1        13670        12849 13259.5       8.1          6.5        7.30
##   engine.size ncylinder horsepower weight      iweight  lweight wheel.base length width    ftype fdrive
## 1         1.6         4        103   1075 0.0009302326 6.980076        249    424   168 personal  front
## 2         1.6         4        103   1065 0.0009389671 6.970730        249    389   168 personal  front
## 3         2.2         4        140   1187 0.0008424600 7.079184        264    465   175 personal  front
## 4         2.2         4        140   1214 0.0008237232 7.101676        264    465   173 personal  front
## 5         2.2         4        140   1187 0.0008424600 7.079184        264    465   175 personal  front
## 6         2.0         4        132   1171 0.0008539710 7.065613        267    442   170 personal  front
summary(Cars2004nh)
##     vname                type           drive        price.retail     price.dealer        price       
##  Length:425         Min.   :1.000   Min.   :1.000   Min.   : 10280   Min.   :  9875   Min.   : 10078  
##  Class :character   1st Qu.:1.000   1st Qu.:1.000   1st Qu.: 20370   1st Qu.: 18973   1st Qu.: 19601  
##  Mode  :character   Median :1.000   Median :1.000   Median : 27905   Median : 25672   Median : 26656  
##                     Mean   :2.219   Mean   :1.692   Mean   : 32866   Mean   : 30096   Mean   : 31481  
##                     3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.: 39235   3rd Qu.: 35777   3rd Qu.: 37514  
##                     Max.   :6.000   Max.   :3.000   Max.   :192465   Max.   :173560   Max.   :183013  
##                                                                                                       
##    cons.city      cons.highway     consumption     engine.size      ncylinder        horsepower   
##  Min.   : 6.20   Min.   : 5.100   Min.   : 5.65   Min.   :1.300   Min.   :-1.000   Min.   :100.0  
##  1st Qu.:11.20   1st Qu.: 8.100   1st Qu.: 9.65   1st Qu.:2.400   1st Qu.: 4.000   1st Qu.:165.0  
##  Median :12.40   Median : 9.000   Median :10.70   Median :3.000   Median : 6.000   Median :210.0  
##  Mean   :12.36   Mean   : 9.142   Mean   :10.75   Mean   :3.208   Mean   : 5.791   Mean   :216.8  
##  3rd Qu.:13.80   3rd Qu.: 9.800   3rd Qu.:11.65   3rd Qu.:3.900   3rd Qu.: 6.000   3rd Qu.:255.0  
##  Max.   :23.50   Max.   :19.600   Max.   :21.55   Max.   :8.300   Max.   :12.000   Max.   :500.0  
##  NA's   :14      NA's   :14       NA's   :14                                                      
##      weight        iweight             lweight        wheel.base        length          width      
##  Min.   : 923   Min.   :0.0003067   Min.   :6.828   Min.   :226.0   Min.   :363.0   Min.   :163.0  
##  1st Qu.:1412   1st Qu.:0.0005542   1st Qu.:7.253   1st Qu.:262.0   1st Qu.:450.0   1st Qu.:175.0  
##  Median :1577   Median :0.0006341   Median :7.363   Median :272.0   Median :472.0   Median :180.0  
##  Mean   :1626   Mean   :0.0006412   Mean   :7.373   Mean   :274.9   Mean   :470.6   Mean   :181.1  
##  3rd Qu.:1804   3rd Qu.:0.0007082   3rd Qu.:7.498   3rd Qu.:284.0   3rd Qu.:490.0   3rd Qu.:185.0  
##  Max.   :3261   Max.   :0.0010834   Max.   :8.090   Max.   :366.0   Max.   :577.0   Max.   :206.0  
##  NA's   :2      NA's   :2           NA's   :2       NA's   :2       NA's   :26      NA's   :28     
##       ftype       fdrive   
##  personal:242   front:223  
##  wagon   : 30   rear :110  
##  SUV     : 60   4x4  : 92  
##  pickup  : 24              
##  sport   : 49              
##  minivan : 20              
## 

Data subsampling and preprocessing

Exclude cars where the consumption is missing:

Cars2004nh <- Cars2004nh[!is.na(Cars2004nh$consumption),] 

Create a working subsample:

set.seed(20250218)
idx <- c(sample((1:nrow(Cars2004nh))[Cars2004nh$ftype=="personal"], size=100), 
         sample((1:nrow(Cars2004nh))[Cars2004nh$ftype!="personal"], size=120))
Cars2004nh <- Cars2004nh[idx,]

Next, we derive some additional variables. First, binary factor fdrive2, which distinguishes whether the drive is front or other (rear or 4x4):

Cars2004nh <- transform(Cars2004nh, 
                        fdrive2 = factor(1*(fdrive == "front") + 
                                           2*(fdrive != "front"), 
                                         labels = c("front", "other")))
with(Cars2004nh, table(fdrive, fdrive2))
##        fdrive2
## fdrive  front other
##   front   106     0
##   rear      0    64
##   4x4       0    50

Next, we derive binary indicator cons10 of having a consumption <= 10 l/100 km and then create a factor with No (0) and Yes (1) labels:

Cars2004nh <- transform(Cars2004nh, 
                        cons10  = 1 * (consumption <= 10), # 0 or 1
                        fcons10 = factor(1 * (consumption <= 10), 
                                         labels = c("No", "Yes")))

summary(Cars2004nh)
##     vname                type           drive        price.retail     price.dealer        price       
##  Length:220         Min.   :1.000   Min.   :1.000   Min.   : 10280   Min.   :  9875   Min.   : 10078  
##  Class :character   1st Qu.:1.000   1st Qu.:1.000   1st Qu.: 20334   1st Qu.: 18839   1st Qu.: 19572  
##  Mode  :character   Median :2.000   Median :2.000   Median : 27270   Median : 25142   Median : 26147  
##                     Mean   :2.618   Mean   :1.745   Mean   : 32819   Mean   : 30012   Mean   : 31416  
##                     3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.: 40151   3rd Qu.: 35905   3rd Qu.: 38015  
##                     Max.   :6.000   Max.   :3.000   Max.   :126670   Max.   :117854   Max.   :122262  
##                                                                                                       
##    cons.city      cons.highway    consumption     engine.size      ncylinder        horsepower   
##  Min.   : 6.70   Min.   : 5.50   Min.   : 6.10   Min.   :1.300   Min.   :-1.000   Min.   :103.0  
##  1st Qu.:11.20   1st Qu.: 8.10   1st Qu.: 9.65   1st Qu.:2.400   1st Qu.: 4.000   1st Qu.:165.0  
##  Median :12.40   Median : 9.00   Median :10.70   Median :3.200   Median : 6.000   Median :210.0  
##  Mean   :12.48   Mean   : 9.28   Mean   :10.88   Mean   :3.205   Mean   : 5.764   Mean   :214.7  
##  3rd Qu.:13.80   3rd Qu.: 9.80   3rd Qu.:11.80   3rd Qu.:3.900   3rd Qu.: 6.000   3rd Qu.:250.0  
##  Max.   :19.60   Max.   :16.80   Max.   :17.45   Max.   :6.000   Max.   :12.000   Max.   :493.0  
##                                                                                                  
##      weight        iweight             lweight        wheel.base        length          width      
##  Min.   : 923   Min.   :0.0003595   Min.   :6.828   Min.   :226.0   Min.   :363.0   Min.   :165.0  
##  1st Qu.:1384   1st Qu.:0.0005524   1st Qu.:7.233   1st Qu.:259.0   1st Qu.:443.5   1st Qu.:175.0  
##  Median :1584   Median :0.0006313   Median :7.368   Median :272.0   Median :466.0   Median :180.0  
##  Mean   :1625   Mean   :0.0006428   Mean   :7.371   Mean   :274.5   Mean   :466.7   Mean   :181.1  
##  3rd Qu.:1810   3rd Qu.:0.0007224   3rd Qu.:7.501   3rd Qu.:287.0   3rd Qu.:493.0   3rd Qu.:185.0  
##  Max.   :2782   Max.   :0.0010834   Max.   :7.931   Max.   :366.0   Max.   :556.0   Max.   :203.0  
##                                                                     NA's   :16      NA's   :18     
##       ftype       fdrive     fdrive2        cons10     fcons10  
##  personal:100   front:106   front:106   Min.   :0.00   No :143  
##  wagon   : 17   rear : 64   other:114   1st Qu.:0.00   Yes: 77  
##  SUV     : 36   4x4  : 50               Median :0.00            
##  pickup  : 16                           Mean   :0.35            
##  sport   : 36                           3rd Qu.:1.00            
##  minivan : 15                           Max.   :1.00            
## 

Two sample problem

In this section, we review the methods for two-sample problems, i.e. exploring the relationship between continuous variable and binary variable. In particular, we are going to explore the following question:

Does the consumption depend on the drive while distinguishing only two categories: front and other?

Exploration

We start with an exploration phase. The most straightforward visualization tool for this problem is boxplot, the default plot for numeric ~ factor:

plot(consumption ~ fdrive2, data = Cars2004nh)

With some effort you can make this plot more compelling:

COL <- c("yellowgreen", "goldenrod")
names(COL) <- levels(Cars2004nh$fdrive2)

par(mfrow = c(1,1), mar = c(4,4,1.5,0.5))
plot(consumption ~ fdrive2, data = Cars2004nh, col = COL,
     xlab = "Drive of the vehicle", ylab = "Consumption [l/100 km]",
     main = "Consumption vs. Front drive",
     cex.main = 1.0,
     xaxt = "n")
axis(1, at = 1:nlevels(Cars2004nh$fdrive2), labels = c("Front drive", "Other"))
Boxplots of consumption for two categories of drive.

Boxplots of consumption for two categories of drive.

Think about other ways of displaying the two different distributions. Here are the sample characteristics:

with(Cars2004nh, tapply(consumption, fdrive2, summary))
## $front
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.100   8.450   9.875   9.863  10.700  15.600 
## 
## $other
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.30   10.70   11.45   11.82   12.45   17.45
with(Cars2004nh, tapply(consumption, fdrive2, sd, na.rm = TRUE))
##    front    other 
## 1.848568 1.960938
X = subset(Cars2004nh, fdrive2 == "front")$consumption
Y = subset(Cars2004nh, fdrive2 == "other")$consumption

Formal statistical test(s)

For each test refresh the following:

  • What is the underlying model in each test?
  • What are the assumptions?

Standard two-sample t-test (assuming equal variances and normality)

t.test(consumption ~ fdrive2, data = Cars2004nh, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  consumption by fdrive2
## t = -7.62, df = 218, p-value = 7.68e-13
## alternative hypothesis: true difference in means between group front and group other is not equal to 0
## 95 percent confidence interval:
##  -2.468658 -1.454050
## sample estimates:
## mean in group front mean in group other 
##            9.863208           11.824561
mean(X) - mean(Y)
## [1] -1.961354

Welch two-sample t-test (not assuming equal variances)

t.test(consumption ~ fdrive2, data = Cars2004nh)  
## 
##  Welch Two Sample t-test
## 
## data:  consumption by fdrive2
## t = -7.6364, df = 217.96, p-value = 6.949e-13
## alternative hypothesis: true difference in means between group front and group other is not equal to 0
## 95 percent confidence interval:
##  -2.467568 -1.455140
## sample estimates:
## mean in group front mean in group other 
##            9.863208           11.824561

Wilcoxon two-sample rank test

wilcox.test(consumption ~ fdrive2, data = Cars2004nh, conf.int = TRUE)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  consumption by fdrive2
## W = 2548, p-value = 1.273e-13
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -2.300047 -1.350055
## sample estimates:
## difference in location 
##              -1.799991
median(X) - median(Y)
## [1] -1.575
XYdiff = outer(X, Y, FUN="-")
median(XYdiff)
## [1] -1.8

Kolmogorov-Smirnov test

ks.test(consumption ~ fdrive2, data = Cars2004nh)
## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  consumption by fdrive2
## D = 0.52185, p-value = 2.035e-13
## alternative hypothesis: two-sided
par(mfrow = c(1,1), mar = c(4,4,1.5,0.5))
plot(ecdf(X), main="ECDF")
plot(ecdf(Y), col=2, add=TRUE)

More sample problem

What if there are more than two groups?

Does the consumption depend on the drive (distinguishing front/rear/4x4)?

Exploration

Let’s start with boxplots:

plot(consumption ~ fdrive, data = Cars2004nh)

with(Cars2004nh, tapply(consumption, fdrive, summary))
## $front
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.100   8.450   9.875   9.863  10.700  15.600 
## 
## $rear
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.30   10.55   11.25   11.25   11.80   15.25 
## 
## $`4x4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.95   10.70   11.60   12.56   14.79   17.45
with(Cars2004nh, tapply(consumption, fdrive, sd, na.rm = TRUE))
##    front     rear      4x4 
## 1.848568 1.360218 2.344725

Formal statistical test(s)

For each test refresh the following:

  • What is the underlying model in each test?
  • What are the assumptions?

Standard ANOVA

summary(aov(consumption ~ fdrive, data = Cars2004nh))  
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## fdrive        2  259.9  129.93   37.86 7.88e-15 ***
## Residuals   217  744.8    3.43                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA without assuming equal variances

oneway.test(consumption ~ fdrive, data = Cars2004nh)   
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  consumption and fdrive
## F = 30.988, num df = 2.00, denom df = 113.11, p-value = 1.856e-11

Kruskal-Wallis Rank Sum Test

kruskal.test(consumption ~ fdrive, data = Cars2004nh)  
## 
##  Kruskal-Wallis rank sum test
## 
## data:  consumption by fdrive
## Kruskal-Wallis chi-squared = 58.147, df = 2, p-value = 2.363e-13

Alternatively: the same test statistics all in the standard ANOVA above

summary(m0 <- lm(consumption ~ fdrive, data=Cars2004nh))
## 
## Call:
## lm(formula = consumption ~ fdrive, data = Cars2004nh)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9477 -1.3130 -0.0632  0.9118  5.7368 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.8632     0.1799  54.814  < 2e-16 ***
## fdriverear    1.3844     0.2933   4.721 4.21e-06 ***
## fdrive4x4     2.6998     0.3178   8.494 3.18e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.853 on 217 degrees of freedom
## Multiple R-squared:  0.2587, Adjusted R-squared:  0.2518 
## F-statistic: 37.86 on 2 and 217 DF,  p-value: 7.876e-15
anova(m0)
## Analysis of Variance Table
## 
## Response: consumption
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## fdrive      2 259.87 129.933  37.858 7.876e-15 ***
## Residuals 217 744.76   3.432                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
oneway.test(consumption ~ fdrive, data = Cars2004nh, var.equal=TRUE)   
## 
##  One-way analysis of means
## 
## data:  consumption and fdrive
## F = 37.858, num df = 2, denom df = 217, p-value = 7.876e-15

Follow these steps to compute the sum of squares and degrees of freedom needed for ANOVA with equal variances:

X = subset(Cars2004nh, fdrive == "front")$consumption
Y = subset(Cars2004nh, fdrive == "rear")$consumption
Z = subset(Cars2004nh, fdrive == "4x4")$consumption
nX = length(X)
nY = length(Y)
nZ = length(Z)
mX = mean(X)
mY = mean(Y)
mZ = mean(Z)
m = mean(c(X,Y,Z))
nX*(m - mX)^2+nY*(m-mY)^2+nZ*(m-mZ)^2
## [1] 259.8652
sum((X - mX)^2)+sum((Y-mY)^2)+sum((Z-mZ)^2)
## [1] 744.7577
sum((c(X,Y,Z)-m)^2)
## [1] 1004.623
nX+nY+nZ
## [1] 220

Conditional mean / Regression

Now we review the basic methods for describing the relationship between two continuous variables.

Exploration

Why are boxplots not reasonable choice? Use scatterplot instead, default for plot between numeric and numeric variable.

par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(consumption ~ length, data = Cars2004nh, col = "red3", bg = "orange", pch = 23)

Histograms to inspect the shape of marginal distributions:

### Histograms 
par(mfrow = c(1, 2), mar = c(4,4,0.5,0.5))
hist(Cars2004nh$consumption, prob = TRUE, main = "", col = "seagreen", 
     xlab = "Consumption [l/100 km]")
hist(Cars2004nh$length, prob = TRUE, main = "", col = "seagreen", 
     xlab = "Length [cm]")

with(Cars2004nh, cor(consumption, length))
## [1] NA
with(Cars2004nh, cor(consumption, length, use = "complete.obs"))
## [1] 0.4379756
with(Cars2004nh, cor(consumption, length, method = "spearman", use = "complete.obs"))
## [1] 0.4440134

Formal statistical test(s)

For each test refresh the following:

  • What is the underlying model in each test?
  • What are the assumptions?

Test based on Pearson’s correlation coefficient

with(Cars2004nh, cor.test(consumption, length))
## 
##  Pearson's product-moment correlation
## 
## data:  consumption and length
## t = 6.9242, df = 202, p-value = 5.713e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3198482 0.5426954
## sample estimates:
##       cor 
## 0.4379756

Test based on Spearman’s correlation coefficient

with(Cars2004nh, cor.test(consumption, length, method = "spearman"))
## 
##  Spearman's rank correlation rho
## 
## data:  consumption and length
## S = 786671, p-value = 2.892e-11
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.4440134

How would you assess the relationship between the dependent variable and independent covariates?

Here are some graphical tools to examine the (non)linearity of the relationship:

par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(consumption ~ length, data = Cars2004nh, col = "red3", bg = "orange", pch = 23)
grid <- quantile(Cars2004nh$length, seq(0, 1, length = 20), na.rm = T)

xGrid <- NULL 
yMean <- NULL 
for(i in 2:length(grid)){
  tempData <- Cars2004nh$consumption[Cars2004nh$length >= grid[i - 1] & Cars2004nh$length < grid[i]]
  lines(rep(mean(tempData, na.rm = T), 2) ~ c(grid[i - 1], grid[i]), col = "red")
  points((grid[i - 1] + grid[i])/2, mean(tempData, na.rm = T), 
         pch = 23, bg = "red", cex = 1.2)
  
  xGrid <- c(xGrid, (grid[i - 1] + grid[i])/2)
  yMean <- c(yMean, mean(tempData, na.rm = T))
  
  lines(rep(grid[i], 2), c(min(Cars2004nh$consumption), max(Cars2004nh$consumption)), lty = 3)
}
### which interpolates as: 
lines(yMean ~ xGrid, col = "darkred", lwd = 3)

Less groups without for cycle:

grid <- quantile(Cars2004nh$length, seq(0, 1, length = 11), na.rm = T)
Cars2004nh$flength <- cut(Cars2004nh$length, breaks = grid, right = FALSE)
xGrid <- (grid[-1] + grid[-length(grid)]) / 2
yMean <- tapply(Cars2004nh$consumption, Cars2004nh$flength, mean)

par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(consumption ~ length, data = Cars2004nh, type = "n")
segments(x0 = grid, y0 = min(Cars2004nh$consumption), y1 = max(Cars2004nh$consumption), lty = 3)
points(consumption ~ length, data = Cars2004nh, col = "red3", bg = "orange", pch = 23)
segments(x0 = grid[-length(grid)], x1 = grid[-1], y0 = yMean, col = "red")
lines(x = xGrid, y = yMean, col = "darkred", lwd = 3)
points(x = xGrid, y = yMean, pch = 23, bg = "red", cex = 1.2)

Category for every value of length that appears in the data:

par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(consumption ~ length, data = Cars2004nh, col = "red3", bg = "orange", pch = 23)

grid <- as.numeric(names(table(Cars2004nh$length)))
xGrid <- NULL 
yMean <- NULL 
for (i in 2:(length(grid) - 1)){
  tempData <- Cars2004nh$consumption[Cars2004nh$length >= grid[i - 1] & Cars2004nh$length <= grid[i + 1]]
  points(grid[i], mean(tempData, na.rm = T), pch = 23, bg = "red", cex = 1.2)
  
  xGrid <- c(xGrid, grid[i])
  yMean <- c(yMean, mean(tempData, na.rm = T))
}

### which now interpolates as: 
lines(yMean ~ xGrid, col = "darkred", lwd = 3)

Local trend by lowess curve:

par(mfrow = c(1,3), mar = c(4,4,1.5,0.5))
for(f in c(1/10, 1/5, 1/2)){
  plot(consumption ~ length, data = Cars2004nh, col = "red3", bg = "orange", 
       pch = 23, main = paste0("f = ", f))
  tempData <- Cars2004nh[!is.na(Cars2004nh$length),]
  lines(lowess(tempData$consumption ~ tempData$length, f = f), col = "red3", lwd = 3)
}

How a straight line should be fitted into the data?

Contingency table

We use them to capture the relationship between two categorical variables.

Does the fact that the car is economical (in the U.S. sense, consumption <= 10) depend on the drive (distinguishing only front and other)?

Exploration

The default plot for factor ~ factor is the spineplot that compares the proportions of the first factor within each category of the second factor separately:

par(mfrow = c(1,1), mar = c(4,4,1,2.5))
plot(fcons10 ~ fdrive2, data = Cars2004nh, col = diverge_hcl(2))

Contingency table:

(TAB2 <- with(Cars2004nh, table(fdrive2, fcons10)))
##        fcons10
## fdrive2 No Yes
##   front 46  60
##   other 97  17

Proportion table:

round(prop.table(TAB2, margin = 1), 2)
##        fcons10
## fdrive2   No  Yes
##   front 0.43 0.57
##   other 0.85 0.15

Formal statistical test(s)

For each test refresh the following:

  • What is the underlying model in each test?
  • What are the assumptions?

Chi-square test using continuity correction to be more conservative

chisq.test(TAB2)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  TAB2
## X-squared = 40.154, df = 1, p-value = 2.347e-10

Chi-square test without continuity correction

(chi = chisq.test(TAB2, correct = FALSE))
## 
##  Pearson's Chi-squared test
## 
## data:  TAB2
## X-squared = 41.966, df = 1, p-value = 9.286e-11

How to obtain the test results manually:

# Expected observed counts under the null hypothesis 
print(chi$expected)  
##        fcons10
## fdrive2   No  Yes
##   front 68.9 37.1
##   other 74.1 39.9
n = sum(TAB2)         # number of observations
R = rowSums(TAB2)/n   # observed proportion in rows
C = colSums(TAB2)/n   # observed proportion in columns
R%*%t(C)*n            # expected counts under independence
##        No  Yes
## [1,] 68.9 37.1
## [2,] 74.1 39.9
(chi$observed - chi$expected)/sqrt(chi$expected)  # residuals
##        fcons10
## fdrive2        No       Yes
##   front -2.758836  3.759660
##   other  2.660274 -3.625342
chi$residuals
##        fcons10
## fdrive2        No       Yes
##   front -2.758836  3.759660
##   other  2.660274 -3.625342
(ts<-sum(chi$res^2))    # test statistic
## [1] 41.96638
pchisq(ts, df=1, lower.tail=FALSE) # p-value
## [1] 9.285619e-11

Test based on difference between proportions. Notice the equivalency to chisq.test. What can you learn from the output?

(x <- TAB2[, "Yes"])
## front other 
##    60    17
(n <- apply(TAB2, 1, sum))
## front other 
##   106   114
prop.test(x, n)
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  x out of n
## X-squared = 40.154, df = 1, p-value = 2.347e-10
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.2930180 0.5408118
## sample estimates:
##    prop 1    prop 2 
## 0.5660377 0.1491228
prop.test(x, n, correct = FALSE)
## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  x out of n
## X-squared = 41.966, df = 1, p-value = 9.286e-11
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.3021210 0.5317089
## sample estimates:
##    prop 1    prop 2 
## 0.5660377 0.1491228

Reconstruction of the test statistic:

p2 = x/n              # proportion estimates in groups
p = sum(x) / sum(n)   # common proportion estimate
(ts <- (diff(p2) / sqrt(p*(1-p)*sum(1/n)))^2) # test statistic
##    other 
## 41.96638
pchisq(ts, df=1, lower.tail=FALSE)
##        other 
## 9.285619e-11

What is tested here?

prop.test(x, n, alternative = "greater")
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  x out of n
## X-squared = 40.154, df = 1, p-value = 1.174e-10
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.3114739 1.0000000
## sample estimates:
##    prop 1    prop 2 
## 0.5660377 0.1491228

Individual work

Two sample problem

  • Verify the assumptions for each two-sample test.
  • For instance, a formal test for equal variances in two groups or visual assessment of normality of the two samples.
  • How would you visualize the sample means and medians which are tested in the two-sample problems?

More sample problem

  • Instead of using three group specific means, consider a reformulation by using the conditional mean given the groups. What is the analogy with the regression formulation?

Contingency tables

  • Try also for categorical variables with more than 2 groups.
  • Explore the following pairs: (fcons10, fdrive) and (fdrive,ftype).

For practical purposes one sided alternative can be tested instead. Which alternative makes more sense? What exactly do we test here? Reconstruct the test statistics and the corresponding p-values.

t.test(consumption ~ fdrive2, data = Cars2004nh, alternative = "less", mu = -2)
## 
##  Welch Two Sample t-test
## 
## data:  consumption by fdrive2
## t = 0.15047, df = 217.96, p-value = 0.5597
## alternative hypothesis: true difference in means between group front and group other is less than -2
## 95 percent confidence interval:
##       -Inf -1.537082
## sample estimates:
## mean in group front mean in group other 
##            9.863208           11.824561
t.test(consumption ~ fdrive2, data = Cars2004nh, alternative = "less", mu = -2, 
       var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  consumption by fdrive2
## t = 0.15014, df = 218, p-value = 0.5596
## alternative hypothesis: true difference in means between group front and group other is less than -2
## 95 percent confidence interval:
##       -Inf -1.536167
## sample estimates:
## mean in group front mean in group other 
##            9.863208           11.824561
wilcox.test(consumption ~ fdrive2, data = Cars2004nh, alternative = "less", 
            mu = -2, conf.int = TRUE)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  consumption by fdrive2
## W = 6410, p-value = 0.7827
## alternative hypothesis: true location shift is less than -2
## 95 percent confidence interval:
##       -Inf -1.449992
## sample estimates:
## difference in location 
##              -1.799991

Does the fact that the car is economical (in the U.S. sense, consumption<=10) depend on the drive (while distinguishing front, rear or 4x4)?

  • What are we testing by the following test?
  • What is the model behind this test?
  • Does the model seem to be good enough to be useful?
  • Which artefacts of the model might be relatively safely ignored and why?
  • What is the conclusion?
chisq.test(TAB3)
## 
##  Pearson's Chi-squared test
## 
## data:  TAB3
## X-squared = 42.298, df = 2, p-value = 6.532e-10
print(chisq.test(TAB3)$expected)
##        fcons10
## fdrive    No  Yes
##   front 68.9 37.1
##   rear  41.6 22.4
##   4x4   32.5 17.5
print(chisq.test(TAB3)$residuals)
##        fcons10
## fdrive         No       Yes
##   front -2.758836  3.759660
##   rear   1.767495 -2.408690
##   4x4    2.017233 -2.749026
print(chisq.test(TAB3)$stdres)
##        fcons10
## fdrive         No       Yes
##   front -6.478146  6.478146
##   rear   3.547915 -3.547915
##   4x4    3.878904 -3.878904

Does the drive (front/rear/4x4) depend on the type of the car (personal/wagon/SUV/pickup/sport/minivan)?

  • What are we testing by the following test?
  • What is the model behind this test?
  • Does the model seem to be good enough to be useful?
  • Which artefacts of the model might be relatively safely ignored and why?
  • What is the conclusion?
  • Why did we get the warning message? What can be done about it?
chisq.test(TAB4)
## 
##  Pearson's Chi-squared test
## 
## data:  TAB4
## X-squared = 119.77, df = 10, p-value < 2.2e-16
print(chisq.test(TAB4)$expected)
##           fdrive
## ftype          front      rear       4x4
##   personal 48.181818 29.090909 22.727273
##   wagon     8.190909  4.945455  3.863636
##   SUV      17.345455 10.472727  8.181818
##   pickup    7.709091  4.654545  3.636364
##   sport    17.345455 10.472727  8.181818
##   minivan   7.227273  4.363636  3.409091