Download this R markdown as: R, Rmd.

Outline of this lab session:

Loading the data and libraries

Firstly, some necessary R libraries and the working dataset:

library("colorspace")    # colors defined by function
library("mffSM")         # data and plotLM()
library("splines")       # B-splines modelling
library("MASS")          # function stdres for standardized residuals

We are going to use the dataset Dris from the previous exercise:

data(Dris, package = "mffSM")
# help("Dris")
head(Dris)
##   yield   N  P   K Ca Mg
## 1  5.47 470 47 320 47 11
## 2  5.63 530 48 357 60 16
## 3  5.63 530 48 310 63 16
## 4  4.84 482 47 357 47 13
## 5  4.84 506 48 294 52 15
## 6  4.21 500 45 283 61 14
summary(Dris)
##      yield             N               P               K               Ca              Mg       
##  Min.   :1.920   Min.   :268.0   Min.   :32.00   Min.   :200.0   Min.   :29.00   Min.   : 8.00  
##  1st Qu.:4.040   1st Qu.:427.0   1st Qu.:43.00   1st Qu.:338.0   1st Qu.:41.75   1st Qu.:10.00  
##  Median :4.840   Median :473.0   Median :49.00   Median :377.0   Median :49.00   Median :11.00  
##  Mean   :4.864   Mean   :469.9   Mean   :48.65   Mean   :375.4   Mean   :51.45   Mean   :11.64  
##  3rd Qu.:5.558   3rd Qu.:518.0   3rd Qu.:54.00   3rd Qu.:407.0   3rd Qu.:59.00   3rd Qu.:13.00  
##  Max.   :8.792   Max.   :657.0   Max.   :72.00   Max.   :580.0   Max.   :95.00   Max.   :22.00

1. Basic model diagnostics

We will use the following model at the beginning:

aLogMg <- lm(yield ~ log2(Mg), data = Dris)
summary(aLogMg) 
## 
## Call:
## lm(formula = yield ~ log2(Mg), data = Dris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1194 -0.7412 -0.0741  0.7451  3.9841 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.4851     0.7790   1.907   0.0574 .  
## log2(Mg)      0.9605     0.2208   4.349 1.77e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.07 on 366 degrees of freedom
## Multiple R-squared:  0.04915,    Adjusted R-squared:  0.04655 
## F-statistic: 18.92 on 1 and 366 DF,  p-value: 1.772e-05

We will closely inspect the model residuals. There are two types of residuals:

Here are the three most important nicely formatted graphical diagnostic tools:

par(mar = c(4,4,3,0.5))
plotLM(aLogMg)                # function from 'mffSM' 

Individual work

How can we manually reconstruct the plots above?

par(mfrow = c(1,2), mar = c(4,4,2,0.5))
plot(aLogMg, which = 1)

### manual computation
plot(fitted(aLogMg), residuals(aLogMg),
     lwd=2, col="blue4", bg="skyblue", pch=21)
abline(h=0, col = "grey", lty = 3)
lines(lowess(fitted(aLogMg), residuals(aLogMg)), col="red", lwd=3)

par(mfrow = c(1,2), mar = c(4,4,2,0.5))
plot(aLogMg, which = 2, ask = FALSE)

### standardized residuals
H = qr.Q(aLogMg[["qr"]]) %*% t(qr.Q(aLogMg[["qr"]]))
M = diag(nrow(H))-H
stdres = residuals(aLogMg)/sqrt(deviance(aLogMg)/aLogMg$df.residual*diag(M))   
qq = qqnorm(stdres, plot.it=FALSE)
# all.equal(stdres(aLogMg), stdres)   # --> TRUE (computed by MASS function)
# all.equal(qq$y, stdres)             # --> TRUE (will be on y-axis)
plot(qq$x, qq$y, lwd=2, col="blue4", bg="skyblue", pch=21)
qqline(stdres, lwd=2, lty=3)

par(mfrow = c(1,2), mar = c(4,4,2,0.5))
plot(aLogMg, which = 3, ask = FALSE)  

### manual computation
plot(fitted(aLogMg), sqrt(abs(stdres)),
     lwd=2, col="blue4", bg="skyblue", pch=21)
lines(lowess(fitted(aLogMg), sqrt(abs(stdres))), col="red", lwd=3)

Note, that there are actually infinitely many possibilities how to check that \(\mathsf{E}\,(u_i| \boldsymbol{X}_i) = 0\). A scatterplot of fitted values vs. residuals is only one of them. Other common possibility is to plot the covariate values vs. the residuals.

par(mfrow = c(1,2), mar = c(4,4,2,0.5))
plot(log2(Dris$Mg), residuals(aLogMg), 
     xlab=expression(log[2](Mg)), ylab="Residuals", 
     main="Residuals vs. covariate in model", 
     pch = 21, col = "blue4", bg = "skyblue")
lines(lowess(residuals(aLogMg)~log2(Dris$Mg)), col="red")

plot(log2(Dris$Ca), residuals(aLogMg), 
     xlab=expression(log[2](Ca)), ylab="Residuals", 
     main="Residuals vs. covariate not in model", 
     pch = 21, col = "blue4", bg = "skyblue")
lines(lowess(residuals(aLogMg)~log2(Dris$Ca)), col="red")

Also check the following and explain the results:

sum(aLogMg$residuals)
## [1] -5.522666e-14
sum(aLogMg$residuals*log2(Dris$Mg))
## [1] -2.58342e-13

Compare the plots for model aLogMg with the interaction model from previous exercise:

par(mar = c(4,4,3,0.5))
plotLM(lm(yield ~ log2(Mg) + log2(Ca) + I(log2(Mg) * log2(Ca)), data = Dris))

2. Polynomials and splines as regressors

For some example of possible model improvements, we will consider a different dataset. We will fit a series of different models that we will mutually compare.

A dataset providing a series of (independent) measurements of head acceleration in a simulated motorcycle accident (used to test crash helmets) is given in the datafile Motorcycle (within the R package mffSM).

data(Motorcycle, package = "mffSM")
# help(Motorcycle)
head(Motorcycle)
##   time haccel
## 1  2.4    0.0
## 2  2.6   -1.3
## 3  3.2   -2.7
## 4  3.6    0.0
## 5  4.0   -2.7
## 6  6.2   -2.7
dim(Motorcycle)
## [1] 133   2
summary(Motorcycle)
##       time           haccel       
##  Min.   : 2.40   Min.   :-134.00  
##  1st Qu.:15.80   1st Qu.: -54.90  
##  Median :24.00   Median : -13.30  
##  Mean   :25.55   Mean   : -25.55  
##  3rd Qu.:35.20   3rd Qu.:   0.00  
##  Max.   :65.60   Max.   :  75.00

We are primarily interested in how the head acceleration (covariate haccel) depends on the time after the accident (given in miliseconds – the time covariate).

par(mfrow = c(1,1), mar = c(4,4,1,1))
plot(haccel ~ time, data = Motorcycle, pch = 21, bg = "orange", col = "red3",
     xlab = "Time [ms]", ylab = "Head acceleration [g]")

fit0 = lm(haccel~time, data=Motorcycle)
abline(fit0)

Using the diagnostic plots, we can clearly see that the model denoted as fit0 is hardly appropriate:

par(mar = c(4,4,3,0.5))
plotLM(fit0)

As far as there is a quite flexible shape observed in the plot we need some more flexible model to capture the structure of the underlying conditional expectation of the head acceleration given the time. We will use polynomials (of different degrees) for fitting the model (while still staying within the linear regression framework).

fit6 <- lm(haccel~poly(time, 6, raw=TRUE), data=Motorcycle)  
fit12 <- lm(haccel~poly(time, 12, raw=TRUE), data=Motorcycle)

The model can be visualized as follows:

xgrid <- seq(0, 70, length = 500)

par(mfrow = c(1,1), mar = c(4,4,2,1))
plot(haccel ~ time, data = Motorcycle, pch = 21, bg = "orange", col = "red3",
     main = "Fitted polynomial curves",
     xlab = "Time [ms]", ylab = "Head acceleration [g]")
lines(xgrid, predict(fit6, newdata=data.frame(time=xgrid)), lwd=2, col="blue")
lines(xgrid, predict(fit12, newdata=data.frame(time=xgrid)), lwd=2, col="darkgreen")
legend(37,-70, lty=rep(1, 2), col=c("blue", "darkgreen"), 
       legend=paste("p=", c(6,12),sep=""), lwd=c(2,2), cex=1.2)

Classical polynomials are, however, not computationally very effective. Especially for higher order of the polynomial approximation the solution tends to be unstable. To avoid numerical problems you can use orthogonal polynomials with raw=FALSE, which stabilizes the magnitude (and standard errors) of estimated coefficients:

summary(fit6)
## 
## Call:
## lm(formula = haccel ~ poly(time, 6, raw = TRUE), data = Motorcycle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.039 -23.622  -0.497  24.032  75.521 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -2.069e+02  4.349e+01  -4.757 5.29e-06 ***
## poly(time, 6, raw = TRUE)1  9.658e+01  1.558e+01   6.199 7.47e-09 ***
## poly(time, 6, raw = TRUE)2 -1.302e+01  1.865e+00  -6.982 1.49e-10 ***
## poly(time, 6, raw = TRUE)3  7.160e-01  1.012e-01   7.078 9.04e-11 ***
## poly(time, 6, raw = TRUE)4 -1.870e-02  2.732e-03  -6.846 2.97e-10 ***
## poly(time, 6, raw = TRUE)5  2.321e-04  3.572e-05   6.496 1.73e-09 ***
## poly(time, 6, raw = TRUE)6 -1.102e-06  1.799e-07  -6.128 1.05e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.09 on 126 degrees of freedom
## Multiple R-squared:  0.5524, Adjusted R-squared:  0.5311 
## F-statistic: 25.91 on 6 and 126 DF,  p-value: < 2.2e-16
summary(lm(haccel~poly(time, 6, raw=FALSE), data=Motorcycle))
## 
## Call:
## lm(formula = haccel ~ poly(time, 6, raw = FALSE), data = Motorcycle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.039 -23.622  -0.497  24.032  75.521 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -25.546      2.869  -8.903 5.01e-15 ***
## poly(time, 6, raw = FALSE)1  160.611     33.091   4.854 3.52e-06 ***
## poly(time, 6, raw = FALSE)2   99.015     33.091   2.992  0.00333 ** 
## poly(time, 6, raw = FALSE)3 -244.969     33.091  -7.403 1.67e-11 ***
## poly(time, 6, raw = FALSE)4   64.683     33.091   1.955  0.05283 .  
## poly(time, 6, raw = FALSE)5  171.291     33.091   5.176 8.68e-07 ***
## poly(time, 6, raw = FALSE)6 -202.776     33.091  -6.128 1.05e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.09 on 126 degrees of freedom
## Multiple R-squared:  0.5524, Adjusted R-squared:  0.5311 
## F-statistic: 25.91 on 6 and 126 DF,  p-value: < 2.2e-16

Another possible solution how to introduce some reasonable stability into the model where some higher order polynomials are used, are splines.

### Choice of knots 
knots <- c(0, 11, 12, 13, 20, 30, 32, 34, 40, 50, 70) 
(inner <- knots[-c(1, length(knots))])  # inner knots 
## [1] 11 12 13 20 30 32 34 40 50
(bound <- knots[c(1, length(knots))])   # boundary knots 
## [1]  0 70
### B-splines of degree 3
DEGREE <- 3 # piecewise cubic 
Bgrid <- bs(xgrid, knots = inner, Boundary.knots = bound, degree = DEGREE, 
            intercept = TRUE)
dim(Bgrid)   # 13-dimensional linear space of piecewise polynomials
## [1] 500  13

The B-spline basis is a set of the following functions:

COL <- rainbow_hcl(ncol(Bgrid), c = 80)
par(mfrow = c(1,1), mar = c(4,4,1,1))
plot(xgrid, Bgrid[,1], type = "l", col = COL[1], lwd = 2, xlab = "x", 
     ylab = "B(x)", xlim = range(knots), ylim = range(Bgrid, na.rm = TRUE))
for (j in 2:ncol(Bgrid)) lines(xgrid, Bgrid[,j], col = COL[j], lwd = 2)
points(knots, rep(0, length(knots)), 
       pch = 21, bg = "blue", col = "skyblue", cex = 1.5)

Thus, we can use B-splines to fit the model.

More details about the B-splines can be found, for instance, here.

m1 <- lm(haccel ~ bs(time, knots = inner, Boundary.knots = bound, 
                     degree = DEGREE, intercept = TRUE) - 1, data = Motorcycle)

This is the fitted curve:

par(mfrow = c(1,1), mar = c(4,4,2,1))
plot(haccel ~ time, data = Motorcycle, pch = 21, bg = "orange", col = "red3",
     main = "Fitted B-spline curve",
     xlab = "Time [ms]", ylab = "Head acceleration [g]")
lines(xgrid, predict(m1, newdata=data.frame(time=xgrid)), lwd=2, col="blue")

names(m1$coefficients) <- paste("B", 1:ncol(Bgrid))
summary(m1)
## 
## Call:
## lm(formula = haccel ~ bs(time, knots = inner, Boundary.knots = bound, 
##     degree = DEGREE, intercept = TRUE) - 1, data = Motorcycle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.857 -11.740  -0.268  10.356  55.249 
## 
## Coefficients:
##      Estimate Std. Error t value Pr(>|t|)    
## B 1   -11.614     83.252  -0.140    0.889    
## B 2    12.450     81.282   0.153    0.879    
## B 3   -13.988     45.345  -0.308    0.758    
## B 4     2.984     18.894   0.158    0.875    
## B 5     6.115     12.139   0.504    0.615    
## B 6  -237.283     18.850 -12.588  < 2e-16 ***
## B 7    17.346     14.094   1.231    0.221    
## B 8    53.248     12.635   4.214 4.88e-05 ***
## B 9     5.102     13.051   0.391    0.697    
## B 10   12.609     17.146   0.735    0.464    
## B 11  -34.949     38.161  -0.916    0.362    
## B 12   58.255     65.045   0.896    0.372    
## B 13  -93.068     75.924  -1.226    0.223    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.61 on 120 degrees of freedom
## Multiple R-squared:  0.8447, Adjusted R-squared:  0.8278 
## F-statistic: 50.19 on 13 and 120 DF,  p-value: < 2.2e-16

Since the intercept is included within the B-spline basis (intercept = TRUE in bs), we omitted it in the formula with -1. However, this use comes with some issues deeply rooted within the implementation. For example, neither Multiple R-squared nor the F-statistic quantity in the output correspond to correct numbers. Simply put, summary.lm thinks the intercept is not included despite still being a linear combination of given columns.

Individual work

Can you explain why the values reported for \(R^2\) and \(R_{adj}^2\) are not appropriate? What is the right meaning of these numbers? Can you manually reconstruct them?

The \(R^2\) value from the model is

summary(m1)$r.squared
## [1] 0.8446597

and the manual calculations will give

1 - deviance(m1)/sum(Motorcycle$haccel^2)
## [1] 0.8446597
sum(fitted(m1)^2)/sum(Motorcycle$haccel^2)
## [1] 0.8446597

Interpretation of the F-test from the summary output above is a test of a submodel

summary(m1)$fstatistic
##     value     numdf     dendf 
##  50.19211  13.00000 120.00000
# the usual F-test (correct)
m2 <- lm(haccel ~ 1, data = Motorcycle)
anova(m2, m1)
## Analysis of Variance Table
## 
## Model 1: haccel ~ 1
## Model 2: haccel ~ bs(time, knots = inner, Boundary.knots = bound, degree = DEGREE, 
##     intercept = TRUE) - 1
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1    132 308223                                 
## 2    120  61362 12    246861 40.23 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# the wrong F-test (in summary m1)
(m0 <- lm(haccel ~ -1, data=Motorcycle))
## 
## Call:
## lm(formula = haccel ~ -1, data = Motorcycle)
## 
## No coefficients
anova(m0 ,m1)
## Analysis of Variance Table
## 
## Model 1: haccel ~ -1
## Model 2: haccel ~ bs(time, knots = inner, Boundary.knots = bound, degree = DEGREE, 
##     intercept = TRUE) - 1
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    133 395017                                  
## 2    120  61362 13    333655 50.192 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Nevertheless, as the intercept parameter is in the linear span of the columns of \(\mathbb{X}\), it makes sense to calculate the (usual) multiple \(R^2\) value. This can be done ‘manually’ as follows:

var(fitted(m1)) / var(Motorcycle$haccel)
## [1] 0.8009163
cor(Motorcycle$haccel, fitted(m1))^2
## [1] 0.8009163
1 - deviance(m1)/deviance(m2)
## [1] 0.8009163

Alternatively, one can use B-splines (the basis without intercept) with an intercept in the model to get some meaningful numbers in the output:

m0 <- lm(haccel ~ bs(time, knots = inner, Boundary.knots = bound, 
                     degree = DEGREE, intercept = FALSE), data = Motorcycle)
names(m0$coefficients) <- c("(Intercept)", paste("B", 1:12))
summary(m0)
## 
## Call:
## lm(formula = haccel ~ bs(time, knots = inner, Boundary.knots = bound, 
##     degree = DEGREE, intercept = FALSE), data = Motorcycle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.857 -11.740  -0.268  10.356  55.249 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -11.614     83.252  -0.140   0.8893  
## B 1           24.064    161.970   0.149   0.8821  
## B 2           -2.374     57.835  -0.041   0.9673  
## B 3           14.598     93.327   0.156   0.8760  
## B 4           17.729     80.496   0.220   0.8261  
## B 5         -225.669     88.698  -2.544   0.0122 *
## B 6           28.960     82.780   0.350   0.7271  
## B 7           64.862     84.773   0.765   0.4457  
## B 8           16.716     83.997   0.199   0.8426  
## B 9           24.223     85.172   0.284   0.7766  
## B 10         -23.335     91.337  -0.255   0.7988  
## B 11          69.869    105.918   0.660   0.5107  
## B 12         -81.454    112.465  -0.724   0.4703  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.61 on 120 degrees of freedom
## Multiple R-squared:  0.8009, Adjusted R-squared:  0.781 
## F-statistic: 40.23 on 12 and 120 DF,  p-value: < 2.2e-16
anova(m2,m0)
## Analysis of Variance Table
## 
## Model 1: haccel ~ 1
## Model 2: haccel ~ bs(time, knots = inner, Boundary.knots = bound, degree = DEGREE, 
##     intercept = FALSE)
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1    132 308223                                 
## 2    120  61362 12    246861 40.23 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What about the interpretation of the classical diagnostic plots for the model fitted with B-splines?

par(mar = c(4,4,3,0.5))
plotLM(m1)

It seems there is some issue with homoscedasticity assumption. Let’s change the x-axis of the third Scale-Location plot to determine how the variability changes:

par(mfrow = c(1,2), mar = c(4,4,2,1))
plot(abs(fitted(m1)), sqrt(abs(stdres(m1))),
     main = "Abs(fitted)",
     lwd=2, col="blue4", bg="skyblue", pch=21)
lines(lowess(abs(fitted(m1)), sqrt(abs(stdres(m1)))), col="red", lwd=3)

plot(Motorcycle$time, sqrt(abs(stdres(m1))),
     main = "Time",
     lwd=2, col="blue4", bg="skyblue", pch=21)
lines(lowess(Motorcycle$time, sqrt(abs(stdres(m1)))), col="red", lwd=3)

How to adjust the inference for heteroscedastic models will be shown later.