Download this R markdown as: R, Rmd.
Outline of this lab session:
Firstly, some necessary R libraries and the working dataset:
library("colorspace") # colors defined by function
library("mffSM") # data and plotLM()
library("splines") # B-splines modelling
library("MASS") # function stdres for standardized residuals
We are going to use the dataset Dris
from the previous
exercise:
data(Dris, package = "mffSM")
# help("Dris")
head(Dris)
## yield N P K Ca Mg
## 1 5.47 470 47 320 47 11
## 2 5.63 530 48 357 60 16
## 3 5.63 530 48 310 63 16
## 4 4.84 482 47 357 47 13
## 5 4.84 506 48 294 52 15
## 6 4.21 500 45 283 61 14
summary(Dris)
## yield N P K Ca Mg
## Min. :1.920 Min. :268.0 Min. :32.00 Min. :200.0 Min. :29.00 Min. : 8.00
## 1st Qu.:4.040 1st Qu.:427.0 1st Qu.:43.00 1st Qu.:338.0 1st Qu.:41.75 1st Qu.:10.00
## Median :4.840 Median :473.0 Median :49.00 Median :377.0 Median :49.00 Median :11.00
## Mean :4.864 Mean :469.9 Mean :48.65 Mean :375.4 Mean :51.45 Mean :11.64
## 3rd Qu.:5.558 3rd Qu.:518.0 3rd Qu.:54.00 3rd Qu.:407.0 3rd Qu.:59.00 3rd Qu.:13.00
## Max. :8.792 Max. :657.0 Max. :72.00 Max. :580.0 Max. :95.00 Max. :22.00
We will use the following model at the beginning:
aLogMg <- lm(yield ~ log2(Mg), data = Dris)
summary(aLogMg)
##
## Call:
## lm(formula = yield ~ log2(Mg), data = Dris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1194 -0.7412 -0.0741 0.7451 3.9841
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4851 0.7790 1.907 0.0574 .
## log2(Mg) 0.9605 0.2208 4.349 1.77e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.07 on 366 degrees of freedom
## Multiple R-squared: 0.04915, Adjusted R-squared: 0.04655
## F-statistic: 18.92 on 1 and 366 DF, p-value: 1.772e-05
We will closely inspect the model residuals. There are two types of residuals:
Here are the three most important nicely formatted graphical diagnostic tools:
par(mar = c(4,4,3,0.5))
plotLM(aLogMg) # function from 'mffSM'
How can we manually reconstruct the plots above?
par(mfrow = c(1,2), mar = c(4,4,2,0.5))
plot(aLogMg, which = 1)
### manual computation
plot(fitted(aLogMg), residuals(aLogMg),
lwd=2, col="blue4", bg="skyblue", pch=21)
abline(h=0, col = "grey", lty = 3)
lines(lowess(fitted(aLogMg), residuals(aLogMg)), col="red", lwd=3)
par(mfrow = c(1,2), mar = c(4,4,2,0.5))
plot(aLogMg, which = 2, ask = FALSE)
### standardized residuals
H = qr.Q(aLogMg[["qr"]]) %*% t(qr.Q(aLogMg[["qr"]]))
M = diag(nrow(H))-H
stdres = residuals(aLogMg)/sqrt(deviance(aLogMg)/aLogMg$df.residual*diag(M))
qq = qqnorm(stdres, plot.it=FALSE)
# all.equal(stdres(aLogMg), stdres) # --> TRUE (computed by MASS function)
# all.equal(qq$y, stdres) # --> TRUE (will be on y-axis)
plot(qq$x, qq$y, lwd=2, col="blue4", bg="skyblue", pch=21)
qqline(stdres, lwd=2, lty=3)
par(mfrow = c(1,2), mar = c(4,4,2,0.5))
plot(aLogMg, which = 3, ask = FALSE)
### manual computation
plot(fitted(aLogMg), sqrt(abs(stdres)),
lwd=2, col="blue4", bg="skyblue", pch=21)
lines(lowess(fitted(aLogMg), sqrt(abs(stdres))), col="red", lwd=3)
Note, that there are actually infinitely many possibilities how to check that \(\mathsf{E}\,(u_i| \boldsymbol{X}_i) = 0\). A scatterplot of fitted values vs. residuals is only one of them. Other common possibility is to plot the covariate values vs. the residuals.
par(mfrow = c(1,2), mar = c(4,4,2,0.5))
plot(log2(Dris$Mg), residuals(aLogMg),
xlab=expression(log[2](Mg)), ylab="Residuals",
main="Residuals vs. covariate in model",
pch = 21, col = "blue4", bg = "skyblue")
lines(lowess(residuals(aLogMg)~log2(Dris$Mg)), col="red")
plot(log2(Dris$Ca), residuals(aLogMg),
xlab=expression(log[2](Ca)), ylab="Residuals",
main="Residuals vs. covariate not in model",
pch = 21, col = "blue4", bg = "skyblue")
lines(lowess(residuals(aLogMg)~log2(Dris$Ca)), col="red")
Also check the following and explain the results:
sum(aLogMg$residuals)
## [1] -5.522666e-14
sum(aLogMg$residuals*log2(Dris$Mg))
## [1] -2.58342e-13
Compare the plots for model aLogMg
with the interaction
model from previous exercise:
par(mar = c(4,4,3,0.5))
plotLM(lm(yield ~ log2(Mg) + log2(Ca) + I(log2(Mg) * log2(Ca)), data = Dris))
For some example of possible model improvements, we will consider a different dataset. We will fit a series of different models that we will mutually compare.
A dataset providing a series of (independent) measurements of head
acceleration in a simulated motorcycle accident (used to test crash
helmets) is given in the datafile Motorcycle
(within the R
package mffSM
).
data(Motorcycle, package = "mffSM")
# help(Motorcycle)
head(Motorcycle)
## time haccel
## 1 2.4 0.0
## 2 2.6 -1.3
## 3 3.2 -2.7
## 4 3.6 0.0
## 5 4.0 -2.7
## 6 6.2 -2.7
dim(Motorcycle)
## [1] 133 2
summary(Motorcycle)
## time haccel
## Min. : 2.40 Min. :-134.00
## 1st Qu.:15.80 1st Qu.: -54.90
## Median :24.00 Median : -13.30
## Mean :25.55 Mean : -25.55
## 3rd Qu.:35.20 3rd Qu.: 0.00
## Max. :65.60 Max. : 75.00
We are primarily interested in how the head acceleration (covariate
haccel
) depends on the time after the accident (given in
miliseconds – the time covariate).
par(mfrow = c(1,1), mar = c(4,4,1,1))
plot(haccel ~ time, data = Motorcycle, pch = 21, bg = "orange", col = "red3",
xlab = "Time [ms]", ylab = "Head acceleration [g]")
fit0 = lm(haccel~time, data=Motorcycle)
abline(fit0)
Using the diagnostic plots, we can clearly see that the model denoted
as fit0
is hardly appropriate:
par(mar = c(4,4,3,0.5))
plotLM(fit0)
As far as there is a quite flexible shape observed in the plot we need some more flexible model to capture the structure of the underlying conditional expectation of the head acceleration given the time. We will use polynomials (of different degrees) for fitting the model (while still staying within the linear regression framework).
fit6 <- lm(haccel~poly(time, 6, raw=TRUE), data=Motorcycle)
fit12 <- lm(haccel~poly(time, 12, raw=TRUE), data=Motorcycle)
The model can be visualized as follows:
xgrid <- seq(0, 70, length = 500)
par(mfrow = c(1,1), mar = c(4,4,2,1))
plot(haccel ~ time, data = Motorcycle, pch = 21, bg = "orange", col = "red3",
main = "Fitted polynomial curves",
xlab = "Time [ms]", ylab = "Head acceleration [g]")
lines(xgrid, predict(fit6, newdata=data.frame(time=xgrid)), lwd=2, col="blue")
lines(xgrid, predict(fit12, newdata=data.frame(time=xgrid)), lwd=2, col="darkgreen")
legend(37,-70, lty=rep(1, 2), col=c("blue", "darkgreen"),
legend=paste("p=", c(6,12),sep=""), lwd=c(2,2), cex=1.2)
Classical polynomials are, however, not computationally very
effective. Especially for higher order of the polynomial approximation
the solution tends to be unstable. To avoid numerical problems you can
use orthogonal polynomials with raw=FALSE
, which stabilizes
the magnitude (and standard errors) of estimated coefficients:
summary(fit6)
##
## Call:
## lm(formula = haccel ~ poly(time, 6, raw = TRUE), data = Motorcycle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.039 -23.622 -0.497 24.032 75.521
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.069e+02 4.349e+01 -4.757 5.29e-06 ***
## poly(time, 6, raw = TRUE)1 9.658e+01 1.558e+01 6.199 7.47e-09 ***
## poly(time, 6, raw = TRUE)2 -1.302e+01 1.865e+00 -6.982 1.49e-10 ***
## poly(time, 6, raw = TRUE)3 7.160e-01 1.012e-01 7.078 9.04e-11 ***
## poly(time, 6, raw = TRUE)4 -1.870e-02 2.732e-03 -6.846 2.97e-10 ***
## poly(time, 6, raw = TRUE)5 2.321e-04 3.572e-05 6.496 1.73e-09 ***
## poly(time, 6, raw = TRUE)6 -1.102e-06 1.799e-07 -6.128 1.05e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.09 on 126 degrees of freedom
## Multiple R-squared: 0.5524, Adjusted R-squared: 0.5311
## F-statistic: 25.91 on 6 and 126 DF, p-value: < 2.2e-16
summary(lm(haccel~poly(time, 6, raw=FALSE), data=Motorcycle))
##
## Call:
## lm(formula = haccel ~ poly(time, 6, raw = FALSE), data = Motorcycle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.039 -23.622 -0.497 24.032 75.521
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.546 2.869 -8.903 5.01e-15 ***
## poly(time, 6, raw = FALSE)1 160.611 33.091 4.854 3.52e-06 ***
## poly(time, 6, raw = FALSE)2 99.015 33.091 2.992 0.00333 **
## poly(time, 6, raw = FALSE)3 -244.969 33.091 -7.403 1.67e-11 ***
## poly(time, 6, raw = FALSE)4 64.683 33.091 1.955 0.05283 .
## poly(time, 6, raw = FALSE)5 171.291 33.091 5.176 8.68e-07 ***
## poly(time, 6, raw = FALSE)6 -202.776 33.091 -6.128 1.05e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.09 on 126 degrees of freedom
## Multiple R-squared: 0.5524, Adjusted R-squared: 0.5311
## F-statistic: 25.91 on 6 and 126 DF, p-value: < 2.2e-16
Another possible solution how to introduce some reasonable stability into the model where some higher order polynomials are used, are splines.
### Choice of knots
knots <- c(0, 11, 12, 13, 20, 30, 32, 34, 40, 50, 70)
(inner <- knots[-c(1, length(knots))]) # inner knots
## [1] 11 12 13 20 30 32 34 40 50
(bound <- knots[c(1, length(knots))]) # boundary knots
## [1] 0 70
### B-splines of degree 3
DEGREE <- 3 # piecewise cubic
Bgrid <- bs(xgrid, knots = inner, Boundary.knots = bound, degree = DEGREE,
intercept = TRUE)
dim(Bgrid) # 13-dimensional linear space of piecewise polynomials
## [1] 500 13
The B-spline basis is a set of the following functions:
COL <- rainbow_hcl(ncol(Bgrid), c = 80)
par(mfrow = c(1,1), mar = c(4,4,1,1))
plot(xgrid, Bgrid[,1], type = "l", col = COL[1], lwd = 2, xlab = "x",
ylab = "B(x)", xlim = range(knots), ylim = range(Bgrid, na.rm = TRUE))
for (j in 2:ncol(Bgrid)) lines(xgrid, Bgrid[,j], col = COL[j], lwd = 2)
points(knots, rep(0, length(knots)),
pch = 21, bg = "blue", col = "skyblue", cex = 1.5)
Thus, we can use B-splines to fit the model.
More details about the B-splines can be found, for instance, here.
m1 <- lm(haccel ~ bs(time, knots = inner, Boundary.knots = bound,
degree = DEGREE, intercept = TRUE) - 1, data = Motorcycle)
This is the fitted curve:
par(mfrow = c(1,1), mar = c(4,4,2,1))
plot(haccel ~ time, data = Motorcycle, pch = 21, bg = "orange", col = "red3",
main = "Fitted B-spline curve",
xlab = "Time [ms]", ylab = "Head acceleration [g]")
lines(xgrid, predict(m1, newdata=data.frame(time=xgrid)), lwd=2, col="blue")
names(m1$coefficients) <- paste("B", 1:ncol(Bgrid))
summary(m1)
##
## Call:
## lm(formula = haccel ~ bs(time, knots = inner, Boundary.knots = bound,
## degree = DEGREE, intercept = TRUE) - 1, data = Motorcycle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.857 -11.740 -0.268 10.356 55.249
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## B 1 -11.614 83.252 -0.140 0.889
## B 2 12.450 81.282 0.153 0.879
## B 3 -13.988 45.345 -0.308 0.758
## B 4 2.984 18.894 0.158 0.875
## B 5 6.115 12.139 0.504 0.615
## B 6 -237.283 18.850 -12.588 < 2e-16 ***
## B 7 17.346 14.094 1.231 0.221
## B 8 53.248 12.635 4.214 4.88e-05 ***
## B 9 5.102 13.051 0.391 0.697
## B 10 12.609 17.146 0.735 0.464
## B 11 -34.949 38.161 -0.916 0.362
## B 12 58.255 65.045 0.896 0.372
## B 13 -93.068 75.924 -1.226 0.223
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.61 on 120 degrees of freedom
## Multiple R-squared: 0.8447, Adjusted R-squared: 0.8278
## F-statistic: 50.19 on 13 and 120 DF, p-value: < 2.2e-16
Since the intercept is included within the B-spline basis
(intercept = TRUE
in bs
), we omitted it in the
formula with -1
. However, this use comes with some issues
deeply rooted within the implementation. For example, neither
Multiple R-squared
nor the F-statistic
quantity in the output correspond to correct numbers. Simply put,
summary.lm
thinks the intercept is not included despite
still being a linear combination of given columns.
Can you explain why the values reported for \(R^2\) and \(R_{adj}^2\) are not appropriate? What is the right meaning of these numbers? Can you manually reconstruct them?
The \(R^2\) value from the model is
summary(m1)$r.squared
## [1] 0.8446597
and the manual calculations will give
1 - deviance(m1)/sum(Motorcycle$haccel^2)
## [1] 0.8446597
sum(fitted(m1)^2)/sum(Motorcycle$haccel^2)
## [1] 0.8446597
Interpretation of the F-test from the summary output above is a test of a submodel
summary(m1)$fstatistic
## value numdf dendf
## 50.19211 13.00000 120.00000
# the usual F-test (correct)
m2 <- lm(haccel ~ 1, data = Motorcycle)
anova(m2, m1)
## Analysis of Variance Table
##
## Model 1: haccel ~ 1
## Model 2: haccel ~ bs(time, knots = inner, Boundary.knots = bound, degree = DEGREE,
## intercept = TRUE) - 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 132 308223
## 2 120 61362 12 246861 40.23 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# the wrong F-test (in summary m1)
(m0 <- lm(haccel ~ -1, data=Motorcycle))
##
## Call:
## lm(formula = haccel ~ -1, data = Motorcycle)
##
## No coefficients
anova(m0 ,m1)
## Analysis of Variance Table
##
## Model 1: haccel ~ -1
## Model 2: haccel ~ bs(time, knots = inner, Boundary.knots = bound, degree = DEGREE,
## intercept = TRUE) - 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 133 395017
## 2 120 61362 13 333655 50.192 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Nevertheless, as the intercept parameter is in the linear span of the columns of \(\mathbb{X}\), it makes sense to calculate the (usual) multiple \(R^2\) value. This can be done ‘manually’ as follows:
var(fitted(m1)) / var(Motorcycle$haccel)
## [1] 0.8009163
cor(Motorcycle$haccel, fitted(m1))^2
## [1] 0.8009163
1 - deviance(m1)/deviance(m2)
## [1] 0.8009163
Alternatively, one can use B-splines (the basis without intercept) with an intercept in the model to get some meaningful numbers in the output:
m0 <- lm(haccel ~ bs(time, knots = inner, Boundary.knots = bound,
degree = DEGREE, intercept = FALSE), data = Motorcycle)
names(m0$coefficients) <- c("(Intercept)", paste("B", 1:12))
summary(m0)
##
## Call:
## lm(formula = haccel ~ bs(time, knots = inner, Boundary.knots = bound,
## degree = DEGREE, intercept = FALSE), data = Motorcycle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.857 -11.740 -0.268 10.356 55.249
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.614 83.252 -0.140 0.8893
## B 1 24.064 161.970 0.149 0.8821
## B 2 -2.374 57.835 -0.041 0.9673
## B 3 14.598 93.327 0.156 0.8760
## B 4 17.729 80.496 0.220 0.8261
## B 5 -225.669 88.698 -2.544 0.0122 *
## B 6 28.960 82.780 0.350 0.7271
## B 7 64.862 84.773 0.765 0.4457
## B 8 16.716 83.997 0.199 0.8426
## B 9 24.223 85.172 0.284 0.7766
## B 10 -23.335 91.337 -0.255 0.7988
## B 11 69.869 105.918 0.660 0.5107
## B 12 -81.454 112.465 -0.724 0.4703
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.61 on 120 degrees of freedom
## Multiple R-squared: 0.8009, Adjusted R-squared: 0.781
## F-statistic: 40.23 on 12 and 120 DF, p-value: < 2.2e-16
anova(m2,m0)
## Analysis of Variance Table
##
## Model 1: haccel ~ 1
## Model 2: haccel ~ bs(time, knots = inner, Boundary.knots = bound, degree = DEGREE,
## intercept = FALSE)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 132 308223
## 2 120 61362 12 246861 40.23 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What about the interpretation of the classical diagnostic plots for the model fitted with B-splines?
par(mar = c(4,4,3,0.5))
plotLM(m1)
It seems there is some issue with homoscedasticity assumption. Let’s change the x-axis of the third Scale-Location plot to determine how the variability changes:
par(mfrow = c(1,2), mar = c(4,4,2,1))
plot(abs(fitted(m1)), sqrt(abs(stdres(m1))),
main = "Abs(fitted)",
lwd=2, col="blue4", bg="skyblue", pch=21)
lines(lowess(abs(fitted(m1)), sqrt(abs(stdres(m1)))), col="red", lwd=3)
plot(Motorcycle$time, sqrt(abs(stdres(m1))),
main = "Time",
lwd=2, col="blue4", bg="skyblue", pch=21)
lines(lowess(Motorcycle$time, sqrt(abs(stdres(m1)))), col="red", lwd=3)
How to adjust the inference for heteroscedastic models will be shown later.