Download this R markdown as: R, Rmd.
Outline of this lab session:
We will consider a simple linear regression model - modelling the
amount of yield
given the information about the
concentration of magnesium Mg
. For some model improvements
we will use the logarithm transformation of the independent covariate
(Mg
).
library("mffSM")
library("splines") # B-splines modelling
library("MASS") # function stdres for standardized residuals
library("car") # confidenceEllipse() function
### Working data
data(Dris, package = "mffSM")
head(Dris)
## yield N P K Ca Mg
## 1 5.47 470 47 320 47 11
## 2 5.63 530 48 357 60 16
## 3 5.63 530 48 310 63 16
## 4 4.84 482 47 357 47 13
## 5 4.84 506 48 294 52 15
## 6 4.21 500 45 283 61 14
summary(Dris)
## yield N P K Ca Mg
## Min. :1.920 Min. :268.0 Min. :32.00 Min. :200.0 Min. :29.00 Min. : 8.00
## 1st Qu.:4.040 1st Qu.:427.0 1st Qu.:43.00 1st Qu.:338.0 1st Qu.:41.75 1st Qu.:10.00
## Median :4.840 Median :473.0 Median :49.00 Median :377.0 Median :49.00 Median :11.00
## Mean :4.864 Mean :469.9 Mean :48.65 Mean :375.4 Mean :51.45 Mean :11.64
## 3rd Qu.:5.558 3rd Qu.:518.0 3rd Qu.:54.00 3rd Qu.:407.0 3rd Qu.:59.00 3rd Qu.:13.00
## Max. :8.792 Max. :657.0 Max. :72.00 Max. :580.0 Max. :95.00 Max. :22.00
### a simple linear model fitted in the last session
aLogMg <- lm(yield ~ log2(Mg), data = Dris)
summary(aLogMg)
##
## Call:
## lm(formula = yield ~ log2(Mg), data = Dris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1194 -0.7412 -0.0741 0.7451 3.9841
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4851 0.7790 1.907 0.0574 .
## log2(Mg) 0.9605 0.2208 4.349 1.77e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.07 on 366 degrees of freedom
## Multiple R-squared: 0.04915, Adjusted R-squared: 0.04655
## F-statistic: 18.92 on 1 and 366 DF, p-value: 1.772e-05
We will use the model above and we will construct three different types of the confidence intervals/bounds for the model:
Firstly, use the help session in R to find out, what is the purpose
of the predict
command? Try the following:
help(predict)
help(predict.lm)
What is the difference between the outputs from “predict” and “fitted”? A few values are shown in the output below.
data.frame(Predict = predict(aLogMg), Fitted = fitted(aLogMg))[1:10,]
## Predict Fitted
## 1 4.807918 4.807918
## 2 5.327136 5.327136
## 3 5.327136 5.327136
## 4 5.039407 5.039407
## 5 5.237704 5.237704
## 6 5.142100 5.142100
## 7 5.142100 5.142100
## 8 5.327136 5.327136
## 9 5.327136 5.327136
## 10 5.142100 5.142100
all.equal(predict(aLogMg), fitted(aLogMg))
## [1] TRUE
Now, we will use the same command, however, for some new value of \(X\) (concentration of magnesium)
predict(aLogMg, newdata = data.frame(Mg = 10))
## 1
## 4.675846
(point.fit <- aLogMg$coef[1] + aLogMg$coef[2]*log2(10))
## (Intercept)
## 4.675846
It can be even obtained with the corresponding standard error
predict(aLogMg, newdata = data.frame(Mg = 10), se.fit = TRUE)
## $fit
## 1
## 4.675846
##
## $se.fit
## [1] 0.07063962
##
## $df
## [1] 366
##
## $residual.scale
## [1] 1.069745
The standard error can be also reconstructed manually
lvec <- c(1, log2(10))
(se.fit <- c(sqrt(t(lvec) %*% vcov(aLogMg) %*% lvec)))
## [1] 0.07063962
all.equal(predict(aLogMg, newdata = data.frame(Mg = 10), se.fit = TRUE)$se.fit,
se.fit)
## [1] TRUE
# Recall - residual degrees of freedom, i.e. $df
aLogMg$df.residual
## [1] 366
# sqrt(MS_e) = estimate of sigma, i.e. $residual scale
sqrt(deviance(aLogMg)/aLogMg$df.residual)
## [1] 1.069745
summary(aLogMg)$sigma
## [1] 1.069745
Alternatively, the output can be also obtained from the
LSest()
function in the R library mffSM
:
LSest(aLogMg, L = c(1, log2(10)))
## Estimate Std. Error t value P value Lower Upper
## 1 4.675846 0.07063962 66.19296 < 2.22e-16 4.536935 4.814756
When we are interested in the estimate of the conditional mean \(E[Y | \boldsymbol{X} = \boldsymbol{x}]\) we can either provide the point estimate or the interval estimate instead. For the interval estimate, we need the confidence level and the estimate of the variance parameter.
It can be given automatically as follows;
predict(aLogMg, newdata = data.frame(Mg = 10), interval = "confidence", level = 0.95)
## fit lwr upr
## 1 4.675846 4.536935 4.814756
Or, alternatively, we can reconstruct everything manually:
xt_beta <- as.numeric(crossprod(c(1, log2(10)), coef(aLogMg)))
xt_beta + c(-1, 1) * se.fit * qt(0.975, df = aLogMg[["df.residual"]])
## [1] 4.536935 4.814756
We can calculate a sequence of confidence intervals for \(E[Y | \boldsymbol{X} = \boldsymbol{x}]\) for different values of \(\boldsymbol{x}\)
predict(aLogMg, newdata = data.frame(Mg = 10:20), interval = "confidence", level = 0.95)
## fit lwr upr
## 1 4.675846 4.536935 4.814756
## 2 4.807918 4.695321 4.920516
## 3 4.928491 4.815074 5.041908
## 4 5.039407 4.904194 5.174620
## 5 5.142100 4.975415 5.308785
## 6 5.237704 5.036446 5.438962
## 7 5.327136 5.090943 5.563328
## 8 5.411144 5.140735 5.681553
## 9 5.490349 5.186859 5.793839
## 10 5.565271 5.229971 5.900570
## 11 5.636348 5.270529 6.002168
predict(aLogMg, newdata = data.frame(Mg = 10:20), interval = "confidence", level = 0.95, se.fit = TRUE)
## $fit
## fit lwr upr
## 1 4.675846 4.536935 4.814756
## 2 4.807918 4.695321 4.920516
## 3 4.928491 4.815074 5.041908
## 4 5.039407 4.904194 5.174620
## 5 5.142100 4.975415 5.308785
## 6 5.237704 5.036446 5.438962
## 7 5.327136 5.090943 5.563328
## 8 5.411144 5.140735 5.681553
## 9 5.490349 5.186859 5.793839
## 10 5.565271 5.229971 5.900570
## 11 5.636348 5.270529 6.002168
##
## $se.fit
## 1 2 3 4 5 6 7 8 9 10
## 0.07063962 0.05725882 0.05767573 0.06875935 0.08476368 0.10234493 0.12011023 0.13751003 0.15433283 0.17050848
## 11
## 0.18602860
##
## $df
## [1] 366
##
## $residual.scale
## [1] 1.069745
A denser grid of x-values to obtain the pointwise confidence band AROUND the regression line:
x.Mg <- seq(min(Dris[, "Mg"]) - 1, max(Dris[, "Mg"]) + 1, length = 101)
ciEY.aLogMg <- predict(aLogMg, data.frame(Mg = x.Mg), interval = "confidence",
level = 0.95)
Observations together with the estimated regression line and a sequence of confidence intervals for \(E[Y|\boldsymbol{X} = \boldsymbol{x}]\) are visualized as a band – confidence region (band) AROUND the regression line.
par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(yield ~ log2(Mg), data = Dris,
ylab = "Yield", xlab = "Log2(Mg)",
pch = 23, col = "orange", bg = "beige")
lines(log2(x.Mg), ciEY.aLogMg[, "fit"], lwd = 2, col = "red4")
lines(log2(x.Mg), ciEY.aLogMg[, "lwr"], lwd = 2, col = "red3", lty = 2)
lines(log2(x.Mg), ciEY.aLogMg[, "upr"], lwd = 2, col = "red3", lty = 2)
### Where is the band narrowest?
abline(v = mean(log2(Dris$Mg)))
abline(h = mean(Dris$yield))
Everything can be also plotted with respect to the original scale of the magnesium concentration:
par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(yield ~ Mg, data = Dris,
ylab = "Yield", xlab = "Mg",
pch = 23, col = "orange", bg = "beige")
lines(x.Mg, ciEY.aLogMg[, "fit"], lwd = 2, col = "red4")
lines(x.Mg, ciEY.aLogMg[, "lwr"], lwd = 2, col = "red3", lty = 2)
lines(x.Mg, ciEY.aLogMg[, "upr"], lwd = 2, col = "red3", lty = 2)
### What about now?
abline(v = 2^mean(log2(Dris$Mg)))
abline(h = mean(Dris$yield))
The second option provides a confidence band for the whole regression line simultaneously. Note, that there is different interpretation of the point-wise band constructed above (where each interval can be interpreted only individually, for some some given value of the explanatory vector \(\boldsymbol{X} = \boldsymbol{x}\)).
Some theoretical details towards the band FOR the whole regression line simultaneously were given during the lecture, nevertheless, the code provided below can be considered as purely illustrative. It is only important to understand the difference between
ciEY2.aLogMg <- predict(aLogMg, data.frame(Mg = x.Mg), se.fit = TRUE)
forEY.aLogMg <- data.frame(
fit = ciEY2.aLogMg[["fit"]],
lwr = ciEY2.aLogMg[["fit"]] - ciEY2.aLogMg[["se.fit"]] * sqrt(qf(0.95, 2, ciEY2.aLogMg[["df"]])*2),
upr = ciEY2.aLogMg[["fit"]] + ciEY2.aLogMg[["se.fit"]] * sqrt(qf(0.95, 2, ciEY2.aLogMg[["df"]])*2))
head(forEY.aLogMg)
## fit lwr upr
## 1 4.181597 3.772108 4.591086
## 2 4.212914 3.820056 4.605772
## 3 4.243538 3.866851 4.620226
## 4 4.273501 3.912532 4.634470
## 5 4.302829 3.957135 4.648524
## 6 4.331550 4.000692 4.662407
The computation of the confidence band is based on the confidence ellipse for the regression parameters (\(\beta_0\), \(\beta_1\))
par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
confidenceEllipse(aLogMg)
The regression lines can be visualized on the logarithm scale or the original scale of the logarithm concentration:
par(mfrow = c(1,2), mar = c(4,4,2.5,0.5))
## Log scale
plot(yield ~ log2(Mg), data = Dris,
ylab = "Yield", xlab = "Log2(Mg)", main = "Log scale",
pch = 23, col = "orange", bg = "beige")
lines(log2(x.Mg), ciEY.aLogMg[, "fit"], lwd = 2, col = "darkgreen")
# Band AROUND
lines(log2(x.Mg), ciEY.aLogMg[, "lwr"], lwd = 2, col = "red3", lty = 2)
lines(log2(x.Mg), ciEY.aLogMg[, "upr"], lwd = 2, col = "red3", lty = 2)
# Band FOR
lines(log2(x.Mg), forEY.aLogMg[, "lwr"], lwd = 2, col = "blue2", lty = 4)
lines(log2(x.Mg), forEY.aLogMg[, "upr"], lwd = 2, col = "blue2", lty = 4)
legend("topright", legend = c("For", "Around"), bty = "n",
col = c("blue2", "red3"), lty = c(4, 2), lwd = 2)
## Original scale
plot(yield ~ Mg, data = Dris,
ylab = "Yield", xlab = "Mg", main = "Original scale",
pch = 23, col = "orange", bg = "beige")
lines(x.Mg, ciEY.aLogMg[, "fit"], lwd = 2, col = "darkgreen")
# Band AROUND
lines(x.Mg, ciEY.aLogMg[, "lwr"], lwd = 2, col = "red3", lty = 2)
lines(x.Mg, ciEY.aLogMg[, "upr"], lwd = 2, col = "red3", lty = 2)
# Band FOR
lines(x.Mg, forEY.aLogMg[, "lwr"], lwd = 2, col = "blue2", lty = 4)
lines(x.Mg, forEY.aLogMg[, "upr"], lwd = 2, col = "blue2", lty = 4)
legend("topright", legend = c("For", "Around"), bty = "n",
col = c("blue2", "red3"), lty = c(4, 2), lwd = 2)
Finally, we will compare both confidence bands constructed above with one more confidence band – the prediction band (which will be again interpreted in a point-wise manner as the first confidence band around the regression line).
The idea is to construct an interval which covers with probability \(1 - \alpha\) the NEW value of the response \(Y\) obtained for \(\boldsymbol{X}\) being set to \(\boldsymbol{x}\).
Compare the following two commands:
predict(aLogMg, newdata = data.frame(Mg = 10), interval = "prediction")
## fit lwr upr
## 1 4.675846 2.567646 6.784046
predict(aLogMg, newdata = data.frame(Mg = 10), interval = "confidence")
## fit lwr upr
## 1 4.675846 4.536935 4.814756
Manual calculation of the prediction interval can follow the lines below:
s = summary(aLogMg)$sigma
se.fit2 = sqrt(s^2 + (se.fit)^2)
xt_beta + c(-1, 1) * se.fit2 * qt(0.975, df = aLogMg[["df.residual"]])
## [1] 2.567646 6.784046
Prediction band - a sequence of prediction intervals for a dense grid of x-values:
pred.aLogMg <- predict(aLogMg, data.frame(Mg = x.Mg), interval = "prediction", level = 0.95)
head(pred.aLogMg)
## fit lwr upr
## 1 4.181597 2.052618 6.310576
## 2 4.212914 2.085942 6.339886
## 3 4.243538 2.118440 6.368637
## 4 4.273501 2.150150 6.396852
## 5 4.302829 2.181106 6.424553
## 6 4.331550 2.211341 6.451759
Finally, we can visualize the result – using the logarithm scale or the original scale of the magnesium concentration:
par(mfrow = c(1,2), mar = c(4,4,2.5,0.5))
## Log scale
plot(yield ~ log2(Mg), data = Dris,
ylab = "Yield", xlab = "log2(Mg)", main = "Log scale",
pch = 23, col = "orange", bg = "beige")
lines(log2(x.Mg), ciEY.aLogMg[, "fit"], lwd = 2, col = "red3")
# Band AROUND the regression line
lines(log2(x.Mg), ciEY.aLogMg[, "lwr"], lwd = 2, col = "red3", lty = 2)
lines(log2(x.Mg), ciEY.aLogMg[, "upr"], lwd = 2, col = "red3", lty = 2)
# Prediction band
lines(log2(x.Mg), pred.aLogMg[, "lwr"], lwd = 2, col = "blue2", lty = 4)
lines(log2(x.Mg), pred.aLogMg[, "upr"], lwd = 2, col = "blue2", lty = 4)
legend("topleft", legend = c("Prediction", "Around the regression line"), bty = "n",
col = c("blue2", "red3"), lty = c(4, 2), lwd = 2)
## Original scale
plot(yield ~ Mg, data = Dris,
ylab = "Yield", xlab = "Mg", main = "Original scale",
pch = 23, col = "orange", bg = "beige")
lines(x.Mg, ciEY.aLogMg[, "fit"], lwd = 2, col = "red3")
# Band AROUND the regression line
lines(x.Mg, ciEY.aLogMg[, "lwr"], lwd = 2, col = "red3", lty = 2)
lines(x.Mg, ciEY.aLogMg[, "upr"], lwd = 2, col = "red3", lty = 2)
# Prediction band
lines(x.Mg, pred.aLogMg[, "lwr"], lwd = 2, col = "blue2", lty = 4)
lines(x.Mg, pred.aLogMg[, "upr"], lwd = 2, col = "blue2", lty = 4)
legend("bottomright", legend = c("Prediction", "Around the regression line"), bty = "n",
col = c("blue2", "red3"), lty = c(4, 2), lwd = 2)