Download this R markdown as: R, Rmd.

Outline of this lab session:

Loading the data and libraries

We will consider a simple linear regression model - modelling the amount of yield given the information about the concentration of magnesium Mg. For some model improvements we will use the logarithm transformation of the independent covariate (Mg).

library("mffSM")
library("splines")       # B-splines modelling
library("MASS")          # function stdres for standardized residuals
library("car")           # confidenceEllipse() function

### Working data 
data(Dris, package = "mffSM")
head(Dris)
##   yield   N  P   K Ca Mg
## 1  5.47 470 47 320 47 11
## 2  5.63 530 48 357 60 16
## 3  5.63 530 48 310 63 16
## 4  4.84 482 47 357 47 13
## 5  4.84 506 48 294 52 15
## 6  4.21 500 45 283 61 14
summary(Dris)
##      yield             N               P               K               Ca              Mg       
##  Min.   :1.920   Min.   :268.0   Min.   :32.00   Min.   :200.0   Min.   :29.00   Min.   : 8.00  
##  1st Qu.:4.040   1st Qu.:427.0   1st Qu.:43.00   1st Qu.:338.0   1st Qu.:41.75   1st Qu.:10.00  
##  Median :4.840   Median :473.0   Median :49.00   Median :377.0   Median :49.00   Median :11.00  
##  Mean   :4.864   Mean   :469.9   Mean   :48.65   Mean   :375.4   Mean   :51.45   Mean   :11.64  
##  3rd Qu.:5.558   3rd Qu.:518.0   3rd Qu.:54.00   3rd Qu.:407.0   3rd Qu.:59.00   3rd Qu.:13.00  
##  Max.   :8.792   Max.   :657.0   Max.   :72.00   Max.   :580.0   Max.   :95.00   Max.   :22.00
### a simple linear model fitted in the last session
aLogMg <- lm(yield ~ log2(Mg), data = Dris)
summary(aLogMg) 
## 
## Call:
## lm(formula = yield ~ log2(Mg), data = Dris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1194 -0.7412 -0.0741  0.7451  3.9841 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.4851     0.7790   1.907   0.0574 .  
## log2(Mg)      0.9605     0.2208   4.349 1.77e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.07 on 366 degrees of freedom
## Multiple R-squared:  0.04915,    Adjusted R-squared:  0.04655 
## F-statistic: 18.92 on 1 and 366 DF,  p-value: 1.772e-05

We will use the model above and we will construct three different types of the confidence intervals/bounds for the model:

1. Pointwise confidence intervals (region/band) around the regression line

Firstly, use the help session in R to find out, what is the purpose of the predict command? Try the following:

help(predict)
help(predict.lm)

What is the difference between the outputs from “predict” and “fitted”? A few values are shown in the output below.

data.frame(Predict = predict(aLogMg), Fitted = fitted(aLogMg))[1:10,]
##     Predict   Fitted
## 1  4.807918 4.807918
## 2  5.327136 5.327136
## 3  5.327136 5.327136
## 4  5.039407 5.039407
## 5  5.237704 5.237704
## 6  5.142100 5.142100
## 7  5.142100 5.142100
## 8  5.327136 5.327136
## 9  5.327136 5.327136
## 10 5.142100 5.142100
all.equal(predict(aLogMg), fitted(aLogMg))
## [1] TRUE

Now, we will use the same command, however, for some new value of \(X\) (concentration of magnesium)

predict(aLogMg, newdata = data.frame(Mg = 10))
##        1 
## 4.675846
(point.fit <- aLogMg$coef[1] + aLogMg$coef[2]*log2(10))
## (Intercept) 
##    4.675846

It can be even obtained with the corresponding standard error

predict(aLogMg, newdata = data.frame(Mg = 10), se.fit = TRUE)
## $fit
##        1 
## 4.675846 
## 
## $se.fit
## [1] 0.07063962
## 
## $df
## [1] 366
## 
## $residual.scale
## [1] 1.069745

The standard error can be also reconstructed manually

lvec <- c(1, log2(10)) 
(se.fit <- c(sqrt(t(lvec) %*% vcov(aLogMg) %*% lvec)))
## [1] 0.07063962
all.equal(predict(aLogMg, newdata = data.frame(Mg = 10), se.fit = TRUE)$se.fit, 
          se.fit)
## [1] TRUE
# Recall - residual degrees of freedom, i.e. $df
aLogMg$df.residual
## [1] 366
# sqrt(MS_e) = estimate of sigma, i.e. $residual scale
sqrt(deviance(aLogMg)/aLogMg$df.residual)
## [1] 1.069745
summary(aLogMg)$sigma
## [1] 1.069745

Alternatively, the output can be also obtained from the LSest() function in the R library mffSM:

LSest(aLogMg, L = c(1, log2(10)))
##   Estimate Std. Error  t value    P value    Lower    Upper
## 1 4.675846 0.07063962 66.19296 < 2.22e-16 4.536935 4.814756

When we are interested in the estimate of the conditional mean \(E[Y | \boldsymbol{X} = \boldsymbol{x}]\) we can either provide the point estimate or the interval estimate instead. For the interval estimate, we need the confidence level and the estimate of the variance parameter.

It can be given automatically as follows;

predict(aLogMg, newdata = data.frame(Mg = 10), interval = "confidence", level = 0.95)
##        fit      lwr      upr
## 1 4.675846 4.536935 4.814756

Or, alternatively, we can reconstruct everything manually:

xt_beta <- as.numeric(crossprod(c(1, log2(10)), coef(aLogMg))) 
xt_beta + c(-1, 1) * se.fit * qt(0.975, df = aLogMg[["df.residual"]])
## [1] 4.536935 4.814756

We can calculate a sequence of confidence intervals for \(E[Y | \boldsymbol{X} = \boldsymbol{x}]\) for different values of \(\boldsymbol{x}\)

predict(aLogMg, newdata = data.frame(Mg = 10:20), interval = "confidence", level = 0.95)
##         fit      lwr      upr
## 1  4.675846 4.536935 4.814756
## 2  4.807918 4.695321 4.920516
## 3  4.928491 4.815074 5.041908
## 4  5.039407 4.904194 5.174620
## 5  5.142100 4.975415 5.308785
## 6  5.237704 5.036446 5.438962
## 7  5.327136 5.090943 5.563328
## 8  5.411144 5.140735 5.681553
## 9  5.490349 5.186859 5.793839
## 10 5.565271 5.229971 5.900570
## 11 5.636348 5.270529 6.002168
predict(aLogMg, newdata = data.frame(Mg = 10:20), interval = "confidence", level = 0.95, se.fit = TRUE)
## $fit
##         fit      lwr      upr
## 1  4.675846 4.536935 4.814756
## 2  4.807918 4.695321 4.920516
## 3  4.928491 4.815074 5.041908
## 4  5.039407 4.904194 5.174620
## 5  5.142100 4.975415 5.308785
## 6  5.237704 5.036446 5.438962
## 7  5.327136 5.090943 5.563328
## 8  5.411144 5.140735 5.681553
## 9  5.490349 5.186859 5.793839
## 10 5.565271 5.229971 5.900570
## 11 5.636348 5.270529 6.002168
## 
## $se.fit
##          1          2          3          4          5          6          7          8          9         10 
## 0.07063962 0.05725882 0.05767573 0.06875935 0.08476368 0.10234493 0.12011023 0.13751003 0.15433283 0.17050848 
##         11 
## 0.18602860 
## 
## $df
## [1] 366
## 
## $residual.scale
## [1] 1.069745

A denser grid of x-values to obtain the pointwise confidence band AROUND the regression line:

x.Mg <- seq(min(Dris[, "Mg"]) - 1, max(Dris[, "Mg"]) + 1, length = 101)
ciEY.aLogMg <- predict(aLogMg, data.frame(Mg = x.Mg), interval = "confidence", 
                       level = 0.95)

Observations together with the estimated regression line and a sequence of confidence intervals for \(E[Y|\boldsymbol{X} = \boldsymbol{x}]\) are visualized as a band – confidence region (band) AROUND the regression line.

par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(yield ~ log2(Mg), data = Dris, 
     ylab = "Yield", xlab = "Log2(Mg)", 
     pch = 23, col = "orange", bg = "beige")
lines(log2(x.Mg), ciEY.aLogMg[, "fit"], lwd = 2, col = "red4")
lines(log2(x.Mg), ciEY.aLogMg[, "lwr"], lwd = 2, col = "red3", lty = 2)
lines(log2(x.Mg), ciEY.aLogMg[, "upr"], lwd = 2, col = "red3", lty = 2)

### Where is the band narrowest? 
abline(v = mean(log2(Dris$Mg)))
abline(h = mean(Dris$yield))

Everything can be also plotted with respect to the original scale of the magnesium concentration:

par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(yield ~ Mg, data = Dris, 
     ylab = "Yield", xlab = "Mg", 
     pch = 23, col = "orange", bg = "beige")
lines(x.Mg, ciEY.aLogMg[, "fit"], lwd = 2, col = "red4")
lines(x.Mg, ciEY.aLogMg[, "lwr"], lwd = 2, col = "red3", lty = 2)
lines(x.Mg, ciEY.aLogMg[, "upr"], lwd = 2, col = "red3", lty = 2)

### What about now?
abline(v = 2^mean(log2(Dris$Mg)))
abline(h = mean(Dris$yield))

2. Confidence region (band) FOR the (whole) regression line simultaneously

The second option provides a confidence band for the whole regression line simultaneously. Note, that there is different interpretation of the point-wise band constructed above (where each interval can be interpreted only individually, for some some given value of the explanatory vector \(\boldsymbol{X} = \boldsymbol{x}\)).

Some theoretical details towards the band FOR the whole regression line simultaneously were given during the lecture, nevertheless, the code provided below can be considered as purely illustrative. It is only important to understand the difference between

ciEY2.aLogMg <- predict(aLogMg, data.frame(Mg = x.Mg), se.fit = TRUE)
forEY.aLogMg <- data.frame(
  fit = ciEY2.aLogMg[["fit"]],
  lwr = ciEY2.aLogMg[["fit"]] - ciEY2.aLogMg[["se.fit"]] * sqrt(qf(0.95, 2, ciEY2.aLogMg[["df"]])*2),
  upr = ciEY2.aLogMg[["fit"]] + ciEY2.aLogMg[["se.fit"]] * sqrt(qf(0.95, 2, ciEY2.aLogMg[["df"]])*2))
head(forEY.aLogMg)
##        fit      lwr      upr
## 1 4.181597 3.772108 4.591086
## 2 4.212914 3.820056 4.605772
## 3 4.243538 3.866851 4.620226
## 4 4.273501 3.912532 4.634470
## 5 4.302829 3.957135 4.648524
## 6 4.331550 4.000692 4.662407

The computation of the confidence band is based on the confidence ellipse for the regression parameters (\(\beta_0\), \(\beta_1\))

par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
confidenceEllipse(aLogMg)

The regression lines can be visualized on the logarithm scale or the original scale of the logarithm concentration:

par(mfrow = c(1,2), mar = c(4,4,2.5,0.5))
## Log scale
plot(yield ~ log2(Mg), data = Dris, 
     ylab = "Yield", xlab = "Log2(Mg)", main = "Log scale",
     pch = 23, col = "orange", bg = "beige")
lines(log2(x.Mg), ciEY.aLogMg[, "fit"], lwd = 2, col = "darkgreen")
# Band AROUND
lines(log2(x.Mg), ciEY.aLogMg[, "lwr"], lwd = 2, col = "red3", lty = 2)
lines(log2(x.Mg), ciEY.aLogMg[, "upr"], lwd = 2, col = "red3", lty = 2)
# Band FOR
lines(log2(x.Mg), forEY.aLogMg[, "lwr"], lwd = 2, col = "blue2", lty = 4)
lines(log2(x.Mg), forEY.aLogMg[, "upr"], lwd = 2, col = "blue2", lty = 4)
legend("topright", legend = c("For", "Around"), bty = "n",
       col = c("blue2", "red3"), lty = c(4, 2), lwd = 2)

## Original scale
plot(yield ~ Mg, data = Dris, 
     ylab = "Yield", xlab = "Mg", main = "Original scale",
     pch = 23, col = "orange", bg = "beige")
lines(x.Mg, ciEY.aLogMg[, "fit"], lwd = 2, col = "darkgreen")
# Band AROUND
lines(x.Mg, ciEY.aLogMg[, "lwr"], lwd = 2, col = "red3", lty = 2)
lines(x.Mg, ciEY.aLogMg[, "upr"], lwd = 2, col = "red3", lty = 2)
# Band FOR
lines(x.Mg, forEY.aLogMg[, "lwr"], lwd = 2, col = "blue2", lty = 4)
lines(x.Mg, forEY.aLogMg[, "upr"], lwd = 2, col = "blue2", lty = 4)
legend("topright", legend = c("For", "Around"), bty = "n",
       col = c("blue2", "red3"), lty = c(4, 2), lwd = 2)

3. Prediction interval (for some new given value of \(\boldsymbol{X}\))

Finally, we will compare both confidence bands constructed above with one more confidence band – the prediction band (which will be again interpreted in a point-wise manner as the first confidence band around the regression line).

The idea is to construct an interval which covers with probability \(1 - \alpha\) the NEW value of the response \(Y\) obtained for \(\boldsymbol{X}\) being set to \(\boldsymbol{x}\).

Compare the following two commands:

predict(aLogMg, newdata = data.frame(Mg = 10), interval = "prediction")    
##        fit      lwr      upr
## 1 4.675846 2.567646 6.784046
predict(aLogMg, newdata = data.frame(Mg = 10), interval = "confidence") 
##        fit      lwr      upr
## 1 4.675846 4.536935 4.814756

Manual calculation of the prediction interval can follow the lines below:

s = summary(aLogMg)$sigma
se.fit2 = sqrt(s^2 + (se.fit)^2)
xt_beta + c(-1, 1) * se.fit2 * qt(0.975, df = aLogMg[["df.residual"]])
## [1] 2.567646 6.784046

Prediction band - a sequence of prediction intervals for a dense grid of x-values:

pred.aLogMg <- predict(aLogMg, data.frame(Mg = x.Mg), interval = "prediction", level = 0.95)
head(pred.aLogMg)
##        fit      lwr      upr
## 1 4.181597 2.052618 6.310576
## 2 4.212914 2.085942 6.339886
## 3 4.243538 2.118440 6.368637
## 4 4.273501 2.150150 6.396852
## 5 4.302829 2.181106 6.424553
## 6 4.331550 2.211341 6.451759

Finally, we can visualize the result – using the logarithm scale or the original scale of the magnesium concentration:

par(mfrow = c(1,2), mar = c(4,4,2.5,0.5))
## Log scale
plot(yield ~ log2(Mg), data = Dris, 
     ylab = "Yield", xlab = "log2(Mg)", main = "Log scale",
     pch = 23, col = "orange", bg = "beige")
lines(log2(x.Mg), ciEY.aLogMg[, "fit"], lwd = 2, col = "red3")
# Band AROUND the regression line
lines(log2(x.Mg), ciEY.aLogMg[, "lwr"], lwd = 2, col = "red3", lty = 2)
lines(log2(x.Mg), ciEY.aLogMg[, "upr"], lwd = 2, col = "red3", lty = 2)
# Prediction band
lines(log2(x.Mg), pred.aLogMg[, "lwr"], lwd = 2, col = "blue2", lty = 4)
lines(log2(x.Mg), pred.aLogMg[, "upr"], lwd = 2, col = "blue2", lty = 4)
legend("topleft", legend = c("Prediction", "Around the regression line"), bty = "n",
       col = c("blue2", "red3"), lty = c(4, 2), lwd = 2)

## Original scale
plot(yield ~ Mg, data = Dris, 
     ylab = "Yield", xlab = "Mg", main = "Original scale",
     pch = 23, col = "orange", bg = "beige")
lines(x.Mg, ciEY.aLogMg[, "fit"], lwd = 2, col = "red3")
# Band AROUND the regression line
lines(x.Mg, ciEY.aLogMg[, "lwr"], lwd = 2, col = "red3", lty = 2)
lines(x.Mg, ciEY.aLogMg[, "upr"], lwd = 2, col = "red3", lty = 2)
# Prediction band
lines(x.Mg, pred.aLogMg[, "lwr"], lwd = 2, col = "blue2", lty = 4)
lines(x.Mg, pred.aLogMg[, "upr"], lwd = 2, col = "blue2", lty = 4)
legend("bottomright", legend = c("Prediction", "Around the regression line"), bty = "n",
       col = c("blue2", "red3"), lty = c(4, 2), lwd = 2)

Individual work