Letný semester 2024 | Cvičenie 4 | 19.03.2024
HTML
Markdown – Rmd source file (UTF
encoding)
The dataset mtcars
(available under the standard R
installation) will be used for the purposes of this lab session. Some
insight about the data can be taken from the help session (command
?mtcars
) or from some outputs provided below.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
For simplicity, we will only consider three explanatory (continuous)
variables from the whole mtcars
dataset, i.e. \(\boldsymbol{x} \in \mathbb{R}^3\). Any
generalization for general \(\boldsymbol{X}
\in \mathbb{R}^p\) for some \(p \in
\mathbb{N}\) is straightforward. In particular, the (continuous)
dependent variable \(Y \in \mathbb{R}\)
is the car consumption (variable in the dataset denoted as
mpg
– miles per gallon) and three independent variables
will be the car’s displacement (disp
), horse powers
(hp
), and the rear axle ratio information
(drat
).
Three simple scatterplots showing the dependent variable \(Y\) agains one independent variable (based on the random sample \(\{(Y_1, \boldsymbol{X}_i^\top)^\top;~i = 1, \dots, 32\}\)) are shown below. Only a limited information about the whole multivariate structure of the underlying data can be, howver, taken from the series of two-dimensional scatterplots.
par(mfrow = c(1,3))
plot(mpg ~ disp, data = mtcars, pch = 22, cex = 0.8, bg = "lightblue", ylab = "Consumption [miles per gallon]", xlab = "Displacement [cubic inches]")
lines(lowess(mtcars$mpg ~ mtcars$disp), col = "red", lwd = 2)
plot(mpg ~ hp, data = mtcars, pch = 22, cex = 0.8, bg = "lightblue", ylab = "", xlab = "Horse Power")
lines(lowess(mtcars$mpg ~ mtcars$hp), col = "red", lwd = 2)
plot(mpg ~ drat, data = mtcars, pch = 22, cex = 0.8, bg = "lightblue", ylab = "", xlab = "Rear axle ratio")
lines(lowess(mtcars$mpg ~ mtcars$drat), col = "red", lwd = 2)
The red curves are non-parametric estimates of the conditional mean—these estimates offer a usefull insight that the dependence of the consumptions is very likely negative (regressive) wrt. the displacement and horsepower while it seems to positive (progresive) for the rear axle ratio.
A simple multiple regression model of the form \[ Y = a + \beta_1 \textrm{`disp`} + \beta_2\textrm{`hp`} + \beta_3 \textrm{`drat`} + \varepsilon \] or mathematically (the model for the empirical data, for \(i = 1, \dots, 32\)) \[ Y_i = \boldsymbol{X}_i^\top\boldsymbol{\beta} + \varepsilon_i \] where \(\boldsymbol{X} = (1, disp, hp, drat)^\top\) and \(\boldsymbol{\beta} \in \mathbb{R}^4\) will be considered. In the R software, the model can be fitted using the following R code:
fit1 <- lm(mpg ~ disp + hp + drat, data = mtcars)
print(fit1)
##
## Call:
## lm(formula = mpg ~ disp + hp + drat, data = mtcars)
##
## Coefficients:
## (Intercept) disp hp drat
## 19.34429 -0.01923 -0.03123 2.71498
The information in the output above gives values for four estimated parameters – elements of the parameter vector \(\boldsymbol{\beta} \in \mathbb{R}^4\). The estimated parameters are not the same as estimated three independent simple linear regression model. Compare the output above with the following outputs:
fit_disp <- lm(mpg ~ disp, data = mtcars)
fit_hp <- lm(mpg ~ hp, data = mtcars)
fit_drat <- lm(mpg ~ drat, data = mtcars)
print(fit_disp)
##
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
##
## Coefficients:
## (Intercept) disp
## 29.59985 -0.04122
print(fit_hp)
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Coefficients:
## (Intercept) hp
## 30.09886 -0.06823
print(fit_drat)
##
## Call:
## lm(formula = mpg ~ drat, data = mtcars)
##
## Coefficients:
## (Intercept) drat
## -7.525 7.678
Indeed, while the sigs of the estimated parameters correspond, the magnitues are substantially different. The ordinary regressions can be directly ploted into the corresponding scatterplots:
par(mfrow = c(1,3))
plot(mpg ~ disp, data = mtcars, pch = 22, cex = 0.8, bg = "lightblue", ylab = "Consumption [miles per gallon]", xlab = "Displacement [cubic inches]")
lines(lowess(mtcars$mpg ~ mtcars$disp), col = "red", lwd = 2)
abline(fit_disp, col = "blue")
plot(mpg ~ hp, data = mtcars, pch = 22, cex = 0.8, bg = "lightblue", ylab = "", xlab = "Horse Power")
lines(lowess(mtcars$mpg ~ mtcars$hp), col = "red", lwd = 2)
abline(fit_hp, col = "blue")
plot(mpg ~ drat, data = mtcars, pch = 22, cex = 0.8, bg = "lightblue", ylab = "", xlab = "Rear axle ratio")
lines(lowess(mtcars$mpg ~ mtcars$drat), col = "red", lwd = 2)
abline(fit_drat, col = "blue")
However, the multiple regression model does not estimate a line but a
three dimensional subspace. For instance, the intercept value \(\widehat{a} = 29.59\) from the simple
regression model fit_disp
stands for the expected
consuption (miles driven per one gallon) if the displacement of the
engine is equal to zezo cubic inches (i.e., an unrealistic
scenario).
Nevertheless, the intercept estimate \(\widehat{\beta}_0 = 19.34\) stands for the expected car consumption (again in miles driven per one gallon) but if all three regressors – the information about the displacement, the horse powers, and, also, the rear to axel ratio – are all equal to ze. Therefore, to draw simple line for two-dimensional scatterplot, it is needed to pre-select some reasonable information (realistic case) for all other covariates in the model.
For instance, consider the multiple model from above and, in particular, the scatterplot of the consumption against the displacement. Various regression lines can be considered:
plot(mpg ~ disp, data = mtcars, pch = 22, cex = 0.8, bg = "lightblue", ylab = "Consumption [miles per gallon]", xlab = "Displacement [cubic inches]")
xGrid <- seq(min(mtcars$disp), max(mtcars$disp, 100))
mCoeff <- fit1$coeff
yMeanMean <- mCoeff[1] + mCoeff[2] * xGrid + mCoeff[3] * mean(mtcars$hp) + mCoeff[4] * mean(mtcars$drat)
lines(yMeanMean ~ xGrid, col = "red", lwd = 2)
yMinMin <- mCoeff[1] + mCoeff[2] * xGrid + mCoeff[3] * min(mtcars$hp) + mCoeff[4] * min(mtcars$drat)
yMaxMax<- mCoeff[1] + mCoeff[2] * xGrid + mCoeff[3] * max(mtcars$hp) + mCoeff[4] * max(mtcars$drat)
lines(yMinMin ~ xGrid, col = "blue", lwd = 2, lty = 2)
lines(yMaxMax ~ xGrid, col = "darkgreen", lwd = 2, lty = 2)
legend("topright", legend = c("Average HP & average rear axle ratio", "Min HP & min rear axle ratio", "Max HP & max rear axle ratio"), col =c("red", "blue", "darkgreen"), lwd = c(3,3,3), lty = c(1,2,2))
The main purpose of statistic is to do inference – i.e., to say something relevant about the whole (unknwon) population (from which we have a small random sample \(\{(Y_1, \boldsymbol{X}_i^\top)^\top;~i = 1, \dots, 32\}\)) using only the infomarion (data) in the sample and the techniques of statistical inference. There are two main tools for the statistical inference – condidence intervals for the uknown paramters and the hypothesis tests.
While the confidence intervals must be calculated manually – however,
using just the standard output from the regression model – some
statistical tests are provided automaticaly. A statistical output of the
multiple linear regression model fitted above can be obtained by the
command summary()
:
summary(fit1)
##
## Call:
## lm(formula = mpg ~ disp + hp + drat, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1225 -1.8454 -0.4456 1.1342 6.4958
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.344293 6.370882 3.036 0.00513 **
## disp -0.019232 0.009371 -2.052 0.04960 *
## hp -0.031229 0.013345 -2.340 0.02663 *
## drat 2.714975 1.487366 1.825 0.07863 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.008 on 28 degrees of freedom
## Multiple R-squared: 0.775, Adjusted R-squared: 0.7509
## F-statistic: 32.15 on 3 and 28 DF, p-value: 3.28e-09
Four (the number of estimated parameters) statistical tests – \(p\)-values of the tests are provided in the output (the fourth column of the table) by default. In each case, the corresponding null hypothesis stantes that the estimated parameter (not the estimate, but the unknown true value) is equal to zero, i.e.,
\[ H_0: \beta_j = 0 \] against a general alternative \[ H_A: \beta_j \neq 0 \] where \(j \in \{1, 2, 3, 4\}\) corresponds to the line in the output table, respectively the element of the estimated paramter vector \(\boldsymbol{\beta} \in \mathbb{R}^4\).
Intercept
)?
disp
, hp
, and drat
)?
Considering the critical level of \(\alpha
= 0.05\), we can not reject the null hypothesis \(H_0; \beta_4 = 0\) which (in words) means
that the consumption of the car is independent of the rear axle ratio.
Thus, instead of the model with three predictor variablees we can
consider a smaller (restricted) model with only the information on the
displacement (disp
) and the horse power
(hp
).
fit2 <- lm(mpg ~ disp + hp, data = mtcars)
summary(fit2)
##
## Call:
## lm(formula = mpg ~ disp + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7945 -2.3036 -0.8246 1.8582 6.9363
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.735904 1.331566 23.083 < 2e-16 ***
## disp -0.030346 0.007405 -4.098 0.000306 ***
## hp -0.024840 0.013385 -1.856 0.073679 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.127 on 29 degrees of freedom
## Multiple R-squared: 0.7482, Adjusted R-squared: 0.7309
## F-statistic: 43.09 on 2 and 29 DF, p-value: 2.062e-09
In the restricted model, using again the critical value of \(\alpha = 0.05\), we can again see that the null phypothesis \(H_0: \beta_3 = 0\) is not rejected – the corresponding \(p\)-value \(0.0737\) is slightly higher that \(0.05\). Thus, we can again consider a submodel, where the car consumption is only explained by the information about the engine displacement:
fit3 <- lm(mpg ~ disp, data = mtcars)
summary(fit3)
##
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8922 -2.2022 -0.9631 1.6272 7.2305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.599855 1.229720 24.070 < 2e-16 ***
## disp -0.041215 0.004712 -8.747 9.38e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared: 0.7183, Adjusted R-squared: 0.709
## F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10
In the model above, both statistical tests return significant \(p\)-value and the model can not be further
reduced. Thus, we performed the model selection procedure. From
all considered models we selected one final model that is (in some
well-defined sense) the most appropriate one. This particular selection
strategy is called the stepwise backward selection strategy.
The idea is to fit a full model firstly and, consecutively, to reduce
the model by taking out the less significant predictors. The final model
is the one in which no other predictor can be added. This is the
procedure that we just applied above.
The idea is reverse. We start with all possible (three in this case) simple regresion models (as we fitted them above) and we conclude that the most significant dependence is for the model with the dependence of the consumption on the displacement – compare the \(p\)-values below:
summary(fit_disp)
##
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8922 -2.2022 -0.9631 1.6272 7.2305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.599855 1.229720 24.070 < 2e-16 ***
## disp -0.041215 0.004712 -8.747 9.38e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared: 0.7183, Adjusted R-squared: 0.709
## F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10
summary(fit_hp)
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
summary(fit_drat)
##
## Call:
## lm(formula = mpg ~ drat, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0775 -2.6803 -0.2095 2.2976 9.0225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.525 5.477 -1.374 0.18
## drat 7.678 1.507 5.096 1.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.485 on 30 degrees of freedom
## Multiple R-squared: 0.464, Adjusted R-squared: 0.4461
## F-statistic: 25.97 on 1 and 30 DF, p-value: 1.776e-05
Thus, the model that we use to proceed further is the model where the
consumption is explained by the displacement. In the second step of the
stepwise forward selection procedure we look for another covariate (out
of two remaining ones, hp
, or drat
) which can
be added into the model (meaning that the corresponding \(p\)-value of the test is significant). We
consider two models:
fit4 <- lm(mpg ~ disp + hp, data = mtcars)
summary(fit4)
##
## Call:
## lm(formula = mpg ~ disp + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7945 -2.3036 -0.8246 1.8582 6.9363
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.735904 1.331566 23.083 < 2e-16 ***
## disp -0.030346 0.007405 -4.098 0.000306 ***
## hp -0.024840 0.013385 -1.856 0.073679 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.127 on 29 degrees of freedom
## Multiple R-squared: 0.7482, Adjusted R-squared: 0.7309
## F-statistic: 43.09 on 2 and 29 DF, p-value: 2.062e-09
fit5 <- lm(mpg ~ disp + drat, data = mtcars)
summary(fit5)
##
## Call:
## lm(formula = mpg ~ disp + drat, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1265 -2.2045 -0.5835 1.4497 6.9884
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.844880 6.747971 3.237 0.00302 **
## disp -0.035694 0.006653 -5.365 9.19e-06 ***
## drat 1.802027 1.542091 1.169 0.25210
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.232 on 29 degrees of freedom
## Multiple R-squared: 0.731, Adjusted R-squared: 0.7125
## F-statistic: 39.41 on 2 and 29 DF, p-value: 5.385e-09
In both cases we observe the \(p\)-value for the added covariate which is
larger than \(0.05\). Thus, the final
model is the model denoted as fit3
(which is, of course,
equivalent to the model denoted as fit_disp
).
There are also other selection strategies based on various selection criteria. Typically used criteria are, for instance, the Akaike’s Information Criterion (AIC) or Bayesian Information Criterion (BIC). But there are also others.
For the AIC criterion, there is a standard R function
AIC()
:
AIC(fit3)
## [1] 170.2094
### compare also with the AIC for the remaining two (simple) regression models
AIC(fit_hp)
## [1] 181.2386
AIC(fit_drat)
## [1] 190.7999
The BIC criterion is implemented, for instance, in the library
flexmix
(installation with the command
install.packages("flexmix")
is needed)
library(flexmix)
## Loading required package: lattice
BIC(fit3)
## [1] 174.6066
### compare also with the BIC for the remaining two (simple) regression models
BIC(fit_hp)
## [1] 185.6358
BIC(fit_drat)
## [1] 195.1971
Finally, we will consider a regression dependence of a continuous variable \(Y\) on some categorical covariate \(X\) (taking more than 2 levels). We will consider the number of cylinders (there are four, six, and eight cylinder cars in the data set). Note, that the categorical covariate can be either assumed to be nominal or, also, ordinal.
boxplot(mpg ~ cyl, data = mtcars, col = "lightblue")
Compare and explain the following models:
fit6 <- lm(mpg ~ cyl, data = mtcars)
print(fit6)
##
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
##
## Coefficients:
## (Intercept) cyl
## 37.885 -2.876
fit7 <- lm(mpg ~ as.factor(cyl), data = mtcars)
print(fit7)
##
## Call:
## lm(formula = mpg ~ as.factor(cyl), data = mtcars)
##
## Coefficients:
## (Intercept) as.factor(cyl)6 as.factor(cyl)8
## 26.664 -6.921 -11.564
fit8 <- lm(mpg ~ cyl + I(cyl^2), data = mtcars)
summary(fit8)
##
## Call:
## lm(formula = mpg ~ cyl + I(cyl^2), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2636 -1.8357 0.0286 1.3893 7.2364
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.3390 11.6471 4.064 0.000336 ***
## cyl -6.3078 4.1723 -1.512 0.141400
## I(cyl^2) 0.2847 0.3451 0.825 0.416072
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.223 on 29 degrees of freedom
## Multiple R-squared: 0.7325, Adjusted R-squared: 0.714
## F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09
fit6
, fit7
, and fit8
.
What are other possible parametrizations that can be used? Compare
the following model with the model denoted as fit7
.
mtcars$dummy1 <- as.numeric(mtcars$cyl == 4)
mtcars$dummy1[mtcars$cyl == 8] <- -1
mtcars$dummy2 <- as.numeric(mtcars$cyl == 6)
mtcars$dummy2[mtcars$cyl == 8] <- -1
fit9 <- lm(mpg ~ dummy1 + dummy2, data = mtcars)
print(fit9)
##
## Call:
## lm(formula = mpg ~ dummy1 + dummy2, data = mtcars)
##
## Coefficients:
## (Intercept) dummy1 dummy2
## 20.5022 6.1615 -0.7593
fit7
from the model above
(fit9
)? Look at the output of the summary()
function and explain the statistical tests that are provided for each
column of the summary table (in model fit7
and model
fit9
).