Letný semester 2024 | Cvičenie 3 | 12.03.2024
HTML
Markdown – Rmd source file (UTF
encoding)
Some necessary R packages (each package must be firstly installed in
R – this can be achieved by running the command
install.packages("package_name")
). After the installation,
the libraries are initialized by
library(MASS)
library(ISLR2)
The installation (command install.packages()
) should be
performed just once. However, the innicialization of the library – the
command library()
must be used everytime when starting a
new R session.
The ISLR2
library contains the Boston
data
set, which records medv
(median house value) for \(506\) census tracts in Boston. We will seek
to predict medv
using some of the \(12\) given predictors such as
rm
(average number of rooms per house), age
(average age of houses), or lstat
(percent of households
with low socioeconomic status).
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 5.21 28.7
To find out more about the data set, we can type
?Boston
.
We will start with some simple explanatory analysis. For all three explanatory variables we draw a simple scatterplot.
par(mfrow = c(1,3))
plot(medv ~ rm, data = Boston, pch = 22, bg = "gray", ylab = "Median value [in 1000$]", xlab = "Average number of rooms")
plot(medv ~ age, data = Boston, pch = 22, bg = "gray", ylab = "", xlab = "Proportion of owner-occupied units built prior to 1940")
plot(medv ~ lstat, data = Boston, pch = 22, bg = "gray", ylab = "", xlab = "Lower status of the population [%].")
It seems that there is a positive (progressive) relationship between
the median house value (dependent variable medv
) and the
average number of room in the house (independent variable
rm
) and a negative relationship (regression) in terms of
the relationship between the median house value and the population
status (explanatory variable lstat
). For the proportion of
the owner-occupied unites build prior to 1940 (variable
age
), it is not that much obvious what relationship to
expect… In all three situations we will go for a simple linear
(ordinary) regression model in terms of
\[ Y_i = a + b X_i + \varepsilon_i \qquad i = 1, \dots, 506. \]
age
below 50% and above 50%).
In order to get some insight into the relationship \(Y = f(X) + \varepsilon\) (and to judge the
appropriatness of the linear line to be used as a surrogate for \(f(\cdot)\)) we can take a look at some
(non-parametric) data smoothing – function lowess()
in
R:
par(mfrow = c(1,3))
plot(medv ~ rm, data = Boston, pch = 22, bg = "gray", ylab = "Median value [in 1000$]", xlab = "Average number of rooms")
lines(lowess(Boston$medv ~ Boston$rm, f = 2/3), col = "red", lwd = 2)
plot(medv ~ age, data = Boston, pch = 22, bg = "gray", ylab = "", xlab = "Proportion of owner-occupied units built prior to 1940")
lines(lowess(Boston$medv ~ Boston$age, f = 2/3), col = "red", lwd = 2)
plot(medv ~ lstat, data = Boston, pch = 22, bg = "gray", ylab = "", xlab = "Lower status of the population [%].")
lines(lowess(Boston$medv ~ Boston$lstat, f = 2/3), col = "red", lwd = 2)
Note, that there is hardly any resonable analytical expression for
the red curves above (the specific form of the function \(f(\cdot)\)). Also note the parameter
f = 2/3
in the function lowess()
. Run the same
R code with different options for the value of this parameter to see
differences.
lstat
We will now fit a simple linear regression line through the data
using the R function lm()
. In the implementation used
below, the analytical form of the function \(f(\cdot)\) being used to “smooth” the data
is \[
f(x) = a + bx
\] for some intercept parameter \(a \in
\mathbb{R}\) and the slope parameter \(b \in \mathbb{R}\). The following R code
fits a simple linear regression line, with medv
as the
response (depedent variable) and lstat
as the predictor
(explanatory/indepenent variable/covariate). The basic syntax is {},
where y
is the response, x
is the predictor,
and data
is the data set in which these two variables are
kept.
lm.fit <- lm(medv ~ lstat, data = Boston)
If we type lm.fit
, some basic information about the
model is output. For more detailed information, we use
summary(lm.fit)
. This gives us \(p\)-values and standard errors for the
coefficients, as well as the \(R^2\)
statistic and \(F\)-statistic for the
model.
lm.fit
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Coefficients:
## (Intercept) lstat
## 34.55 -0.95
This fits a straight line through the data (the line intersects the \(y\) axis at the point \(34.55\) and the value of the slope parameter – which equals \(-0.95\) – means that for each unit increase on the \(x\) axis the line drops by \(0.95\) units with respect to the \(y\) axis).
plot(medv ~ lstat, data = Boston, pch = 22, bg = "gray", ylab = "", xlab = "Lower status of the population [%].")
lines(lowess(Boston$medv ~ Boston$lstat, f = 2/3), col = "red", lwd = 2)
abline(lm.fit, col = "blue", lwd = 2)
It is important to realize that the estimated parameters \(\widehat{a} = 34.55\) and \(\widehat{b} = -0.95\) are given realization of two random variables – \(\widehat{a}\) and \(\widehat{b}\) (i.e., for another dataset, or another subset of the observations we would obtained different values of the estimated parameters). Thus, we are speaking about random quantities here and, therefore, it is reasonable to inspect some of their statistical properties (such as the corresponding mean parameters, variance of the estimates, or even their full distribution…).
Some of the statistical properties (i.e., the empirical estimates for the standard errors of the estimates) can be obtained by the following command:
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
We can use the names()
function in order to find out
what other pieces of information are stored in lm.fit
.
Although we can extract these quantities by
name—e.g. lm.fit$coefficients
—it is safer to use the
extractor functions like coef()
to access them.
names(lm.fit)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
coef(lm.fit)
## (Intercept) lstat
## 34.5538409 -0.9500494
It is imporant to realize, that the estimates \(\widehat{a}\) and \(\widehat{b}\) are random variables. They have the corresponding variances \(Var(\widehat{a})\) and \(Var(\widehat{b})\) but the values that we see in the table above are just the estimates of this theoretical quantities – thus, mathematically correctly, we should use the notation \[ \widehat{Var(\widehat{a})} \approx 0.5626^2 \] and
\[
\widehat{Var(\widehat{b})} \approx 0.0387^2
\] The values in the table (second column) are the estimates for
the true standard errors of \(\widehat{a}\) and \(\widehat{b}\) (because we do not now the
true variance, or standard error respectively, of the error terms \(\varepsilon_i\) in the underlying model
\(Y_i = a + b X_i + \varepsilon_i\),
where we only assume that \(\varepsilon_i \sim
(0, \sigma^2)\)).
The unknown variance \(\sigma^2 >
0\) is also estimated in the output above – see the value for the
Residual standard error
. Thus, we have \(\widehat{\sigma^2} = 6.216^2\).
### model residuals -- estimates for random errors and their estimated variance/standard error
var(lm.fit$residuals)
## [1] 38.55917
sqrt(var(lm.fit$residuals))
## [1] 6.209603
The intercept parameter \(a \in
\mathbb{R}\) and the slope parameter \(b \in \mathbb{R}\) are unknown but fixed
parameters and we have the corresponding (point) estimates for both –
random quantities \(\widehat{a}\) and
\(\widehat{b}\). Thus, it is also
reasonable to ask for an interval estimates instead – the confidence
intervales for \(a\) and \(b\). This can be obtained by the command
confint()
.
confint(lm.fit)
## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
The estimated parameters \(\widehat{a}\) and \(\widehat{b}\) can be used to estimate some
characteristic of the distribution of the dependent variable \(Y\) (i.e., medv
) given the
value of the independent variable “\(X =
x\)” (i.e., lstat
). In other words, the estimated
parameters can be used to estimate the conditional expectation of the
conditional distribution of \(Y\) given
“\(X = x\)”, respectively \[
\widehat{\mu_x} = \widehat{E[Y | X = x]} = \widehat{a} + \widehat{b}x.
\]
However, sometimes it can be also useful to “predict” the value of \(Y\) for a specific value of \(X\). As far as \(Y\) is random (even conditionally on \(X = x\)) we need to give some characteristic of the whole conditional distribution – and this characteristic is said to be a prediction for \(Y\). It is common to use the conditional expectation for this purpose.
In the R program, we can use the predict()
function,
which also produce the confidence intervals and prediction intervals for
the prediction of medv
for a given value of
lstat
.
predict(lm.fit, data.frame(lstat = (c(5, 10, 15))),
interval = "confidence")
## fit lwr upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
predict(lm.fit, data.frame(lstat = (c(5, 10, 15))),
interval = "prediction")
## fit lwr upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310 8.077742 32.52846
For instance, the 95,% confidence interval associated with a
lstat
value of 10 is \((24.47,
25.63)\), and the 95,% prediction interval is \((12.828, 37.28)\). As expected, the
confidence and prediction intervals are centered around the same point
(a predicted value of \(25.05\) for
medv
when lstat
equals 10), but the latter are
substantially wider. Can you explain why?
Next we examine some diagnostic plots – but we will discuss different diagnostics tools later.
Four diagnostic plots are automatically produced by applying the
plot()
function directly to the output from
lm()
. In general, this command will produce one plot at a
time, and hitting Enter will generate the next plot. However,
it is often convenient to view all four plots together. We can achieve
this by using the par()
and mfrow()
functions,
which tell R
to split the display screen into separate
panels so that multiple plots can be viewed simultaneously. For example,
par(mfrow = c(2, 2))
divides the plotting region into a
\(2 \times 2\) grid of panels.
par(mfrow = c(2, 2))
plot(lm.fit)
These plots can be used to judge the quality of the model that was used – in our particular case we used a simple linear regression line of the form \(Y = a + bX + \varepsilon\).
age
and rm
.
Is there any relationship/expectation with respect to the estimated covariances and correlations below?
cov(Boston$medv, Boston$rm)
## [1] 4.493446
cov(Boston$medv, Boston$age)
## [1] -97.58902
cov(Boston$medv, Boston$lstat)
## [1] -48.44754
cor(Boston$medv, Boston$rm)
## [1] 0.6953599
cor(Boston$medv, Boston$age)
## [1] -0.3769546
cor(Boston$medv, Boston$lstat)
## [1] -0.7376627
Until now, the explanatory variable \(X\) was treated as a continuous variable. However, this variable can be also used as a binary information (and also as a categorical variable, but this will be discussed later).
We already mentioned a situation where the proportion of the onwer-occupied houses is below 50% and above. Thus, we will create another variable in the original data that will reflect this information.
Boston$fage <- (Boston$age > 50)
table(Boston$fage)
##
## FALSE TRUE
## 147 359
We can look at both subpopulations by the means of two boxplots for instance:
boxplot(medv ~ fage, data = Boston, col = "lightblue", xlab = "Proportion of owner-occupied houses is above 50%", ylab = "Median house value")
The figure above somehow corresponds with the scatterplot of
medv
against age
where we observed that higher
proportion (i.e., higher values of the age
variable) are
associated with rather lower median house values (dependent variable
\(Y \equiv\) medv
). In the
boxplot above, we can also see that higher proportions of the
owner-occupied houses (the sub-population where the proportion is above
\(50\%\)) are associated with rather
lower median house values.
What will happen when this information (explanatory variable
fage
which only takes two values – one for true and zero
otherwise) will be used in a simple linear regression model?
lm.fit2 <- lm(medv ~ fage, data = Boston)
lm.fit2
##
## Call:
## lm(formula = medv ~ fage, data = Boston)
##
## Coefficients:
## (Intercept) fageTRUE
## 26.693 -5.864
And the correponding statistical summary of the model:
summary(lm.fit2)
##
## Call:
## lm(formula = medv ~ fage, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.829 -5.720 -1.729 2.898 29.171
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.6932 0.7267 36.730 < 2e-16 ***
## fageTRUE -5.8639 0.8628 -6.796 3.04e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.811 on 504 degrees of freedom
## Multiple R-squared: 0.08396, Adjusted R-squared: 0.08214
## F-statistic: 46.19 on 1 and 504 DF, p-value: 3.038e-11
Can you see some analogy with the following (partial) exploratory characteristic (respectively their sample estimates)?
mean(Boston$medv[Boston$fage == F])
## [1] 26.6932
mean(Boston$medv[Boston$fage == T])
## [1] 20.82925
mean(Boston$medv[Boston$fage == T]) - mean(Boston$medv[Boston$fage == F])
## [1] -5.863949
The estimated model in this case takes a form \[ Y = a + bX + \varepsilon \] where \(x\) can only take two specific values – it is either equal to zero and, thus, the model becomes \(f(x) = a\) or the value of \(x\) is equal to 1 and the model becomes \(f(x) = a + b\). Note, that any other parametrization of the explanatory variable \(X\) (for instance, \(x = -1\) for the subpopulation with the proportion below 50% and \(x = 1\) for the subpopulation with the proportion above 50%) gives mathematicaly equivalent model.
### different parametrization of two subpopulations (using +/- 1 instead of 0/1)
Boston$fage2 <- (2 * as.numeric(Boston$fage) - 1)
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 5.21 28.7
## fage fage2
## 1 TRUE 1
## 2 TRUE 1
## 3 TRUE 1
## 4 FALSE -1
## 5 TRUE 1
## 6 TRUE 1
lm.fit3 <- lm(medv ~ fage2, data = Boston)
lm.fit3
##
## Call:
## lm(formula = medv ~ fage2, data = Boston)
##
## Coefficients:
## (Intercept) fage2
## 23.761 -2.932
summary()
.
(mean(Boston$medv[Boston$fage2 == -1]) + mean(Boston$medv[Boston$fage2 == 1]))/2
## [1] 23.76122
We already mentioned some transformation (parametrizations) of the independent/exploratory variable \(X\) in a situation where \(X\) stands for a binary information (yes/no, true/false, \(\pm 1\), \(0/1\), etc.). However, different transformations can be also used for the independent covariate \(X\) if it stands for a continuous variable.
Consider the first model
lm.fit
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Coefficients:
## (Intercept) lstat
## 34.55 -0.95
and the model
lm(medv ~ I(lstat/100), data = Boston)
##
## Call:
## lm(formula = medv ~ I(lstat/100), data = Boston)
##
## Coefficients:
## (Intercept) I(lstat/100)
## 34.55 -95.00
What type of transformation is used in this example and what is the
underlying interpretation of this different parametrization of \(X\)? Such and similar parametrization are
typically used to improve the interpretation of the final model. For
instance, using the first model lm.fit
the interpretation
of the estimated intercept parametr \(\widehat{a}\) can be given as follows:
“The estimated expected value for the median house value is 34.5
thousand dollars if the lower population status is at the zero
level.” But this may not be realistic when looking at the data –
indeed, the minimum observed value for lstat
is \(1.73\). Thus, it can be more suitable to
use a different interpretation of the intercept parameter. Consider the
following model and the summary statistics for lstat
:
lm(medv ~ I(lstat - 11.36), data = Boston)
##
## Call:
## lm(formula = medv ~ I(lstat - 11.36), data = Boston)
##
## Coefficients:
## (Intercept) I(lstat - 11.36)
## 23.76 -0.95
summary(Boston$lstat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.73 6.95 11.36 12.65 16.95 37.97
What is the interpretation of the intercept parameter estimate in
this case? Can you obtain the value of the estimate using the original
model?
The transformations of the independent covariates can be, however, also used to improve the model fit. For instance, consider the following model
Boston$lstat_transformed <- sqrt(Boston$lstat)
lm.fit4 <- lm(medv ~ lstat_transformed, data = Boston)
Both model can be compared visually:
par(mfrow = c(1,3))
plot(medv ~ lstat, data = Boston, pch = 22, bg = "gray", ylab = "", xlab = "Lower status of the population [%].")
abline(lm.fit, col = "blue", lwd = 2)
plot(medv ~ lstat_transformed, data = Boston, pch = 22, bg = "gray", ylab = "", xlab = "Lower status of the population squared [%].")
abline(lm.fit4, col = "blue", lwd = 2)
plot(medv ~ lstat, data = Boston, pch = 22, bg = "gray", ylab = "", xlab = "Lower status of the population [%].")
xGrid <- seq(0,40, length = 1000)
yValues <- lm.fit4$coeff[1] + lm.fit4$coeff[2] * sqrt(xGrid)
lines(yValues ~ xGrid, col = "blue", lwd = 2)
Note, that the first plot is the scatterplot for the relationship
medv
\(\sim\)
lstat
– which is the original model which fits the straight
line through the data. The middle plot is the scatterplot for the
relationship medv
\(\sim
(\)lstat
\()^{1/2}\)
and the corresponding model (lm.fit4
) again fits the
straight line through the data (with coordinates \((Y_i, \sqrt{X_i})\)). Finally, the third
plot shows the original scatterplot (the data with the coordinates \((Y_i, X_i)\)) but the model
lm.fit4
does not fit a straight line through such data. The
interpretation is simple with respect to the transformed data \((Y_i, \sqrt{X_i})\) but it is not that much
straightforward with rescpet to the original data \((Y_i, X_i)\).
Nevertheless, it seems that the model \(lm.fit4\) fits the data more closely that the model \(lm.fit\).