Základy regrese | NMFM 334

Letný semester 2024 | Cvičenie 11 | 07.05.2024
HTML Markdown – Rmd source file (UTF encoding)

Outline of the lab session no.11:

Oytlying observations
Leverage points

In a standard linear regression model we postulate a specific parametric form for the conditional mean of the response variable \(Y\) given a vector some explanatory variables \(\boldsymbol{X} \in \mathbb{R}^p\), for instance \[ Y | \boldsymbol{X} \sim (\boldsymbol{X}^\top\boldsymbol{\beta}, \sigma^2), \] where \(\boldsymbol{\beta} \in \mathbb{R}^p\) is the unknown paramter vector and \(\sigma^2 > 0\) is (usually) the unknown variance parameter. The normality assumption is not strictly needed.

It practical applications it can happen that some specific observation \(i \in \{1, \dots, n\}\) does not allign well with the proposed model. This miss-alignment can be meassured with respect to both axes—the \(x\) axis respresenting the explanatory variable(s) or the \(y\) axis representing the dependent variable \(Y\) (in the first case the observation is called the outlying observation and in the second case, the observation is called the leverage point).

Both may (or may not) have a crucial impact on the final model estimate and, also, the corresponding inference.

1. Outlying observations

For illustration purposes we will again consider a small dataset mtcars with 32 independent observations. Firstly, we will focus on outlying observations and their effect on the final model estimate (the estimated parameters and model diagnostics).

Firstly, consider the following model:

library("mffSM")

summary(m1 <- lm(mpg ~ disp , data = mtcars))

## 
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8922 -2.2022 -0.9631  1.6272  7.2305 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.599855   1.229720  24.070  < 2e-16 ***
## disp        -0.041215   0.004712  -8.747 9.38e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared:  0.7183,    Adjusted R-squared:  0.709 
## F-statistic: 76.51 on 1 and 30 DF,  p-value: 9.38e-10

plotLM(m1)

In the following, we will introduce one outlyng observation and we will investigate the effect of this outlier on the overall model.

mtcars_out1 <- mtcars
mtcars_out1$mpg[12] <- 50

summary(m2 <- lm(mpg ~ disp , data = mtcars_out1))

## 
## Call:
## lm(formula = mpg ~ disp, data = mtcars_out1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.741 -3.270 -1.972  1.352 30.574 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.915986   2.473191  12.096 4.57e-13 ***
## disp        -0.038034   0.009476  -4.014 0.000368 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.539 on 30 degrees of freedom
## Multiple R-squared:  0.3494,    Adjusted R-squared:  0.3277 
## F-statistic: 16.11 on 1 and 30 DF,  p-value: 0.0003678

Both model can be compared visually:

plot(mpg ~ disp, pch = 22, bg = "lightblue", xlab = "Displacement [cu.in.]", ylab = "Miles per Gallon", data = mtcars_out1)

points(mtcars_out1$mpg[12] ~ mtcars_out1$disp[12], pch = 22, bg = "red", cex = 2)
abline(m1, col = "blue", lwd = 2)
abline(m2, col = "red", lwd = 2)

The effect of the second (artificial) oulier can be even more serious

mtcars_out2 <- mtcars_out1
mtcars_out2$mpg[13] <- 48

summary(m3 <- lm(mpg ~ disp , data = mtcars_out2))

## 
## Call:
## lm(formula = mpg ~ disp, data = mtcars_out2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.517 -4.366 -3.241  0.479 29.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.20483    3.13503   9.635 1.08e-10 ***
## disp        -0.03513    0.01201  -2.924  0.00651 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.289 on 30 degrees of freedom
## Multiple R-squared:  0.2218,    Adjusted R-squared:  0.1959 
## F-statistic: 8.552 on 1 and 30 DF,  p-value: 0.006513

Graphically:

plot(mpg ~ disp, pch = 22, bg = "lightblue", xlab = "Displacement [cu.in.]", ylab = "Miles per Gallon", data = mtcars_out2)

points(mtcars_out2$mpg[12] ~ mtcars_out2$disp[12], pch = 22, bg = "red", cex = 2)
points(mtcars_out2$mpg[13] ~ mtcars_out2$disp[13], pch = 22, bg = "red", cex = 2)
abline(m1, col = "blue", lwd = 2)
abline(m2, col = "red", lwd = 2)
abline(m3, col = "green", lwd = 2)

Indidividual work

Consider the effect of just one outlying observation and try how much the original regression line can be changed by applying the effect of just one outlying observation.
Consider the effect of two outlying observations located in a way that both will compensate the effect of the other one.
Note, that the outlying observations artificially introduced in the datasets above are truly typical with respect to the displacement information (i.e., the \(x\) axis) and they are only atypical with respect to the response values.
Consider the vector of the estimated (mean) parameters and the vector of the fitted values (\(\widehat{\boldsymbol{Y}}\)) and compared these quantities in the model without outlying observations and in a model with some outliers.

Compare also the graphical diagnostification plots for both models with outlying observations.

plotLM(m2)

plotLM(m3)

2. Leverage points

Unlike the outlying observations, the leverage points are atypical with respect to the domain of the covariace—the \(x\) axes. Similarly as before, we will introduce some leverage points in the dataset and we will investigate the effect on the final model.

mtcars_lev1 <- mtcars
mtcars_lev1$disp[1] <- 10

summary(m4 <- lm(mpg ~ disp , data = mtcars_lev1))

## 
## Call:
## lm(formula = mpg ~ disp, data = mtcars_lev1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3136 -2.1134 -0.8407  2.0451  7.9121 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.694249   1.271062   22.57  < 2e-16 ***
## disp        -0.038063   0.004899   -7.77 1.14e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.53 on 30 degrees of freedom
## Multiple R-squared:  0.668, Adjusted R-squared:  0.657 
## F-statistic: 60.37 on 1 and 30 DF,  p-value: 1.142e-08

Visually:

plot(mpg ~ disp, pch = 22, bg = "lightblue", xlab = "Displacement [cu.in.]", ylab = "Miles per Gallon", data = mtcars_lev1)

points(mtcars_lev1$mpg[1] ~ mtcars_lev1$disp[1], pch = 22, bg = "red", cex = 2)
abline(m1, col = "blue", lwd = 2)
abline(m4, col = "red", lwd = 2)

And, again, the second leverage point can make the whole situation even worse:

mtcars_lev2 <- mtcars_lev1
mtcars_lev2$disp[25] <- 800

summary(m5 <- lm(mpg ~ disp , data = mtcars_lev2))

## 
## Call:
## lm(formula = mpg ~ disp, data = mtcars_lev2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.743 -2.908 -1.559  1.853 12.997 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25.990757   1.455999  17.851  < 2e-16 ***
## disp        -0.024735   0.005075  -4.874 3.32e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.577 on 30 degrees of freedom
## Multiple R-squared:  0.4419,    Adjusted R-squared:  0.4233 
## F-statistic: 23.76 on 1 and 30 DF,  p-value: 3.324e-05

plot(mpg ~ disp, pch = 22, bg = "lightblue", xlab = "Displacement [cu.in.]", ylab = "Miles per Gallon", data = mtcars_lev2)

points(mtcars_lev2$mpg[1] ~ mtcars_lev2$disp[1], pch = 22, bg = "red", cex = 2)
points(mtcars_lev2$mpg[25] ~ mtcars_lev2$disp[25], pch = 22, bg = "red", cex = 2)
abline(m1, col = "blue", lwd = 2)
abline(m4, col = "red", lwd = 2)
abline(m5, col = "green", lwd = 2)

Note, that both leverage points are now quite typical with respect to the response values—they are only atypical with respect to the displacement value.

plotLM(m5)

Indidividual work

Compare the estimated parameters for all three models and investigate the effect of the introduced leverage points. Note, that the leverage poitns does not have that much effect (compared to the oytlying observations) if they are supposed to be only atypical with respect to the covariate doman (and not with respect to the respose).
Compare the diagonal elements of the hat matrix \(\mathbb{H}\) and try to detect leverage points by the values on the diagonal of \(\mathbb{H}\).
Consider a situation, where both types of atypical observations are introduced in the model—the leverage points and, also, the outlying observations.
How much bad can the model become with bad leverage points and bad outlying observations?