7  Leverage and Influence

7.1 Influential observations and leverage

Recall that violations of model assumptions are more likely at remote points, and these violations may be hard to detect from inspection of the ordinary residuals because their residuals will usually be smaller. Points that are outlying in the \(x\)-direction are known as leverage points. Influential points are not only remote in terms of the specific values for the regressors, but the observed response is not consistent with the values that would be predicted based on only the other data points. It is important to find these influential points and assess their impact on the model.

Below gives an example of an influential point. The seventh point in the data set is outlying in the \(x\)-direction, and it’s response value is not consistent with the regression line based on the other six observations:

set.seed(330)
x=c(rnorm(6),2.5)
y=x*2+3
y[7]=y[7]+7
plot(x,y,pch=22,bg=1)
a=lm(y~x)
curve(a$coefficients[1]+x*a$coefficients[2],add=T,lwd=3)
curve(x*2+3,add=T,col=2,lwd=3)

a2=lm(y[-7]~x[-7])
curve(a2$coefficients[1]+x*a2$coefficients[2],add=T,lwd=3,col='blue',lty=2)

a$coefficients
(Intercept)           x 
   2.048937    3.979977 

Sometimes we find that a regression coefficient may have a sign that does not make engineering or scientific sense, a regressor known to be important may be statistically insignificant, or a model that fits the data well and that is logical from an application – environment perspective may produce poor predictions. These situations may be the result of one or, perhaps, a few influential observations.

Recall the hat matrix \(H=X(X^\top X)^{-1}X^\top\), as well as that it holds that \({\textrm{Var}}\left[\hat\epsilon\right]=\sigma^2(I-H)\) and \({\textrm{Var}}\left[\hat Y\right]=\sigma^2 H\). Note that \(h_{ij}\) can be interpreted as the amount of leverage exerted by the \(ith\) observation \(y_i\) on the \(jth\) fitted value \(\hat y_j\). We usually focus attention on the diagonal elements \(h_{ii}\) of the hat matrix \(H\), which may be written as \[h_{ii}=x_i^\top (X^\top X)^{-1} x_i,\] where \(X_i^\top\) is the \(i\)th row of \(X\). The hat matrix diagonal is a standardized measure of the distance of the \(i\)th observation from the center (or centroid) of the \(x\)-space. Therefore, large values of \(h_{ii}\) implies that \(x_i\) is potentially influential. Furthermore, note that \(rank(H)=p\) since the trace of an idempotent matrix equals its rank, which means that \(\bar h= p/n\). It follows that values well above \(p/n\), say \(h_{ii}>2p/n\), can be called leverage points.

X=as.matrix(cbind(rep(1,length(x)),x))
# or

X=model.matrix(a)
hat=X%*%solve(t(X)%*%X)%*%t(X)

diag(hat)
        1         2         3         4         5         6         7 
0.2027453 0.2288737 0.2596869 0.1751432 0.1735495 0.3887329 0.5712686 
p=2
n=7
diag(hat)>2*p/n
    1     2     3     4     5     6     7 
FALSE FALSE FALSE FALSE FALSE FALSE FALSE 

7.2 Cook’s Distance

Cook’s Distance is one way to incorporate both the \(X\) and \(Y\) values into an outlyingness measure:

\[ D_i\left(X^{\top} X, p, MSE\right) \equiv D_i=\frac{\left(\hat{\beta}_{(i)}-\hat{\beta}\right)^{\top} X^{\top} X\left(\hat{\beta}_{(i)}-\hat{\beta}\right)}{p MSE}, \ i\in [n], \] where \(\hat{\beta}_{(i)}\) is the OLS estimator with the \(i\)th point removed.Large values of Cook’s distance signal a leverage point.

What do we mean by a large value? We can compare \(D_i\) to the 50th percentile of the \(F_{p,n-p}\) distribution. This gives the interpretation that deleting the \(i\)th point moves the estimate to the boundary of a 50% confidence interval. \(F_{p,n-p}\approx 1\), and so usually take \(D_i\geq 1\) to be large.

Observe that \[ D_i=\frac{r_i^2}{p} \frac{\operatorname{Var}\left(\hat{Y}_i\right)}{\operatorname{Var}\left(\hat\epsilon_i\right)}=\frac{r_i^2}{p} \frac{h_{i i}}{1-h_{i i}}, \quad i=1,2, \ldots, n,\] where it is important to recall that \(r_i\) is the studentized residual. Now, the quantity \(\frac{h_{i i}}{1-h_{i i}}\) can be shown to be the distance from the vector \(x_i\) to the centroid of the remaining data. Therefore, \(D_i\) is the product of outlyingness in both the \(X\) and \(Y\) directions. We may also write \(D_i\) as \[D_i=\frac{\left\lVert\hat{y}_{(i)}-\hat{y}\right\rVert^2}{p MSE},\] which allows for the interpretation: The Cook’s distance of the \(i\)th point is the normalized distance between the fitted value with and without point \(i\).

#cut off
cooks.distance(a)
           1            2            3            4            5            6 
0.1708029420 0.2516095165 0.0180669722 0.0009569213 0.0011772793 0.2002829110 
           7 
3.3311562309 
cooks.distance(a)>1
    1     2     3     4     5     6     7 
FALSE FALSE FALSE FALSE FALSE FALSE  TRUE 
df=data.frame(cbind(y,x))
df[cooks.distance(a)>1,]
   y   x
7 15 2.5

7.3 Data depth functions

A more modern approach and nonparametric approach to outlier detection is through data depth. A data depth function gives meaning to centrality, order and outlyingness in spaces beyond \(\mathbb{R}\). A data depth function is a function which takes a sample and a point, and returns how central the point is, with respect to the sample. Depth functions can be written as \({\textrm{D}}\colon \mathbb{R}^{d}\times \text{Sample} \rightarrow \mathbb{R}^+\). There are different definitions of depth, so I will give a few.

Let \(S^{d-1}= \{x\in \mathbb{R}^{d}\colon \left\lVert x\right\rVert=1\}\) be the set of unit vectors in \(\mathbb{R}^{d}\), let \(\mathbb{X}_{n}=\{(Y_1,X_{1,1},\ldots,X_{1,p-1}),\ldots, (Y_n,X_{n,1},\ldots,X_{n,p-1})\}\), let \(\mathbb{X}_{n}^\top u\) be \(\mathbb{X}_{n}\) projected onto \(u\in S^{d-1}\) and let \(\widehat F_u\) be the empirical CDF with respect to \(\mathbb{X}_{n}^\top u\).

The halfspace depth \({\textrm{D}}_H\) of a point \(x\in \mathbb{R}^{d}\) with respect to a distribution \(F\) over \(\mathbb{R}^{d}\) is \[ {\textrm{D}}_H(x;F)=\inf_{u\in S^{d-1}} \widehat F_u(x^\top u)\wedge (1-F_u(x^\top u))=\inf_{u\in S^{d-1}} F_u(x^\top u). \]

Given a translation and scale equivariant location estimate \(\mu\) and a translation and scale invariant scale estimate \(\sigma\), the outlyingness at \(x\in\mathbb{R}^{d}\) is defined as \[O(x)=\sup _{u\in S^{d-1}} \frac{\left|x^\top u-\mu(\mathbb{X}_{n}^\top u)\right|}{\sigma(\mathbb{X}_{n}^\top u)}.\] Define projection depth as \[{\textrm{D}}_p(x)=(1+O(x))^{-1}.\]

In order to detect outliers, we look for observations that have low depth. See, continuing our toy example:

# install.packages('ddalpha')
depths=ddalpha::depth.projection(cbind(x,y),cbind(x,y))
depths
[1] 0.276409011 0.255272074 0.500000000 0.973754328 0.973046927 0.338954415
[7] 0.001740398
depths<0.015
[1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

Example 7.1 Recall example Example 6.6. Check for leverage and influential points in the proposed models. Compute all three measures of leverage/influence/outlyingness introduced in this lesson. What do you find?

I will load in the data below:

We can now analyze the data:

# df=df[df$Lotsize<70000,]


custom_palette <- c(
  "#1f77b4", "#ff7f0e", "#2ca02c", "#d62728",
  "#9467bd", "#8c564b", "#e377c2", "#7f7f7f",
  "#bcbd22", "#17becf", "#393b79", 
  "#8c6d31", "#9c9ede", "#637939", "#eb348f"
)

# Our model from the previous lecture
df=df_clean2[-which.max(df_clean2$Lotsize),]
df=df[df$Lotsize>0,]
df['district_3']=df['District']==3
df['district_4']=df['District']==4
df['district_15']=df['District']==15
model2=lm(Sale_price~.-district_3-district_4-district_15+district_3*Year_Built+district_3*Lotsize+district_4*Lotsize+district_4*Year_Built+district_15*Year_Built+district_15*Lotsize+district_3*Fin_sqft+district_4*Fin_sqft+district_15*Fin_sqft,df)


# Compute residuals
student_res2=rstudent(model2)


summ2=summary(model2); summ2

Call:
lm(formula = Sale_price ~ . - district_3 - district_4 - district_15 + 
    district_3 * Year_Built + district_3 * Lotsize + district_4 * 
    Lotsize + district_4 * Year_Built + district_15 * Year_Built + 
    district_15 * Lotsize + district_3 * Fin_sqft + district_4 * 
    Fin_sqft + district_15 * Fin_sqft, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-303393  -24422    -942   23352  719561 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                -1.129e+06  3.625e+04 -31.143  < 2e-16 ***
District                    5.695e+03  7.701e+01  73.950  < 2e-16 ***
ExtwallBlock               -3.944e+03  3.688e+03  -1.069 0.284879    
ExtwallBrick                6.725e+03  7.245e+02   9.281  < 2e-16 ***
ExtwallFiber-Cement         3.847e+04  3.754e+03  10.246  < 2e-16 ***
ExtwallFrame               -4.986e+03  9.786e+02  -5.095 3.51e-07 ***
ExtwallMasonry / Frame      3.751e+03  1.711e+03   2.192 0.028404 *  
ExtwallPrem Wood            1.894e+04  5.596e+03   3.385 0.000714 ***
ExtwallStone                1.162e+04  1.531e+03   7.594 3.22e-14 ***
ExtwallStucco               3.514e+03  2.149e+03   1.635 0.102070    
Stories1                    3.684e+02  1.047e+04   0.035 0.971935    
Stories1.5                  1.317e+04  1.046e+04   1.259 0.207954    
Stories2                    1.451e+04  1.042e+04   1.392 0.163802    
Year_Built                  4.629e+02  1.606e+01  28.823  < 2e-16 ***
Fin_sqft                    6.048e+01  1.080e+00  55.986  < 2e-16 ***
Units1                      5.778e+04  7.389e+03   7.820 5.48e-15 ***
Units2                     -9.774e+03  7.386e+03  -1.323 0.185735    
Units3                     -4.087e+04  7.940e+03  -5.147 2.67e-07 ***
Bdrms0                      1.155e+05  1.854e+04   6.227 4.82e-10 ***
Bdrms1                      7.894e+04  1.016e+04   7.766 8.41e-15 ***
Bdrms2                      8.331e+04  9.086e+03   9.170  < 2e-16 ***
Bdrms3                      8.791e+04  9.033e+03   9.733  < 2e-16 ***
Bdrms4                      7.905e+04  9.001e+03   8.782  < 2e-16 ***
Bdrms5                      7.507e+04  9.002e+03   8.340  < 2e-16 ***
Bdrms6                      6.149e+04  9.006e+03   6.828 8.80e-12 ***
Bdrms7                      2.462e+04  9.572e+03   2.573 0.010100 *  
Bdrms8                      1.242e+04  1.008e+04   1.233 0.217672    
Fbath0                     -1.936e+04  1.388e+04  -1.395 0.162964    
Fbath1                     -1.050e+04  9.795e+03  -1.072 0.283695    
Fbath2                      7.289e+03  9.751e+03   0.748 0.454721    
Fbath3                      2.935e+04  9.641e+03   3.044 0.002339 ** 
Fbath4                      6.757e+04  1.025e+04   6.595 4.35e-11 ***
Lotsize                     1.419e+00  1.131e-01  12.545  < 2e-16 ***
Sale_date                   4.839e+00  2.603e-01  18.587  < 2e-16 ***
district_3TRUE              9.189e+05  1.373e+05   6.693 2.23e-11 ***
district_4TRUE              1.088e+06  2.504e+05   4.346 1.39e-05 ***
district_15TRUE             4.475e+05  1.212e+05   3.692 0.000223 ***
Year_Built:district_3TRUE  -5.100e+02  7.205e+01  -7.078 1.51e-12 ***
Lotsize:district_3TRUE      1.227e+01  4.803e-01  25.549  < 2e-16 ***
Lotsize:district_4TRUE     -3.796e+00  1.569e+00  -2.420 0.015512 *  
Year_Built:district_4TRUE  -5.558e+02  1.306e+02  -4.255 2.09e-05 ***
Year_Built:district_15TRUE -2.829e+02  6.295e+01  -4.493 7.05e-06 ***
Lotsize:district_15TRUE     3.140e+00  1.213e+00   2.589 0.009620 ** 
Fin_sqft:district_3TRUE     6.621e+01  1.517e+00  43.639  < 2e-16 ***
Fin_sqft:district_4TRUE    -2.049e+01  4.181e+00  -4.901 9.61e-07 ***
Fin_sqft:district_15TRUE   -1.161e+01  3.163e+00  -3.672 0.000241 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 43870 on 24429 degrees of freedom
Multiple R-squared:  0.7301,    Adjusted R-squared:  0.7296 
F-statistic:  1468 on 45 and 24429 DF,  p-value: < 2.2e-16
summ2$adj.r.squared
[1] 0.7295574
# Compute residual analysis


MSE2=summ2$sigma^2
qqnorm(student_res2,pch=22,bg=1)
abline(0,1)

hist(student_res2,freq=F,breaks=100)
curve(dnorm(x,0,1),add=T)

# hist(student_res2,freq=F,breaks=100)
# curve(dnorm(x,0,1),add=T)
plot(model2$fitted.values,student_res2,pch=22,bg=1)
abline(h=0)

# First measure

X=model.matrix(model2)
hat=X%*%solve(t(X)%*%X)%*%t(X)

# diag(hat)
p=ncol(X)
n=nrow(X)
out_1=which(diag(hat)>2*p/n)
plot(sort(diag(hat)[out_1]))
abline(h=2*p/n)

# I would still look at those after the elbow



# Cooks distances
CDS=cooks.distance(model2)
plot(sort(CDS,T)[1:100])

which(CDS>1)
22532 
22398 
max(CDS)
[1] 1.075144
df[CDS>1,]
      District Extwall Stories Year_Built Fin_sqft Units Bdrms Fbath Lotsize
22532        3   Block       1       1960     4323     1     4     3   72480
      Sale_date Sale_price District District District
22532     17713    1250000     TRUE    FALSE    FALSE
# I would still look at those two values that are far from the other distances
# I would still look at those before the elbow



# We may only look at numeric values for depth functions - so we can either
numer=NULL
for(i in names(df)){
  if(!is.factor(df[1,i])){
    numer=c(numer,i)
  }
}
numer
[1] "District"    "Year_Built"  "Fin_sqft"    "Lotsize"     "Sale_date"  
[6] "Sale_price"  "district_3"  "district_4"  "district_15"
df_mat=as.matrix(df[,numer])
depths=ddalpha::depth.projection(df_mat,df_mat)
which(depths<0.1)[1:10]
 [1]   5 130 326 471 473 593 637 669 673 735
plot(sort(depths,F)[1:100])

# Notice there is a crack around 0.035, I would look at those observations
plot(sort(depths,F)[1:1000])

plot(sort(depths,F))

# OR 
depths=ddalpha::depth.projection(cbind(X,df$Sale_price),cbind(X,df$Sale_price))
which(depths<0.1)[1:10]
 [1]  2  3  4  5  7 13 15 16 17 20
plot(sort(depths,F)[1:100])

which.max(diag(hat))
22532 
22398 
which.max(CDS)
22532 
22398 
which.min(depths)
[1] 22398
# Hugely expensive home!
df[which.min(depths),]
      District Extwall Stories Year_Built Fin_sqft Units Bdrms Fbath Lotsize
22532        3   Block       1       1960     4323     1     4     3   72480
      Sale_date Sale_price District District District
22532     17713    1250000     TRUE    FALSE    FALSE
df[which.max(CDS),]
      District Extwall Stories Year_Built Fin_sqft Units Bdrms Fbath Lotsize
22532        3   Block       1       1960     4323     1     4     3   72480
      Sale_date Sale_price District District District
22532     17713    1250000     TRUE    FALSE    FALSE
df[which.max(diag(hat)),]
      District Extwall Stories Year_Built Fin_sqft Units Bdrms Fbath Lotsize
22532        3   Block       1       1960     4323     1     4     3   72480
      Sale_date Sale_price District District District
22532     17713    1250000     TRUE    FALSE    FALSE
model3=lm(Sale_price~.-district_3-district_4-district_15+district_3*Year_Built+district_3*Lotsize+district_4*Lotsize+district_4*Year_Built+district_15*Year_Built+district_15*Lotsize+district_3*Fin_sqft+district_4*Fin_sqft+district_15*Fin_sqft,df[-(order(depths)[1:100]),])

# OR 

model4=lm(Sale_price~.-district_3-district_4-district_15+district_3*Year_Built+district_3*Lotsize+district_4*Lotsize+district_4*Year_Built+district_15*Year_Built+district_15*Lotsize+district_3*Fin_sqft+district_4*Fin_sqft+district_15*Fin_sqft,df[-(order(CDS,decreasing = T)[1:100]),])

# Compare
s=summary(model3)
summary(model3)

Call:
lm(formula = Sale_price ~ . - district_3 - district_4 - district_15 + 
    district_3 * Year_Built + district_3 * Lotsize + district_4 * 
    Lotsize + district_4 * Year_Built + district_15 * Year_Built + 
    district_15 * Lotsize + district_3 * Fin_sqft + district_4 * 
    Fin_sqft + district_15 * Fin_sqft, data = df[-(order(depths)[1:100]), 
    ])

Residuals:
    Min      1Q  Median      3Q     Max 
-275030  -24260   -1069   23029  411038 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                -1.062e+06  3.577e+04 -29.689  < 2e-16 ***
District                    5.683e+03  7.309e+01  77.760  < 2e-16 ***
ExtwallBlock               -2.139e+03  3.506e+03  -0.610 0.541686    
ExtwallBrick                6.923e+03  6.907e+02  10.024  < 2e-16 ***
ExtwallFiber-Cement         4.220e+04  3.577e+03  11.797  < 2e-16 ***
ExtwallFrame               -4.770e+03  9.303e+02  -5.128 2.96e-07 ***
ExtwallMasonry / Frame      4.223e+03  1.636e+03   2.582 0.009828 ** 
ExtwallPrem Wood            2.134e+04  5.348e+03   3.991 6.59e-05 ***
ExtwallStone                1.034e+04  1.458e+03   7.092 1.36e-12 ***
ExtwallStucco               5.659e+03  2.057e+03   2.751 0.005948 ** 
Stories1                   -1.987e+04  1.027e+04  -1.934 0.053066 .  
Stories1.5                 -6.665e+03  1.026e+04  -0.650 0.515852    
Stories2                   -3.784e+03  1.023e+04  -0.370 0.711314    
Year_Built                  4.521e+02  1.538e+01  29.398  < 2e-16 ***
Fin_sqft                    5.704e+01  1.038e+00  54.976  < 2e-16 ***
Units1                      5.644e+04  7.049e+03   8.006 1.24e-15 ***
Units2                     -9.869e+03  7.045e+03  -1.401 0.161282    
Units3                     -4.277e+04  7.578e+03  -5.644 1.68e-08 ***
Bdrms0                      9.518e+04  1.773e+04   5.369 8.01e-08 ***
Bdrms1                      5.351e+04  9.890e+03   5.410 6.35e-08 ***
Bdrms2                      5.984e+04  8.891e+03   6.731 1.73e-11 ***
Bdrms3                      6.531e+04  8.840e+03   7.388 1.54e-13 ***
Bdrms4                      5.701e+04  8.809e+03   6.472 9.83e-11 ***
Bdrms5                      5.396e+04  8.808e+03   6.126 9.14e-10 ***
Bdrms6                      4.182e+04  8.810e+03   4.747 2.07e-06 ***
Bdrms7                      1.719e+04  9.384e+03   1.832 0.066973 .  
Bdrms8                      6.934e+03  9.886e+03   0.701 0.483047    
Fbath0                     -1.844e+04  1.524e+04  -1.210 0.226294    
Fbath1                     -7.378e+03  1.205e+04  -0.612 0.540357    
Fbath2                      1.114e+04  1.202e+04   0.927 0.353926    
Fbath3                      3.273e+04  1.197e+04   2.735 0.006250 ** 
Fbath4                      4.163e+04  1.265e+04   3.292 0.000998 ***
Lotsize                     1.669e+00  1.139e-01  14.653  < 2e-16 ***
Sale_date                   4.682e+00  2.475e-01  18.917  < 2e-16 ***
district_3TRUE              9.129e+05  1.355e+05   6.737 1.65e-11 ***
district_4TRUE              1.058e+06  2.425e+05   4.365 1.28e-05 ***
district_15TRUE             4.870e+05  1.167e+05   4.173 3.01e-05 ***
Year_Built:district_3TRUE  -4.924e+02  7.127e+01  -6.908 5.03e-12 ***
Lotsize:district_3TRUE      1.496e+01  8.517e-01  17.566  < 2e-16 ***
Lotsize:district_4TRUE     -9.878e-01  2.307e+00  -0.428 0.668598    
Year_Built:district_4TRUE  -5.537e+02  1.271e+02  -4.356 1.33e-05 ***
Year_Built:district_15TRUE -3.057e+02  6.079e+01  -5.029 4.97e-07 ***
Lotsize:district_15TRUE     4.600e+00  1.240e+00   3.709 0.000208 ***
Fin_sqft:district_3TRUE     4.606e+01  1.682e+00  27.383  < 2e-16 ***
Fin_sqft:district_4TRUE    -1.509e+01  4.811e+00  -3.136 0.001717 ** 
Fin_sqft:district_15TRUE   -1.299e+01  3.012e+00  -4.312 1.62e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 41620 on 24329 degrees of freedom
Multiple R-squared:  0.6746,    Adjusted R-squared:  0.674 
F-statistic:  1121 on 45 and 24329 DF,  p-value: < 2.2e-16
summary(model2)

Call:
lm(formula = Sale_price ~ . - district_3 - district_4 - district_15 + 
    district_3 * Year_Built + district_3 * Lotsize + district_4 * 
    Lotsize + district_4 * Year_Built + district_15 * Year_Built + 
    district_15 * Lotsize + district_3 * Fin_sqft + district_4 * 
    Fin_sqft + district_15 * Fin_sqft, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-303393  -24422    -942   23352  719561 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                -1.129e+06  3.625e+04 -31.143  < 2e-16 ***
District                    5.695e+03  7.701e+01  73.950  < 2e-16 ***
ExtwallBlock               -3.944e+03  3.688e+03  -1.069 0.284879    
ExtwallBrick                6.725e+03  7.245e+02   9.281  < 2e-16 ***
ExtwallFiber-Cement         3.847e+04  3.754e+03  10.246  < 2e-16 ***
ExtwallFrame               -4.986e+03  9.786e+02  -5.095 3.51e-07 ***
ExtwallMasonry / Frame      3.751e+03  1.711e+03   2.192 0.028404 *  
ExtwallPrem Wood            1.894e+04  5.596e+03   3.385 0.000714 ***
ExtwallStone                1.162e+04  1.531e+03   7.594 3.22e-14 ***
ExtwallStucco               3.514e+03  2.149e+03   1.635 0.102070    
Stories1                    3.684e+02  1.047e+04   0.035 0.971935    
Stories1.5                  1.317e+04  1.046e+04   1.259 0.207954    
Stories2                    1.451e+04  1.042e+04   1.392 0.163802    
Year_Built                  4.629e+02  1.606e+01  28.823  < 2e-16 ***
Fin_sqft                    6.048e+01  1.080e+00  55.986  < 2e-16 ***
Units1                      5.778e+04  7.389e+03   7.820 5.48e-15 ***
Units2                     -9.774e+03  7.386e+03  -1.323 0.185735    
Units3                     -4.087e+04  7.940e+03  -5.147 2.67e-07 ***
Bdrms0                      1.155e+05  1.854e+04   6.227 4.82e-10 ***
Bdrms1                      7.894e+04  1.016e+04   7.766 8.41e-15 ***
Bdrms2                      8.331e+04  9.086e+03   9.170  < 2e-16 ***
Bdrms3                      8.791e+04  9.033e+03   9.733  < 2e-16 ***
Bdrms4                      7.905e+04  9.001e+03   8.782  < 2e-16 ***
Bdrms5                      7.507e+04  9.002e+03   8.340  < 2e-16 ***
Bdrms6                      6.149e+04  9.006e+03   6.828 8.80e-12 ***
Bdrms7                      2.462e+04  9.572e+03   2.573 0.010100 *  
Bdrms8                      1.242e+04  1.008e+04   1.233 0.217672    
Fbath0                     -1.936e+04  1.388e+04  -1.395 0.162964    
Fbath1                     -1.050e+04  9.795e+03  -1.072 0.283695    
Fbath2                      7.289e+03  9.751e+03   0.748 0.454721    
Fbath3                      2.935e+04  9.641e+03   3.044 0.002339 ** 
Fbath4                      6.757e+04  1.025e+04   6.595 4.35e-11 ***
Lotsize                     1.419e+00  1.131e-01  12.545  < 2e-16 ***
Sale_date                   4.839e+00  2.603e-01  18.587  < 2e-16 ***
district_3TRUE              9.189e+05  1.373e+05   6.693 2.23e-11 ***
district_4TRUE              1.088e+06  2.504e+05   4.346 1.39e-05 ***
district_15TRUE             4.475e+05  1.212e+05   3.692 0.000223 ***
Year_Built:district_3TRUE  -5.100e+02  7.205e+01  -7.078 1.51e-12 ***
Lotsize:district_3TRUE      1.227e+01  4.803e-01  25.549  < 2e-16 ***
Lotsize:district_4TRUE     -3.796e+00  1.569e+00  -2.420 0.015512 *  
Year_Built:district_4TRUE  -5.558e+02  1.306e+02  -4.255 2.09e-05 ***
Year_Built:district_15TRUE -2.829e+02  6.295e+01  -4.493 7.05e-06 ***
Lotsize:district_15TRUE     3.140e+00  1.213e+00   2.589 0.009620 ** 
Fin_sqft:district_3TRUE     6.621e+01  1.517e+00  43.639  < 2e-16 ***
Fin_sqft:district_4TRUE    -2.049e+01  4.181e+00  -4.901 9.61e-07 ***
Fin_sqft:district_15TRUE   -1.161e+01  3.163e+00  -3.672 0.000241 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 43870 on 24429 degrees of freedom
Multiple R-squared:  0.7301,    Adjusted R-squared:  0.7296 
F-statistic:  1468 on 45 and 24429 DF,  p-value: < 2.2e-16
# Notice that some of the coefficients moved several standard errors!! This is a huge change - recall that outside 2 SE is outside the confidence interval. 
sort(abs(model3$coefficients-model2$coefficients)/s$coefficients[,2],T)
   Fin_sqft:district_3TRUE                   Fin_sqft 
               11.97605245                 3.31639566 
    Lotsize:district_3TRUE                     Bdrms2 
                3.15994935                 2.64035516 
                    Bdrms1                     Bdrms3 
                2.57156818                 2.55739988 
                    Bdrms4                     Bdrms5 
                2.50157649                 2.39726109 
                    Bdrms6                    Lotsize 
                2.23266080                 2.19756713 
                    Fbath4                   Stories1 
                2.05158670                 1.97035162 
                Stories1.5                (Intercept) 
                1.93341261                 1.87240899 
                  Stories2     Lotsize:district_4TRUE 
                1.78952063                 1.21723628 
   Lotsize:district_15TRUE                     Bdrms0 
                1.17739379                 1.14446105 
   Fin_sqft:district_4TRUE        ExtwallFiber-Cement 
                1.12329496                 1.04407021 
             ExtwallStucco               ExtwallStone 
                1.04255877                 0.87854123 
                    Bdrms7                 Year_Built 
                0.79208704                 0.70504462 
                 Sale_date                     Bdrms8 
                0.63434805                 0.55499651 
              ExtwallBlock   Fin_sqft:district_15TRUE 
                0.51482838                 0.45620086 
          ExtwallPrem Wood Year_Built:district_15TRUE 
                0.44975422                 0.37580752 
           district_15TRUE                     Fbath2 
                0.33901662                 0.32042867 
    ExtwallMasonry / Frame               ExtwallBrick 
                0.28873525                 0.28772014 
                    Fbath3                     Fbath1 
                0.28254847                 0.25918539 
                    Units3  Year_Built:district_3TRUE 
                0.25090184                 0.24682136 
              ExtwallFrame                     Units1 
                0.23214386                 0.19080464 
                  District             district_4TRUE 
                0.15694852                 0.12256690 
                    Fbath0             district_3TRUE 
                0.06035946                 0.04448568 
 Year_Built:district_4TRUE                     Units2 
                0.01715649                 0.01349622 

How should we treat influential observations? The easiest course of action is removal. If there are many influential observations, then you might want to try robust model fitting methods, which automatically account for outliers and influential observations.

7.4 Homework questions

Complete the Chapter 6 textbook questions.

Exercise 7.1 What are the three methods we have learned for detecting influential/leverage points?

Exercise 7.2 Compute the hat values, Cook’s distances and the depth values for the body weight example. Are there any influential/leverage points/outliers?

Exercise 7.3 Compute the hat values, Cook’s distances and the depth values for the cars example. Are there any outliers/influential/leverage points?

Exercise 7.4 Fit a model without location to the real estate data of your choosing. Compute the hat values, Cook’s distances and the depth values for the cars example. Are there any influential/leverage points/outliers? Print out the influential/leverage points/outliers. Why do you think they are outlying? Should we remove them?