8  Multicollinearity

8.1 Multicollinearity and the problems it creates

A serious problem that may dramatically impact the usefulness of a regression model is multicollinearity, or near-linear dependence among the regression variables. That is multicollinearity refers to near-linear dependence among the regressors. The regressors are the columns of the \(X\) matrix, so clearly an exact linear dependence among the regressors would result in a singular \(X^\top X\). This will impact our ability to estimate \(\beta\).

To elaborate, assume that the regressor variables and the response have been centered and scaled to unit length. The matrix \(X^\top X\) is then a \(p\times p\) correlation matrix (of the vector of regressors) and \(X^\top Y\) is the vector of correlations between the regressors and response. Recall that a set of vectors \(v_1,\ldots,v_n\) are linearly dependent if there exists \(c\neq 0\) such that \(\sum_{i=1}^nc_iv_i=0\). If the columns of \(X\) are linearly dependent, then \(X^\top X\) is not invertible! We say there is multicollinearity if there exists \(c\neq 0\) such that \(\sum_{i=1}^nc_iv_i<\epsilon\) for some small \(\epsilon\).

Multicollinearity results in large variances and covariances for the least - squares estimators of the regression coefficients. Let \(A=(X^\top X)^{-1}\), where the regressors have been centered and scaled to unit length. That is, columns \(2,\ldots,p\) of \(X\) have their mean subtracted and are divided by their respective norms. Then \[A_{jj}=(1-R^2_j)^{-1},\] where \(R^2_j\) is the coefficient of multiple determination from the regression of \(X_j\) on the remaining \(p - 1\) regressor variables. Now, recall that \({\textrm{Var}}\left[\hat\beta_j\right]=A_{jj}\sigma^2\). What happens when the correlation is approximately 1 between \(X_j\) and another regressor? It is easy to see that the the variance of coefficient \(j\) goes to \(\infty\) as \({\textrm{corr}}\left[X_j,X_i\right]\to 1\).

This huge variance results in large magnitude of the least squares estimators of the regression coefficients. We have that \({\textrm{E}}\left[\left\lVert\hat\beta-\beta\right\rVert^2\right]=\sum_{i=1}^p {\textrm{Var}}\left[\hat\beta_i\right]=\sigma^2 \mathop{\mathrm{trace}}(X^\top X)^{-1}\). Recall! \(\mathop{\mathrm{trace}}(A)\) is the sum of its eigenvalues. If \(X^\top X\) has near linearly dependent columns, then some of the eigenvalues \(\lambda_1,\ldots,\lambda_p\) will be near 0 (why?). Thus, \[{\textrm{E}}\left[\left\lVert\hat\beta-\beta\right\rVert^2\right]=\sigma^2\mathop{\mathrm{trace}}(X^\top X)^{-1}=\sigma^2\sum_{i=1}^p\frac{1}{\lambda_i}.\] We can also show that \[{\textrm{E}}\left[\hat{\beta}^{\top} \hat{\beta}\right]=\beta^{\top} \beta+\sigma^2 \operatorname{Tr}\left(X^{\top} X\right)^{-1},\] which gives the same interpretation.

We can also observe this empirically.

# Let's simulate data from a regression model with highly correlated regressors. 
simulate_coef=function(){
  n=100
  X=rnorm(n)
  X2=2*X+rnorm(n,0,0.01)
  Y=1+2*X+X2*2+rnorm(n)
  return(coef(lm(Y~X+X2)))
}

# See the HUGE variance in the estimated coefficients? 0 this is from MCL!
CF=replicate(1000,simulate_coef())
hist(CF[1,],xlab='Coef 1')

hist(CF[2,],xlab='Coef 2')

hist(CF[3,],xlab='Coef 3')

It is easy to see from this analysis that multicollinearity is a serious problem, and we should check for it in regression modelling.

8.2 Multicollinearity Diagnostics

8.2.1 VIF

We need some diagnostics to detect multicollinearity. The first of which is the correlation matrix, which is good for detecting pairwise correlations, but not so much for more complicated dependencies! To correct this, a popular diagnostic is the variance inflation factor (VIF): these are the diagonals of \((X^\top X)^{-1}\). A VIF that exceeds 3, 5 or 10 is an indication that the associated regression coefficients are poorly estimated because of multicollinearity. The variance inflation factor can be written as \((1-R^2_j)^{-1}\), where \(R^2_j\) is the coefficient of determination obtained from regressing \(x_j\) on the other regressor variables. For categorical variables, we may look at their VIF together, instead of for the individual dummy variables. This is done via the generalized VIF (GVIF), which was developed by our very own Georges Monette and John Fox (Fox and Monette 1992). We can consider the \((GVIF^{(1/(2\text{number of dummy variables})))}\). However, we want to compare these to the square root of our rules of thumb, so \(\sqrt{3},\sqrt{5},\sqrt{10}\). The values \((GVIF^{(1/(2\text{number of dummy variables})))}\) are computed automatically in R. When your model has interaction effects, or polynomial terms, it is best to exclude those and compute the VIFs. In summary:

  • Compare the VIF of continuous variables to 3, 5 or 10, above those values signals multicollinearity
  • Compare the \((GVIF^{(1/(2\text{number of dummy variables})))}\) of categorical variables to \(\sqrt{3},\sqrt{5},\text{ or }\sqrt{10}\), above those values signals multicollinearity
  • When computing VIF and GVIF, it is best to exclude interaction effects and or polynomial terms
  • The given thresholds are rules of thumb, and we should not discard evidence of multicollinearity of we are very close to a threshold, e.g., 9.9.

8.2.2 Condition number

One can also look at the eigenvalues of \(X^\top X\), where the regressors are centered and normalized to unit length. If the eigenvalues are small, this indicates multicollinearity. One metric computed from the eigenvalues is the condition number of \(X^\top X\): \(\kappa=\max \lambda_j/\min \lambda_j\), where the regressors are centered and normalized to unit length. Condition numbers between 100 and 1000 imply moderate to strong multicollinearity, and if \(\kappa\) exceeds 1000, severe multicollinearity is indicated. Diagonalizing via \(X^\top X=\Lambda D \Lambda^\top\) yields the eigenvectors, which help us determine the exact dependence between variables is. You can check the eigenvectors associated with the small eigenvalues. Components that are large in the eigenvector indicate that that variable is contributing to the multicollinearity. Again, for this computation, it is best to exclude interaction effects and or polynomial terms.

Note

While the method of least squares will generally produce poor estimates of the individual model parameters when strong multicollinearity is present, this does not necessarily imply that the fitted model is a poor predictor.

  1. If predictions are confined to regions of the \(X\)-space where the multicollinearity holds approximately, the fitted model often produces satisfactory predictions.
  2. The linear combinations \(X\beta\) may be estimated well, even if \(\beta\) is not.

Example 8.1 Recall example Example 6.6. Check for multicollinearity in the proposed models.

I will load in the data below:

model2=lm(Sale_price~.+District*Year_Built+District*Lotsize+District*Fin_sqft
          ,df)




X=model.matrix(model2)
depths=ddalpha::depth.projection(cbind(X,df$Sale_price),cbind(X,df$Sale_price))

model3=lm(Sale_price~.+District*Year_Built+District*Lotsize+District*Fin_sqft
         ,df[-(order(depths)[1:100]),])



summary(model3)

Call:
lm(formula = Sale_price ~ . + District * Year_Built + District * 
    Lotsize + District * Fin_sqft, data = df[-(order(depths)[1:100]), 
    ])

Residuals:
    Min      1Q  Median      3Q     Max 
-534806  -27348     620   26676  331590 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             3.779e+05  7.621e+04   4.959 7.14e-07 ***
District               -9.906e+04  7.305e+03 -13.560  < 2e-16 ***
ExtwallBlock           -2.804e+03  4.300e+03  -0.652 0.514322    
ExtwallBrick            7.301e+03  8.448e+02   8.642  < 2e-16 ***
ExtwallFiber-Cement     3.649e+04  4.353e+03   8.384  < 2e-16 ***
ExtwallFrame           -4.668e+03  1.139e+03  -4.100 4.15e-05 ***
ExtwallMasonry / Frame  4.130e+03  1.999e+03   2.066 0.038834 *  
ExtwallPrem Wood        1.399e+04  6.600e+03   2.119 0.034097 *  
ExtwallStone            4.754e+03  1.785e+03   2.664 0.007732 ** 
ExtwallStucco           1.406e+04  2.517e+03   5.585 2.36e-08 ***
Stories1                1.458e+04  1.199e+04   1.215 0.224321    
Stories1.5              2.675e+04  1.197e+04   2.234 0.025488 *  
Stories2                3.379e+04  1.192e+04   2.834 0.004601 ** 
Year_Built             -3.743e+02  3.746e+01  -9.991  < 2e-16 ***
Fin_sqft                1.116e+02  1.577e+00  70.780  < 2e-16 ***
Units1                  9.867e+04  8.560e+03  11.527  < 2e-16 ***
Units2                  2.209e+04  8.559e+03   2.580 0.009874 ** 
Units3                 -1.509e+04  9.218e+03  -1.637 0.101597    
Bdrms0                  1.281e+05  2.162e+04   5.927 3.12e-09 ***
Bdrms1                  1.024e+05  1.192e+04   8.588  < 2e-16 ***
Bdrms2                  1.090e+05  1.068e+04  10.200  < 2e-16 ***
Bdrms3                  1.122e+05  1.061e+04  10.567  < 2e-16 ***
Bdrms4                  1.007e+05  1.057e+04   9.528  < 2e-16 ***
Bdrms5                  9.913e+04  1.057e+04   9.380  < 2e-16 ***
Bdrms6                  8.065e+04  1.058e+04   7.620 2.64e-14 ***
Bdrms7                  5.916e+04  1.130e+04   5.236 1.66e-07 ***
Bdrms8                  6.590e+04  1.190e+04   5.540 3.06e-08 ***
Fbath0                 -1.795e+04  1.794e+04  -1.001 0.317003    
Fbath1                 -5.482e+03  1.387e+04  -0.395 0.692759    
Fbath2                  1.495e+04  1.382e+04   1.082 0.279172    
Fbath3                  4.888e+04  1.377e+04   3.550 0.000386 ***
Fbath4                  5.684e+04  1.464e+04   3.882 0.000104 ***
Lotsize                 4.904e-01  3.018e-01   1.625 0.104159    
Sale_date               4.706e+00  3.035e-01  15.507  < 2e-16 ***
District:Year_Built     5.551e+01  3.773e+00  14.715  < 2e-16 ***
District:Lotsize        1.349e-01  3.289e-02   4.102 4.11e-05 ***
District:Fin_sqft      -4.820e+00  1.407e-01 -34.250  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 51060 on 24338 degrees of freedom
Multiple R-squared:  0.4875,    Adjusted R-squared:  0.4867 
F-statistic:   643 on 36 and 24338 DF,  p-value: < 2.2e-16
# Correlations 
X=model.matrix(model3)
corrplot::corrplot(cor(X[,-1]))

corrplot::corrplot(cor(X[,-(1:23)]))

# VIFS
# predictor removes the interactions
car::vif(model3,type = 'predictor')
GVIFs computed for predictors
                   GVIF Df GVIF^(1/(2*Df))                Interacts With
District       5.706956  7        1.132476 Year_Built, Lotsize, Fin_sqft
Extwall        1.407080  8        1.021574                          --  
Stories        2.843010  3        1.190227                          --  
Year_Built    25.788359  3        1.718847                      District
Fin_sqft   38788.502554  3        5.818136                      District
Units          3.316301  3        1.221169                          --  
Bdrms          4.191319  9        1.082867                          --  
Fbath          3.176910  5        1.122537                          --  
Lotsize    12001.808204  3        4.784917                      District
Sale_date      1.018315  1        1.009116                          --  
                                                                             Other Predictors
District                                     Extwall, Stories, Units, Bdrms, Fbath, Sale_date
Extwall      District, Stories, Year_Built, Fin_sqft, Units, Bdrms, Fbath, Lotsize, Sale_date
Stories      District, Extwall, Year_Built, Fin_sqft, Units, Bdrms, Fbath, Lotsize, Sale_date
Year_Built                Extwall, Stories, Fin_sqft, Units, Bdrms, Fbath, Lotsize, Sale_date
Fin_sqft                Extwall, Stories, Year_Built, Units, Bdrms, Fbath, Lotsize, Sale_date
Units      District, Extwall, Stories, Year_Built, Fin_sqft, Bdrms, Fbath, Lotsize, Sale_date
Bdrms      District, Extwall, Stories, Year_Built, Fin_sqft, Units, Fbath, Lotsize, Sale_date
Fbath      District, Extwall, Stories, Year_Built, Fin_sqft, Units, Bdrms, Lotsize, Sale_date
Lotsize                Extwall, Stories, Year_Built, Fin_sqft, Units, Bdrms, Fbath, Sale_date
Sale_date      District, Extwall, Stories, Year_Built, Fin_sqft, Units, Bdrms, Fbath, Lotsize
model4=lm(Sale_price~Extwall +Stories+Year_Built+Fin_sqft+Units+Bdrms+Fbath+Sale_date+District*Year_Built+District+District*Fin_sqft
         ,df[-(order(depths)[1:100]),])

car::vif(model4,type = 'predictor')
GVIFs computed for predictors
                   GVIF Df GVIF^(1/(2*Df))       Interacts With
Extwall        1.393341  8        1.020948                 --  
Stories        2.798074  3        1.187071                 --  
Year_Built    10.370804  3        1.476733             District
Fin_sqft   30242.739803  3        5.581748             District
Units          3.277628  3        1.218784                 --  
Bdrms          4.148440  9        1.082248                 --  
Fbath          3.174973  5        1.122468                 --  
Sale_date      1.018192  1        1.009055                 --  
District       5.330306  5        1.182157 Year_Built, Fin_sqft
                                                                    Other Predictors
Extwall      Stories, Year_Built, Fin_sqft, Units, Bdrms, Fbath, Sale_date, District
Stories      Extwall, Year_Built, Fin_sqft, Units, Bdrms, Fbath, Sale_date, District
Year_Built                Extwall, Stories, Fin_sqft, Units, Bdrms, Fbath, Sale_date
Fin_sqft                Extwall, Stories, Year_Built, Units, Bdrms, Fbath, Sale_date
Units      Extwall, Stories, Year_Built, Fin_sqft, Bdrms, Fbath, Sale_date, District
Bdrms      Extwall, Stories, Year_Built, Fin_sqft, Units, Fbath, Sale_date, District
Fbath      Extwall, Stories, Year_Built, Fin_sqft, Units, Bdrms, Sale_date, District
Sale_date      Extwall, Stories, Year_Built, Fin_sqft, Units, Bdrms, Fbath, District
District                            Extwall, Stories, Units, Bdrms, Fbath, Sale_date
model4=lm(Sale_price~Extwall +Stories+Year_Built+
            Lotsize+Units+Bdrms+Fbath+Sale_date+District*Year_Built+District+District*Lotsize
         ,df[-(order(depths)[1:100]),])

X=model.matrix(model4)
car::vif(model4,type = 'predictor')
GVIFs computed for predictors
                  GVIF Df GVIF^(1/(2*Df))      Interacts With
Extwall       1.278210  8        1.015460                --  
Stories       2.157990  3        1.136776                --  
Year_Built   11.078804  3        1.493077            District
Lotsize    9845.744360  3        4.629578            District
Units         3.240026  3        1.216442                --  
Bdrms         3.115442  9        1.065167                --  
Fbath         2.643774  5        1.102104                --  
Sale_date     1.017814  1        1.008868                --  
District      1.364965  5        1.031602 Year_Built, Lotsize
                                                                   Other Predictors
Extwall      Stories, Year_Built, Lotsize, Units, Bdrms, Fbath, Sale_date, District
Stories      Extwall, Year_Built, Lotsize, Units, Bdrms, Fbath, Sale_date, District
Year_Built                Extwall, Stories, Lotsize, Units, Bdrms, Fbath, Sale_date
Lotsize                Extwall, Stories, Year_Built, Units, Bdrms, Fbath, Sale_date
Units      Extwall, Stories, Year_Built, Lotsize, Bdrms, Fbath, Sale_date, District
Bdrms      Extwall, Stories, Year_Built, Lotsize, Units, Fbath, Sale_date, District
Fbath      Extwall, Stories, Year_Built, Lotsize, Units, Bdrms, Sale_date, District
Sale_date      Extwall, Stories, Year_Built, Lotsize, Units, Bdrms, Fbath, District
District                           Extwall, Stories, Units, Bdrms, Fbath, Sale_date
model4=lm(Sale_price~Extwall +Stories+
            Lotsize+Units+Bdrms+Fbath+Sale_date+District+District*Lotsize
         ,df[-(order(depths)[1:100]),])

car::vif(model4,type = 'predictor')
GVIFs computed for predictors
              GVIF Df GVIF^(1/(2*Df)) Interacts With
Extwall   1.196173  8        1.011258           --  
Stories   2.069115  3        1.128836           --  
Lotsize   1.104506  3        1.016704       District
Units     3.217255  3        1.215013           --  
Bdrms     3.031412  9        1.063550           --  
Fbath     2.629030  5        1.101487           --  
Sale_date 1.017598  1        1.008761           --  
District  1.104506  3        1.016704        Lotsize
                                                      Other Predictors
Extwall     Stories, Lotsize, Units, Bdrms, Fbath, Sale_date, District
Stories     Extwall, Lotsize, Units, Bdrms, Fbath, Sale_date, District
Lotsize               Extwall, Stories, Units, Bdrms, Fbath, Sale_date
Units     Extwall, Stories, Lotsize, Bdrms, Fbath, Sale_date, District
Bdrms     Extwall, Stories, Lotsize, Units, Fbath, Sale_date, District
Fbath     Extwall, Stories, Lotsize, Units, Bdrms, Sale_date, District
Sale_date     Extwall, Stories, Lotsize, Units, Bdrms, Fbath, District
District              Extwall, Stories, Units, Bdrms, Fbath, Sale_date
# Let's go back to 

model4=lm(Sale_price~Extwall +Stories+Year_Built+
            Lotsize+Units+Bdrms+Fbath+Sale_date+District*Year_Built+District+District*Lotsize
         ,df[-(order(depths)[1:100]),])

# remove interactions to compute the condition number

model4=lm(Sale_price~Extwall +Stories+Year_Built+
            Lotsize+Units+Bdrms+Fbath+Sale_date+District
         ,df[-(order(depths)[1:100]),])


X=model.matrix(model4)

# Correlations 
# X=X[,-1]
X2=apply(X,2,function(x){(x-mean(x))/sqrt(sum((x-mean(x))^2))}); dim(X2)
[1] 24375    33
X2[,1]=X[,1]
X=X2
A=eigen(t(X)%*%X)

plot(sort(A$values)[1:10])

which.min(A$values)
[1] 33
dim(A$vectors)
[1] 33 33
#Condition number 
max(A$values)/min(A$values)
[1] 23006492
rownames(A$vectors)=names(model4$coefficients)

# Which components are important here

A$vectors[,which.min(A$values)]
           (Intercept)           ExtwallBlock           ExtwallBrick 
          0.000000e+00           1.183183e-04           5.023350e-04 
   ExtwallFiber-Cement           ExtwallFrame ExtwallMasonry / Frame 
          7.984521e-04          -3.717751e-04           9.720170e-05 
      ExtwallPrem Wood           ExtwallStone          ExtwallStucco 
          4.178118e-05           2.263316e-04          -1.609326e-04 
              Stories1             Stories1.5               Stories2 
         -1.496242e-01          -1.072532e-01          -1.291492e-01 
            Year_Built                Lotsize                 Units1 
         -8.772768e-04           1.317715e-03           3.384157e-02 
                Units2                 Units3                 Bdrms0 
          3.200628e-02           6.437710e-03          -7.642016e-03 
                Bdrms1                 Bdrms2                 Bdrms3 
         -2.705151e-02          -1.508608e-01          -2.224989e-01 
                Bdrms4                 Bdrms5                 Bdrms6 
         -1.881172e-01          -9.964649e-02          -9.455411e-02 
                Bdrms7                 Bdrms8                 Fbath0 
         -2.716034e-02          -2.107422e-02           3.789098e-02 
                Fbath1                 Fbath2                 Fbath3 
          6.220386e-01           6.094723e-01           2.310907e-01 
                Fbath4              Sale_date               District 
          7.411521e-02          -1.357350e-04          -2.682832e-04 
# round(A$vectors[,which.min(A$values)],1)
round(A$vectors[,which.min(A$values)][abs(round(A$vectors[,which.min(A$values)],1))>0],1)
  Stories1 Stories1.5   Stories2     Bdrms2     Bdrms3     Bdrms4     Bdrms5 
      -0.1       -0.1       -0.1       -0.2       -0.2       -0.2       -0.1 
    Bdrms6     Fbath1     Fbath2     Fbath3     Fbath4 
      -0.1        0.6        0.6        0.2        0.1 
min_4=order(A$values)[1:4]
A$vectors[,min_4]
                                [,1]          [,2]          [,3]          [,4]
(Intercept)             0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
ExtwallBlock            1.183183e-04 -1.688281e-04  2.931817e-04 -1.666650e-03
ExtwallBrick            5.023350e-04 -1.838441e-04 -7.588083e-04 -7.154054e-04
ExtwallFiber-Cement     7.984521e-04 -5.053987e-04 -5.463016e-04 -1.952110e-06
ExtwallFrame           -3.717751e-04  3.013905e-04 -7.559034e-04 -4.405188e-03
ExtwallMasonry / Frame  9.720170e-05  5.792381e-04 -6.226611e-04  1.062471e-03
ExtwallPrem Wood        4.178118e-05  6.584509e-07  1.282657e-04 -1.781336e-05
ExtwallStone            2.263316e-04 -2.112436e-04 -1.232160e-04 -3.908048e-04
ExtwallStucco          -1.609326e-04 -7.704310e-04 -5.381685e-05 -1.582375e-03
Stories1               -1.496242e-01 -5.233556e-01 -3.805896e-01  2.338171e-03
Stories1.5             -1.072532e-01 -3.781817e-01 -2.746108e-01  1.055941e-03
Stories2               -1.291492e-01 -4.513938e-01 -3.293708e-01 -2.518288e-03
Year_Built             -8.772768e-04 -2.070787e-04  1.049999e-03  2.622742e-03
Lotsize                 1.317715e-03  5.782571e-04 -1.059107e-03 -6.722199e-04
Units1                  3.384157e-02 -5.649521e-02  6.395877e-02 -6.982272e-01
Units2                  3.200628e-02 -5.429427e-02  6.133402e-02 -6.830494e-01
Units3                  6.437710e-03 -1.072306e-02  1.064539e-02 -1.710868e-01
Bdrms0                 -7.642016e-03  1.293026e-02 -1.477400e-02 -1.160899e-02
Bdrms1                 -2.705151e-02  4.587072e-02 -5.237369e-02 -1.166316e-02
Bdrms2                 -1.508608e-01  2.553062e-01 -2.921990e-01 -5.349358e-02
Bdrms3                 -2.224989e-01  3.768277e-01 -4.309299e-01 -7.858784e-02
Bdrms4                 -1.881172e-01  3.191923e-01 -3.647193e-01 -6.667615e-02
Bdrms5                 -9.964649e-02  1.698822e-01 -1.937604e-01 -3.500891e-02
Bdrms6                 -9.455411e-02  1.626039e-01 -1.853779e-01 -3.367020e-02
Bdrms7                 -2.716034e-02  4.998053e-02 -5.836753e-02 -1.518320e-02
Bdrms8                 -2.107422e-02  3.849246e-02 -4.568050e-02 -1.339025e-02
Fbath0                  3.789098e-02  2.299437e-03 -1.794899e-02  2.950374e-03
Fbath1                  6.220386e-01  3.427433e-02 -2.895872e-01 -7.283055e-04
Fbath2                  6.094723e-01  3.296367e-02 -2.846543e-01  2.088769e-03
Fbath3                  2.310907e-01  1.305409e-02 -1.112123e-01  9.476307e-04
Fbath4                  7.411521e-02  3.446950e-03 -3.524152e-02 -3.875152e-03
Sale_date              -1.357350e-04  7.861120e-04  3.368267e-04  7.174826e-03
District               -2.682832e-04 -3.924635e-04  4.123621e-04  8.067261e-04
# round(A$vectors[,which.min(A$values)],1)
for(i in min_4)
  print(round(A$vectors[,i][abs(round(A$vectors[,i],1))>0],1))
  Stories1 Stories1.5   Stories2     Bdrms2     Bdrms3     Bdrms4     Bdrms5 
      -0.1       -0.1       -0.1       -0.2       -0.2       -0.2       -0.1 
    Bdrms6     Fbath1     Fbath2     Fbath3     Fbath4 
      -0.1        0.6        0.6        0.2        0.1 
  Stories1 Stories1.5   Stories2     Units1     Units2     Bdrms2     Bdrms3 
      -0.5       -0.4       -0.5       -0.1       -0.1        0.3        0.4 
    Bdrms4     Bdrms5     Bdrms6 
       0.3        0.2        0.2 
  Stories1 Stories1.5   Stories2     Units1     Units2     Bdrms1     Bdrms2 
      -0.4       -0.3       -0.3        0.1        0.1       -0.1       -0.3 
    Bdrms3     Bdrms4     Bdrms5     Bdrms6     Bdrms7     Fbath1     Fbath2 
      -0.4       -0.4       -0.2       -0.2       -0.1       -0.3       -0.3 
    Fbath3 
      -0.1 
Units1 Units2 Units3 Bdrms2 Bdrms3 Bdrms4 
  -0.7   -0.7   -0.2   -0.1   -0.1   -0.1 
model5=lm(Sale_price~Extwall +Year_Built+
            Lotsize+Bdrms+Sale_date+District
         ,df[-(order(depths)[1:100]),])


X=model.matrix(model5)

# Correlations 
# X=X[,-1]
X2=apply(X,2,function(x){(x-mean(x))/sqrt(sum((x-mean(x))^2))}); dim(X2)
[1] 24375    22
X2[,1]=X[,1]
X=X2
A=eigen(t(X)%*%X)

plot(sort(A$values)[1:10])

which.min(A$values)
[1] 22
dim(A$vectors)
[1] 22 22
#Condition number 
max(A$values)/min(A$values)
[1] 15006590
rownames(A$vectors)=names(model5$coefficients)

# Which components are important here

A$vectors[,which.min(A$values)]
           (Intercept)           ExtwallBlock           ExtwallBrick 
          0.000000e+00           4.193092e-05          -4.846419e-04 
   ExtwallFiber-Cement           ExtwallFrame ExtwallMasonry / Frame 
         -9.029125e-05          -1.490925e-03          -1.069266e-03 
      ExtwallPrem Wood           ExtwallStone          ExtwallStucco 
          2.531984e-05           9.918108e-05           1.498931e-04 
            Year_Built                Lotsize                 Bdrms0 
          1.320937e-03          -7.893090e-04          -2.238586e-02 
                Bdrms1                 Bdrms2                 Bdrms3 
         -7.538850e-02          -4.188232e-01          -6.179918e-01 
                Bdrms4                 Bdrms5                 Bdrms6 
         -5.242135e-01          -2.799998e-01          -2.678554e-01 
                Bdrms7                 Bdrms8              Sale_date 
         -8.464429e-02          -6.611714e-02           3.865153e-04 
              District 
          5.490140e-04 
# round(A$vectors[,which.min(A$values)],1)
round(A$vectors[,which.min(A$values)][abs(round(A$vectors[,which.min(A$values)],1))>0],1)
Bdrms1 Bdrms2 Bdrms3 Bdrms4 Bdrms5 Bdrms6 Bdrms7 Bdrms8 
  -0.1   -0.4   -0.6   -0.5   -0.3   -0.3   -0.1   -0.1 
# This is fine

Example 8.2 Example 9.1 from the textbook - The Acetylene Data. Below presents data concerning the percentage of conversion of \(n\) - heptane to acetylene and three explanatory variables. These are typical chemical process data for which a full quadratic response surface in all three regressors is often considered to be an appropriate tentative model. Let’s build the model and see how the extrapolation performs.

########### Example 2 




df <- data.frame(
  Conversion_of_n_Heptane_to_Acetylene = c(49.0, 50.2, 50.5, 48.5, 47.5, 44.5, 28.0, 31.5, 34.5, 35.0, 38.0, 38.5, 15.0, 17.0, 20.5, 29.5),
  Reactor_Temperature_deg_C = c(1300, 1300, 1300, 1300, 1300, 1300, 1200, 1200, 1200, 1200, 1200, 1200, 1100, 1100, 1100, 1100),
  Ratio_of_H2_to_n_Heptane_mole_ratio = c(7.5, 9.0, 11.0, 13.5, 17.0, 23.0, 5.3, 7.5, 11.0, 13.5, 17.0, 23.0, 5.3, 7.5, 11.0, 17.0),
  Contact_Time_sec = c(0.0120, 0.0120, 0.0115, 0.0130, 0.0135, 0.0120, 0.0400, 0.0380, 0.0320, 0.0260, 0.0340, 0.0410, 0.0840, 0.0980, 0.0920, 0.0860)
)

# Printing the dataframe
head(df)
  Conversion_of_n_Heptane_to_Acetylene Reactor_Temperature_deg_C
1                                 49.0                      1300
2                                 50.2                      1300
3                                 50.5                      1300
4                                 48.5                      1300
5                                 47.5                      1300
6                                 44.5                      1300
  Ratio_of_H2_to_n_Heptane_mole_ratio Contact_Time_sec
1                                 7.5           0.0120
2                                 9.0           0.0120
3                                11.0           0.0115
4                                13.5           0.0130
5                                17.0           0.0135
6                                23.0           0.0120
# For standardizing via Z scores
unit_norm=function(x){
  x=x-mean(x)
  return(sqrt(length(x)-1)*x/sqrt(sum(x^2)))
}

df2=df

#standardizing the regressors
df2[,2:4]=apply(df[,2:4],2,unit_norm)

df2=data.frame(df2)
corrplot::corrplot(cor(df2))

# Observe the high correlations between Contact time and Reactor temperature
plot(df2$Reactor_Temperature_deg_C,df2$Contact_Time_sec,pch=22,bg=1)

orig=names(df2)
names(df2)=c('P','t','H','C')

df2$t2=df2$t^2
df2$H2=df2$H^2
df2$C2=df2$C^2

RS=lm(P~t+H+C+t*H+t*C+C*H+t2+H2+C2  ,df2)
summary(RS)

Call:
lm(formula = P ~ t + H + C + t * H + t * C + C * H + t2 + H2 + 
    C2, data = df2)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3499 -0.3411  0.1297  0.5011  0.6720 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  35.8958     1.0916  32.884 5.26e-08 ***
t             4.0038     4.5087   0.888 0.408719    
H             2.7783     0.3071   9.048 0.000102 ***
C            -8.0423     6.0707  -1.325 0.233461    
t2          -12.5236    12.3238  -1.016 0.348741    
H2           -0.9727     0.3746  -2.597 0.040844 *  
C2          -11.5932     7.7063  -1.504 0.183182    
t:H          -6.4568     1.4660  -4.404 0.004547 ** 
t:C         -26.9804    21.0213  -1.283 0.246663    
H:C          -3.7681     1.6553  -2.276 0.063116 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9014 on 6 degrees of freedom
Multiple R-squared:  0.9977,    Adjusted R-squared:  0.9943 
F-statistic: 289.7 on 9 and 6 DF,  p-value: 3.225e-07
# Crazy high VIF
car::vif(RS)
there are higher-order terms (interactions) in this model
consider setting type = 'predictor'; see ?vif
          t           H           C          t2          H2          C2 
 375.247759    1.740631  680.280039 1762.575365    3.164318 1156.766284 
        t:H         t:C         H:C 
  31.037059 6563.345193   35.611286 
car::vif(RS,type='predictor')
GVIFs computed for predictors
           GVIF Df GVIF^(1/(2*Df)) Interacts With Other Predictors
t  51654.740516  6        2.470354           H, C       t2, H2, C2
H  51654.740516  6        2.470354           t, C       t2, H2, C2
C  51654.740516  6        2.470354           t, H       t2, H2, C2
t2  1762.575365  1       41.983037           --    t, H, C, H2, C2
H2     3.164318  1        1.778853           --    t, H, C, t2, C2
C2  1156.766284  1       34.011267           --    t, H, C, t2, H2
# hidden extrapolation - be careful

# Create a grid to extrapolate over
c_grid=seq(-2,2,l=100)
t_grid=seq(-2,2,l=100)
g=expand.grid(t_grid,c_grid)

# Adding a fixed value of H=-0.6082179
g=cbind(g,rep(-0.6082179,nrow(g)))
# Adding H^2
g=cbind(g,g^2)

# Making new dataframe for predictions
new_dat=data.frame(g)
names(new_dat)=c("t","C","H","t2","C2","H2")

# Predicting the values on the grid
Z=predict.lm(RS,new_dat)

contour(c_grid,t_grid,matrix(Z, ncol = length(t_grid)),lwd=2,labcex=2,ylim=c(-1,1)*2,xlim=c(-1,1)*2)
points(df2$t,df2$C,pch=22,bg=1)

## Notice that as soon as we leave the region with data, the response becomes negative. 
# Recall that it is a percentage and can't be negative
## This shows that mild extrapolation dangerous!

Example 8.3 Use the NFL data from the textbook - regress the number of wins against all variables. Check for multicollinearity. Propose a method to resolve the multicollinearity.

#####################################################
#
#              
#
#           NFL DATA EXAMPLE
#                        
#
#####################################################

df=MPV::table.b1
# Note

df
    y   x1   x2   x3   x4  x5   x6   x7   x8   x9
1  10 2113 1985 38.9 64.7   4  868 59.7 2205 1917
2  11 2003 2855 38.8 61.3   3  615 55.0 2096 1575
3  11 2957 1737 40.1 60.0  14  914 65.6 1847 2175
4  13 2285 2905 41.6 45.3  -4  957 61.4 1903 2476
5  10 2971 1666 39.2 53.8  15  836 66.1 1457 1866
6  11 2309 2927 39.7 74.1   8  786 61.0 1848 2339
7  10 2528 2341 38.1 65.4  12  754 66.1 1564 2092
8  11 2147 2737 37.0 78.3  -1  761 58.0 1821 1909
9   4 1689 1414 42.1 47.6  -3  714 57.0 2577 2001
10  2 2566 1838 42.3 54.2  -1  797 58.9 2476 2254
11  7 2363 1480 37.3 48.0  19  984 67.5 1984 2217
12 10 2109 2191 39.5 51.9   6  700 57.2 1917 1758
13  9 2295 2229 37.4 53.6  -5 1037 58.8 1761 2032
14  9 1932 2204 35.1 71.4   3  986 58.6 1709 2025
15  6 2213 2140 38.8 58.3   6  819 59.2 1901 1686
16  5 1722 1730 36.6 52.6 -19  791 54.4 2288 1835
17  5 1498 2072 35.3 59.3  -5  776 49.6 2072 1914
18  5 1873 2929 41.1 55.3  10  789 54.3 2861 2496
19  6 2118 2268 38.2 69.6   6  582 58.7 2411 2670
20  4 1775 1983 39.3 78.3   7  901 51.7 2289 2202
21  3 1904 1792 39.7 38.1  -9  734 61.9 2203 1988
22  3 1929 1606 39.7 68.8 -21  627 52.7 2592 2324
23  4 2080 1492 35.5 68.8  -8  722 57.8 2053 2550
24 10 2301 2835 35.3 74.1   2  683 59.7 1979 2110
25  6 2040 2416 38.7 50.0   0  576 54.9 2048 2628
26  8 2447 1638 39.9 57.1  -8  848 65.3 1786 1776
27  2 1416 2649 37.4 56.3 -22  684 43.8 2876 2524
28  0 1503 1503 39.3 47.0  -9  875 53.5 2560 2241
summary(df)
       y                x1             x2             x3              x4       
 Min.   : 0.000   Min.   :1416   Min.   :1414   Min.   :35.10   Min.   :38.10  
 1st Qu.: 4.000   1st Qu.:1896   1st Qu.:1714   1st Qu.:37.38   1st Qu.:52.42  
 Median : 6.500   Median :2111   Median :2106   Median :38.85   Median :57.70  
 Mean   : 6.964   Mean   :2110   Mean   :2127   Mean   :38.64   Mean   :59.40  
 3rd Qu.:10.000   3rd Qu.:2303   3rd Qu.:2474   3rd Qu.:39.70   3rd Qu.:68.80  
 Max.   :13.000   Max.   :2971   Max.   :2929   Max.   :42.30   Max.   :78.30  
       x5               x6               x7              x8      
 Min.   :-22.00   Min.   : 576.0   Min.   :43.80   Min.   :1457  
 1st Qu.: -5.75   1st Qu.: 710.5   1st Qu.:54.77   1st Qu.:1848  
 Median :  1.00   Median : 787.5   Median :58.65   Median :2050  
 Mean   :  0.00   Mean   : 789.9   Mean   :58.16   Mean   :2110  
 3rd Qu.:  6.25   3rd Qu.: 869.8   3rd Qu.:61.10   3rd Qu.:2320  
 Max.   : 19.00   Max.   :1037.0   Max.   :67.50   Max.   :2876  
       x9      
 Min.   :1575  
 1st Qu.:1913  
 Median :2101  
 Mean   :2128  
 3rd Qu.:2328  
 Max.   :2670  
names(df)
 [1] "y"  "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9"
names(df)=c("Wins","RushY","PassY",
            "PuntA","FGP","TurnD",
            "PenY","PerR","ORY","OPY")
summary(df)
      Wins            RushY          PassY          PuntA            FGP       
 Min.   : 0.000   Min.   :1416   Min.   :1414   Min.   :35.10   Min.   :38.10  
 1st Qu.: 4.000   1st Qu.:1896   1st Qu.:1714   1st Qu.:37.38   1st Qu.:52.42  
 Median : 6.500   Median :2111   Median :2106   Median :38.85   Median :57.70  
 Mean   : 6.964   Mean   :2110   Mean   :2127   Mean   :38.64   Mean   :59.40  
 3rd Qu.:10.000   3rd Qu.:2303   3rd Qu.:2474   3rd Qu.:39.70   3rd Qu.:68.80  
 Max.   :13.000   Max.   :2971   Max.   :2929   Max.   :42.30   Max.   :78.30  
     TurnD             PenY             PerR            ORY      
 Min.   :-22.00   Min.   : 576.0   Min.   :43.80   Min.   :1457  
 1st Qu.: -5.75   1st Qu.: 710.5   1st Qu.:54.77   1st Qu.:1848  
 Median :  1.00   Median : 787.5   Median :58.65   Median :2050  
 Mean   :  0.00   Mean   : 789.9   Mean   :58.16   Mean   :2110  
 3rd Qu.:  6.25   3rd Qu.: 869.8   3rd Qu.:61.10   3rd Qu.:2320  
 Max.   : 19.00   Max.   :1037.0   Max.   :67.50   Max.   :2876  
      OPY      
 Min.   :1575  
 1st Qu.:1913  
 Median :2101  
 Mean   :2128  
 3rd Qu.:2328  
 Max.   :2670  
plot(df)

model=lm(Wins~.,data=df)
################## Multicollinearity

###### Corr plot
corrplot::corrplot(cor(df))

###### VIF
car::vif(model)
   RushY    PassY    PuntA      FGP    TurnD     PenY     PerR      ORY 
4.827645 1.420161 2.126597 1.566107 1.924035 1.275979 5.414572 4.535643 
     OPY 
1.423390 
# recalculating via the X'X matrix
X=model.matrix(model)
dim(X)
[1] 28 10
#standardizing
X2=apply(X,2,function(x){(x-mean(x))/sqrt(sum((x-mean(x))^2))}); dim(X2)
[1] 28 10
#replacing with column of ones again 
X2[,1]=X[,1]
X=X2
diag(solve(t(X)%*%X))
(Intercept)       RushY       PassY       PuntA         FGP       TurnD 
 0.03571429  4.82764538  1.42016105  2.12659726  1.56610698  1.92403474 
       PenY        PerR         ORY         OPY 
 1.27597850  5.41457162  4.53564335  1.42338989 
# observe that
model=lm(RushY~.,data=df[,-1])
s=summary(model)
(1-s$r.squared)^(-1)
[1] 4.827645
###### Condition Number / Eigenvalue / Eigenvector
X=model.matrix(model)
dim(X)
[1] 28  9
#standardizing
X2=apply(X,2,function(x){(x-mean(x))/sqrt(sum((x-mean(x))^2))}); dim(X2)
[1] 28  9
X2[,1]=X[,1]
X=X2


ev=eigen(t(X)%*%X)
xev=ev$values
condition_number=max(xev)/min(xev)
condition_number
[1] 215.646
sort(xev)
[1]  0.1298424  0.3552922  0.5328487  0.6867363  0.8232325  1.2191453  1.6977743
[8]  2.5551283 28.0000000
#index of minimum eigenvalue
minn=which.min(xev)
ev$vectors[,minn]
[1] -1.675215e-16 -2.109094e-01  2.615778e-01 -4.250523e-02  1.442042e-01
[6] -4.744502e-02 -6.475229e-01 -6.423531e-01  1.741790e-01
rownames(ev$vectors)=names(model$coefficients)
ev$vectors
                     [,1]          [,2]          [,3]          [,4]
(Intercept)  1.000000e+00 -3.434947e-17  5.656001e-17  1.332883e-16
PassY       -1.931673e-17  8.040232e-02  4.548137e-01 -4.590825e-01
PuntA        1.733313e-16  2.847481e-02 -5.282204e-01 -5.050070e-01
FGP          6.643674e-18 -1.498563e-02  6.360469e-01 -4.652414e-02
TurnD        1.358771e-18 -4.436052e-01  9.254740e-02 -4.700157e-01
PenY        -4.747269e-17 -3.638914e-01 -1.751243e-01  1.469074e-01
PerR        -4.843659e-17 -5.401530e-01 -1.437332e-01 -1.847958e-01
ORY         -9.496374e-17  5.364518e-01 -2.217080e-01 -1.153371e-01
OPY          1.353659e-16  2.893988e-01  2.291129e-02 -4.920345e-01
                     [,5]          [,6]          [,7]          [,8]
(Intercept)  1.753269e-17  4.497241e-17 -8.295922e-18  9.898359e-17
PassY        3.414787e-01  5.912719e-01  2.410126e-01 -8.451722e-02
PuntA        2.835269e-01  3.495064e-02 -3.785463e-01 -4.145009e-01
FGP         -1.687798e-01 -2.417375e-01 -6.140270e-01 -3.567854e-01
TurnD       -9.900605e-03 -1.291665e-01 -2.199263e-01  6.984198e-01
PenY        -5.210958e-01  6.822653e-01 -2.680418e-01 -6.487362e-02
PerR        -4.221330e-02 -3.015818e-01  2.274534e-01 -2.994520e-01
ORY         -9.681973e-02  2.538335e-02 -3.566234e-01  3.161428e-01
OPY         -7.012299e-01 -1.302810e-01  3.499387e-01 -1.101525e-01
                     [,9]
(Intercept) -1.675215e-16
PassY       -2.109094e-01
PuntA        2.615778e-01
FGP         -4.250523e-02
TurnD        1.442042e-01
PenY        -4.744502e-02
PerR        -6.475229e-01
ORY         -6.423531e-01
OPY          1.741790e-01
# names(model$coefficients)[abs(round(ev$vectors[,minn],1))>0]
round(ev$vectors[,minn],1)
(Intercept)       PassY       PuntA         FGP       TurnD        PenY 
        0.0        -0.2         0.3         0.0         0.1         0.0 
       PerR         ORY         OPY 
       -0.6        -0.6         0.2 
# What should we do?

df$RushDiff=(df$RushY-df$ORY)
# df$RushDiffPer=df$RushY/(df$RushY+df$ORY)

# paste(names(df),collapse='+')

model=lm(Wins~PassY+PuntA+FGP+TurnD+PenY+PerR+OPY+RushDiff,data=df)
summary(model)

Call:
lm(formula = Wins ~ PassY + PuntA + FGP + TurnD + PenY + PerR + 
    OPY + RushDiff, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6976 -0.8333  0.0823  0.7845  2.7968 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6.5581613 12.7312595  -0.515 0.612409    
PassY        0.0037880  0.0008196   4.622 0.000186 ***
PuntA       -0.0208741  0.2052446  -0.102 0.920057    
FGP          0.0249704  0.0407274   0.613 0.547073    
TurnD       -0.0029187  0.0465199  -0.063 0.950628    
PenY         0.0021519  0.0031747   0.678 0.506053    
PerR         0.1388836  0.1504572   0.923 0.367542    
OPY         -0.0023446  0.0012749  -1.839 0.081587 .  
RushDiff     0.0023289  0.0011199   2.080 0.051347 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.822 on 19 degrees of freedom
Multiple R-squared:  0.8071,    Adjusted R-squared:  0.7258 
F-statistic: 9.935 on 8 and 19 DF,  p-value: 2.305e-05
corrplot::corrplot(cor(df))
car::vif(model)
   PassY    PuntA      FGP    TurnD     PenY     PerR      OPY RushDiff 
1.360875 1.347757 1.514088 1.914974 1.230133 5.347319 1.162295 4.774672 
model=lm(Wins~PassY+PuntA+FGP+TurnD+PenY+OPY+RushDiff,data=df)
summary(model)

Call:
lm(formula = Wins ~ PassY + PuntA + FGP + TurnD + PenY + OPY + 
    RushDiff, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9325 -0.9011 -0.1062  0.9020  3.0495 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.8643720  9.8338943   0.088 0.930833    
PassY        0.0035079  0.0007586   4.624 0.000164 ***
PuntA        0.0143946  0.2009097   0.072 0.943594    
FGP          0.0156562  0.0393115   0.398 0.694659    
TurnD        0.0114784  0.0436650   0.263 0.795336    
PenY         0.0023024  0.0031587   0.729 0.474510    
OPY         -0.0021934  0.0012596  -1.741 0.096996 .  
RushDiff     0.0031495  0.0006787   4.641 0.000158 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.815 on 20 degrees of freedom
Multiple R-squared:  0.7984,    Adjusted R-squared:  0.7279 
F-statistic: 11.32 on 7 and 20 DF,  p-value: 9.485e-06
corrplot::corrplot(cor(df))

car::vif(model)
   PassY    PuntA      FGP    TurnD     PenY      OPY RushDiff 
1.174416 1.301051 1.421150 1.699715 1.226886 1.143100 1.766523 

There are four primary sources/causes of multicollinearity :

  • The data collection method employed: In this case, some of the variables are typically confounded and/or the experiment/study was not well-designed.
  • Constraints on the model or in the population: Sometimes, only certain combinations of levels of variables can be observed together.
  • Model specification: Sometimes you have two or more variables in your model that are measuring the same thing.
  • An overdefined model: You have too many variables in your model.

Make sure to check for each of these. Sometimes, regression coefficients have the wrong sign. This is likely due to one of the following:

  • The range of some of the regressors is too small – if the range of some of the regressors is too small, then the variance of \(\hat\beta\) is high.
  • Important regressors have not been included in the model.
  • Multicollinearity is present.
  • Computational errors have been made.

Multicollinearity can be cured with:

  1. more data (lol often not possible), 2, model re-specification: Can you include a function of the variables that preserves the information, but aren’t linearly dependent? Can you remove a variable?
  2. Or, a modified version of regression, one of Lasso, ridge or elastic net regression.

8.3 Homework questions

Complete the Chapter 9 textbook questions.

Exercise 8.1 Check for multicollinearity in all of our past examples.

Exercise 8.2 Summarize the 3 multicollinearity diagnostics.