Test for Heteroscedasticity, Multicollinearity and Autocorrelation

Anushka Agrawal
Nerd For Tech
Published in
4 min readMay 7, 2021

--

In the articles earlier, we understood the importance of observing the three behaviors in the model: Homoscedasticity, Multicollinearity and Autocorrelation.

We would be going over the concept and R application of most used tests in multiple regression modeling. The list is as follows:

Image by author
  1. Breush-Pagan test
  2. VIF (Variance Inflation Factor)
  3. Runs test
  4. Box-Cox transformation to address heteroscedasticity.

Breush-Pagan test

Lets understand this by using the model built in the earlier article and applying the test for heteroscedasticity in the model. Here we set the following hypothesis for the test.
H0: There is constant variation in the model, i.e., there is homoscedasticity in the model.

The library where we can find this test command is the lmtest library in R programming.

R code:
>
model <- lm(y~x1+x2+x3+x4+x5,data=data)
>#implementing the test
>bptest(model)

Snapshot of the Output screen for the Test, Image by author

Here, we can observe that the p-value for the breush-pagan test is < 2.2e^-16 < 0.05 (alpha). Hence, we reject H0 at 5% level of significance. Therefore, we conclude that the model, in fact, has heteroscedasticity. Therefore, we would have to apply some other transformations to increase homoscedasticity and remove heteroscedasticity. One such transformation is Box-Cox transformation.

VIF

In order to get the idea of multicollinearity in the model, we use the vif function available in the faraway library of R programming. Now, it is theoretically proven that this value, when lower than 10, represents a good model. Any vif value for the variable, around 10 sets the alarm on.

R code:
>vif(model)

Output snapshot of the vif values for each variable, Image by author

Here, we see all the variables’ vif is < 10, in fact, quite less than 10. Therefore, we can safely conclude that the model doesn’t contain any multicollinearity.

Runs test

This test is conducted to check if the observations in a given variable are contain some pattern to them or are the completely random in nature. This is to see if the observations in a single variable have any relation or trends to them. Here, we require the following package installed i.e., snpar.

Our hypothesis becomes:
H0: The variable are random in nature.
H1: The variables contain randomness.

R code:
>
runs.test(x1) #bedrooms
>runs.test(x2) # bathrooms
>runs.test(x3) #sqrtft.living area
>runs.test(x4) #floors
>runs.test(x5) # grades

Output Screen for the code

Here, we see that we reject H0 for the run’s test for variables: X2 (number of bathrooms), X3 (square feet area) and X5 (grades) at 5% level of significance. Hence, we conclude that these variables, have some pattern to them and are not random in nature. Therefore, we can safely assume randomness in the other two variables and that the observations in the variable are not related or do not follow a trend. This helps us check the problem of autocorrelations in the dataset.

Box-Cox Transformation for addressing Heteroscedasticity

Here, we make use of the library called caret. Throughout the testing, we would also require the following libraries installed in R programming i.e., gvlma and ggplot2. We install these package and use it to conduct this transformation.

The Box-Cox transformation transforms the dependent variables (Y), i.e., here, the price of the flat such that there is least heteroscedasticity.

R code:
#Box-Cox transformation
install.packages(“caret”)
#conducting the transformation
library(caret)
transf <- BoxCoxTrans(y)
#building a new model by loading the variable in the dataframe
data2 <- cbind(data, transf = predict(dist_BC, y))
head(data2)
#storing the new y variable
ynew <- data2$transf
model3 <- lm(ynew~x1+x2+x3+x4+x5)
#the gvlma summary to check for heteroscedasticity
library(gvlma)
gvlma(model3)

Output image for the Transformation, Image by author
Output screen of the summary, Image by author

Here, we can see that no variable satisfies the linearity assumptions. This means that there is heteroscedasticity due to all the variables in the mode. The residuals of this model have variable variance. This contradicts the assumption, on which the linear modeling is based on. Hence, even though the other statistics might be good (like significant test of regression and coefficient estimates) the model holds lesser reliability and has lower predicting capacity.

Limitations of the Study

This analysis was conducted on a dataset available online, hence, might not have been appropriate for a multiple model fitting. Although, the purpose of this article, is to help everyone understand the statistics behind the optimization of a multiple regression model.

--

--