Linear Regression Model-Part 2
In my previous article, we went over the model fitting for the given data of percentage of hardwood pulp in the paper(X) and the tensile strength of the paper (Y).
Once we have established the significance of the regression model and the estimated of the population parameters of the model, we then have to analyze the trends of the residuals to verify the assumptions of the linear regression model. We run the following R code to obtain a series if insightful plots as discussed below.
Note: The assumptions of the linear regression plots cannot be completely met, but we are examining if those assumptions are relatively met for us to use this analysis and continue with the regression model.
Although, we can always plot the observed value and check for the linearity (in case of simple linear regression) because we would have only two variables. But, while fitting multiple linear regression, we would have multiple regressors and hence, that scatter plot is harder to obtain and analyze, in that case, these residual plots play a vital role.
R code for obtaining the Diagnostic Model plots:
*model refers to the variable that stores the linear regression model built in R (from the previous article)
Checking the Linearity Assumption
According to the theory, the plot of the Residuals v/s Fitted Values for a decent/predictive/good model should be random in nature. As seen from the plot above, the plot is random in nature and doesn’t portray any trend, like, an upwards sloping line/ a downward sloping line/ cyclic nature etc. Also, with reference to the vertical axis, we see that the values of the residuals should be concentrated around the mean value of the residuals, i.e., 0 (stemming from the assumption, Errors ~ N(0,1)).
Although the red line should be fairly flat (or along the x=0 line) it is not the case here.
Normal QQ plot
Since one of the assumptions of the Linear regression model is the normality of the residuals, this plot gives us an understanding of the QQ plot for the residuals. It plots the Standardized residuals of the model against the Theoretical Quantiles. Once we obtain the observed plots on the straight 45 degree line, we can safely conclude that the residuals obey normality. (as also discussed in the article Testing for Normality)
Non Linearities and Non-constant Variance
The last two plots of the group of four plots for diagnostics of the model, they help us understand the non-linearities and the non-constant variance assumptions of the model. We should ideally observe a linear trend in the red line in Plot 3 i.e., the Scale-Location Plot. We use the Standardized residuals for both the plots. Practically, we have three types of residuals in a model: Regular Residuals, Standardized Residuals and Studentized Residuals.
Plot 4, Leverage meaning: The measure of ‘how far from central tendency is the observed predictor variable’.
Limitations of the Model Fit
You might notice some inefficiencies in the model fit, due to the lack of sufficiently large data. Once we have a large and somewhat exhaustive dataset to all the situations, we can obtain a much better estimate and much better fit of the models. This also increases the type of the diagnostics plot of the model, making the assumptions of the linear regression decently valid.