Simple Linear Regression Modeling-Part 1

Anushka Agrawal
Nerd For Tech
Published in
5 min readMay 5, 2021

--

Regression Analysis is one of the most acknowledged and useful tools of statistics. It is one of the most efficient ways to understand the relation between certain variables, while being able to make logical predictions for the future.

Lets understand Simple linear regression with an example and R codes. But first, we should know why is it being referred as “simple” linear regression. Well, this is because we would be studying the relation between only two variables, one dependent and one independent (clearly explained with the example below).

Example
Consider the example given below:
Suppose we wish to study the impact of the Amount of hardwood pulp in the paper on its Tensile Strength.

In order to begin the data analysis, we would first have to establish the type of each variable we are dealing with. Building on, we see which variable is impacting the other, or which variable is affecting the other variable but is itself independent. This variable is called the independent variable. In this example, because we have to study the impact of the percentage of the hardwood pulp in the paper on the tensile strength, the Percentage of Hardwood Pulp in the paper is our independent variable (or regressor), general notation X. On the other hand, the dependent variable (which is affected by the independent variable) is called the regressand, general notation Y.

Moving forward, we run the following R command to obtain the summary of the model fitting of Y on X. This is because, our dependent variable (Y, here tensile strength of the paper) would change with the variation in the independent variable (X, here percentage of hardwood pulp in paper), and hence expressing Y in terms of X would help us visualize the exact relationship the two variables have.

R code for Linear Regression:
>
#independent variable
>X <- c(10, 15, 15, 20, 20, 20, 25, 25, 28, 30)
>#dependent variable
>Y <- c(50, 75, 102, 112, 115, 110, 125, 120, 137, 150)

>#fitting a linear regression model
>model <- lm(Y~X)

Now the simple linear regression model is of the form below

general equation for a linear regression model

This is a general form of the linear regression model. Here, the RHS consists of an intercept estimate, slope estimate and the dependent values of Xi. The populations parameters of the model (intercept and slope) are practically impossible to find, hence we replace them with their unbiased estimates and form a model.
Hence the model becomes:

Once we have a those estimates, we can make predictions for various values of X (provided the regression is statistically significant and logical).

According to the output snapshot above, the intercept estimate for the model is equal to 18.91 and the slope estimate is equal to 4.36. So, the fitted equation for our Linear regression is given by Y = 18.91 + 4.36 X

Other Important Statistics

On running the R code given below, we obatin some very important statistics which help us in understanding the model’s usefulness better.

R code:
>
model
>summary(model)

Output snapshot for the above R commands

Things to Observe:

  1. The significance of the Regression Model
    Certainly, like any other estimation, we always have to test for the significance of the observation, to see if the result holds due to a logical explanation or is it only due to chance.
    Here, the last p-value, corresponding to the F-test in the last line of the output is what we are concerned with. Theoretically speaking, like t-test for the significance of the average, we have to conduct an F-test for testing the significance of the regression.

Hypothesis:
H0: The regression is significant.
H1: The regression is insignificant.

The p-value for the F-test for the regression is 3.086e^(-5) < 0.05 (alpha). Hence, we reject H0 and conclude that the regression is significant.

2. The significance of the Estimates of the Population parameters
Obtaining the values is not enough, if these values are not significant, the model is rendered useless. The p-value for the t-tests for the significance of these parameters is given in the tabular form under Coefficients.

Hypothesis:
H0: The parameters are significant.
H1: The parameters are insignificant.

We see, the p-value for the t-test for the significance of the intercept parameter is 0.131> 0.05(alpha). Here, we accept H0 and conclude that the estimate of intercept parameter is insignificant an can be excluded.

Also, the p-value for the t-test for the significance of the slope parameter is 3.09e(-5)< 0.05(alpha). Here, we reject H0 and conclude that the estimate of slope parameter is significant.

Therefore, our revised model becomes:

We can make predictions using this equation. For various percentages of the Hardwood pulp in the paper (X) we can obtain the estimate of the Tensile strength of the paper (Y).

Further topic of Discussion

In the next article, we would be covering the various other factors that help us understand the usefulness and predictive capacity of the fitted model.

--

--