Testing for Normality
In my previous article, we went over the single variable hypothesis testing. Likewise, we can apply the tests for the double variable testing like, testing for the significance of the difference of means, testing for the difference of population variances and others.
But now, we would move over the topic of testing for the normality of any form of sample (data). But first lets understand what does Normality mean? When we use normality in context of hypothesis testing, we mean to test if the sample seems to originate from a normal distribution, i.e., if the whole sample can be modeled on any normal distribution.
Data Visualizations for Normality
If we would like to visually see if the sample seems to follow normality or situations we wish to set a pre-test prediction, we can use the following two types of visualization techniques for Normality, namely PP Plot and QQ Plot.
PP plot (probability-probability plot or percent-percent plot) is the plot between the two cumulative distribution functions of two samples, to see how closely they agree with each other.
R code: There is no inbuilt function to obtain the pp plot in R, nevertheless, we can code from scratch.
The reference link for the PP plot is shared at the end of this article.
QQ plot is the plot of the quantiles of the two samples (one sample and other normal distribution, the default) to check if they fall in sync with each other, or if they obey each other. Given below is an example of the QQ plot from the sample generated above.
R code:
qqnorm(sample)
qqline(sample, col=”red”)
We can see that the observed quantiles seem to faintly follow the straight line, therefore, it seems to satisfy normality. In order to be sure if the sample originates from a normal distribution, we perform the Tests of Normality as discussed below.
Tests of Normality
Lets explore one famous test for Normality, the Kolmogorov-Smirnov (KS) test with its R application.
Kolmogorov-Smirnov (KS) test is the measure of the significance of the difference between the cumulative frequencies based on the empirical observations and the assumption of the distribution. We could use SPSS, Python or R softwares to conduct this test. Let us take the following example to test for Normality:
Example: Lets test for the normality of 50 randomly generated sample from uniform distribution. Here, we first obtain the random sample from uniform and normal distribution and then apply the KS test on them.
Solution: Setting the hypothesis for this testing:
H0: The sample follows normal distribution, i.e., the sample seems to originate from a normal distribution.
H1: The sample does not follow normal distribution, i.e., the sample doesn’t seem to originate from a normal distribution.
R code for KS test:
ks.test(sample, pop)
here, sample contains the sample from uniform distribution
and pop contains the random sample from normal distribution
Here we can see that p-value for the testing of normality is equal to 4.808e^(6) < 0.05 (alpha). Hence, we reject H0 at 5% level of significance.
KS Plot
We can also obtain the KS plot using the following set of commands for the random values generated:
R code for KS plot:
plot(ecdf(sample), col=”red”)
plot( ecdf(pop),col=”blue”, add=TRUE)
here, ecdf() function is used to find the empirical cumulative distribution function
Reference link for the R-code for PP plot:
https://www.r-bloggers.com/2009/12/r-tutorial-series-graphic-analysis-of-regression-assumptions/
Further topic of Discussion
I hope this article helped to explore some frequently used techniques to test for normality. Follow me to stay updates on various other fascinating statistic concepts.