Data Screening (EDA) Interpretation Guide

Measures of Homogeneity of Variance

The Levene test is a homogeneity-of-variance test that is less dependent on the assumption of normality than most tests and thus is particularly useful with analysis of variance. It is obtained by computing, for each case, the absolute differences from its cell mean and performing a one-way analysis of variance on these differences. If the Levene test statistic is significant then the groups are not homogeneous and we may need to consider transforming the original data or using a non-parametric statistic.

A variance ratio analysis can be obtained by dividing the lowest variance of a group for two groups into the highest group variance of the two group variances. Concern arises if the resulting ratio is 4-5 + which indicates that the largest variance is 4 to 5 times the smallest variance.

Also, you can eyeball the similarity in heights (50% of cases) of the comparative graph of the groups’ box plots. Additionally, you can look for similarities (values that are close) of the group standard deviations.

Measures of Normality

In a normal probability plot, each observed value is paired with its expected value from the normal distribution. The expected value from the normal distribution is based on the number of cases in the sample and the rank order of the case in the sample. If the sample is from a normal distribution, we expect that the points will fall more or less on a straight line.

A detrended normal plot are the actual deviations of the points from a straight line. If the sample is from a normal population, the points should cluster around a horizontal line through 0, and there should be no pattern. A striking pattern suggest departure from normality.

The Shapiro-Wilks’ test and the Lilliefors test are statistical tests that test the hypothesis that the data are from a normal distribution. If either test is significant then the data is not normally distributed. It is important to remember that whenever the sample size is large, almost any goodness-of-fit test will result in rejection of the null hypothesis since it is almost impossible to find data that are exactly normally distributed. For most statistical tests, it is sufficient that the data are approximately normally distributed.

A distribution that is not symmetric but has more cases (more of a "tail") toward one end of the distribution than the other is said to be skewed.

Value of 0 = normal

Positive Value = positive skew

Negative Value = negative skew

Concern arises when skewness value is greater than plus or minus 2.50-3.00.

Kurtosis is the relative concentration of scores in the center, the upper and lower ends (tails) and the shoulders (between the center and the tails) of a distribution.

Value of 0 = normal-mesokurtic

Positive Value = leptokurtic

Negative Value = platykurtic

Concern arises when kurtosis value is greater than plus or minus 2.50-3.00.