I've seen in many analysis reports that to test for normality in a discrete variable, as well as in a continuous variable, people use a frequency chart and a density plot to visually check for normality. But I couldn't understand how that would be useful, since our data is discrete and the density curve represents a case for continuous variable. Can anyone explain this to me?
1 Answers
There are several distinct issues here.
To start, in principle, a discrete variable can't be normal at all, as the normal is a continuous distribution. The exceptions are where discreteness or rounding is a reporting convention and there are enough distinct values for data to come close to a normal distribution, or for a normal distribution to be a reasonable reference distribution. People's heights in inches or cm can be examples. (By reference distribution is meant just a relevant standard with which distributions can be compared. Using sea level as a reference for altitude doesn't mean that the world is expected to be flat, and using freezing point of water as a reference for temperature doesn't imply anything about variability. So also, using normal distributions as a reference doesn't mean that not being normal is exceptional or problematic in itself.)
At some level, declaring (close enough to) normal (for my purpose) is always an approximation, e.g. in so far as variables have finite limits and don't extend over the entire real line.
Many texts are firm in the pursuit of moderate rigour that discrete variables have probability mass functions and only continuous variables can be said to have probability density functions.
Contrariwise, a variable like height reported discretely can be compared with a normal density because the idea is that height is in principle continuous and so we are interested in how far a normal distribution is a good approximation.
Contrariwise too, several texts are happy with density as a general idea applicable to any random variable, underlining that the question is just density with respect to what measure (e.g. counting measure). These texts are alluding to the measure theory the reader is expected to know (about). As I know almost no measure theory, this wording may not be entirely felicitous to those who know more.
At some point, most people with statistical experience would say that there are discrete variables which just can't be compared meaningfully or helpfully with normal distributions. Concretely,
A categorical variable with distinct named or nominal values say "frog" "toad" "newt".
An ordinal (graded) response such as 1 "strongly disagree" to 5 "strongly disagree". I have often seen (in this forum alone) concern over whether such variables can be said to be normally distributed. At best, this is a learner's misinterpretation of good texts or courses; at worst, it is an echo of bad texts or courses. For what it is worth, noting whether such a variable has (say) an approximately symmetric or strongly skewed distribution can I think be worthwhile (although such summaries take the numbers literally).
Counts that are usually small integers, such as the number of children, cats, or cars per household.
FWIW, while a histogram and superimposed density curve can often be helpful, I consider a normal quantile plot to be by far the best graphical check of normality. See also Benefits of using QQ-plots over histograms For a strongly discrete variable the histogram should collapse to a bar chart of frequencies of the distinct values observed, and artefacts of bin width or origin should barely bite, but conversely distinct discrete values would be obvious on a normal quantile plot if they matter.
There have been dozens of posts about checking for normality, and the over-arching question is always why do you think it matters whether this variable is normally distributed? See Is normality testing 'essentially useless'? for several points of view.
The nuances here are everywhere dense: e.g. a Pearson correlation can be calculated as a measure of linearity of relationship and be used as a descriptive statistic, but some of the machinery for calculating P-values is based on an ideal condition of bivariate normality. A related and even more widespread misunderstanding is that regression depends on variables being distributed normally: if that were true using (0, 1) indicators as predictors would be out of order, and everyone in the know agrees that is utterly standard as part of the machinery. So, normality there isn't or is important, depending on what you want to do.

- 48,377
- 8
- 110
- 156
-
this was highly exploratory and helpful. Thank you! – Atilla Colak May 14 '21 at 10:20
-
What do you think causes the misunderstanding of thinking normality is required for regression predictors (IVs)? Is it to do with a normally distributed predictor is symmetric and has a reasonable kurtosis (by definition an ex. kurtosis of zero), which in a sense is better than a highly skewed continuous predictor? The latter you may want to transform to make it more symmetric with a more evenly distributed density giving more stability to the regression process for example it presumably reduces the number of high leverage points. – Single Malt May 14 '21 at 11:05
-
1@SingleMalt (a) Some books say so. You might not believe how many bad books are out there. As just one example, see the reviews at https://www.amazon.com/dp/0521763223#customerReviews where my detailed criticisms supporting 1 star lie alongside positive reviews. (b) There is a general myth of "you need normal distributions for parametric statistics". (c) People misread normality of errors as implying normality of variables. Even competent authors can, in my view, over-emphasise normality of errors by calling it an assumption. Assumption in statistics often just means "ideal condition". – Nick Cox May 14 '21 at 12:22
-
I've seen discussions pointing out that U-shaped distributions for predictors are good for minimising uncertainty about parameters. This is more relevant for experimental design than for regressions with observational data. – Nick Cox May 14 '21 at 12:24
-
Very interesting review. The worth of using real data for exposition includes it adds relevancy and it will naturally contain real-world complications such as non-linearities or missing values. You mention they “confuse distribution shape with location and scale”, something that I am uncertain of the distinction between the two, thus is perhaps a non-rare source of confusion. – Single Malt May 14 '21 at 13:10
-
1Location = where centred, scale = how big around location, shape = everything else. – Nick Cox May 14 '21 at 13:20