10

As an assumption of linear regression, the normality of the distribution of the error is sometimes wrongly "extended" or interpreted as the need for normality of the y or x.

Is it possible to construct a scenario/dataset that where the X and Y are non-normal but the error term is and therefore the obtained linear regression estimates are valid?

ECII
  • 1,791
  • 2
  • 17
  • 25
  • 5
    Trivial example: X has a Bernoulli distribution (ie, taking the values 0 or 1); Y = X + N(0, 0.1). Neither X nor Y is normally distributed on its own, but regressing Y on X still works. – Hong Ooi Feb 17 '14 at 08:45
  • I guess you are thinking about the distribution of the residuals, not the distribution of the variables. – tashuhka Feb 17 '14 at 10:03
  • 5
    I have an example worked out here: [What if residuals are normally distributed but Y is not?](https://stats.stackexchange.com/questions/12262//33320#33320) – gung - Reinstate Monica Feb 17 '14 at 14:52
  • Related: https://stats.stackexchange.com/questions/148803/how-does-linear-regression-use-the-normal-distribution – kjetil b halvorsen Dec 07 '19 at 13:25

1 Answers1

15

Expanding on Hong Oois comment with an image. Here is an image of a dataset where none of the marginals are normally distributed but the residuals still are, thus the assumptions of linear regression are still valid:

enter image description here

The image was generated by the following R code:

library(psych)
x <- rbinom(100, 1, 0.3)
y <- rnorm(length(x), 5 + x * 5, 1)

scatter.hist(x, y, correl=F, density=F, ellipse=F, xlab="x", ylab="y")
Rasmus Bååth
  • 6,422
  • 34
  • 57