What are the myths associated with linear regression, data transformations?

Question

I have been encountering many assumptions associated with linear regression (especially ordinary least squares regression) which are untrue or unnecessary. For example:

independent variables must have a Gaussian distribution
outliers are the points either above or below the upper or lower whiskers correspondingly (employing Boxplot terminology)
and that the sole purpose of transformations is to bring a distribution close to normal in order to suit the model.

I would like to know what are the myths that are generally taken for facts/assumptions about linear regression, especially concerning associated nonlinear transformations and distributional assumptions. How did such myths come about?

Linear regression refers more specifically to ordinary least squares regression? (or is this another misconception/assumption that regression equals minimizing the squared error?) — Sextus Empiricus, Feb 05 '22 at 13:56
Here's a source of several myths that was referenced in a recent question: https://www.albert.io/blog/ultimate-properties-of-ols-estimators-guide/. Perhaps including such references in our answers would be a good thing because (a) it would demonstrate these issues are real and (b) point to some of the offenders. — whuber, Feb 05 '22 at 16:00
BTW, **please keep each answer to one myth** (or set of closely related ones). Otherwise the voting will be ambiguous and the answers will be more difficult to curate. — whuber, Feb 05 '22 at 16:01
Incidentally, this question is too broad: it asks about a large set of things. Could we please focus answers on issues directly related to **ordinary least squares regression**? Otherwise this thread will quickly become a mess and have to be closed. — whuber, Feb 05 '22 at 16:05
The question is closed because it needs more focus, but are there suggestions to bring more focus? Or is the closing about the unclear sentence 'the purpose of transformations, linear regression, and different kinds of distributions' which is very broad when taken literally. — Sextus Empiricus, Feb 06 '22 at 12:28
Do you consider differencing to be included as a data transformation? If so, there are endless questions on this site where people are confused about whether and how stationarity is required for linear regression, if differencing should be applied first, etc. — Chris Haug, Feb 06 '22 at 19:43
Linear regression is widely used by people with little background in statistics or indeed in math (physicists, computer scientists, economists, but also biologists, psychologists, etc.). Thus what looks like a *myth* might often be just an incorrectly stated (mathematically/statistically) but empirically/intuitively correct understanding of where the regression is applicable or not. — Roger Vadim, Feb 07 '22 at 09:21
The assumptions, even when stated correctly, are of importance when one's agenda is making inference from the data (At least it is what I have come across in an overwhelmingly huge no. of answers in this website). But I think the linearity assumption (relationship between IV and DV) holds importance for predictions too. Am I right? — Ritik P. Nayak, Feb 08 '22 at 10:18

Dave · Accepted Answer · 2022-02-08T02:55:44.050

31

There are three myths that bother me.

Predictor variables need to be normal.
The pooled/marginal distribution of $Y$ has to be normal.
Predictor variables should not be correlated, and if they are, some should be removed.

I believe that the first two come from misunderstanding the standard assumption about normality in an OLS linear regression, which assumes that the error terms, which are estimated by the residuals, are normal. It seems that people have misinterpreted this to mean that the pooled/marginal distribution of all $Y$ values has to be normal.

For the myth about correlated predictors, I have two hypotheses.

People misinterpret the Gauss-Markov assumption about error term independence to mean that the predictors are independent.
People think they can eliminate features to get strong performance with fewer variables, decreasing the overfitting.

I understand the idea of dropping predictors in order to have less overfitting risk without sacrificing much of the information in your feature space, but that seems not to work out. My post here gets into why and links to further reading.

edited Feb 08 '22 at 02:55

answered Feb 05 '22 at 02:53

Dave

28,473
4
52
104

Thank you @Dave for such a riveting demonstration of the facts. But on your 3rd point, isn't the first half of it; "Predictor variables should not be correlated" a norm? Isn't it true that multicollinearity is a matter of concern for inference when it comes to making sense of the parameters associated with the the predictors? Does multicollinearity not lead to vague CIs of the parameters? In fact, when making dummy variables of categorical data, I would drop one of the dummies in order to avoid multicollinearity. How do you defend your third argument then? – Ritik P. Nayak Feb 06 '22 at 19:46
1

@RitikP.Nayak Yes, confidence intervals are wider when features are correlated. If you start screening features for correlation, however, you’re influencing downstream inference. If you want uncorrelated features in order to have the narrowest confidence intervals you can have, that might call for a designed experiment. – Dave Feb 06 '22 at 19:59
Thank you @Dave . I'm with you on it. But I request a comment on the second point too that I made in the earlier comment; "when making dummy variables of categorical data, I would drop one of the dummies in order to avoid multicollinearity". To add to this, does multicollinearity doesn't at all matter for prediction purposes? – Ritik P. Nayak Feb 06 '22 at 20:07
You’re not dropping a category; you’re letting it be subsumed by the intercept. – Dave Feb 06 '22 at 20:23
Even normality of the error terms isn't needed for most things. The only thing that will buy you is efficiency. OLS will be unbiased and/or consistent even if errors aren't normally distributed. – RoyalTS Feb 07 '22 at 09:35

score 20 · Answer 2 · answered Feb 05 '22 at 12:52

20

@Dave's answers are excellent. Here are some more myths.

The original scale/transformation for Y is one you should use in the model.
The central limit theorem means you don't have to worry about any of this if N is moderately large.
Trying different transformations for Y does not distort standard errors, p-values, or confidence interval widths.

answered Feb 05 '22 at 12:52

Frank Harrell

74,029
5
148
322

5

If I could, I would give a separate upvote for each of these points because they address issues that appear to be particularly poorly understood by people who visit this site. – whuber Feb 05 '22 at 14:18
1

I need an explanation about the second point about the central limit theorem, because I just posted an answer that is about the opposite statement 'That you need to worry about it' although I have tried to make it more nuanced and make it *not* about the CLT and instead phrased it to be about the 'principle behind the CLT' and tried to focus on the idea of the limiting behaviour... – Sextus Empiricus Feb 05 '22 at 15:37
.. is that the idea behind the second point, the myth of the CLT? That people make the CLT, which is about the behaviour of a limit at infinity, equal to the the behaviour of statistics that relate to samples with a finite size? – Sextus Empiricus Feb 05 '22 at 15:40
1

Yes and many, many people misunderstand the fact that if you should have taken log(Y) instead of Y (or square root or some other nonlinear transformation) the CLT will not rescue you. – Frank Harrell Feb 05 '22 at 16:30
3

This answer could be greatly improved by adding some more context (like most answers here have done) by e.g. clearly stating what the truth is that stands opposed to the myths, explaining why the myths are wrong and/or trying to explain what might be causing people to believe the myths. – NotThatGuy Feb 06 '22 at 04:02

COOLSerdash · Answer 3 · 2022-02-05T20:32:56.613

20

Myth

A linear regression model can only model linear relationships between the outcome $y$ and the explanatory variables.

Fact

Despite the name, linear regression models can easily accomodate nonlinear relationships using polynomials, fractional polynomials, splines and other methods. The term "linear" in linear regression pertains to the fact that the model is linear in the parameters $\beta_0, \beta_1, \ldots$. For an in-depth explanation about the term "linear" with regards to models, I highly recommend this post.

edited Feb 05 '22 at 20:32

answered Feb 05 '22 at 18:53

COOLSerdash

25,317
8
73
123

I'm confused by the statement of the "_myth_" -- it's true that linear-regression models can only model linear relationships (pretty much by-definition). I _think_ you're kinda mixing two different points together: (1) that the term "_linear_" can mean different things (which seems to be the main focus of the linked answer); (2) that non-linear problems can be approximated as linear problems to enable linear-regression (e.g., doing a linear-regression on $y=mz+b$ where $z\equiv x^2$ as an approximation of doing a non-linear regression on $y=mx^2+b ).$ – Nat Feb 05 '22 at 20:03
4

@Nat I encourage you to read through the linked post. The word "linear" in linear regression means linear in the parameters, not linear in $x$. The model $y=mx^2+b$ would therefore not be called nonlinear regression, because it's still linear in its parameters. It's a linear model of a nonlinear relationship. – COOLSerdash Feb 05 '22 at 20:09
2

An example of a nonlinear problem would be $y= b_0+ x^{b_1} + \text{error}$, where you cannot describe $y$ in terms of a linear sum of the $b$ coefficients you are solving for. An example of how this model would show up is if you wanted to measure the decay rate of something (e.g., heart rate during exercise), but the rate and the asymptotic bounds both depend on independent variables (e.g., initial measurement, time passed, temperature, and subject ID). – Max Candocia Feb 05 '22 at 20:30
1

@Max If you intend $x$ to be an explanatory variable and $b_1$ to be a parameter, that is *not* the kind of "nonlinear relationship" described in this answer, because $b_1$ does not enter the model in a linear fashion (nor can the model be transformed to make it so). – whuber Feb 05 '22 at 21:56

score 14 · Answer 4 · answered Feb 05 '22 at 14:34

14

Myth: Variables that are not "significant" should be removed from a multiple regression.

See When should one include a variable in a regression despite it not being statistically significant? for a discussion. Then search our site for "model identification," "regularization," "Lasso," etc.

answered Feb 05 '22 at 14:34

whuber

281,159
54
637
1,101

Thank you @whuber . I have upvoted and accepted your answer. But let's for moment not look it from the 'statistical significance' lens. Let's talk about the correlation between the IV and DV. Let's say for instance, that I have 3 predictors in a dataset (say, x1, x2, and x3). There's no linear relationship between the three. But, x1 and x2 are highly correlated with the DV (say, y) whereas x3 has a very weak correlation with the DV, say a correlation coefficient of 0.2. Does this then become a suitable reason for one to exclude x3 from making predictions? – Ritik P. Nayak Feb 08 '22 at 10:09
The answer is a very strong, resounding, *no.* The whole basket of variables matters in multiple regression: removing or adding just one can completely change the picture. The raw (bivariate) correlation tells you even less than the multiple regression p-value might. For insight into what goes on in multiple regression, see https://stats.stackexchange.com/a/28493/919. For additional examples and discussion see https://stats.stackexchange.com/a/34813/919, https://stats.stackexchange.com/a/372416/919, *etc.* – whuber Feb 08 '22 at 15:15

score 11 · Answer 5 · answered Feb 05 '22 at 14:40

Myth: You should always standardize (or somehow "normalize") variables for the purpose of fitting regression models.

Usually not: software will either do this automatically (under the hood, as it were) or uses algorithms that accommodate huge ranges of values among the variables without losing numerical precision.

When the order of magnitude of one explanatory variable is more than about eight times greater than of another variable, though, then watch out: even preliminary standardization can run into trouble. ("Eight" orders of magnitude is the square root of double precision, which is about 15.6 orders of magnitude.) The commonest example is when a date is used along with other variables, because some dates are represented as the number of seconds elapsed since approximately 1970, which is on the order of $10^9$ seconds.

@Max I cannot discern anything in your comment that appears relevant to this issue. Perhaps you have a different conception of what "standardize" and "normalize" mean in statistics. The former means to recenter and rescale a variable to achieve zero mean and unit variance while the latter applies to a comparable affine transformation to place the variable's range at $0$ to $1.$ — whuber, Feb 05 '22 at 21:13

COOLSerdash · Answer 6 · 2022-02-05T18:53:36.067

Myths:

The normality of residuals (and possibly other assumptions of the model) should be tested with a formal hypothesis test, such as the Shapiro-Wilk test.
A small $p$-value of such tests indicates that the model is invalid.

Facts:

Formal test of normality (and of other assumptions such as homoskedasticity) do not answer the relevant questions and if they are used to guide subsequent actions, can distort the operating characteristic of the models (e.g. inflating type 1 errors, change distribution of $p$-values under the null etc.).
A "significant" Shapiro-Wilk test of the residuals just indicates some degree of incompatibiltiy with a normal distribution. It does not say that the (inevitable) deviation from a normal distribution is meaningful or impactful concerning the operating characteristics of the model. Some aspects - e.g. prediction intervals - are more sensitive with regards to the distribution of the errors than others. The $t$-test of the coefficients are reasonably robust (with regards to type 1 errors), for example. Whether or not the deviation of the residuals from a normal distribution is worrysome depends on a number of things: the goal of the analysis, the sample size, the degree of deviation, and more.

score 7 · Answer 7 · answered Feb 05 '22 at 15:08

Where do these ideas come from?

Poor texts (correction: very poor texts) after treating descriptive statistics often include some more or less mangled version of an idea that (1) you ideally need normally distributed variables to do anything inferential, or else (2) you need non-parametric tests. Then they may or may not mention that transformations could get you nearer (1).

The first context for writing like ,this is often Student $t$ tests for comparing means and (Pearson) correlations. There is some historical context for this, for example in treatments that focused on a reference case of a bivariate normal distribution with a correlation as one parameter.

So then writers start talking about regression.

These texts are usually innocent of any formal specification of a data generation process.

Regarding sources, do not forget tons of machine learning tutorials, blog posts, YouTube videos and such the authors of which have never understood (or even studied) statistics at an appropriate level. — Richard Hardy, Feb 05 '22 at 18:39

Sextus Empiricus · Answer 8 · 2022-02-05T16:51:06.177

5

"Also, how did such myths come about?"

One common assumption in regression is homoscedasticity (and a myth is that this is also necessary). Transformations are used to bring the data closer to this assumption.

The violation of the assumption doesn't make the fitting method bad, least squares regression is the best unbiased linear estimator (in terms of lowest variance of the estimates) no matter what the underlying distribution is.

But, the violation of the assumptions may cause wrong inferences when we express the observed effects in terms of significance/p-values.

There is a difference between the assumptions that are necessary for least squares regression to work, and the assumptions that are necessary for the significance and hypothesis tests based on least squares regression to work.

edited Feb 05 '22 at 16:51

answered Feb 05 '22 at 14:53

Sextus Empiricus

43,080
1
72
161

Is OLS still blue under heteroskedasticity? Would that not be GLS instead? – Richard Hardy Feb 05 '22 at 18:44
4

A hobby-horse is that we would be better off not talking about assumptions but more about ideal conditions. Blame is shared by those don't teach the difference between (1) assumptions that imply theorems (2) ideal conditions that are often Utopian on the one hand and often not crucial on the other. But statistics (like many others) is a spiral subject that requires many cycles of learning and unlearning. Much still boils down to assertions from experience with flavour like "With this kind of data, logarithms are always a good idea" or "That assumption doesn't usually bite hard". – Nick Cox Feb 05 '22 at 18:50

score 2 · Answer 9 · edited Feb 06 '22 at 11:09

2

Myth: The error/deviation of the observations needs to be normally distributed.

No, it doesn't.

It is not just about the distribution of the errors of the observations, Instead what often matters is the distribution of the error of the estimates.

These estimates are computed as a weighted sum of the observations $$\hat\beta = M \cdot y$$ with $$M = (X^TX)^{-1}X^T$$

If we want to estimate the error or significance of the estimates $\hat\beta$, then it is sufficient if those estimates follow approximately a normal distribution. This can happen also when the sampled error of observations $y$ do not follow a normal distribution.

Due to the same principle of the central limit theorem, a statistic that is a weighted sum of variables or some sort of mean of variables will approach a normal distribution.

So even if the distribution of the error/deviation of the observations is not normally distributed, the error/deviation estimates might still be approximately normally distributed.

edited Feb 06 '22 at 11:09

Nick Cox

48,377
8
110
156

answered Feb 05 '22 at 15:33

Sextus Empiricus

43,080
1
72
161

This sounds like [a common misconception about the central limit theorem.](https://stats.stackexchange.com/questions/473455/debunking-wrong-clt-statement) – Dave Feb 05 '22 at 15:36
@Dave I mentioned on purpose the 'principle of the central limit theorem' and not plainly 'the central limit theorem'. – Sextus Empiricus Feb 05 '22 at 15:44
1

The fact is that the estimates from OLS are a linear combination of observations $$\hat\beta = M \cdot y$$ with $$M = (X^TX)^{-1}X^T$$ For the distribution of sums we know that they approach normal distributions. The CLT is the statement about the limit for a sample that becomes infinitely large, but the principle that "sums of random variables can be approximated by a normal distribution" is not less true because of that. Sure, it doesn't work for all types of variables. The point is that the OLS estimate is about the distribution of the estimate and not the distribution of the errors. – Sextus Empiricus Feb 05 '22 at 15:51
2

The CLT helps here far less than you think. Often when the residuals are skewed the wrong transformation of Y was used. This can ruin all the $\beta$s. – Frank Harrell Feb 05 '22 at 16:29
It is not true for *every* sort of sum of variables, but for many variables, a sum like $\hat\beta = M \cdot y$ can be well approximated by a normal distribution. The myth is that the $y$ needs to be normal distributed, **that** is not true. I am not saying that the opposite is instead true, but I am just adding a nuance. The problem of inference and computing p-values is often not about the matter of $\hat\beta$ not following a normal distribution (the bulk of the distribution can be well approximated), but instead that other assumptions cause wrong estimates of the variance or bias present. – Sextus Empiricus Feb 05 '22 at 17:24
You start talking about errors but suddenly mention $y$ where I would expect to see $\varepsilon$: *This can happen also when the $y$ are not normally distributed.* I find that confusing. I think it would be clearer if you stuck with errors/residuals throughout. Moreover, I wonder if your last sentence is correct. There is no CLT at play when we talk about random variables and their realizations (errors and residuals and such) rather than sums and averages. So I wonder if normal errors are compatible with nonnormal residuals. – Richard Hardy Feb 05 '22 at 18:45
Just for pedagogical reasons: I think your point would be more easily accessible to a wider audience if you kept the reference to the unconditional distribution of $\varepsilon$ than to the conditional distribution of $y$. (And if you prefer the latter, consider stating explicitly what the conditioning variables are.) – Richard Hardy Feb 06 '22 at 06:46
1

@RichardHardy I have edited the post. Feel free to improve it further. I believe you get the point that I wish to make. – Sextus Empiricus Feb 06 '22 at 09:03

score -3 · Answer 10 · answered Feb 05 '22 at 21:16

-3

Myth: If the histogram of the residuals is nicely bell-shaped, and if the normal q-q plot of the residuals is very close to a straight line (and the sample size is reasonably large so that sampling error is minor), then the normality assumption is reasonable.

answered Feb 05 '22 at 21:16

BigBendRegion

4,593
12
22

3

This needs elaboration: which "normality assumption" are you referring to and *why* do you call this a "myth"? One overarching problem is that your assertion is completely qualitative and subjective: how are we intended to understand "nicely bell-shaped," "very close to .. straight," and "reasonable"? Readers might be surprised but not enlightened. – whuber Feb 05 '22 at 21:53
It's all just textbook stuff. – BigBendRegion Feb 05 '22 at 22:41
7

That doesn't clarify anything. Also, since many sources state the opposite of this answer, *some* explanation is needed! – whuber Feb 05 '22 at 22:53
3

Without further elaboration, this one seems to be wrong. – Dave Feb 06 '22 at 19:13

What are the myths associated with linear regression, data transformations?

10 Answers10

Linked