Is the linearity of Y and X an assumption for linear regression?

Question

There are many posts regarding linear regression, so I'm sorry I'm still coming back to this subject. However, I still have some questions about it. I know for sure that the model should be linear within the parameters. That meaning we cannot have something like:

$$y = \beta_0 + \beta_1^2x_1$$

But I've read somewhere that the relationship between Y and X should be linear. Is this correct? If so, what does that mean? Does this mean that the relationship between Y and all $x$'s should be linear? If that is correct, why would $y = \beta_1 \cdot x^2$ be a linear regression model?

The literature is a little to blame here. In many texts and courses, the emphasis on first meeting regression is on fitting straight lines (or planes or ...) to data, so relationships are modelled by forms linear in the variables. Then later -- if a student survives until then -- this has to be unlearned, or at least modified, to appreciating that linearity in the parameters is the real deal. With $y = \beta_0 + \beta_1 x_1$ the form is linear either way, but nothing stops $\beta_1 = \beta_2^2$ or $x_1 = \ln x_2$. — Nick Cox, Feb 10 '20 at 12:01
@NickCox $\beta_1=\beta_2^2$ precludes a value less than zero. You’d still call that linear? — Dave, Feb 10 '20 at 12:27
@Dave Good point; let's just say nothing _in this definition_ stops the parameters or variables being functions of some other quantities. After all, you could not have $x_1 = \ln x_2$ either if $x_2 \le 0$. — Nick Cox, Feb 10 '20 at 12:30
@NickCox The difference with $x_1 = ln(x_2)$ is that if your $x_2$ data weren't all positive, you wouldn't take a log transform, and if you did, I would expect a software package to squawk. With $\beta_2^2 = \beta_1$, what do you infer about $\beta_2$ if $\widehat{\beta}_1 <0$? — Dave, Feb 10 '20 at 15:57
The first statement is true about me, but not in general. I am afraid I know researchers who take logarithms of negative numbers by accident and ignore the consequences of missing values. But again, sure, and I am just reacting to an example the OP gave. In many circumstances, other details not part of the problem could qualify or complicate simplified statements. As you know, I can't edit comments this late. I could delete and reissue them, which would just confuse the thread. If you think that the qualification I gave in my second comment is not enough, then sorry about that. — Nick Cox, Feb 10 '20 at 16:05
The inference if $\beta_1$ were estimated as negative is that thinking of it as the square of something else doesn't match the message in the data. — Nick Cox, Feb 10 '20 at 16:06
@Dave I encourage you to post your own answer. I don't think either answer to date is quite on-target. — Nick Cox, Feb 10 '20 at 16:08

score 0 · Answer 1 · edited Feb 10 '20 at 11:54

0

You are right. In statistics linear models are such models which are linear with respect to coefficients $\beta_i$. However there is also "common meaning" of linear, denoting the linear relationship (with random error $\epsilon$) between the variables, for example $y=a\cdot x+b$.

However many non linear relationships (in terms of $x$ and $y$) can be linearized. For example:

$$y=a \cdot x^b$$

A logarithm $\ln(.)$ can be applied to it and one gets:

$$\ln(y)=\ln(ax^b)$$

Using some properties of logarithms one can rearrange this to:

$$\ln(y)=\ln(a)+\ln(x^b) = \ln(a)+b\cdot \ln(x)$$

Rearranging this finally gives:

$$\ln(y)=b\cdot \ln(x)+ \ln(a)$$

which you can present as:

$$Y = A\cdot X+ B$$

where $Y=\ln(y)$, $A=b$, $X=\ln(x)$, $B=\ln(a)$. This means that you just have to recalculate your values of $x$ and $y$ and perform the linear regression to find $A$, and $B$. Then you need to go back to the original model by finding $a$ and $b$: $$a=e^B$$ $$b=A$$.

So you can see that it is still somewhat a linear model.

Of course trying to apply a linear model to nonlinear data in a straightforward way will give you a model that is useless. Try to find linear regression of data generated from the function $sin(x)$ for $x \in [0,2\pi]$

edited Feb 10 '20 at 11:54

Nick Cox

48,377
8
110
156

answered Feb 10 '20 at 11:24

Wojciech Artichowicz

185
6

$y_i = \beta_0+\beta_1 sin(x_i) +\epsilon_i$ for $\epsilon_i \overset{iid}{\sim} N(0,1)$ seems to meet your criteria. – Dave Feb 10 '20 at 11:30
So Y and X don't need to be linear? (You wrote about a nonlinear relationship when you wrote about y=a*x^b. – trder Feb 10 '20 at 11:31
My main question is: would it be wrong to say that Y and X being linear is a linear regression assumption? – trder Feb 10 '20 at 11:34
@trder The model in my comment is a linear model. I suggest MathematicalMonk’s video on basis functions: https://youtube.com/watch?v=rVviNyIR-fI. This post, however, contains nonlinear models, such as $ln(a)$. – Dave Feb 10 '20 at 11:37
@Dave yes, statistically it is a linear model. However trying to apply a linear model (in terms of $x$ and $y$) to such relationship will turn out to be useless – Wojciech Artichowicz Feb 10 '20 at 11:43
@trder Your question was about the meaning of term "linear". In statistics linear models are such, that have linear coefficients. However linear relatioship refers to the relationship between the variables. Those are two different things. And additionally you can use linear regression (in terms of both: model, and relationships) to find the coefficients of non linear models, by applying transformations to the feature space. – Wojciech Artichowicz Feb 10 '20 at 11:45

score 0 · Answer 2 · answered Feb 10 '20 at 12:20

A short answer:

Yes, the linearity between $X$ and $Y$ is an assumption.

A somewhat longer answer:

Statistical terminology might be confusing. "Linear" in "linear regression" means being linear in the coefficients, but "logistic" in "logistic regression" does not mean being logistic in the coefficients! Logistic regression is just a name, a convention, under which statisticians understand a certain method.

Now, the only explicit assumption behind linear regression is that the errors are normally distributed, with a constant variance. However, if you attempt to fit a straight line to data generated by a non-linear process (like the quadratic one from your question), the errors (deviances of the data from the fitted line) will not be normally distributed.

You can transform your data, by taking $Z = f(X)$, where $f(\cdot)$ is some non-linear function (possibly a vector), and fit a linear function (a line, a plane, a hyperplane...) through such transformed data, $Y = \beta_0 + \beta Z$ ($\beta$ also possibly being a vector). It's a legitimate mathematical trick, and you'd still be linear in both senses of the word.

Is the linearity of Y and X an assumption for linear regression?

2 Answers2