Do explanatory variables have to have a linear relationship with the response variables?

Question

Do explanatory variables have to have a linear relationship with the response variable in multiple linear regression? What is the reason for this assumption?

Also, why are heteroscedastic relationships between IV's and DV's a problem in multiple regression?

score 1 · Answer 1 · edited Feb 19 '21 at 10:58

1

I assume you are talking about OLS/linear regression. Using OLS implies already that one assumes that there is a linear relationship. Why? Because you explain the response variable by a linear combination of the regressors. Hence, using OLS if you don't think that there is a linear relationship between the explanatory variables and the response variable defeats the purpose of OLS in the first place. Think about it like that: trying to identify a linear relationship between two variables when the true relationship isn't even close to linear is kind of like buying apples for a cherry pie.

If you are not talking about linear regression but non-linear regression, there is no assumption for a linear relationship between the response variable and the explanatory variables. Think about including the square of a regressor and calculating the partial effect for this regressor. The effect changes with the value of the regressor and hence, there is no linear relationship assumed or needed. Cheers.

edited Feb 19 '21 at 10:58

MarianD

1,493
2
8
17

answered Feb 19 '21 at 09:48

J3lackkyy

535
1
9

Thank you for the answer. But my question is about the pair-wise linear relationship between predictors and targets in multiple linear regression – Feb 19 '21 at 09:52
To your second question: an IV should control for variation that would otherwise end up in the error term to ensure exogeneity. There are certain properties that an IV should have in order to be a good IV. If the relationship of a used IV to the response variable is heteroskedastic it is obvisously not a good IV as it will bias the identification of the coefficient which is correlated with the variation in the error term that you wanted to face with the IV in the first place. – J3lackkyy Feb 19 '21 at 09:52
yes the pairwise relationship is assumed to be linear by the model that you are using. For instance, Y=b_0=b_1*x_1+b_2*x_2 , the partial derivative with respect to the first predictor/regressor is b_1. Hence, there is a linear relationship assumed. – J3lackkyy Feb 19 '21 at 09:55
could you please expand on your second comment? I don't quite understand – Feb 19 '21 at 09:59
Try to phrase a question about what is unclear to you, please. – J3lackkyy Feb 19 '21 at 10:06
I don't quite understand exactly why the relationships are assumed to be linear between the predictors and responses. Regarding what you have answered, I don't understand why taking the derivative w.r.t the predictor would mean that the predictor has a linear relationship with the response. – Feb 19 '21 at 10:12
the partial derivative represents the effect of changes in the respective regressor. For instance, Y=beta_1*x_1, if x_1 is increased by one unit, Y is on average increased by beta_1 units. This is the case for any value of x_1 and hence, this relationship is called linear. Think about something practical: Y=wage, X=education (years), if there is a linear relationship between Y and X this would mean that the effect on the wage is the same if somebody goes from 15 to 17 years of education and somebody goes from 50 to 52 years of education. – J3lackkyy Feb 19 '21 at 10:30
When you think about that you recognize that this makes no sense because the effect should be decreasing as the number of education years is increased (after PhD every additional year at uni doesn't bring the same increase of knowledge). Hence, in this case the linearity assumption does not hold and you should not use a linear regression/linear OLS. – J3lackkyy Feb 19 '21 at 10:33

score 0 · Accepted Answer · answered Feb 19 '21 at 17:06

The 'linear' in 'linear regression' means linear in the parameters, which isn't necessarily what people normally mean by 'linear' outside of statistics. (To help clarify the issues, it may help you to read through this CV thread: How to tell the difference between linear and non-linear regression models?) The linearity at issue isn't really an assumption, but just a statement of fact about the kind of model it is.

So is there, then, an assumption of the colloquial sense of linearity? Sort of. The model is being fit with the variables you chose and the structure / functional form you chose, and the results are conditional on those choices. That said, you aren't required to simply input the raw form of each variable (call it '$X$'), you can input a transformation, $f(X)$, or several versions of it. It is common, for example, to include both $X$ and $X^2$ in a regression model. This works just fine. The model really does just fit straight lines / flat planes / etc., but they can capture what you need. To see this, it may help you to read my answer here: Why is polynomial regression considered a special case of multiple linear regression?

Regarding homoscedasticity, that is an assumption (see here). The problem with using a model that assumes homoscedasticity with data that include substantial heteroscedasticity is that you are using the information in your data inefficiently to find the best parameter values and to understand the amount of noise in the data / uncertainty in the relationship. That could mean that you have less power to detect an association (see here) or that you have an increased probability of type I errors (see here).

Do explanatory variables have to have a linear relationship with the response variables?

2 Answers2