3

I am new to machine learning, and I'm trying to cover some of the basics. One of the assumptions of linear regression is a linear relationship.

However on Reddit I was told today that no machine learning model requires a correlation between any of the predictors and the output. My question is is there a difference between correlation, and a linear relationship?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Jweir136
  • 31
  • 1
  • Regarding the connection between simple linear regression and correlation, there are some useful answers on SE. See e. g. the answers to [this Q](https://stats.stackexchange.com/q/2125/136579) and [this answer](https://stats.stackexchange.com/a/133331/136579). – statmerkur Dec 28 '18 at 10:43

1 Answers1

3

One of the assumptions of linear regression is a linear relationship.

There is a fairly common confusion on this matter that makes the scope of linear regression look narrower than it actually is. In regression analysis, we model the expected value of a response variable $Y_i$ conditional on some regressors $\mathbb{x}_i$. In general, we write the response variables as:

$$Y_i = \mathbb{E}(Y_i|\mathbb{x}_i) + \varepsilon_i \quad \quad \quad \varepsilon_i \equiv Y_i - \mathbb{x}_i,$$

where the first part is the true regression function and the second part is the error term. (This model form implies that $\mathbb{E}(\varepsilon_i | \mathbf{x}_i) = 0$.) In a linear regression we assume that the true regression function is a linear function of the parameter vector $\boldsymbol{\beta} = (\beta_0,...,\beta_m)$. This gives us the model form:

$$Y_i = \sum_{k=0}^m \beta_k x_{i,k}^* + \varepsilon_i \quad \quad \quad x_{i,k}^* \equiv f_k(\mathbf{x}_i).$$

You can see from this model form that we can transform the original regressors $\mathbf{x}_i$ via any transform we want (including a non-linear transform). Hence, tThe important thing to notice about this is that linear regression does not necessarily assume linearity with respect to the regressor variables. The "linear" in linear regression comes from the fact that the model is linear with respect to the parameters in the regression function. Nonlinear regression occurs when the regression function has one or more parameters that cannot be linearised.

...is there a difference between correlation, and a linear relationship?

Correlation is a measure of the strength of a linear relationship between two variables. It occurs as a special case of linear regression. If we have use a simple linear regression model under standard assumptions then we have a single regressor $x_i$, with no transformation of this variable. The simple linear regression model is:

$$Y_i = \beta_0 + \beta_1 x_{i} + \varepsilon_i \quad \quad \quad \varepsilon_i \sim \text{N}(0, \sigma^2).$$

If we fit the simple linear regression model using ordinary least squares (OLS) estimation (the standard estimation method) then we get a coefficient of determination that is equal to the square of the sample correlation between the $y_i$ and $x_i$ values. This gives a close connection between simple linear regression and sample correlation analysis.

Ben
  • 91,027
  • 3
  • 150
  • 376
  • Hmm... Even though this answer is interesting, linear regression (in most basic stats undergrad courses) does assume that the relationship is linear. Here you are describing Generalized Linear Models (GLMs), which do relax some assumptions and have a link function (here `f`) which is non linear – D1X Aug 02 '21 at 14:20
  • The above would be considered within the scope of linear regression. GLMs encompass linear regression as a special case. – Ben Aug 02 '21 at 21:55