2

I want to determine whether or not I get the same regression results when doing regression of $x$ on $y$ and of $y$ on $x$.

Using R's built in lm function I get the following results.

##
## Call:
## lm(formula = y ~ x, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92127 -0.45577 -0.04136 0.70941 1.83882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0001 1.1247 2.667 0.02573 *
## x 0.5001 0.1179 4.241 0.00217

And

##
## Call:
## lm(formula = x ~ y, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6522 -1.5117 -0.2657 1.2341 3.8946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9975 2.4344 -0.410 0.69156
## y 1.3328 0.3142 4.241 0.00217

I figured that if the regression lines are the same then

$$y_1 = \alpha + \beta x_1 \Longleftrightarrow x_1 = \frac{y_1- \alpha}{\beta}$$

from lm(y ~ x, data = df1) and

$$x_2 = \alpha_2 + \beta_2 y_2$$

from lm(x ~ y, data = df1) should match up. (Is this correct?)

In my case that would give us (for $y = 1$)

$$\begin{align*}x_1 = \frac{y_1- \alpha}{\beta} = \frac{1 - 3.0001}{0.5001} \approx -3.9994 \\ x_2 = \alpha_2 + \beta_2 y_2 = -0.9975 + 1.3328y = 0.3353 \end{align*}$$

So $x_1 \neq x_2$ and thus there is a difference between linear regression of $y$ on $x$ and that of $x$ on $y$.

Is this correct?

Thanks in advance.

Dayne
  • 2,113
  • 1
  • 7
  • 24
Mevve
  • 155
  • 4
  • 1
    You are forgetting the error term in your equations. There are other issues as well but that should be the start. Please also see [this](https://stats.stackexchange.com/questions/22718/what-is-the-difference-between-linear-regression-on-y-with-x-and-x-with-y). – Dayne Oct 30 '20 at 17:33

2 Answers2

4

In the case of a simple linear regression:

$$y = \alpha + \beta x + \epsilon$$

Beta can be estimated via $\beta = \frac{\text{Cov}(x,y)}{\text{Var}(x)}$. And so if we flip x and y, the covariance stays the same, it is only the denominator part for the variance that changes. So from there I imagine you can work out when they will (or will not) be equal!

Andy W
  • 15,245
  • 8
  • 69
  • 191
3

It depends on your loss function. A common way is to minimalize the residual sum of squares (case $y \sim x$):

$$ \sum_{i=1}^n (y_i - \alpha - \beta x_i)^2 \rightarrow min$$

This is what your function in R does. It takes into consideration only the vertical distance (in the case when $y$ is your vertical axis).

By slipping $x$ and $y$ it will be the originial horizontal distance minimilized (after summation, of course).

So it is not the same, but there exist other methods too. As a loss function you can choose the Euclidean distance of the points from the regression line and minimize the sum of those errors. In this case your solution should work.

jumpini
  • 156
  • 2