1

If y is linearly dependent on x such that the result of performing a linear regression gives you $y=\alpha x + \beta + \eta$, where the noise is normally distributed with zero mean and some prediction variance $\sigma ^{2}$, and also x is normally distributed with known mean and variance (say $\mu$ and $\sigma _{x}^{2}$), then what is the joint distribution $P(x,y)$ ?

I set out to show what my intuition was telling me, which is that, of course, the distribution should be bi-variate normal. The way I set out the proof (I can provide more details of the algebra if useful), is to explicitly calculate $P(x,y)=P(y\mid x)P(x)$, and collect all coefficients of $x^2, y^2, xy, x $ and $y$, and then try to write this as $\Lambda_{11}(x-a)^2 + 2 \Lambda_{12} (x--a)(y-b) + \Lambda_{22}(y-b)^2$, again, collecting all $x^2, y^2, xy, x $ and $y$ terms, and then matching coefficients (making use of the fact that the correlation matrix is symmetric and hence the inverse is too, and thus $\Lambda _{12}=\Lambda_{21}$).

This appears to work, and I then note that $\Lambda $ is supposed to be the inverse of the correlation matrix $\Sigma$, so by using the explicit formulae for matrix inversion in 2d, given by

$$ \Sigma_{11} =\frac{\Lambda_{22}}{\Lambda_{11}\Lambda_{22} - \Lambda_{12}^{2}}$$

$$ \Sigma_{12} =\frac{-\Lambda_{12}}{\Lambda_{11}\Lambda_{22} - \Lambda_{12}^{2}}$$

$$ \Sigma_{22} =\frac{\Lambda_{11}}{\Lambda_{11}\Lambda_{22} - \Lambda_{12}^{2}}$$

I find that: $$\Sigma = \begin{pmatrix}\sigma_x^2& \alpha \sigma_x^2\\ \alpha \sigma_x^2&\sigma^2 + \alpha^{2}\sigma_x^2\end{pmatrix}$$

$$a = \mu \text{ and } b = \alpha \mu + \beta$$

(and consequently, I conclude that because I can match all of the terms in $x$ and $y$ when writing the joint distribution as a bivariate Gaussian, that the constant terms must work out the same, as both forms of the distribution are normalised)

What I find very strange about this, is the off-diagonals of the correlation matrix not being affected by $\sigma $. In the limit $\sigma \to \infty$, $y$ is not dependent on $x$ so presumably their correlation should go to zero. I would expect the off-diagonals to be inversely proportional to $\sigma$.

Have I made a mistake in my calculation or, if I haven't, what is the best way to interpret this result?

Michael Hardy
  • 7,094
  • 1
  • 20
  • 38
gazza89
  • 1,734
  • 1
  • 9
  • 17
  • 3
    Even under the assumption of joint normality, why do you conclude that "y is not dependent on x" when $\sigma$ grows large? Just the opposite would seem to be the case: the larger $\sigma$ gets, the more apparent and consistent their linear relationship ought to be. – whuber Dec 23 '18 at 14:35
  • @StubbornAtom : I didn't claim that y is univariate normal, I said it's normal conditioned on x, which is the assumption behind standard linear regression. – gazza89 Dec 23 '18 at 14:37
  • @whuber : It's not under any normality assumption that I make that claim, it's more of an intuitive claim that if $\sigma $ is large (and I think this probably needs to be fleshed out more quantitatively, that it has to be large wrt $\alpha$, potentially with some powers in there), then knowing x gives you increasingly little information about y. Consider plotting the line $y=\alpha x$, and then dotting some points along the line. As you let the dots lie further and further either side of the line (increasing $\sigma$), the correlation between x and y goes down – gazza89 Dec 23 '18 at 14:40
  • On the contrary--draw some pictures!--provided the $x$ values ultimately become widely spread out (that's a mild distributional assumption), the correlation must approach $\pm 1$ unless there is no underlying correlation at all. – whuber Dec 23 '18 at 14:47
  • Fair enough @whuber, I'll play around with this in Python at some point and report back. Does it look right to you, that the correlation is independent of $\sigma$ and only depends on $\sigma _{x}$ ? – gazza89 Dec 23 '18 at 14:48
  • Yes, because I would expect the correlation to depend only on the ratio $\sigma/\sigma_x.$ – whuber Dec 23 '18 at 15:28

1 Answers1

1

$\Sigma$ is the covariance matrix, not the correlation matrix. In the correlation matrix you will see both $\sigma$ and $\sigma_x$ terms.

Intuitively, covariance is the part of variance that both variables share together which is the variance of the $X$ variable. See also the following property of covariance rule for sums of variables:

$$\text{Cov}(aX+bY,cU+dV) = ac \text{Cov}(X,U) +ad \text{Cov}(X,V) +bc \text{Cov}(Y,U) +bd \text{Cov}(Y,V) $$

and

$$ \text{Cov}(Y,X) = \text{Cov}(\eta+\alpha X,X) = \alpha \text{Cov}(X,X) + \text{Cov}(\eta,X) = \alpha \text{Var}(X) $$

where the last equality minimally requires zero correlation between the error term $\eta$ and the independent variable $X$ (independence will be sufficient).

So for the Pearson correlation you will have:

$$\rho_{XY} = \frac{\alpha \sigma_x^2}{\sigma_x\sigma} = \alpha \frac{\sigma_x}{\sigma} $$

which corresponds to your intuition that the correlation gets smaller for larger $\sigma$.


Regarding the question in the title, about the joint distribution, you can not determine $P(X,Y)$ without first defining the joint distribution $P(X,\eta)$. It is not enough information that $\eta$ and $X$ are individually normal distributed. This relates a bit to How to calculate conditional probability when only marginals are known?

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
  • ah yes indeed, correlation Vs covariance is the key here. On your final point about $P(X, \eta)$ not being defined, implicitly, when I defined $P(y|x)$, I was assuming that all of the usual assumptions of linear regression hold. I never really thought about it in that much depth, but I think the noise being independent x is necessary in linear regression. – gazza89 Dec 24 '18 at 17:59
  • Independent and identical errors is common in regression (e.g. ordinary least squares) but not necessary/obligatory. – Sextus Empiricus Dec 24 '18 at 18:40