1

I am really being confused by why the correlation formula is called the correlation of two variables $X$ and $Y$. Also how is it derived?

The part where we divide covariance by product of standard derivation of $X$ and standard derivation of $Y$ is the most confusing for me. Please explain the reasons or provide some good source for such things.

StatsStudent
  • 10,205
  • 4
  • 37
  • 68
Vicrobot
  • 111
  • 3
  • One way of thinking about the correlation and it's specific representation is that it is a unitless quantity. So by dividing the product moment estimator $E(XY)$ by the SD of X and Y, you get something that doesn't depend on the scale of either variable (in a sense). – AdamO Jan 07 '19 at 17:22
  • Correlation equals covariance of standardized variables. – Michael M Jan 07 '19 at 17:23
  • Have a look here maybe that helps: https://stats.stackexchange.com/questions/256344/why-is-correlation-not-very-useful-when-one-of-the-variables-is-categorical/256380#256380 Or here: https://stats.stackexchange.com/questions/70969/how-to-understand-the-correlation-coefficient-formula?rq=1 – Stefan Jan 07 '19 at 17:24

1 Answers1

4

Maybe going back to the notion of covariance would help.

Say we have two random variables $X$ and $Y$, with a certain number $n$ of independent realizations $x_1,x_2,\dots x_n$ and $y_1,y_2,\dots y_n$. We know that the formula for the sample covariance is

$$\sigma_{xy} =\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})$$

where $\bar{x}$ and $\bar{y}$ are respectively sample means for $X$ and $Y$.

Now, thanks to the Cauchy-Schwarz inequality, we have that the sample covariance is bounded by the product of the standard deviations of the two random variables, which I will denote with $\sigma_x$ and $\sigma_y$. We have then that

$$-\sigma_x\sigma_y \leq \sigma_{xy}\leq \sigma_x\sigma_y$$

Now divide all terms in the inequality by $\sigma_x\sigma_y$ (which are, by construction, non-negative) and you have the formula for correlation

$$-\frac{\sigma_x\sigma_y}{\sigma_x\sigma_y} \leq \frac{\sigma_{xy}}{\sigma_x\sigma_y}\leq \frac{\sigma_x\sigma_y}{\sigma_x\sigma_y} $$

$$-1 \leq \frac{\sigma_{xy}}{\sigma_x\sigma_y} \leq 1$$

with the correct bounds, $-1$ and $1$. If you grasp the notion of covariance, then you'll surely see that it is simply a standardized version of the latter.

Easymode44
  • 684
  • 4
  • 21
  • standardized version means? – Vicrobot Jan 07 '19 at 18:02
  • It means that it gives you the same information as the covariance, but on a scale that varies from $-1$ to $1$. As others have put it, it is indeed unitless, while covariance is expressed in the units in which the variables were measured. – Easymode44 Jan 07 '19 at 18:06
  • So how does that division ensures that resultant amount (i.e. corr.) will still increase in magnitude as there will be a more linear relationship? – Vicrobot Jan 07 '19 at 19:46
  • This behavior is already an inherent quality of covariance. CS ensures the inequality, division only gives a measure of linear relationship that is not dependent on units of measure (which can be a problem, especially when dealing with big magnitudes) – Easymode44 Jan 07 '19 at 20:01
  • One last thing; what makes it pretty sure that when the relationship tends to linearity; the cov. tends to be equal to magnitude(σ X σ Y) ?? – Vicrobot Jan 07 '19 at 20:35
  • I think that the answer provided [above](https://stats.stackexchange.com/questions/70969/how-to-understand-the-correlation-coefficient-formula) provides a sufficient explanation to your last question, surely not answerable in a comment. – Easymode44 Jan 08 '19 at 08:24
  • Please comment on this understanding:- When relation between vars will go linear, the cov will try to attain max magnitude, thus will tend to |σ X σ Y|, since it has got it as its bounds. – Vicrobot Aug 05 '19 at 13:25