Why is correlation formula the way it is? Or Say how it formed?

Question

I am really being confused by why the correlation formula is called the correlation of two variables $X$ and $Y$. Also how is it derived?

The part where we divide covariance by product of standard derivation of $X$ and standard derivation of $Y$ is the most confusing for me. Please explain the reasons or provide some good source for such things.

One way of thinking about the correlation and it's specific representation is that it is a unitless quantity. So by dividing the product moment estimator $E(XY)$ by the SD of X and Y, you get something that doesn't depend on the scale of either variable (in a sense). — AdamO, Jan 07 '19 at 17:22
Have a look here maybe that helps: https://stats.stackexchange.com/questions/256344/why-is-correlation-not-very-useful-when-one-of-the-variables-is-categorical/256380#256380 Or here: https://stats.stackexchange.com/questions/70969/how-to-understand-the-correlation-coefficient-formula?rq=1 — Stefan, Jan 07 '19 at 17:24

Easymode44 · Answer 1 · 2022-01-17T13:40:12.333

4

Maybe going back to the notion of covariance would help.

Say we have two random variables $X$ and $Y$, with a certain number $n$ of independent realizations $x_1,x_2,\dots x_n$ and $y_1,y_2,\dots y_n$. We know that the formula for the sample covariance is

$$\sigma_{xy} =\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})$$

where $\bar{x}$ and $\bar{y}$ are respectively sample means for $X$ and $Y$.

Now, thanks to the Cauchy-Schwarz inequality, we have that the sample covariance is bounded by the product of the standard deviations of the two random variables, which I will denote with $\sigma_x$ and $\sigma_y$. We have then that

$$-\sigma_x\sigma_y \leq \sigma_{xy}\leq \sigma_x\sigma_y$$

Now divide all terms in the inequality by $\sigma_x\sigma_y$ (which are, by construction, non-negative) and you have the formula for correlation

$$-\frac{\sigma_x\sigma_y}{\sigma_x\sigma_y} \leq \frac{\sigma_{xy}}{\sigma_x\sigma_y}\leq \frac{\sigma_x\sigma_y}{\sigma_x\sigma_y} $$

$$-1 \leq \frac{\sigma_{xy}}{\sigma_x\sigma_y} \leq 1$$

with the correct bounds, $-1$ and $1$. If you grasp the notion of covariance, then you'll surely see that it is simply a standardized version of the latter.

edited Jan 17 '22 at 13:40

answered Jan 07 '19 at 17:18

Easymode44

684
4
21

standardized version means? – Vicrobot Jan 07 '19 at 18:02
It means that it gives you the same information as the covariance, but on a scale that varies from $-1$ to $1$. As others have put it, it is indeed unitless, while covariance is expressed in the units in which the variables were measured. – Easymode44 Jan 07 '19 at 18:06
So how does that division ensures that resultant amount (i.e. corr.) will still increase in magnitude as there will be a more linear relationship? – Vicrobot Jan 07 '19 at 19:46
This behavior is already an inherent quality of covariance. CS ensures the inequality, division only gives a measure of linear relationship that is not dependent on units of measure (which can be a problem, especially when dealing with big magnitudes) – Easymode44 Jan 07 '19 at 20:01
One last thing; what makes it pretty sure that when the relationship tends to linearity; the cov. tends to be equal to magnitude(σ X σ Y) ?? – Vicrobot Jan 07 '19 at 20:35
I think that the answer provided [above](https://stats.stackexchange.com/questions/70969/how-to-understand-the-correlation-coefficient-formula) provides a sufficient explanation to your last question, surely not answerable in a comment. – Easymode44 Jan 08 '19 at 08:24
Please comment on this understanding:- When relation between vars will go linear, the cov will try to attain max magnitude, thus will tend to |σ X σ Y|, since it has got it as its bounds. – Vicrobot Aug 05 '19 at 13:25

Why is correlation formula the way it is? Or Say how it formed?

1 Answers1