Computing mathematical expectation of the correlation coefficient or $R^2$ in linear regression

Question

I'm re-posting a question from math.stackexchange.com, I think the current answer in math.se is not right.

Select $n$ numbers from a set $\{1,2,...,U\}$, $y_i$ is the $i$th number selected, and $x_i$ is the rank of $y_i$ in the $n$ numbers. The selection is without replacement. $n$ is always smaller than $U$. The rank is the order of the a number after the $n$ numbers are sorted in ascending order.

We can get $n$ data points $(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)$, And a best fit line for these data points can be found by linear regression. $r_{xy}$ (correlation coefficient) is the goodness of the fit line, I want to calculate $\mathbb{E}(r_{xy})$ or $\mathbb{E}(r_{xy}^2)$ (correlation of determination).

If the $\mathbb{E}[r_{xy}]$ cannot be calculated, an estimation or lower bound is still OK.

Updated: By calculating the sample correlation coefficient using randomly generated data, we can see that $r_{xy}$ is quite closed to 1, so I want to proof it from the theoretical view, or theoretically say the data generated by the above method is very linear.

Updated: Is it possible to get the distribution of sample correlation coefficient?

Please post a link to the math.SE question. Usually it's not good to cross-post unless significant time has elapsed. — cardinal, May 19 '11 at 15:30
Can the same number be selected twice? Is n smaller than or greater than U? — Nick Sabbe, May 19 '11 at 15:30
Here is the previous question on math.SE: http://math.stackexchange.com/questions/32569/a-question-about-linear-regression — cardinal, May 19 '11 at 15:31
@Nick Sabbe The selection is without replacement. n is always smaller than U. — Fan Zhang, May 19 '11 at 15:34
The problem can be stated more simply: let $\{(x_i,y_i)\} = (1,y_1), \ldots, (n,y_n)$ be the data with the $y_i$ integers, $1 \le y_1 \lt y_2 \cdots \lt y_n \le U$. Finding $\mathbb{E}[r_{xy}]$ is hopeless. There's some chance of attacking $\mathbb{E}[r^2_{xy}]$ but it's not by any means easy (it's a ratio of two quadratic forms in the $y_i$). — whuber, May 19 '11 at 16:17
@whuber Thank you for your comments. If $E[r_{xy}]$ cannot be calculated, an estimation or lower bound is ok. — Fan Zhang, May 19 '11 at 16:20
@Fan That's a good comment: by allowing approximation or bounding, you are opening up the question to many more solution methods. Consider making this point explicit in the question itself. If you can, please indicate the ranges of values of $U$ and $n$ you're interested in. For instance, would you be interested in asymptotics as $U \to \infty$? — whuber, May 19 '11 at 16:24
@whuber I think if $U\rightarrow\inf$, the problem can be seen as selecting with replacement? This would be also my interest. — Fan Zhang, May 19 '11 at 16:29
@Fan for $U \sim \infty$, then after rescaling by $1/U$ you can view the $y_i$ as the order statistics for a sample of $n$ iid variables from a uniform distribution on $[0,1]$, thereby replacing the discrete sums in the original problem with integrals. — whuber, May 19 '11 at 16:33
@whuber I see. Do you have some relevant materials about your approach? — Fan Zhang, May 19 '11 at 16:39
@Fan Applicable techniques would include quadratic forms in random variables (http://stats.stackexchange.com/questions/9220), the "delta method" for estimating moments of functions of random variables; distributions of order statistics for uniform variables; the relationship between gaps between uniform variables and the exponential distribution, and possibly even saddlepoint methods, normal approximations, Central Limit Theorem, etc. — whuber, May 19 '11 at 17:12
@whuber Thank you very much, the information helps me much!! — Fan Zhang, May 19 '11 at 17:34

score 3 · Answer 1 · answered May 19 '11 at 18:52

3

If you only want to show $r^2_{xy}$ must be close to 1, and compute a lower bound for it, it's straightforward, because that means for given $U$ and $n$ you only need to maximize the variance of the residuals. This can be done in exactly four symmetric ways. The two extremes (lowest and highest possible correlations) are illustrated for $U=20, n=9$.

Extreme correlation plots for U=20, n=9

For large values of $U$ and appropriate values of $n$, $r^2_{xy}$ can actually get close to 0. For example, with $n=100$ and very large values of $U \gg n$, $r^2_{xy} \sim 0.03$ in the worst case.

answered May 19 '11 at 18:52

whuber

281,159
54
637
1,101

So, we're trying to show $E(r_{xy}^2)$ is close to 1 (or that we can expect to find nearly a straight line) [I'm working on this problem with Zhang Fan]. While $r_{xy}^2$ bad in some cases, there should be relatively few of these cases. One hope for resolving this problem is therefore to bound the number of cases in which e.g. $r_{xy}^2 \leq 0.99$ (or some other bound). – Douglas S. Stones May 20 '11 at 01:42
I'm hoping that 0.99 should be a reasonable bound to consider. E.g. if we compute some examples in R, we can consistently obtain cor(x,y)'s that are _very_ close to 1. E.g. 0.9994561 is a typical result returned by: m – Douglas S. Stones May 20 '11 at 01:48
@Douglas That suggests what might be a considerable simplification: by taking a target like 0.99, you can convert the ratio of quadratic forms appearing in $r^2_{xy}$ into a *difference* (by clearing the denominator) and then investigate the chance that the difference is positive. So now you're looking at the distribution of a quadratic form under uniformly random permutations: that's an accessible problem. – whuber May 20 '11 at 13:31
@Douglas, @Fan We can do some heuristics, too: asymptotically, the $y_i$ will look uniform, indicating $r^2 \to 1$ as $U \to \infty$. Consider the Kolmogorov-Smirnov statistic $D$, for instance: using that as an upper bound for the residuals shows it is proportional to a (gross) overestimate of $1-r^2$. Therefore $r^2 \to 1$ at least as fast as $D \to 0$. – whuber May 20 '11 at 13:36
@whuber Do you have some more detailed description about Kolmogorov-Smirnov statistic D? – Fan Zhang May 24 '11 at 15:54
@Fan http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test – whuber May 24 '11 at 15:55
@whuber How to proof the points (1,1),(2,2) ... (n-1,n-1),(n,U) has the least correlation among all the sample sets. – Fan Zhang Jun 22 '11 at 15:00
@Fan There may be an elegant way, but here is one: Use Lagrange multipliers (with the KKT conditions) to show the optimum occurs when all but one of the $y_{i+1} - y_i = 1$: that is, the plot looks like two low-sloping lines with a gap. Then show the minimum occurs when the gap is at one end, as shown in the illustration. – whuber Jun 22 '11 at 16:06

probabilityislogic · Accepted Answer · 2011-10-28T15:55:33.437

Re-arrange the problem in terms of new variables, so that $1\leq z_1<z_2<\dots<z_n\leq U$. Then we have $(x_i,y_i)=(x_i,z_{x_i})$, as @whuber pointed out in the comments. Thus you are effectively regressing $z_j$ on $j$, and $r_{xy}=r_{xz}$. Thus if we can work out the marginal distribution for $z_j$, and show that it is basically linear in $j$ the problem is done, and we will have $r_{xy}\sim 1$.

We first need the joint distribution for $z_1,\dots,z_n$. This is quite simple, after you have the solution, but I found it not straight forward before I did the maths. Just a brief lesson in doing maths paying off - so I will present the maths first, then the easy answer.

Now, the original joint distribution is $p(y_1,\dots,y_n)\propto 1$. Changing variables simply relabel things for discrete probabilities, and so the probability is still constant. However, the labelling is not 1-to-1, thus we can't simply write $p(z_1,\dots,z_n)=\frac{(U-n)!}{U!}$. Instead, we have

$$\begin{array}\\p(z_1,\dots,z_n)=\frac{1}{C} & 1\leq z_1<z_2<\dots<z_n\leq U\end{array}$$

And we can find $C$ by normalisation $$C=\sum_{z_n=n}^{U}\sum_{z_{n-1}=n-1}^{z_n-1}\dots\sum_{z_2=2}^{z_3-1}\sum_{z_1=1}^{z_2-1}(1)=\sum_{z_n=n}^{U}\sum_{z_{n-1}=n-1}^{z_n-1}\dots\sum_{z_2=2}^{z_3-1}(z_2-1)$$ $$=\sum_{z_n=n}^{U}\sum_{z_{n-1}=n-1}^{z_n-1}\dots\sum_{z_3=2}^{z_4-1}\frac{(z_3-1)(z_3-2)}{2}=\sum_{z_n=n}^{U}\dots\sum_{z_4=4}^{z_5-1}\frac{(z_4-1)(z_4-2)(z_4-3)}{(2)(3)}$$ $$=\sum_{z_n=n}^{U}\sum_{z_{n-1}=n-1}^{z_n-1}\dots\sum_{z_{j}=j}^{z_{j+1}-1}{z_j-1 \choose j-1}={U \choose n}$$

Which shows the relabeling ratio is equal to $\frac{(U-n)!}{U!}{U \choose n}=\frac{1}{n!}$ - for each $(z_1,\dots,z_n)$ there is $n!$ $(y_1,\dots,y_n)$ values. Makes sense because any permutation of the lables on $y_i$ leads to the same set of ranked $z_i$ values. Now, the marginal distribution $z_1$, we repeat above but with the sum over $z_1$ dropped, and a different range of summation for the remainder, namely, the minimums change from $(2,\dots,n)$ to $(z_1+1,\dots,z_1+n-1)$, and we get:

$$p(z_1)=\sum_{z_n=z_1+n-1}^{U}\;\;\sum_{z_{n-1}=z_1+n-2}^{z_n-1}\dots\sum_{z_2=z_1+1}^{z_3-1}p(z_1,z_2,\dots,z_n)=\frac{{U-z_1 \choose n-1}}{{U \choose n}}$$

With support $z_1\in\{1,2,\dots,U+1-n\}$. This form, combined with a bit of intuition shows that the marginal distribution of any $z_j$ can be reasoned out by:

choosing $j-1$ values below $z_j$, which can be done in ${z_j-1\choose j-1}$ ways (if $z_j\geq j$);
choosing the value $z_j$, which can be done 1 way; and
choosing $n-j$ values above $z_j$ which can be done in ${U-z_j\choose n-j}$ ways (if $z_j\leq U+j-n$)

This method of reasoning will effortly generalise to joint distributions, such as $p(z_j,z_k)$ (which can be used to calculate the expected value of the sample covariance if you want). Hence we have:

$$\begin{array}{c c}\\p(z_j)=\frac{{z_j-1\choose j-1}{U-z_j\choose n-j}}{{U \choose n}} & j\leq z_j\leq U+j-n \\p(z_j,z_k)=\frac{{z_j-1\choose j-1}{z_k-z_j-1 \choose k-j-1}{U-z_k\choose n-k}}{{U \choose n}} & j\leq z_j\leq z_k+j-k\leq U+j-n \end{array}$$

Now the marginal is the pdf of a negative hypergeometric distribution with parameters $k=j,r=n,N=U$ (in terms of the paper's notation). Now this is clear not linear exactly in $j$, but the marginal expectation for $z_j$ is

$$E(z_j)=j\frac{U+1}{n+1}$$

This is indeed linear in $j$, and you would expect beta coefficient of $\frac{U+1}{n+1}$ from the regression, and intercept of zero.

UPDATE

I stopped my answer a bit short before. Have now completed hopefully a more complete answer

Letting $\overline{j}=\frac{n+1}{2}$, and $\overline{z}=\frac{1}{n}\sum_{j=1}^{n}z_j$, the expected square of the sample covariance between $j$ and $z_j$ is given by:

$$E[s_{xz}^2]=E\left[\frac{1}{n}\sum_{j=1}^{n}(j-\overline{j})(z_j-\overline{z})\right]^2$$ $$=\frac{1}{n^2}\left[\sum_{j=1}^{n}(j-\overline{j})^2E(z_j^2)+2\sum_{k=2}^{n}\sum_{j=1}^{k-1}(j-\overline{j})(k-\overline{j})E(z_jz_k)\right]$$

So we need $E(z_j^2)=V(z_j)+E(z_j)^2=Aj^2+Bj$, where $A=\frac{(U+1)(U+2)}{(n+1)(n+2)}$ and $B=\frac{(U+1)(U-n)}{(n+1)(n+2)}$ (using the formula in the pdf file). So the first sum becomes

$$\sum_{j=1}^{n}(j-\overline{j})^2E(z_j^2)=\sum_{j=1}^{n}(j^2-2j\overline{j}+\overline{j}^2)(Aj^2+Bj)$$ $$=\frac{n(n-1)(U+1)}{120}\bigg( U(2n+1)+(3n-1)\bigg)$$

We also need $E(z_jz_k)=E[z_j(z_k-z_j)]+E(z_j^2)$.

$$E[z_j(z_k-z_j)]=\sum_{z_k=k}^{U+k-n}\sum_{z_j=j}^{z_k+j-k}z_j(z_k-z_j) p(z_j,z_k)$$ $$=j(k-j)\sum_{z_k=k}^{U+k-n}\sum_{z_j=j}^{z_k+j-k}\frac{{z_j\choose j}{z_k-z_j \choose k-j}{U-z_k\choose n-k}}{{U \choose n}}=j(k-j)\sum_{z_k=k}^{U+k-n}\frac{{z_k+1 \choose k+1}{U+1-(z_k+1)\choose n-k}}{{U \choose n}}$$ $$=j(k-j)\frac{{U+1\choose n+1}}{{U \choose n}}=j(k-j)\frac{U+1}{n+1}$$ $$\implies E(z_jz_k)=jk\frac{U+1}{n+1}+j^2\frac{(U+1)(U-n)}{(n+1)(n+2)}+j\frac{(U+1)(U-n)}{(n+1)(n+2)}$$

And the second sum is:

$$2\sum_{k=2}^{n}\sum_{j=1}^{k-1}(j-\overline{j})(k-\overline{j})E(z_jz_k)$$ $$=\frac{n(U+1)(n-1)}{720(n+2)}\bigg(6(U-n)(n^3-2n^2-9n-2) + (n+2)(5 n^3- 24 n^2- 35 n +6)\bigg)$$

And so after some rather tedious manipulations, you get for the expected value of the squared covariance of:

$$E[s_{xz}^2]=\frac{(n-1)(n-2)U(U+1)}{120}-\frac{(U+1)(n-1)(n^3+2n^2+11n+22)}{720(n+2)}$$

Now if we have $U>>n$, then the first term dominates as it is $O(U^2n^2)$, whereas the second term is $O(Un^3)$. We can show that the dominant term is well approximated by $E[s_{x}^2s_{z}^2]$, and we have another theoretical reason why the pearson correlation is very close to $1$ (beyond the fact that $E(z_j)\propto j$).

Now the expected sample variance of $j$ is just the sample variance, which is $s_x^2=\frac{1}{n}\sum_{j=1}^{n}(j-\overline{j})^2=\frac{(n+1)(n-1)}{12}$. The expected sample variance for $z_j$ is given by:

$$E[s_z^2]=E\left[\frac{1}{n}\sum_{j=1}^{n}(z_j-\overline{z})^2\right]=\frac{1}{n}\sum_{j=1}^{n}E(z_j^2)-\left[\frac{1}{n}\sum_{j=1}^{n}E(z_j)\right]^2$$ $$=\frac{A(n+1)(2n+1)}{6}+\frac{B(n+1)}{2}-\frac{(U+1)^2}{4}$$ $$=\frac{(U+1)(U-1)}{12}$$

Combining everything together, and noting that $E[s_x^2s_z^2]=s_x^2E[s_z^2]$, we have:

$$E[s_x^2s_z^2]=\frac{(n+1)(n-1)(U+1)(U-1)}{144}\approx \frac{(n-1)(n-2)U(U+1)}{120}\approx E[s_{xz}^2]$$

Which is approximately the same thing as $E[r_{xz}^2]\approx 1$

I understand your answer, and my question is how to get the expectation of correlation coefficient from your current state. — Fan Zhang, Oct 24 '11 at 05:26
I am sorry that I just see the answer today. One thing I think it should be clarified why when $E[s_x^2s_z^2] \approx E[s_{xz}^2]$, then $E[r_{xz}^2] \approx 1$. — Fan Zhang, Jan 12 '12 at 13:11
I think the last step is wrong. E[X/Y] is not equal to E[X]/E[Y]. — Fan Zhang, Jan 13 '12 at 03:27
@FanZhang - The last step is *approximately* correct. This is because we can expand $g(X,Y)=\frac{X}{Y}$ to first order about $(E[X],E[Y])$. And we get $\frac{X}{Y}\approx\frac{E[X]}{E[Y]}+(X-E[X])\frac{1}{E[Y]}-(Y-E[Y])\frac{E[X]}{E[Y]^2}$. Because $E\left(X-E[X]\right)=0$ for any random variable whose expectation exists, we get $E\left(\frac{X}{Y}\right)\approx\frac{E[X]}{E[Y]}$. — probabilityislogic, Jan 14 '12 at 23:39
Thank you. And what is this kind of approximation is called? — Fan Zhang, Jan 15 '12 at 05:21
The delta method, linearisation, taylor series approximation. There's probably other names it goes by, but this is all I can think of atm. — probabilityislogic, Jan 15 '12 at 19:41
http://en.wikipedia.org/wiki/Taylor_expansions_for_the_moments_of_functions_of_random_variables Please see this page, is it possible to get $E[\frac{X}{Y}] \approx E[X]/E[Y] - cov[X,Y]/E[Y]^2 + E[X]/E[Y]^3 var[Y]$ — Fan Zhang, Jan 16 '12 at 03:58
This is a second order Taylor series expansion (hence there is 2 extra terms). Note that it is not always an improvement over the first order. But in any event, We already know that $E[Y]\sim O(U^2)$ so $\frac{1}{E[Y]^2}\sim O(U^{-4})$. You also have $cov(X,Y)\sim O(U^4)$ as it involves 4th order expectations of $z_j$. Similarly $var[Y]\sim O(U^4)$ and $\frac{E[X]}{E[Y]^3}\sim O(U^{-4})$, and this means that the adjustment to the 1st order approximation is likely to be $E\left[\frac{X}{Y}\right]\approx\frac{E[X]}{E[Y]}-O(1)+O(1)\approx \frac{E[X]}{E[Y]}$ — probabilityislogic, Jan 17 '12 at 12:17
The second order approximation is much more involved, and will most likely only lead to a different form of the "n" parts of the expectation, which are negligible when $U>>n$ — probabilityislogic, Jan 17 '12 at 12:19
I see. Currently, I have another question. Would you please have a look? I put it at mathoverflow, but no one know the result. http://mathoverflow.net/questions/85409/problem-about-expectation-of-maximum-partial-sum — Fan Zhang, Jan 17 '12 at 20:21

Computing mathematical expectation of the correlation coefficient or $R^2$ in linear regression

2 Answers2

Linked