How to infer correlation of population from correlation of samples without assuming normality?

Question

There are very large number of data points ($\sim 10^{100}$), which form a (discrete) joint distribution $(X,Y)$, where $X,Y$ are discrete random variables. Note that we have no knowledge of these distributions. We sample a small number $n$ of data points randomly, and we can calculate Pearson correlation coefficient $\rho_s$ of the samples. Then, how can I infer the PCC $\rho_p$ of the population? In general, what is the relation between $\rho_s$, $\rho_p$, and $n$? If I sample, say, $100n$ data points, then does the sample have PCC close to $\rho_p$? We can assume none of the above distribution to be normal. Can we say anything interesting if we assume $n$ to be less than 20 (if not 50)?

Edit: If you think PCC is not very useful for the case one cannot assume normality of distribution, you can use other quantities such as Spearman's R instead.

Is there a specific reason you're asking sampling from a large finite population that is itself drawn from $(X, Y)$, instead of sampling from $(X, Y)$ directly? Also, is the sample size $n$ or $100n$? Also, are you trying to say that $(X, Y)$ **might not** be bivariate normal, or that it **must not** be bivariate normal? — Kodiologist, Sep 11 '17 at 20:52
There are very large number of data points with x- and y-components. The distribution of these data is unknown, and hence their component-wise distribution $X$ and $Y$ are also unknown. In particular, $X,Y,(X,Y)$ are not necessarily normal, but it's possible that they are. I want to know how PCC can be compared for two cases where the sample size is $n$ and $100n$, respectively. — Math.StackExchange, Sep 12 '17 at 04:32
Why do you use $\sigma$ for Pearson correlation in population? This is normally used for standard deviation and could be confusing. — Richard Hardy, Sep 12 '17 at 05:38
I'm sorry for confusing. As I'm from different field, I'm unfamiliar with the convention used in statistics. I changed the notation. — Math.StackExchange, Sep 12 '17 at 05:54

score 1 · Accepted Answer · answered Oct 07 '18 at 14:50

Your question is not entirely clear, but: if $n$ is reasonably large (and in your case there is no need to choose a small $n$), so think (at least $n=1000$ or $n=10000$). In that case the sample pearson correlation should be approximately unbiased, unless the data distribution is very different than gaussian. But you don't need to assume that, you cane estimate the bias using bootstrapping.

Then, for sample size $100 n$, the bias should be smaller, but how much smaller? You could just investigate that also with the bootstrap.

How to infer correlation of population from correlation of samples without assuming normality?

1 Answers1