Pearson populationcorrelation (rho), measures a linear dependence between two variables (x and y). It’s also known as a parametric correlation test because it depends to the distribution of the data. Further, I find the term - sample correlation(r). My question - do these terms differ? and in what sense? The formulas ?
-
Could you please provide a link to the formula for “sample correlation”? – EdM Aug 02 '20 at 22:02
-
See related https://stats.stackexchange.com/q/148346/3277 – ttnphns Aug 03 '20 at 17:18
-
That talks of as if r-square and rho(population correlation) are similarly situated. Pearson r , to me indicates distance /d-statistic. Thanks for the link. – Aug 04 '20 at 00:56
1 Answers
The Pearson correlation can be calculated, with the same formula, on any dataset of paired numeric values or other types of values that might be considered numeric (e.g., ordinal). As it's calculated from a sample drawn from some underlying population it is a sample correlation coefficient, often denoted $r_{xy}$. That's distinguished from the "true" population correlation coefficient, often denoted $\rho_{xy}$. The sample value $r_{xy}$ is an estimate of the population value $\rho_{xy}$, similar to the way that a sample mean $\bar x$ is an estimate of a population mean $\mu_x$.
The multiple ways to write the formula for a Pearson correlation can lead to some confusion. All the formulas for the sample estimates are related to corresponding formulas for the population value. A combination of formulas culled from this page and the Wikipedia page make this clear.
For the population correlation coefficient some formulas are:
$$\rho_{X,Y}= \frac{\operatorname{cov}(X,Y)}{\sigma_X \sigma_Y}=\frac{\operatorname{E}[(X-\mu_X)(Y-\mu_Y)]}{\sigma_X\sigma_Y}=\frac{\operatorname{E}[\,X\,Y\,]-\operatorname{E}[\,X\,]\operatorname{E}[\,Y\,]}{\sqrt{\operatorname{E}[\,X^2\,]-\left(\operatorname{E}[\,X\,] \right)^2} ~ \sqrt{\operatorname{E}[\,Y^2\,]- \left(\operatorname{E}[\,Y\,] \right)^2}}.$$
Correspondingly, for the sample correlation coefficient we have:
$$r_{xy}= \frac{s_{xy}^2}{s_x s_y} =\frac{\sum ^n _{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum ^n _{i=1}(x_i - \bar{x})^2} \sqrt{\sum ^n _{i=1}(y_i - \bar{y})^2}} =\frac{n\sum x_iy_i-\sum x_i\sum y_i} {\sqrt{n\sum x_i^2-(\sum x_i)^2}~\sqrt{n\sum y_i^2-(\sum y_i)^2}}.$$
That last form is the one you cite from byjus.com
in a comment. Its relationship to the corresponding formula for the population coefficient becomes even clearer if you divide its numerator and denominator by $n^2$.
Although there are non-linear factors involved in the calculations, both the sample and the population coefficients solely capture the linear association between $x$ and $y$, not any nonlinear relationships. See the figures on the Wikipedia page for examples.
A significance test on the sample Pearson correlation typically estimates the probability that you would get so large a value if the true correlation within the underlying population was actually 0. The assumption behind some forms of that test is that the data have a bivariate normal distribution. Otherwise you can still calculate a sample correlation that way, but inference about significance relies on methods like bootstrapping. Even then special care must be taken to avoid the problems that arise from the bias and skew in sample estimates of correlation coefficients.
Spearman and Kendall rank-base correlation coefficients are available, with associated tests not depending on an assumption of bivariate normality. All that matters for those tests is the relative ranks within the 2 groups.

- 57,766
- 7
- 66
- 187
-
Generally, it is understood that Correlation is a measure of the degree of linear association among a pair of variables. Karl Pearson it seems follows a Co-variance approach that would invoke interval estimation or vectors of x and y. The standardized coefficient - covariance ÷ (s.d.of X) × s.dof y produces a linear correlation or coefficient. It does not generate sample correlation in this sense. – Aug 02 '20 at 21:23
-
The sample correlation formula picks up nonlinear association that is actually representing a relationship of cause and effect. – Aug 02 '20 at 21:30
-
The sample correlation formula removes something like common association/linerity and then standardizes the net Covariance to assess the correlation coefficient – Aug 02 '20 at 21:38
-
Would you please help me understand/refute my assertions that perturb my mind ? – Aug 02 '20 at 21:40
-
byjus.com Pearson’s Correlation Coefficient Formula Also known as bivariate correlation, the Pearson’s correlation coefficient formula is the most widely used correlation method among all the sciences. The correlation coefficient is denoted by “r”. To find r, let us suppose the two variables as x & y, then the correlation coefficient r is calculated as: r=n(∑xy)−(∑x)(∑y)[n∑x2−(∑x)2][n∑y2−(∑y)2]−−−−−−−−−−−−−−−−−−−−−−−−−−−−√ Notations: – Aug 02 '20 at 22:38
-
@SubhashC.Davar the [Wikipedia page](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) shows formulas for the Pearson correlation coefficient. The definition in terms of population covariances and variances is for the underlying population. The formula for the sample correlation coefficient is the estimate of the population value derived from a particular data sample. The relationship is like that between a population mean $\mu$ and a sample mean $\bar x$. Although the formulas involve non-linear terms and factors, they represent linear relationships between $x$ and $y$. – EdM Aug 03 '20 at 02:37
-
@SubhashC.Davar also see [this answer](https://stats.stackexchange.com/a/104577/28500) with several equivalent formulas for sample correlation coefficients. It can be shown that the first formula there is the same as what's in your most recent comment. The second formula is the sample version directly comparable to the population version you showed in your first comment. – EdM Aug 03 '20 at 02:59
-
The linear correlation coefficient formula is different from (Sample)Correlation Coefficient. please refer to BYJUS.com. Also please peruse edit of my question. – Aug 03 '20 at 04:25
-
@SubhashC.Davar I hope that the several formulas I added to the answer make the close relationship between the population and sample correlation coefficients clearer for you. – EdM Aug 03 '20 at 14:57
-
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/111367/discussion-between-subhash-c-davar-and-edm). – Aug 03 '20 at 15:20