Relationship between correlation and sample variance

Question

A correlation coefficient is lower if there's a low variance in the characteristic of the sample.

For example, the correlation between IQ and school achievement follows this pattern. The correlation is lower if you only include students with similar school achievement. It gets higher if you include students from very different school types.

Is there a (mathematical) explanation?

See the third illustration in the thread on [R^2](http://stats.stackexchange.com/a/13317), Julia: It slices a scatterplot (having high correlation) into skinny vertical pieces, each of which clearly has low correlation. The mathematics will just describe what your eye clearly sees in this picture. — whuber, Feb 15 '12 at 15:23

score 3 · Answer 1 · answered Feb 15 '12 at 15:11

This can be explained as follows. Mathematically, given two variables X and Y, their Correlation is defined as the

covariance(X,Y)/(Standard Deviation(X)*Standard Deviation(Y)).

In other words, the correlation is proportional to the the covariance of the two variables. The divisor in the equation acts has a scaling effect on the covariance so that the resulting correlation will lie between -1 and +1.

So, all other things being equal, reducing the covariance will reduce the correlation. The effect of having similar school achievement is to reduce the covariance between IQ and school achievement. For example, given a wide range of IQ's, if school achievement is similar then school achievement doesn't co-vary with IQ, i.e. there is a relatively random relationship between achievement and IQ, i.e. the correlation is close to zero indicating (relatively speaking) no relationship.

On the other hand, given a wide range of IQ's, if school achievement is also spread over a wide range then correlation can still take any value between -1 (a negative relationship and +1 (a positive relationahip) including 0 (indicating no relationship)

Getting back to your question, it is the reduction in covariance that is important here rather than the reduction in variance.

score 1 · Answer 2 · answered Jul 24 '19 at 01:25

Yes, there is a mathematical although rather conceptual explanation. I was puzzled by the same question until now.

First, why we were puzzled:

1) If you are calculating the correlation coefficient in a sample with lower variance (e.g. all with similar scholar achievements) BUT which truly and perfectly represents the larger population to which it belongs (city’s population, that has higher variance), the correlation coefficients should be very similar once covariance and SD will change together. This could hold for simulated data.

2) Real samples almost never represent perfectly the population, so the sample’s correlation coefficient can be either higher or lower than the population’s depending on which section of the population you selected (that is, if the population correlation coefficient is not perfect, i.e. less than 1, of course). However, the overwhelming tendency is for the lower variance sample’s coefficient to be lower than the higher variance population (or of another same sized sample with higher variance). Why???

My opinion (and answer): noise.

Every measuring tool has a degree of error and a degree of precision. Measurement error explains the reduced coefficients in a thin slice of scale/continuous data mentioned before. While the absolute size of error is always the same, its relative size increases as you “zoom in”. The “shrinking variance” will approach the size of the error itself thus increasing the contribution of noise and decreasing the measured correlation (not the true correlation!), even if everything else is controlled for. Blunt instruments such as questionnaires suffer more from imprecision, where a measured point, post-graduation for example, is too course representing a wide variety of achievements and might have blurred boundaries (is any course taken after graduation a post-graduation course?). Plus, and very frequently, people use Pearson’s correlation coefficient to measure those relationships, which is inappropriate and further contributes to the dampening of coefficients in face of lower variance in ordinal data.

score 0 · Answer 3 · answered Feb 15 '12 at 15:50

0

The correlation coefficient (I frequently use the intraclass as a measure of test-retest reliability) is often defined as the ratio of between-subject variation to the total variation (between-subject + within-subject). If the between-subject variation is high (e.g., persons with very different school types) compared to the within-subject variation, then the correlation coefficient would be high.

answered Feb 15 '12 at 15:50

William Whitworth

124
2

This is rather confusing and is not correct. "The ratio of between-subject variation to the total variation is", in fact, the square of the correlation. It is always non-negative where as correlation can be negative. You are also using terms related to Anova, ie between-subject and within subject. The question relates to two variables and the relationship between them. – martino Feb 15 '12 at 16:34
You are right to direct our attention to possible points of confusion, martino, but you may also be reading the question too narrowly. The use of just a single independent variable is offered only as an example; the question makes sense for multivariate regression, too. ANOVA is (or can be) considered a particular case of regression. Finally, the sense of "lower" in the question really is "smaller in absolute value," so arguably the squared correlation coefficient is a better target for our conversation than the correlation coefficient itself! – whuber Feb 16 '12 at 15:24

Relationship between correlation and sample variance

3 Answers3