Why is it not OK to do a Pearson correlation on proportion data?

Question

An online module I am studying states that one should never use Pearson correlation with proportion data. Why not?

Or, if it is sometimes OK or always OK, why?

What says this, and in what context? "Never" seems far too strong unless they're talking about some very limited situation. It may be that whoever wrote it is simply wrong, but without *context* how are we to guess? — Glen_b, Mar 31 '14 at 04:04
The online module is proprietary and I can't link it. However, I have found a video that states the same thing: http://australianbioinformatics.net/the-pipeline/2013/3/19/dont-correlate-proportions.html. Both the module I have seen and this video indicate that there are no contexts in which correlating proportions is acceptable. — user1205901 - Reinstate Monica, Mar 31 '14 at 04:05
"Never" is too strong. There are reasons to be cautious about interpreting correlation coefficients involving proportions, especially those based on small counts. But the same analysis supporting those reasons also shows that when proportions are based on large counts and the proportions are "sufficiently far" from $0$ or $1$, then the correlation coefficients are not problematic. Furthermore, one can *always* report a correlation coefficient for any set of paired data (where both components exhibit variation) as a *summary* (descriptive) statistic. — whuber, Apr 05 '14 at 00:45

score 11 · Answer 1 · edited Apr 13 '17 at 12:44

11

The video link of your comment sets the context to that of compositions, which may also be called mixtures. In these cases, the sum of the proportion of each component add up to 1. For example, Air is 78% nitrogen, 21% oxygen, and 1% other (total is 100%). Given that the amount of one component is completely determined by the others, any two components will have a perfect multi-linear relationship. For the air example, we have:

$x_{1} + x_{2} + x_{3} = 1$

so then:

$x_{1} = 1 - x_{2} - x_{3} $

$x_{2} = 1 - x_{1} - x_{3}$

$x_{3} = 1 - x_{1} - x_{2}$

So if you know any two components, the third is immediately known.

In general, the constraint on mixtures is

$\sum_{i=1}^{q} x_{i} = 1$

This constraint makes the levels of the factors $x_{i}$ non-indepenent.

You can compute a correlation between two components, but is not informative, as they are always correlated. You can read more about compositional analysis in Analysing data measured as proportional composition .

You can use correlation when the proportion data are from different domains. Say your response is fraction of dead pixels on an LCD screen. You could try to correlate this to, say, the fraction of helium used in a chemical processing step of the screen.

edited Apr 13 '17 at 12:44

Community

1

answered Mar 31 '14 at 05:16

blackeneth

406
2
8

I see - I had mistakenly thought that the compositions were just an example. Is it thus fair to say that correlating proportions is generally unproblematic unless you've got a situation in which compositions 'force' a correlation to exist? – user1205901 - Reinstate Monica Mar 31 '14 at 05:40
`Given that the amount of one component is completely determined by the others, any two components will have a perfect co-linear relationship` is not clear. Can you expand it? – ttnphns Mar 31 '14 at 05:48
I also do not understand this answer. In your 3-variable example, each is "determined" by TWO others, but the Pearson correlation only analyzes one variable in relation to ONE other. So, e.g, if looking at nitrogen vs. oxygen you could have a (nitrogen, oxygen) data set [ (0.78, 0.21), (0.20, 0.41), (0.44, 0.44) ], and you could do a valid correlation coefficient calculation on that data (and it's certainly not co-linear). The Pearson correlation coefficient does not know or care about "other" there... – Jason C Mar 31 '14 at 07:50
3

As a kind of meta-comment, I would not expect to see inaccessible material cited as authority for any statistical point, not that you are proposing to do that. So, it's simple at one level: there is a literature on compositional data analysis, which is where to look; I am not an expert, so I can't say what's most authoritative on correlation, but my instinct is that the warning is exaggerated. Descriptive use of correlation can be helpful. It is just that inferences are complicated by the constraint on totals. – Nick Cox Mar 31 '14 at 08:12
I think the "fraction of dead pixels" would be fine if we were gathering measurements from LCD screens that have the same number of pixels and the gas pressure in the process remained constant. But once you start allowing the denominators of these proportions to change, who can say what the effect of helium is? – David Lovell May 17 '15 at 05:26

ttnphns · Accepted Answer · 2014-03-31T11:21:21.457

7

This is for a case when several variables sum together to 1, in each observation. My answer will be intuition-level; this is intentional (and also, I'm not an expert of compositional data).

Let us have i.i.d. (hence zero-correlated) positive-valued variables which we then sum up and recompute as proportions of that sum. Then,

In case of two variables V1 V2, if V1 is said to vary freely then V2 has no room for freedom (since V1+V2=constant) and is fully fixed; the greater is V1 the lesser is V2, the lesser is V1 the greater is V2. Their correlation is but $-1$ and is always so.
In case of 3 variables V1 V2 V3, if V1 is said to vary freely then V2+V3 is fixed; which is to say that inside (V2+V3) each of the two variables are still partly free: they are on the average $1/2$ times fixed each, full fixed in total. So, if any one of the three variables is taken as free (like we took V1), any of the remaining two is expected $1/2$ fixed. So that the correlation between them is $-0.5$. This is the expected correlation; it may vary from sample to sample.
In case of 4 variables V1 V2 V3 V4 by the same reasoning we have that, if we take any one of the four as free then any one of the remaining is expected to be $1/3$ fixed; so, the expected correlation between any pair of the four - one as free the other as $1/3$ fixed - is $-0.333$.
As the number of (initially i.i.d.) variables grows, the expected pairwise correlation grows from negative towards $0$, and its variation from sample to sample becomes larger.

edited Mar 31 '14 at 11:21

answered Mar 31 '14 at 10:33

ttnphns

51,648
40
253
462

OK, but I guess the interest is in pairs V1, V2, each V summing to 1 ( 100%), but no constraint on individual V except each being a fraction. – Nick Cox Mar 31 '14 at 10:57
`each V summing to 1 ( 100%)` Excuse me? I didn't understand you. I put no constraint on individual V, only being a fraction. However, initial constraint was that my example assumes zero correlations prior turning Vs into fractions. – ttnphns Mar 31 '14 at 11:12
Did you mean that each V has values summing to 1 ("vertically")? No, I meant "horisontally", across variables. But unfortunately the OP didn't elucidate the point in their question. So I took it as I took it. – ttnphns Mar 31 '14 at 11:18
Yes; that is I think what is usually meant here, but the question is not especially clear. – Nick Cox Mar 31 '14 at 11:54
1

@ttnphns I saw a statement that one should never do a Pearson correlation two variables measured as proportions. I've tried to make this clearer by editing the OP to highlight the word 'never'. The video makes the same statement in its title ("Don't correlate proportions!"), though they only discuss this in the context of compositional data. I deliberately left the context undefined because my source stated that Pearson correlations should not be used on proportion data in any context. However, it seems the answer to my question is: "Correlating proportions is fine, except in some contexts." – user1205901 - Reinstate Monica Mar 31 '14 at 23:48
@user1205901, I don't know if my answer answered your question at any point, but if it did, than the moral is this: with data as "horizontal" proportions, the baseline "level of uncorrelatedness" is not 0 but lower. With 3 variables, it is -0.5; so if your two selected (out of the 3) variables correlate as -0.3, this corresponds, to r=+0.2 for usual, not proportions, data. – ttnphns Apr 01 '14 at 06:55

David Lovell · Answer 3 · 2015-05-17T05:28:41.177

This is a deep question, and one with some subtleties that need to be stated. I'll try my best, but even though I've published on this topic (Proportionality: A Valid Alternative to Correlation for Relative Data) I'm always prepared to be surprised by new insights on the analysis of data containing only relative information.

As contributors to this thread have pointed out, correlation is notorious (in some circles) for being meaningless when applied to the compositional data that arises when a set of components is constrained to add up to a constant (as we see with proportions, percentages, parts-per-million, etc.).

Karl Pearson coined the term spurious correlation with this in mind. (Note: Tyler Vigen's popular Spurious Correlation site is not so much about spurious correlation as the "correlation implies causation" fallacy.)

Section 1.7 of Aitchison's (2003) A Concise Guide to Compositional Data Analysis provides a classic illustration of why correlation is an inappropriate measure of association for compositional data (for convenience, quoted in this Supplementary Information.

Compositional data arise not only when a set of non-negative components are made to sum to a constant; data are said to be compositional whenever they carry only relative information.

I think the main problem with the correlation of data that carry only relative information is in the interpretation of the result. This is an issue that we can illustrate with a single variable; let's say "donuts produced per dollar of GDP" across the nations of the world. If one nation's value is higher than another, is that because

their donut production is higher?
their GDP is lower?

...who can say?

Of course, as people remark on this thread, one can calculate correlations of these sorts of variables as a descriptive variable. But what do such correlations mean?

score 3 · Answer 4 · edited Jan 24 '15 at 23:15

3

I had the same question. I found this reference at biorxiv useful:

Lovell D., V. Pawlowsky-Glahn, J. Egozcue, S. Marguerat, J. Bähler (2014),
"Proportionality: a valid alternative to correlation for relative data"

In the supporting information of this paper (Lovell, David, et al. ;doi: dx.doi.org/10.1101/008417), the authors mention that correlations between relative abundances do not provide any information in some cases. They give an example of relative abundances of two mRNA expressions. In Figure S2, the relative abundances of the two different mRNAs are perfectly negatively correlated, even though the correlation of these two mRNA in absolute values is not negatively related (green points and purple points).

Maybe it could help you.

edited Jan 24 '15 at 23:15

Glen_b

257,508
32
553
939

answered Jan 24 '15 at 09:31

sue

83
1
6

2

Thank for your suggestion. I didn't make it clear. In supporting information of this paper (Lovell, David, et al. ;doi: http://dx.doi.org/10.1101/008417), the authors mention that correlations between relative abundances do not provide any information in some cases. They give an example of relative abundances of two mRNA expressions. In Figure S2, the relative abundances of the two diﬀerent mRNAs are perfectly negatively correlated, even though the correlation of these two mRNA in absolute values is not negatively (green points and purple points). – sue Jan 24 '15 at 12:16
@shu maybe you could say *why* this article has helped you with similar problem and summarize it..? Pasting link is *not* an answer, so please elaborate a little bit more. The reason for that is also because links die and if you want your answer to be helpful for someone in the future you should make it self-consistent. Of course providing references *additionally* to your answer is a good habit. – Tim Jan 24 '15 at 21:08

Why is it not OK to do a Pearson correlation on proportion data?

4 Answers4

Linked