An online module I am studying states that one should never use Pearson correlation with proportion data. Why not?
Or, if it is sometimes OK or always OK, why?
An online module I am studying states that one should never use Pearson correlation with proportion data. Why not?
Or, if it is sometimes OK or always OK, why?
The video link of your comment sets the context to that of compositions, which may also be called mixtures. In these cases, the sum of the proportion of each component add up to 1. For example, Air is 78% nitrogen, 21% oxygen, and 1% other (total is 100%). Given that the amount of one component is completely determined by the others, any two components will have a perfect multi-linear relationship. For the air example, we have:
$x_{1} + x_{2} + x_{3} = 1$
so then:
$x_{1} = 1 - x_{2} - x_{3} $
$x_{2} = 1 - x_{1} - x_{3}$
$x_{3} = 1 - x_{1} - x_{2}$
So if you know any two components, the third is immediately known.
In general, the constraint on mixtures is
$\sum_{i=1}^{q} x_{i} = 1$
This constraint makes the levels of the factors $x_{i}$ non-indepenent.
You can compute a correlation between two components, but is not informative, as they are always correlated. You can read more about compositional analysis in Analysing data measured as proportional composition .
You can use correlation when the proportion data are from different domains. Say your response is fraction of dead pixels on an LCD screen. You could try to correlate this to, say, the fraction of helium used in a chemical processing step of the screen.
This is for a case when several variables sum together to 1, in each observation. My answer will be intuition-level; this is intentional (and also, I'm not an expert of compositional data).
Let us have i.i.d. (hence zero-correlated) positive-valued variables which we then sum up and recompute as proportions of that sum. Then,
This is a deep question, and one with some subtleties that need to be stated. I'll try my best, but even though I've published on this topic (Proportionality: A Valid Alternative to Correlation for Relative Data) I'm always prepared to be surprised by new insights on the analysis of data containing only relative information.
As contributors to this thread have pointed out, correlation is notorious (in some circles) for being meaningless when applied to the compositional data that arises when a set of components is constrained to add up to a constant (as we see with proportions, percentages, parts-per-million, etc.).
Karl Pearson coined the term spurious correlation with this in mind. (Note: Tyler Vigen's popular Spurious Correlation site is not so much about spurious correlation as the "correlation implies causation" fallacy.)
Section 1.7 of Aitchison's (2003) A Concise Guide to Compositional Data Analysis provides a classic illustration of why correlation is an inappropriate measure of association for compositional data (for convenience, quoted in this Supplementary Information.
Compositional data arise not only when a set of non-negative components are made to sum to a constant; data are said to be compositional whenever they carry only relative information.
I think the main problem with the correlation of data that carry only relative information is in the interpretation of the result. This is an issue that we can illustrate with a single variable; let's say "donuts produced per dollar of GDP" across the nations of the world. If one nation's value is higher than another, is that because
...who can say?
Of course, as people remark on this thread, one can calculate correlations of these sorts of variables as a descriptive variable. But what do such correlations mean?
I had the same question. I found this reference at biorxiv useful:
Lovell D., V. Pawlowsky-Glahn, J. Egozcue, S. Marguerat, J. Bähler (2014),
"Proportionality: a valid alternative to correlation for relative data"
In the supporting information of this paper (Lovell, David, et al. ;doi: dx.doi.org/10.1101/008417), the authors mention that correlations between relative abundances do not provide any information in some cases. They give an example of relative abundances of two mRNA expressions. In Figure S2, the relative abundances of the two different mRNAs are perfectly negatively correlated, even though the correlation of these two mRNA in absolute values is not negatively related (green points and purple points).
Maybe it could help you.