How to assess correlation in non-ordinal (non-ranked) and non-normally distributed data?

Question

Looking at Person's, Spearman's and Kendall's correlation coefficients for the same data, we can see that both Spearman's Rho and Kendall's Tau misrepresent the acutal correlation, if the data is higher than ordinally scaled and ranks therefore don't represent the actual data well.

Let's look at some examples first and my own data last. Here is some R code (please scroll the code window up to see all of the code) and a plot for different data:

### perfectly correlated:

p <- c(1, 1000)
q <- c(1, 1000)
cor(p, q, method = "pearson")
# 1
cor(p, q, method = "spearman")
# 1
cor(p, q, method = "kendall")
# 1
par(mfrow = c(2, 2))
plot(p, q, main = "correlated", xlab = "", ylab = "", axes = FALSE)
box()

### perfectly uncorrelated:

p <- c(1, 2, 1, 2)
q <- c(1, 1, 2, 2)
cor(p, q, method = "pearson")
# 0
cor(p, q, method = "spearman")
# 0
cor(p, q, method = "kendall")
# 0
plot(p, q, main = "uncorrelated", xlab = "", ylab = "", axes = FALSE)
box()

### almost perfectly correlated:

p <- c(1, 2, 999, 1000)
q <- c(2, 1, 1000, 999)
cor(p, q, method = "pearson")
# 0.999998
cor(p, q, method = "spearman")
# 0.6
cor(p, q, method = "kendall")
# 0.3333333
plot(jitter(p, 100), jitter(q, 100), main = "almost", xlab = "", ylab = "", axes = FALSE)
box()

### my data

p <- c(1.139434, 1.901322, 1.461096, 2.459053, 4.643259, 2.397895, 1.99243, 3.013225, 1.654558, NA, 1.529395, 3.861899, 1.07881, 2.942148, 3.791436, 3.349904, NA, 2.34857, 2.944439, 3.251079, 3.766229, 3.94266, 2.125251, 1.934076, 2.238047, 1.731135, 1.511458, 3.311585, 2.66921, NA, 0.4700036, 1.751754, 1.548813, 4.01228, 0.7503056, 3.430397, 3.718977, 3.154634, 0.8873032, 1.824549, 2.837728, 3.057768, 3.709399, 2.674149, 1.832581, NA, 2.710713, 1.738219, 0.8754687, NA, 3.272417, 2.89395, 1.386294, 1.814749, 2.1366, 4.857225, 0.8043728, 3.531694, 4.75359, 1.791759, 1.754019, 2.367124, 2.736221, 4.004119, 4.39834, 3.745575)
q <- c(0.9162907, 1.332227, 0.415127, 1.765906, 1.523495, 1.722767, 1.622683, 2.455054, 0.6931472, NA, 1.495494, 2.890372, 0.05715841, 2.221092, 3.326474, 2.732743, NA, 1.791759, 2.273598, 2.524516, 2.803572, 3.028522, 1.252763, 1.538763, 1.558145, 1.386294, 1.029619, 2.655252, 2.397895, NA, 0.9808293, 1.32567, 1.548813, 2.585711, 0.6931472, 2.914763, 2.86537, 2.654806, 0.6931472, 1.386294, 2.135531, 2.95491, 2.632064, 2.564949, 1.098612, NA, 1.99606, 0.4770875, 0.4054651, 1.213682, 3.107944, 2.383124, 1.072637, 1.249435, 1.644123, 3.628776, 0.1625189, 2.008824, 3.590034, 1.920377, 0.7985077, 1.813738, 2.436116, 3.754337, 3.335957, 2.908721)
cor(p, q, use = "pairwise.complete.obs", method = "pearson")
# 0.8890321
cor(p, q, use = "pairwise.complete.obs", method = "spearman")
# 0.9087856
cor(p, q, use = "pairwise.complete.obs", method = "kendall")
# 0.7669589
plot(p, q, main = "my data", xlab = "", ylab = "", axes = FALSE)
box()

Even if eyeballing is a bad way to assess data, I'm sure you agree with me that in my almost perfectly correlating third example some of the three calculated correlation coefficients must be off the mark: r_P = 0.999998, r_S = 0.6, r_K = 0.3333333.

Considering the fact that my data is metric and ratio-scaled (the data are seconds), but not normally distributed, what is the best way to calculate its correlation?

Reading [another question](http://stats.stackexchange.com/q/18112/3277) might help you. — ttnphns, Sep 08 '15 at 07:59
`I can simply look at the plot and answer ... how correlated the data is`. Unfortunately, it is not simple to answer. Just try and reflect on, and find that you are basing your description of the correlation plot on some visual "good" prototype or premise. If all people in any time interpreted such plots identically there would be no reason to invent different correlation measures. — ttnphns, Sep 08 '15 at 08:58
@ttnphns Re: duplicate. Please see my edit. Maybe I'm just to mathematically disabled to understand the answers to the other question, but I don't see how they answer my question. — , Sep 08 '15 at 09:20
And can you publish the data itself, not simply the plot of it? To give data is always a good idea. — ttnphns, Sep 08 '15 at 09:24
My data is in the R code in my question. Please scroll the code window to see it. — , Sep 08 '15 at 09:24
Kendall and Spearman correlations are computed on the _ranked_ data. `1 2 3 4` and `2 1 4 3`, respectively. Did you just forget it? — ttnphns, Sep 08 '15 at 09:43
Your "almost" plot is *only* "almost" when viewed in terms of the raw data, as PMCC would do. After rank transformation it doesn't look so "almost" after all. Apply a monotonic transformation so that the X and Y values don't tend to cluster up around two values (something visually meaningful to you, but not at all detected by rank-based correlation measures) and the correlation would not look so strong. If you use visual intuition based on raw data, don't be surprised if this intuition is not reflected in rank-based measures, especially when you picked an arrangement where they differ so much! — Silverfish, Sep 08 '15 at 09:44
Thank you, @Silverfish and ttnphns, that actually helps. I have edited my question and given it a different aim. — , Sep 08 '15 at 10:35
`Looking at Person's, Spearman's and Kendall's correlation coefficients for the same data, we can see that both Spearman's Rho and Kendall's Tau misrepresent the acutal correlation, if the data is higher than ordinally scaled and ranks therefore don't represent the actual data well.` I'm returning to my 2nd comment. **What** is "actual correlation"? There exist no one. Not single. We always imply some this or that clear or dim model of relationship. Models can be plenty, coefficients of correlation can be infinite number of. — ttnphns, Sep 08 '15 at 10:42
Regarding your particular example of the "almost"-plot, choice of Pearson r is most warranted. First, you have so few data points and, second, the configuration is so much "thin rectangle" shape that any one intuitively will pick the relationship with Pearson r. Among well-known coefficients it ia what first comes to mind. — ttnphns, Sep 08 '15 at 10:51
Thank you, @ttnphns, I think I finally begin to understand. But this still leaves me with the problem (phrased at the end of the current edit of my question) of how to calculate the correlation in my data. What would you recommend? As I understand it, I should not use Pearson's r if the data is not normally distributed, and my data certainly isn't. — , Sep 08 '15 at 10:57
You might find it instructive to compare the Pearson and Spearman correlation on bivariate normal data, at various correlations. — Glen_b, Sep 08 '15 at 11:37
@what It is fine to use Pearson's on non-normally distributed data! Just don't apply a significance test that assumes normality .. And there are times when another measure makes more sense e.g. when faced by a curvilinear relationship. — Silverfish, Sep 08 '15 at 14:41

How to assess correlation in non-ordinal (non-ranked) and non-normally distributed data?

0 Answers0