15

I've been staring at the wikipedia page for distance correlation where it seems to be characterized by how it can be calculated. While I could do the calculations I struggle to get what distance correlation measures and why the the calculations look as they do.

Is there a (or many) more intuitive characterization of distance correlation that could help me understand what it measures?

I realize that asking for intuition is a bit vague, but if I knew what kind of intuition I was asking for I would probably not have asked in the first place. I would also be happy for intuition regarding the case of the distance correlation between two random variables (even though distance correlation is defined between two random vectors).

amoeba
  • 93,463
  • 28
  • 275
  • 317
Rasmus Bååth
  • 6,422
  • 34
  • 57

1 Answers1

8

This my answer doesn't answer the question correctly. Please read the comments.

Let us compare usual covariance and distance covariance. The effective part of both are their numerators. (Denominators are simply averaging.) The numerator of covariance is the summed cross-product (= scalar product) of deviations from one point, the mean: $\Sigma (x_i-\mu^x)(y_i-\mu^y)$ (with superscripted $\mu$ as that centroid). To rewrite the expression in this style: $\Sigma d_{i\mu}^x d_{i\mu}^y$, with $d$ standing for the deviation of point $i$ from the centroid, i.e. its (signed) distance to the centroid. The covariance is defined by the sum of the products of the two distances over all points.

How things are with distance covariance? The numerator is, as you know, $\Sigma d_{ij}^x d_{ij}^y$. Isn't it very much like what we've written above? And what is the difference? Here, distance $d$ is between varying data points, not between a data point and the mean as above. The distance covariance is defined by the sum of the products of the two distances over all pairs of points.

Scalar product (between two entities - in our case, variables $x$ and $y$) based on co-distance from one fixed point is maximized when the data are arranged along one straight line. Scalar product based on co-distance from a var*i*able point is maximized when the data are arranged along a straight line locally, piecewisely; in other words, when the data overall represent chain of any shape, dependency of any shape.

And indeed, usual covariance is bigger when the relationship is closer to be perfect linear and variances are bigger. If you standardize the variances to a fixed unit, the covariance depends only on the strength of linear association, and it is then called Pearson correlation. And, as we know - and just have got some intuition why - distance covariance is bigger when the relationship is closer to be perfect curve and data spreads are bigger. If you standardize the spreads to a fixed unit, the covariance depends only on the strength of some curvilinear association, and it is then called Brownian (distance) correlation.

ttnphns
  • 51,648
  • 40
  • 253
  • 462
  • The second paragraph made it click for me. I don't' know why I didn't see that in the wikipedia page... Thanks! – Rasmus Bååth Mar 06 '14 at 08:13
  • Just wondering, where in http://en.wikipedia.org/wiki/Brownian_covariance#Distance_covariance is the numerator from your example (or how to get from your numerator to the wikipedia version)? Wikipedia just describe how to calculate the square of the distance covariance and I'm having a bit of trouble matching your description against the description there... – Rasmus Bååth Mar 06 '14 at 08:25
  • @Rasmus, my "numerator formula" agrees with the wikipedia formula of squared sample distance covariance. But I missed one (important) nuance that distances $d$ are transformed by double centering. I might have to edit my answer, therefore. I hope to find time to return in few days, if not tomorrow. – ttnphns Mar 06 '14 at 17:44
  • Yes, the double centering has been puzzling me. Would be greatly appreachiated if you had the time to clarify that! :) – Rasmus Bååth Mar 06 '14 at 21:09
  • @Rasmus, double centering of distances is familiar to me well. It is done on _squared_ distances. In a univariate situation the result is deviations from the mean, so we get exactly the formula of my 1st paragraph. Indeed, in the above wikipedia article section "Generalization" it is said that with power for distances $\alpha=2$ dCOV actually reflects usual covariance. This I understand well. – ttnphns Mar 08 '14 at 11:33
  • 1
    What still **evades me** is why a lower power, e.g. default $\alpha=1$, which shrinks and decenters the deviations obtained at the double centering, makes dCOV the statistic which has its unique property: it is 0 iff X and Y are statistically independent. Because I haven't got intuition or knowledge of it I'm afraid that my answer's 2nd paragraph is misinterpretation or simplification. I therefore inclined to delete my answer. Can you say anything? – ttnphns Mar 08 '14 at 11:41
  • Well, I'm sure more evades e than it evades you :) Maybe keep the aswer but add your extra question at the end, maybe someone will be able to "fill it in" later? – Rasmus Bååth Mar 08 '14 at 15:55
  • @Rasmus, I still don't have a clear intuition, like you; and the math idea behind dCov is too complex for me. Here is what I think. As've said above, with $\alpha=2$, $d_{ij}^\alpha$, when being doubly centered, are expanded by simple Newton's binomial and consequantly the entries of the matrix $A$ (see Wikipedia) become just deviations from the mean; and thence dCov degenerates into sqrt(usual cov). This is clear. – ttnphns Mar 22 '14 at 09:17
  • (Cont.) But with $\alpha=1$ double centered matrix convey values which are obtained as infinite, produced by Taylor series expansion. The key point (that I can't grab) is that somehow these values of $A$ then relate to the empirical _characteristic function_ of the distribution, and this is that relatedness which make dCov to be a measure of any, linear and nonlinear, independence. – ttnphns Mar 22 '14 at 09:22
  • @ttnphns maybe this helps? (PDF) https://www.researchgate.net/profile/Gabor_Szekely3/publication/243786506_E-Statistics_The_energy_of_statistical_samples/links/5609857f08ae840a08d3b06a.pdf – kram1032 Jun 10 '18 at 11:52
  • Starting with 5. (page 10), as stated in Proposition 3 (page 12), basically you need a strictly negative definite kernel for this to work. It turns out using $\mathbb{E}{\left[{\left| X - Y \right|}^2\right]}$ (etc.) you have only a *conditionally* negative definite kernel which only works in certain situations (such as, in case of normal covariance, for linear dependence) - But any $\left|\cdot\right|^\alpha$ with $0 – kram1032 Jun 10 '18 at 11:52
  • @kram1032, thank you for the paper! I'll read it later. But, really, why don't you issue an answer to the question if you know what to say. It would make good for everybody including myself. – ttnphns Jun 10 '18 at 12:32
  • @ttnphns Because I'm not entirely sure that would answer the question here. I'm still trying to make sense of it myself. That's why I said "*Maybe* this helps?" - I really am not sure whether it would. (My background very much isn't statistics and I'm not confident I could give a decently phrased answer at all) – kram1032 Jun 10 '18 at 12:35