14

I have seen formulas on Wikipedia. that relate Mahalanobis distance and Leverage:

Mahalanobis distance is closely related to the leverage statistic, $h$, but has a different scale: $$D^2 = (N - 1)(h - \tfrac{1}{N}).$$

In a linked article, Wikipedia describes $h$ in these terms:

In the linear regression model, the leverage score for the $i^{th}$ data unit is defined as:$$h_{ii}=(H)_{ii},$$ the $i^{th}$ diagonal element of the hat matrix $H=X(X^{\top}X)^{-1}X^{\top}$, where $^{\top}$ denotes the matrix transpose.

I can't find a proof anywhere. I tried to start from the definitions but I can't make any progress. Anyone can give some hint?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
dave2d
  • 303
  • 2
  • 4

2 Answers2

11

My description of the Mahalanobis distance at Bottom to top explanation of the Mahalanobis distance? includes two key results:

  1. By definition, it does not change when the regressors are uniformly shifted.

  2. The squared Mahalanobis distance between vectors $x$ and $y$ is given by $$D^2(x,y) = (x-y)^\prime \Sigma^{-1}(x-y)$$ where $\Sigma$ is the covariance of the data.

(1) allows us to assume the means of the regressors are all zero. It remains to compute $h_i$. However, for the claim to be true, we need to add one more assumption:

The model must include an intercept.

Allowing for this, let there be $k \ge 0$ regressors and $n$ data, writing the value of regressor $j$ for observation $i$ as $x_{ij}$. Let the column vector of these $n$ values for regressor $j$ be written $\mathbf{x}_{,j}$ and the row vector of these $k$ values for observation $i$ be written $\mathbf{x}_i$. Then the model matrix is

$$X = \pmatrix{ 1 &x_{11} &\cdots &x_{1k} \\ 1 &x_{21} &\cdots &x_{2k} \\ \vdots &\vdots &\vdots &\vdots \\ 1 &x_{n1} &\cdots &x_{nk}}$$

and, by definition, the hat matrix is

$$H = X(X^\prime X)^{-1} X^\prime,$$

whence entry $i$ along the diagonal is

$$h_i = h_{ii} = (1; \mathbf{x}_i) (X^\prime X)^{-1} (1; \mathbf{x}_i)^\prime.\tag{1}$$

There's nothing for it but to work out that central matrix inverse--but by virtue of the first key result, it's easy, especially when we write it in block-matrix form:

$$X^\prime X = n\pmatrix{1 & \mathbf{0}^\prime \\ \mathbf{0} & C}$$

where $\mathbf{0} = (0,0,\ldots,0)^\prime$ and

$$C_{jk} = \frac{1}{n} \sum_{i=1}^n x_{ij} x_{ik} = \frac{n-1}{n}\operatorname{Cov}(\mathbf{x}_j, \mathbf{x}_k) = \frac{n-1}{n}\Sigma_{jk}.$$

(I have written $\Sigma$ for the sample covariance matrix of the regressors.) Because this is block diagonal, its inverse can be found simply by inverting the blocks:

$$(X^\prime X)^{-1} = \frac{1}{n}\pmatrix{1 & \mathbf{0}^\prime \\ \mathbf{0} & C^{-1}} = \pmatrix{\frac{1}{n} & \mathbf{0}^\prime \\ \mathbf{0} & \frac{1}{n-1}\Sigma^{-1}}.$$

From the definition $(1)$ we obtain

$$\eqalign{ h_i &= (1; \mathbf{x}_i) \pmatrix{\frac{1}{n} & \mathbf{0}^\prime \\ \mathbf{0} & \frac{1}{n-1}\Sigma^{-1}}(1; \mathbf{x}_i)^\prime \\ &=\frac{1}{n} + \frac{1}{n-1}\mathbf{x}_i \Sigma^{-1}\mathbf{x}_i^\prime \\ &=\frac{1}{n} + \frac{1}{n-1} D^2(\mathbf{x}_i, \mathbf{0}). }$$

Solving for the squared Mahalanobis length $D_i^2 = D^2(\mathbf{x}_i, \mathbf{0})$ yields

$$D_i^2 = (n-1)\left(h_i - \frac{1}{n}\right),$$

QED.

Looking back, we may trace the additive term $1/n$ to the presence of an intercept, which introduced the column of ones into the model matrix $X$. The multiplicative term $n-1$ appeared after assuming the Mahalanobis distance would be computed using the sample covariance estimate (which divides the sums of squares and products by $n-1$) rather than the covariance matrix of the data (which divides the sum of squares and products by $n$).


The chief value of this analysis is to impart a geometric interpretation to the leverage, which measures how much a unit change in the response at observation $i$ will change the fitted value at that observation: high-leverage observations are at large Mahalanobis distances from the centroid of the regressors, exactly as a mechanically efficient lever operates at a large distance from its fulcrum.


R code to show that the relation indeed holds:

x <- mtcars

# Compute Mahalanobis distances
h <- hat(x, intercept = TRUE); names(h) <- rownames(mtcars)
M <- mahalanobis(x, colMeans(x), cov(x))

# Compute D^2 of the question
n <- nrow(x); D2 <- (n-1)*(h - 1/n)

# Compare.
all.equal(M, D2)               # TRUE
print(signif(cbind(M, D2), 3))
whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • Excellent answer, very well rounded with rigor and intuition. Cheers! – cgrudz Mar 27 '19 at 22:53
  • Thanks for the post @whuber ! For sanity check, here is R code to show that the relation indeed holds: x – Tal Galili Jul 16 '19 at 07:21
  • 1
    @Tal I didn't think I needed a sanity check--but thank you for the code. :-) I have made modifications to clarify it and its output a little. – whuber Jul 16 '19 at 12:33
  • 1
    @whuber, I wanted an example that shows how to make the equality works (making clear to me that I got the assumptions right). I've also extended the relevant Wiki entry: https://en.wikipedia.org/wiki/Leverage_(statistics)#Mahalanobis_distance (feel free to also expend on it there, as you see fit :) ) – Tal Galili Jul 16 '19 at 13:33
  • How do we justify that $X^T X$ is the sample covariance rather than the sample second moment? I buy that the Mahalanobis distance does not change when the regressors are all shifted by the same amount, but why does that mean we can assume the data is already centered when computing $X^T X$? @whuber – Rylan Schaeffer Dec 14 '21 at 19:24
  • @whuber I'm probably misreading something, but it appears that only under the assumption that the data is centered that we get the sample covariance and through the sample covariance reach the Mahalanobis distance. But that doesn't justify assuming the data is already centered *before* the Mahalanobis distance enters the scene – Rylan Schaeffer Dec 14 '21 at 19:31
  • @Rylan Centering is not an assumption. Another way to view this part of the argument would be to restore the term "$\mu$" in the formulas and track it all the way through any algebra involving Mahalanobis distances--only to discover it disappears at the end. Intuitively, this is because the distance doesn't depend on the origin of our units of measurement for the values of the vectors, so we are free to choose any suitable origin--and the means do nicely. – whuber Dec 14 '21 at 19:51
  • I agree that $\mu$ will disappear in anything involving the Mahalanobis distance, but how do we reach the Mahalanobis distance without assuming that the data centered? – Rylan Schaeffer Dec 14 '21 at 20:00
  • To see what I mean, suppose that the data isn't centered. Your $h_i$ term becomes $\frac{1}{n} + x_i^T M^{-1} x_i$ where $M = \frac{1}{n}\sum_i x_i x_i^T = \Sigma + \mu \mu^T$. How do we then reach the Mahalanobis distance? – Rylan Schaeffer Dec 14 '21 at 20:01
  • @Rylan I'm not following the thrust of your comments. The Mahalanobis distance is [defined](https://stats.stackexchange.com/questions/62092) relative to an origin placed at the point of averages. The point is that this is mathematically the same thing as computing it relative to the vector space origin at $(0,\ldots,0)$ after a preliminary centering of the data. – whuber Dec 14 '21 at 20:02
  • See my above comment. My concern is *before* the Mahalanobis distance appears. You reach $(x_i - \mu)^T \Sigma^{-1} (x_i - \mu)$ and then say this is the squared Mahalanobis distance, which is correct. I'm pointing out that $(x_i - \mu)^T \Sigma^{-1} (x_i - \mu)$ is reached only if the data is centered; if we don't make that assumption, we instead reach $x_i^T (\Sigma + \mu \mu^T)^{-1} x_i$. How do we get the Mahalanobis distance from that? – Rylan Schaeffer Dec 14 '21 at 20:08
  • In your comment to my other question, you seem to contradict yourself: https://stats.stackexchange.com/questions/552005/mistake-about-covariance-on-wikipedia-page-for-leverage. There, I specifically asked whether the covariance was the correct quantity in computing the hat matrix / leverages, and you said, "You are correct and that passage [on Wikipedia] is wrong." – Rylan Schaeffer Dec 14 '21 at 20:27
-2

I fear @whuber's answer assumes a key step that is not assumable. I don't know how to fix it, but I want to explain the mistake. The problem is the statement "(1) allows us to assume the means of the regressors are all zero." While that is true for the Mahalanobis distance, @whuber assumes the raw data to be mean zero before any Malanobis distance enters the picture. I don't know how this is justifiable.

To see this, let's redo the calculation without assuming that $X$ is centered. As above, we have:

$$[H]_{ii} = \begin{bmatrix} 1 \\ x_i \end{bmatrix}^T (X^T X)^{-1} \begin{bmatrix} 1 \\ x_i \end{bmatrix}$$

One can see that $X^T X = \sum_{i} x_i x_i^T$. @whuber assumes that the data is centered and thus writes:

$$X^T X = n \begin{bmatrix} 1 & 0^T \\ 0 & C \end{bmatrix}$$

But how is this justified? No Mahalanobis distance has (yet) entered the picture. Nothing tells us that the data is centered. In actuality, the correct expression is:

$$X^T X = n \begin{bmatrix} 1 & \mu^T \\ \mu & \Sigma + \mu \mu^T \end{bmatrix}$$

An easy block inversion is no longer possible and I don't know how to complete the derivation.

Rylan Schaeffer
  • 689
  • 3
  • 10
  • -1 Please read my answer again: I make no such assumptions. If you wish to start with your last expression, feel free to do so: every occurrence of "$\mu$" will vanish after the initial row and column reduction steps. Thus, with extra initial work, you will arrive at the point where I began and you can continue from there. If this isn't perfectly clear, carry out a calculation by hand with a tiny design matrix, such as a $3\times 2$ matrix. – whuber Dec 15 '21 at 02:36
  • Whether you call it an assumption or not, you do not show that $\mu$ vanishes. You just say that it does. Even if it does, the onus falls on you to show that it does. – Rylan Schaeffer Dec 15 '21 at 02:46
  • @whuber , you're a bully. I'm correct in pointing out that your answer is incomplete without showing that "every occurrence of "" will vanish after the initial row and column reduction steps" is true and you're downvoting me – Rylan Schaeffer Dec 15 '21 at 02:54
  • We have had an extensive discussion about this. Trying to address your misunderstanding by posting an incorrect answer is counterproductive because it risks confusing others. My downvote is an effort to signal that there are fundamental problems in your answer. I am sorry if you view that as "bullying:" I see such commenting and voting as being a necessary part of how this site helps develop high-quality answers. For the same reasons, your revenge downvote of my answer is unrelated to its usefulness and thereby detracts from the quality of this site. – whuber Dec 15 '21 at 13:52