Kim (2000) gives a formula for the decomposition of the (squared) Mahalanobis distance for $d$ variables for a random vector $\mathbf{x}$ with mean vector $\mathbf{\mu}$ and covariance matrix $\mathbf{\Sigma}$:
$$ (\mathbf{x}-\mathbf{\mu})^{T}\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu}) = \sum_{i = 1}^{d}\left(1 - \color{red}{\rho_{i}^{2}}\right)^{-1/2}\left[\color{blue}{b_{(i)dd}(x_{i} - \mathrm{E}(x_{i}|x_{-i}))}\right]\left[\color{green}{(x_{i} - \mu_{i})/\sigma_{i}}\right]\tag{1} $$
Explaining the components of the formula
$x_{-i}$ denotes the vector obtained from $\mathbf{x}$ by deleting its $i$th component variable.
$\color{red}{\rho_{i}^2}$ is the multiple correlation coefficient between $x_{i}$ and the vector of the remaining variables $x_{-i}$.
He further writes:
Here $\mathrm{E}(x_{i}|x_{-i})$ denotes the conditional expectation of $x_{i}$ when the other variables in $x_{i}$ are fixed [...]. It becomes the regression line of $x_{i}$ on the variables in $x_{-i}$ [...].
And:
The second component $\color{blue}{b_{(i)dd}(x_{i} - \mathrm{E}(x_{i}|x_{-i}))}$ is the standardized residual of $x_{i}$ from its regression on the variables in $x_{-i}$ [...].
$\mu_{i}$ and $\sigma_{i}$ denote the mean and variance of the $i$th component variable $x_{i}$ of $\mathbf{x}$. So the expression $\color{green}{(x_{i} - \mu_{i})/\sigma_{i}}$ denotes the standardized values of $x_{i}$.
Problem
Assume for the moment that we have $d=3$ variables and want to calculate the component for variable $x_{1}$. The R
code below contains a real dataset with three variables.
According to the explanations above, I could do the following steps (see below for corresponding R
code):
- Calculate a regression with $x_{1}$ the dependent and $x_{2}$ and $x_{3}$ as the independent variables.
- The (unadjusted) $R^{2}$ fromt he regression is the multiple correlation coefficient $\color{red}{\rho_{1}^{2}}$ in the formula above.
- The standardized residuals obtained by $\hat{\epsilon_{i}}/(\hat{\sigma}\sqrt{(1 - h_{ii}})$ where $\hat{\sigma}$ is the estimated residual standard deviation and $h_{ii}$ is the $i$th diagonal of the hat matrix. Obtained by
rstandard
inR
. - $\mu_{1}$ and $\sigma_{1}$ would simply be the mean and standard deviation of $x_{1}$. So I simply standardize $x_{1}$ to obtain $\color{green}{(x_{1} - \mu_{1})/\sigma_{1}}$.
Using this procedure gives components that do not quite sum of to be the mahalanobis distance (see picture below). The author writes that to get the sample values of $(1)$, we could use the maximum likelihood estimates in the, which the above steps do not do (i.e. for the variance estimates). I have tried using the maximum likelihood estimators but the components were still not adding up (I think).
How could I calculate the necessary components of the formula correctly?
Example with code
#------------------------------------------------------------------------------------------
# Data
#------------------------------------------------------------------------------------------
fuel <- c(16.44, 7.19, 9.92, 4.24, 11.2, 14.25, 13.5, 13.32, 29.11, 12.68,
7.51, 9.9, 10.25, 11.11, 12.17, 10.24, 10.18, 8.88, 12.34, 8.51,
26.16, 12.95, 16.93, 14.7, 10.32, 8.98, 9.7, 12.72, 9.49, 8.22,
13.7, 8.21, 15.86, 9.18, 12.49, 17.32)
repair <- c(12.43, 2.7, 1.35, 5.78, 5.05, 5.78, 10.98, 14.27, 15.09, 7.61,
5.8, 3.63, 5.07, 6.15, 14.26, 2.59, 6.05, 2.7, 7.73, 14.02, 17.44,
8.24, 13.37, 10.78, 5.16, 4.49, 11.59, 8.63, 2.16, 7.95, 11.22,
9.85, 11.42, 9.18, 4.67, 6.86)
capital <- c(11.23, 3.92, 9.75, 7.78, 10.67, 9.88, 10.6, 9.45, 3.28, 10.23,
8.13, 9.13, 10.17, 7.61, 14.39, 6.09, 12.14, 12.23, 11.68, 12.01,
16.89, 7.18, 17.59, 14.58, 17, 4.26, 6.83, 5.59, 6.23, 6.72,
4.91, 8.17, 13.06, 9.49, 11.94, 4.44)
x <- cbind(fuel, repair, capital)
#------------------------------------------------------------------------------------------
# Calculate mean vector and covariance matrix
#------------------------------------------------------------------------------------------
m_vec <- colMeans(x)
Sigma <- cov(x)
#------------------------------------------------------------------------------------------
# Calculate the squared Mahalanobis distance
#------------------------------------------------------------------------------------------
maha_d <- mahalanobis(x, center = m_vec, cov = Sigma, inverted = FALSE)
#------------------------------------------------------------------------------------------
# Calculate the multiple correlation coefficients and standardized residuals
#------------------------------------------------------------------------------------------
mod_fuel <- lm(fuel~repair + capital)
r2_fuel <- summary(mod_fuel)$r.squared
rs_fuel <- rstandard(mod_fuel)
x_fuel <- (1 - r2_fuel)^(-1/2)*(rs_fuel)*scale(fuel) # Component 1
mod_repair <- lm(repair~fuel + capital)
r2_repair <- summary(mod_repair)$r.squared
rs_repair <- rstandard(mod_repair)
x_repair <- (1 - r2_repair)^(-1/2)*(rs_repair)*(scale(repair)) # Component 2
mod_capital <- lm(capital~fuel + repair)
r2_capital <- summary(mod_capital)$r.squared
rs_capital <- rstandard(mod_repair)
x_capital <- (1 - r2_capital)^(-1/2)*(rs_capital)*(scale(capital)) # Component 3
zz <- cbind(x_fuel, x_repair, x_capital)
#------------------------------------------------------------------------------------------
# Sum of components
#------------------------------------------------------------------------------------------
rowSums(zz)