3

Kim (2000) gives a formula for the decomposition of the (squared) Mahalanobis distance for $d$ variables for a random vector $\mathbf{x}$ with mean vector $\mathbf{\mu}$ and covariance matrix $\mathbf{\Sigma}$:

$$ (\mathbf{x}-\mathbf{\mu})^{T}\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu}) = \sum_{i = 1}^{d}\left(1 - \color{red}{\rho_{i}^{2}}\right)^{-1/2}\left[\color{blue}{b_{(i)dd}(x_{i} - \mathrm{E}(x_{i}|x_{-i}))}\right]\left[\color{green}{(x_{i} - \mu_{i})/\sigma_{i}}\right]\tag{1} $$

Explaining the components of the formula

$x_{-i}$ denotes the vector obtained from $\mathbf{x}$ by deleting its $i$th component variable.

$\color{red}{\rho_{i}^2}$ is the multiple correlation coefficient between $x_{i}$ and the vector of the remaining variables $x_{-i}$.

He further writes:

Here $\mathrm{E}(x_{i}|x_{-i})$ denotes the conditional expectation of $x_{i}$ when the other variables in $x_{i}$ are fixed [...]. It becomes the regression line of $x_{i}$ on the variables in $x_{-i}$ [...].

And:

The second component $\color{blue}{b_{(i)dd}(x_{i} - \mathrm{E}(x_{i}|x_{-i}))}$ is the standardized residual of $x_{i}$ from its regression on the variables in $x_{-i}$ [...].

$\mu_{i}$ and $\sigma_{i}$ denote the mean and variance of the $i$th component variable $x_{i}$ of $\mathbf{x}$. So the expression $\color{green}{(x_{i} - \mu_{i})/\sigma_{i}}$ denotes the standardized values of $x_{i}$.

Problem

Assume for the moment that we have $d=3$ variables and want to calculate the component for variable $x_{1}$. The R code below contains a real dataset with three variables.

According to the explanations above, I could do the following steps (see below for corresponding R code):

  1. Calculate a regression with $x_{1}$ the dependent and $x_{2}$ and $x_{3}$ as the independent variables.
  2. The (unadjusted) $R^{2}$ fromt he regression is the multiple correlation coefficient $\color{red}{\rho_{1}^{2}}$ in the formula above.
  3. The standardized residuals obtained by $\hat{\epsilon_{i}}/(\hat{\sigma}\sqrt{(1 - h_{ii}})$ where $\hat{\sigma}$ is the estimated residual standard deviation and $h_{ii}$ is the $i$th diagonal of the hat matrix. Obtained by rstandard in R.
  4. $\mu_{1}$ and $\sigma_{1}$ would simply be the mean and standard deviation of $x_{1}$. So I simply standardize $x_{1}$ to obtain $\color{green}{(x_{1} - \mu_{1})/\sigma_{1}}$.

Using this procedure gives components that do not quite sum of to be the mahalanobis distance (see picture below). The author writes that to get the sample values of $(1)$, we could use the maximum likelihood estimates in the, which the above steps do not do (i.e. for the variance estimates). I have tried using the maximum likelihood estimators but the components were still not adding up (I think).

How could I calculate the necessary components of the formula correctly?

Maha_Comparison

Example with code

#------------------------------------------------------------------------------------------
# Data
#------------------------------------------------------------------------------------------

fuel <- c(16.44, 7.19, 9.92, 4.24, 11.2, 14.25, 13.5, 13.32, 29.11, 12.68, 
          7.51, 9.9, 10.25, 11.11, 12.17, 10.24, 10.18, 8.88, 12.34, 8.51, 
          26.16, 12.95, 16.93, 14.7, 10.32, 8.98, 9.7, 12.72, 9.49, 8.22, 
          13.7, 8.21, 15.86, 9.18, 12.49, 17.32)
repair <- c(12.43, 2.7, 1.35, 5.78, 5.05, 5.78, 10.98, 14.27, 15.09, 7.61, 
            5.8, 3.63, 5.07, 6.15, 14.26, 2.59, 6.05, 2.7, 7.73, 14.02, 17.44, 
            8.24, 13.37, 10.78, 5.16, 4.49, 11.59, 8.63, 2.16, 7.95, 11.22, 
            9.85, 11.42, 9.18, 4.67, 6.86)
capital <- c(11.23, 3.92, 9.75, 7.78, 10.67, 9.88, 10.6, 9.45, 3.28, 10.23, 
             8.13, 9.13, 10.17, 7.61, 14.39, 6.09, 12.14, 12.23, 11.68, 12.01, 
             16.89, 7.18, 17.59, 14.58, 17, 4.26, 6.83, 5.59, 6.23, 6.72, 
             4.91, 8.17, 13.06, 9.49, 11.94, 4.44)

x <- cbind(fuel, repair, capital)

#------------------------------------------------------------------------------------------
# Calculate mean vector and covariance matrix
#------------------------------------------------------------------------------------------

m_vec <- colMeans(x)
Sigma <- cov(x)

#------------------------------------------------------------------------------------------
# Calculate the squared Mahalanobis distance
#------------------------------------------------------------------------------------------

maha_d <- mahalanobis(x, center = m_vec, cov = Sigma, inverted = FALSE)

#------------------------------------------------------------------------------------------
# Calculate the multiple correlation coefficients and standardized residuals
#------------------------------------------------------------------------------------------

mod_fuel <- lm(fuel~repair + capital)
r2_fuel <- summary(mod_fuel)$r.squared
rs_fuel <- rstandard(mod_fuel)

x_fuel <- (1 - r2_fuel)^(-1/2)*(rs_fuel)*scale(fuel) # Component 1

mod_repair <- lm(repair~fuel + capital)
r2_repair <- summary(mod_repair)$r.squared
rs_repair <- rstandard(mod_repair)

x_repair <- (1 - r2_repair)^(-1/2)*(rs_repair)*(scale(repair)) # Component 2

mod_capital <- lm(capital~fuel + repair)
r2_capital <- summary(mod_capital)$r.squared
rs_capital <- rstandard(mod_repair)

x_capital <- (1 - r2_capital)^(-1/2)*(rs_capital)*(scale(capital)) # Component 3

zz <- cbind(x_fuel, x_repair, x_capital)

#------------------------------------------------------------------------------------------
# Sum of components
#------------------------------------------------------------------------------------------

rowSums(zz)
COOLSerdash
  • 25,317
  • 8
  • 73
  • 123
  • Notice that in the formula you have given, the covariance matrix and the expected value of X are known, while in your code you estimate them. – Kozolovska Oct 20 '19 at 08:28
  • @t.f Thanks for the hint. The author writes that we could replace the parameters by their maximum likelihood estimates in the formula to get the sample values. – COOLSerdash Oct 20 '19 at 08:29

1 Answers1

0

I was wrong: By consistently using maximum likelihood estimates, the problem vanishes. That means:

  • Variances and covariances have to be calculated using $n$ in the denominator instead of the usual $n-1$.
  • Standardized residuals have to be obtained by dividing the residuals by the biased estimate of the residual standard deviation $\hat{\sigma}$, i.e. the sum of squared residuals dividied by $n$ instead of the residual degrees of freedom ($n-p$).

Mahalanobis_comparison

I still wonder how the decomposition would have to be modified for it to work for the usual unbiased estimators of the involved quantities?

Here is the R code to illustrate the correct calculations:

#------------------------------------------------------------------------------------------
# Data
#------------------------------------------------------------------------------------------

fuel <- c(16.44, 7.19, 9.92, 4.24, 11.2, 14.25, 13.5, 13.32, 29.11, 12.68, 
          7.51, 9.9, 10.25, 11.11, 12.17, 10.24, 10.18, 8.88, 12.34, 8.51, 
          26.16, 12.95, 16.93, 14.7, 10.32, 8.98, 9.7, 12.72, 9.49, 8.22, 
          13.7, 8.21, 15.86, 9.18, 12.49, 17.32)
repair <- c(12.43, 2.7, 1.35, 5.78, 5.05, 5.78, 10.98, 14.27, 15.09, 7.61, 
            5.8, 3.63, 5.07, 6.15, 14.26, 2.59, 6.05, 2.7, 7.73, 14.02, 17.44, 
            8.24, 13.37, 10.78, 5.16, 4.49, 11.59, 8.63, 2.16, 7.95, 11.22, 
            9.85, 11.42, 9.18, 4.67, 6.86)
capital <- c(11.23, 3.92, 9.75, 7.78, 10.67, 9.88, 10.6, 9.45, 3.28, 10.23, 
             8.13, 9.13, 10.17, 7.61, 14.39, 6.09, 12.14, 12.23, 11.68, 12.01, 
             16.89, 7.18, 17.59, 14.58, 17, 4.26, 6.83, 5.59, 6.23, 6.72, 
             4.91, 8.17, 13.06, 9.49, 11.94, 4.44)

x <- cbind(fuel, repair, capital)

#------------------------------------------------------------------------------------------
# Calculate mean vector and covariance matrix
#------------------------------------------------------------------------------------------

n <- nrow(x)

m_vec <- colMeans(x)
Sigma <- var(x)*((n - 1)/n)

#------------------------------------------------------------------------------------------
# Calculate the squared Mahalanobis distance
#------------------------------------------------------------------------------------------

maha_d <- mahalanobis(x, center = m_vec, cov = Sigma, inverted = FALSE)

#------------------------------------------------------------------------------------------
# Calculate the multiple correlation coefficients and standardized residuals
#------------------------------------------------------------------------------------------

mod_fuel <- lm(fuel~repair + capital)
sigma_fuel <- sqrt(sigma(mod_fuel)^2*(mod_fuel$df.residual)/n)

r2_fuel <- summary(mod_fuel)$r.squared
rs_fuel <- residuals(mod_fuel)
rs_fuels <- rs_fuel/sigma_fuel
z_fuel <- (fuel - m_vec["fuel"])/sqrt(Sigma["fuel", "fuel"])

comp_fuel <- (1 - r2_fuel)^(-1/2)*rs_fuels*z_fuel


mod_repair <- lm(repair~fuel + capital)
sigma_repair <- sqrt(sigma(mod_repair)^2*(mod_repair$df.residual)/n)

r2_repair <- summary(mod_repair)$r.squared
rs_repair <- residuals(mod_repair)
rs_repairs <- rs_repair/sigma_repair
z_repair <- (repair - m_vec["repair"])/sqrt(Sigma["repair", "repair"])

comp_repair <- (1 - r2_repair)^(-1/2)*rs_repairs*z_repair


mod_capital <- lm(capital~fuel + repair)
sigma_capital <- sqrt(sigma(mod_capital)^2*(mod_capital$df.residual)/n)

r2_capital <- summary(mod_capital)$r.squared
rs_capital <- residuals(mod_capital)
rs_capitals <- rs_capital/sigma_capital
z_capital <- (capital - m_vec["capital"])/sqrt(Sigma["capital", "capital"])

comp_capital <- (1 - r2_capital)^(-1/2)*rs_capitals*z_capital

zz <- cbind(comp_fuel, comp_repair, comp_capital, comp_fuel + comp_repair + comp_capital, maha_d)
COOLSerdash
  • 25,317
  • 8
  • 73
  • 123