1

Here is my real data example (these are real data check below pictures to see. I am comparing real documents (words represented as TF-IDF values). I equalize list sizes with missing words added as 0 to the list)

Matrix 1 (1x17) (math rounded to easier read)

0.29 0.29 0.29 0.29 0.29 0.18 0.18 0.29 0.29 0.29 0.29 0.29 0.29 0 0 0 0 

Matrix 2 (1x17) (math rounded to easier read)

0 0 0 0 0 0.29 0.29 0 0 0 0 0 0 0.46 0.46 0.46 0.46 

first matrix real document words

enter image description here

second matrix real document words

enter image description here

Here the formula of Mahalanobis distance

enter image description here

I have implemented the formula here and it works perfectly fine : http://people.revoledu.com/kardi/tutorial/Similarity/MahalanobisDistance.html

From this algorithm I have implemented, but it doesn't work on my case.

Because this algorithm requires at least 2xN matrices. Am I right?

That link algorithm is correct or not?

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
MonsterMMORPG
  • 151
  • 1
  • 10
  • Please explain what the seemingly contradictory "single double" value is. – whuber Nov 30 '15 at 23:41
  • @whuber i will use mahalanobis distance to calculate distance between string documents. So i have no clue how to use mahalanobis distance as a metric if result is another matrix – MonsterMMORPG Nov 30 '15 at 23:48
  • But the formula you quote clearly produces a single number! What, then, do you mean by "single double"? – whuber Nov 30 '15 at 23:56
  • @whuber how does it produce single number for multiple rows matrix? i am really confused that is why i am asking :D how do i proceed from the point i am left? if you answer question with step by step i really do appreciate – MonsterMMORPG Nov 30 '15 at 23:59
  • 1
    I cannot tell what your data are. What do the matrices represent? Why do you present two matrices of different dimensions as data? How do you compute a covariance from these two matrices? This does not appear to be a setting in which a Mahalanobis distance makes sense or can be calculated. – whuber Dec 01 '15 at 00:01
  • @whuber I used this example to calculate covariance matrix : http://stattrek.com/matrix-algebra/covariance-matrix.aspx . My data example is real data example. Assume that first matrix is integer representation of first document cluster. Each integer is representation of words. Second matrix is same. – MonsterMMORPG Dec 01 '15 at 07:16
  • It is not apparent whether or how the formulas you reference apply to your two matrices. – whuber Dec 01 '15 at 14:59
  • @whuber i guess you were right. i have implemented the system here : http://people.revoledu.com/kardi/tutorial/Similarity/MahalanobisDistance.html . However since i compare 1 document vs 1 document, they have 1 row of data. So if my matrices are 1xN sizes, Mahalanobis distance is not applicable we can say? – MonsterMMORPG Dec 01 '15 at 20:54
  • @whuber completely updated my question ty – MonsterMMORPG Dec 01 '15 at 21:26
  • You should not completely change your question. If you have a new question please post it as a new question. "It does not work in my case" is not enough information to figure out what you are doing wrong. Please improve your question. – Biswajit Banerjee Dec 01 '15 at 22:40

1 Answers1

4

There are several answers on similar questions on StackExchange (see for example Pairwise Mahalanobis distance in R).

Centering the data

Let us assume your matrices are "data matrices", i.e, they have the dimensions $n \times p$ where $n$ is the number of observations in a sample and $p$ is the number of variables.

Ideally you should center the data matrices by subtracting the column means from the columns before forming your covariance matrix and inverting it. Let us assume that, in your example, two samples are $$ \mathbf{X} = \begin{bmatrix} 0.678177 & 0.989365 & 0.558944\\ 0.652491 & 0.799086 & 0.514262\\ 0.230560 & 0.299587 & 0.023458 \end{bmatrix} $$ and $$ \mathbf{Y} = \begin{bmatrix} 0.015513 & 0.932740 & 0.351038 \\ 0.400391 & 0.515025 & 0.179851 \end{bmatrix} $$ Let us center the data by subtracting the sample means which are $$ E[\mathbf{X}] = \begin{bmatrix} 0.52041 & 0.69601 & 0.36555 \end{bmatrix} $$ and $$ E[\mathbf{Y}] = \begin{bmatrix} 0.20795 & 0.72388 & 0.26544\end{bmatrix} $$ The centered data are $$ \mathbf{X} = \begin{bmatrix} 0.15777 & 0.29335 & 0.19339\\ 0.13208 & 0.10307 & 0.14871\\ -0.28985 & -0.39643 & -0.34210 \end{bmatrix} $$ and $$ \mathbf{Y} = \begin{bmatrix} -0.192439 & 0.208857 & 0.085594\\ 0.192439 & -0.208857 & -0.085594 \end{bmatrix} $$

Covariance

Then the cross-covariance matrix can be computed using $$ \begin{aligned} \mathbf{S}(\mathbf{X},\mathbf{Y}) & = E\left[(\mathbf{X} - E[\mathbf{X}]) \otimes (\mathbf{Y} - E[\mathbf{Y}])\right] \\ & = E\left[\mathbf{X}\otimes\mathbf{Y} - E[\mathbf{X}]\otimes\mathbf{Y} - \mathbf{X}\otimes E[\mathbf{Y}] + E[\mathbf{X}]\otimes E[\mathbf{Y}]\right] \\ & = E[\mathbf{X}\otimes\mathbf{Y}] - E[\mathbf{X}]\otimes E[\mathbf{Y}] - E[\mathbf{X}]\otimes E[\mathbf{Y}] + E[\mathbf{X}]\otimes E[\mathbf{Y}]\\ & = E[\mathbf{X}\otimes\mathbf{Y}] - E[\mathbf{X}]\otimes E[\mathbf{Y}] \end{aligned} $$ In terms of matrix components $$ S_{ij} = E[X_i\,Y_j] - E[X_i]\,E[Y_j] $$ If we assume that the expected value can be estimated by the sample mean, then we can use $$ E[\mathbf{X}] = \frac{1}{n_x}\sum_{i=1}^{n_x} \mathbf{X}^{(i)} ~,~~ E[\mathbf{Y}] = \frac{1}{n_y}\sum_{i=1}^{n_y} \mathbf{Y}^{(i)} $$ and, with $N = n_x + n_y$ if the $\mathbf{X}$ and $\mathbf{Y}$ are different matrices and $N = n_x$ if $\mathbf{X} = \mathbf{Y}$, $$ E[\mathbf{X} \otimes \mathbf{Y}] = \frac{1}{N-1}\sum_{i=1}^N (\mathbf{X} \otimes \mathbf{Y})^{(i)} $$ If the matrices have been centered, $E[X_i] = E[Y_j] = 0$ and we have $$ S_{ij} = E[X_i\,Y_j] \quad \implies \mathbf{S}(\mathbf{X},\mathbf{Y}) = E[\mathbf{X} \otimes \mathbf{Y}]\,. $$

Pooled covariance

The correlations among the data in each sample are estimated using the sample covariance and an estimate of the population covariance is computed using the pooled covariance.

The pooled covariance is given by $$ \mathbf{S} = \frac{n_x}{n_x+n_y} \mathbf{S}(\mathbf{X},\mathbf{X}) + \frac{n_y}{n_x+n_y} \mathbf{S}(\mathbf{Y},\mathbf{Y}) $$

Covariance example

In the example, if $$ \mathbf{X} = \begin{bmatrix} \mathbf{X}^{(1)} \\ \mathbf{X}^{(2)} \\ \mathbf{X}^{(3)} \end{bmatrix} = \begin{bmatrix} [X_1,X_2,X_3]^{(1)} \\ [X_1,X_2,X_3]^{(2)} \\ [X_1,X_2,X_3]^{(3)} \end{bmatrix} \quad \text{and} \quad \mathbf{Y} = \begin{bmatrix} \mathbf{Y}^{(1)} \\ \mathbf{Y}^{(2)} \end{bmatrix} = \begin{bmatrix} [Y_1,Y_2,Y_3]^{(1)} \\ [Y_1,Y_2,Y_3]^{(2)} \end{bmatrix} $$ Then, $$ (\mathbf{X} \otimes \mathbf{X})^{(1)} = (\mathbf{X}^{(1)})^T \cdot \mathbf{X}^{(1)} = \begin{bmatrix} 0.024891 & 0.046281 & 0.030511\\ 0.046281 & 0.086056 & 0.056731\\ 0.030511 & 0.056731 & 0.037399\end{bmatrix} $$ $$ (\mathbf{X} \otimes \mathbf{X})^{(2)} = (\mathbf{X}^{(2)})^T \cdot \mathbf{X}^{(2)} = \begin{bmatrix} 0.017446 & 0.013614 & 0.019642\\ 0.013614 & 0.010624 & 0.015328 \\ 0.019642 & 0.015328 & 0.022114 \end{bmatrix} $$ $$ (\mathbf{X} \otimes \mathbf{X})^{(3)} = (\mathbf{X}^{(3)})^T \cdot \mathbf{X}^{(3)} = \begin{bmatrix} 0.084013 & 0.114904 & 0.099157\\ 0.114904 & 0.157153 & 0.135616 \\ 0.099157 & 0.135616 & 0.117030 \end{bmatrix} $$ Adding these and dividing by $n_x - 1 = 2$ gives $$ \mathbf{S}(\mathbf{X},\mathbf{X}) = \begin{bmatrix} 0.063174 & 0.087400 & 0.074654 \\ 0.087400 & 0.126916 & 0.103837 \\ 0.074654 & 0.103837 & 0.088272 \end{bmatrix} $$ Note that $$ \mathbf{X}^T \cdot \mathbf{X} = (\mathbf{X} \otimes \mathbf{X})^{(1)} + (\mathbf{X} \otimes \mathbf{X})^{(2)} + (\mathbf{X} \otimes \mathbf{X})^{(3)} \,. $$ Similarly, $$ \mathbf{S}(\mathbf{Y},\mathbf{Y}) = \begin{bmatrix} 0.074066 &-0.080385 &-0.032943 \\ -0.080385 & 0.087243 & 0.035754 \\ -0.032943 & 0.035754 & 0.014653 \end{bmatrix} $$

Pooled covariance example

Then the pooled covariance is $$ \mathbf{S} = \frac{3}{5}\begin{bmatrix} 0.063174 & 0.087400 & 0.074654 \\ 0.087400 & 0.126916 & 0.103837 \\ 0.074654 & 0.103837 & 0.088272 \end{bmatrix} + \frac{2}{5}\begin{bmatrix} 0.074066 &-0.080385 &-0.032943 \\ -0.080385 & 0.087243 & 0.035754 \\ -0.032943 & 0.035754 & 0.014653 \end{bmatrix} $$ Therefore $$ \mathbf{S} = \begin{bmatrix} 0.067531 & 0.020286 & 0.031615 \\ 0.020286 & 0.111047 & 0.076604 \\ 0.031615 & 0.076604 & 0.058824 \end{bmatrix} $$

Inverse of the pooled covariance matrix

Now compute the inverse of $\mathbf{S}$ which clearly has to be a square matrix. This is the most computationally intensive part of the calculation and requires special attention for high dimensional data. $$ \mathbf{S}^{-1} = \begin{bmatrix} 84.030 & 155.460 & -247.610 \\ 155.460 & 376.188 & -573.445 \\ -247.610 & -573.445 & 896.850 \end{bmatrix} $$

Mahalanobis distance

Now to find the distance (squared) between $\mathbf{X}^{(i)}$ and $\mathbf{Y}^{(j)}$ we use the relation $$ d^2_{ij} = (\mathbf{X}^{(i)} - \mathbf{Y}^{(j)})\cdot\mathbf{S}^{-1} \cdot(\mathbf{X}^{(i)} - \mathbf{Y}^{(j)})^T $$ First compute $\mathbf{Z} := \mathbf{X}^{(i)} - \mathbf{Y}^{(j)}$. Both vectors have to be the same length for this operation to work. Define $\mathbf{T} := \mathbf{S}^{-1}$. Then $$ d^2_{ij} = \mathbf{Z} \cdot \mathbf{T} \cdot \mathbf{Z}^T $$ In terms of indices, with $N$ as the length of the vector (number of variables), $$ d^2_{ij} = \sum_{i=1}^N \sum_{j=1}^N Z_i T_{ij} Z_j $$ That should give you the distances you seek.

Mahalanobis distance example

Let us find the distance between the vectors $$ \mathbf{X}^{(2)} = \begin{bmatrix} 0.13208 & 0.10307 & 0.14871 \end{bmatrix} $$ and $$ \mathbf{Y}^{(1)} = \begin{bmatrix} -0.192439 & 0.208857 & 0.085594 \end{bmatrix} $$ We have, $$ \mathbf{Z} = \mathbf{X}^{(2)} - \mathbf{Y}^{(1)} = \begin{bmatrix} 0.324521 &-0.105784 & 0.063114 \end{bmatrix} $$ Therefore, $$ \begin{aligned} (d_{21})^2 &= \begin{bmatrix} 0.324521 &-0.105784 & 0.063114 \end{bmatrix} \begin{bmatrix} 84.030 & 155.460 & -247.610 \\ 155.460 & 376.188 & -573.445 \\ -247.610 & -573.445 & 896.850 \end{bmatrix}\begin{bmatrix} 0.324521 \\ -0.105784 \\ 0.063114 \end{bmatrix} \\ &= 3.4722 \,. \end{aligned} $$

Samples with one observations each

Merge the two samples into one and just use the sample covariance instead of the pooled covariance. The result will not make much dense and a simpler distance measure is preferable.

  • ty very much for answer but i still don't get it :( I edited my answer and replied you – MonsterMMORPG Dec 01 '15 at 07:15
  • completely updated my question ty – MonsterMMORPG Dec 01 '15 at 21:26
  • 1
    Even though the new problem is different, the approach discussed in my answer can be applied provided your documents come from the same distribution and a mean of the distribution is known. – Biswajit Banerjee Dec 01 '15 at 22:37
  • ty for reply can you update your question because it is just another mess of equations which is available on the internet million times. but there isn't any actual hand solution to understand – MonsterMMORPG Dec 04 '15 at 11:47
  • 1
    Unfortunately, if you don't understand the equations I can't help you. If you pay a consultant they can work out the details for you. – Biswajit Banerjee Dec 04 '15 at 21:22
  • if we could understand from equations, we wouldn't ask here because there are millions of equations in the web :) all ready to be read – MonsterMMORPG Mar 30 '16 at 22:10