1

I am analyzing a dataset with 5 factors(Y1,Y2,Y3,Y4,Y5).

ID          Y1    Y2    Y3    Y4   Y5   
1           5     1     2     9    40     
2           6     1    17     9    49     
3           5     1     6    10    25     
4           5     1    14     6    69     
5           7     1    19    15    66     
6           5     1     6     7    24     
.           .     .     .     .     .
.           .     .     .     .     .
300         6     1     2    12    28 

The mean and standard deviation of each factor(y_i) is as follows

avg1   avg2   avg3   avg4   avg5   
5.39   1.02   11.8   9.61   42.1  

sd1   sd2    sd3   sd4   sd5
1.22  0.145  10.1  3.61  14.5

I have two new observations

   ID         Y1    Y2    Y3    Y4   X5   
   *          6     1    18     7    36 
   **         3     5     1     3    37

What statistical method should I use if my goal is to determine which out of these two observations are closer or most similar to the sample. Thanks.

1 Answers1

1

While this might be a terrible assumption for your case, if you are willing to make an assumption that your data are multivariate normal, then you might be looking for the Mahalanobis distance.

$$ d(\vec x) = \sqrt{ (\vec x - \vec \mu)^TS^{-1}(\vec x - \vec \mu) } $$

Each of your two five-dimensional points would be an $\vec x$. The multivariate mean is $\vec\mu$, so $\vec\mu = (5.39, 1.02, 11.8, 9.61, 42.1 )^T$. $S$ is the covariance matrix of your five $Y$ variables.

The intuition is as follows:

In a univariate Gaussian, we know that $\sim68\%$ of the density is contained within one standard deviation of the mean, $\sim95\%$ of the density if within two standard deviations of the mean, etc. In that sense, standard deviation gives a standardized measurement: no matter how big or small the standard deviation, you know that being $0.1\sigma$ from the mean is a small deviation. In the univariate case, Mahalanobis distance is exactly the z-score.

$S$ is the variance, which is just a number. The $(\vec x - \vec \mu)$ is a number, so the transpose is not so meaningful, and I am content to drop it (just for the univariate case). All of these quantities are just numbers, so we are free to multiply them in whatever order we wish. Therefore:

$$ d(x) = \sqrt{ (x - \mu)^T\mathbb Var(X)^{-1}(x - \mu) }\\ = \sqrt{ \dfrac{(x-\mu)^2}{Var(X)} }\\ = \dfrac{x - \mu}{\sigma} $$

This is exactly the z-score!

In that sense, Mahalanobis distance is somewhat of a generalization of z-score. Mahalanobis distance measures how far you are from the mainstream of the density, accounting for the fact that the multivariate density might be longer in some dimensions than others (different univariate variances) and rotated (due to correlations).

Note that the rule about how much density is within $1$ or $2$ (or any other number) is not as simple in the univariate case, and it depends on the dimension of the vector.

Dave
  • 28,473
  • 4
  • 52
  • 104
  • Thanks Dave, I was thinking about this approach did not know how to implement, your explanation clears some cobweb in my mind. For multivariate normality assumption, i am planning to check for normality at univariate level for each Y using QQ Plots – Ahir Bhairav Orai Mar 04 '22 at 06:32
  • QQ plots are a nice visualization to check univariate normality. It’s also important to have joint normality, and I don’t have a good solution to checking that in five dimensions. That could be a nice topic for a new question! – Dave Mar 04 '22 at 14:14