Mahalanobis distance, when used for classification purposes, typically assumes a multivariate normal distribution, and the distances from the centroid should then follow a $\chi^2$ distribution (with $d$ degrees of freedom equal to the number of dimensions/features). We can calculate the probability that a new data point belongs to the set using its Mahalanobis distance.
I have data sets that do not follow a multivariate normal distribution ($d \approx 1000$). In theory, each feature should follow a Poisson distribution, and empirically this seems to be the case for many ($\approx 200$) features, and those that do not are in the noise and can be removed from the analysis. How can I classify new points on this data?
I guess there are two components:
- What is an appropriate "Mahalanobis distance" formula on this data (i.e. multivariate Poisson distribution)? Is there a generalization of the distance to other distributions?
- Whether I use the normal Mahalanobis distance or another formulation, what should the distribution of these distances be? Is there a different way to do the hypothesis test?
Alternatively...
The number of known data points $n$ in each class varies widely, from $n=1$ (too few; I'll determine a minimum empirically) to around $n=6000$. The Mahalanobis distance scales with $n$, so distances from one model/class to the next cannot be directly compared. When the data is distributed normally, the chi-squared test provides a way to compare distances from different models (in addition to providing critical values or probabilities). If there is another way to directly compare the "Mahalanobis-like" distances, even if it does not provide probabilities, I could work with that.