1

I have an ongoing research that is about clustering and for assigning each data point to a specific cluster, I'm using the Mahalanobis criterion that is gravely based on the covariance matrix of the cluster. As everyone knows, if the cardinality of the cluster is less than or equal to its dimensionality, then the covariance matrix will be singular and so, the Mahalanobis distance wont be reliable anymore!

But the thing is that I've heard that it would be possible to conquer such problems by using RobustPCA methods. As it's mentioned in this ROBPCA method, in the third paragraph of its Introduction section:

The goal of robust PCA methods is to obtain principal components that are not influenced much by outliers

I think it's trying to say that RobustPCA methods are only useful when we want to draw out the mere principle components that are obtained through real data points but not grossly corrupted observations. So, it's not OK to say that we can use it for overcoming the Mahalanobis criterion issue that is caused by singularity problem of the covariance matrix of the cluster!

However, is this correct?!

Regards ...

SANN
  • 89
  • 1
  • 1
  • 6
  • Not experienced in this, so I'll just write a comment: PCA finds the largest variance among linear combinations of the input (PCs). However, since the variance is proportional to the sum of squared distances from the mean, an outlier *squared* may cause you to find PCs that only have the highest variance because of these outliers. So I agree with you that it should not solve the singularity problem, which can still occur e.g. if the specified number of clusters is larger than the actual number of underlying clusters. – Frans Rodenburg Aug 04 '18 at 08:27
  • @Frans Rodenburg: Yeah! you're right about when the number of discovered clusters is larger than the actual number of underlying clusters. But there is another reference named ["Robust principal component analysis?"](https://arxiv.org/pdf/0912.3599) that is claiming on the issue and makes me confused on the mentioned problem that is it useful for overcoming the singularity problem or not?! Thank you anyway ... – SANN Aug 04 '18 at 08:44
  • I think I can answer b/c I work in that precise area, but I cannot make heads or tail from your question. Can you try to reformulate what your question is? What problem are you trying to solve? – user603 Aug 04 '18 at 11:01
  • @user603: The question is that I have a scalable clustering problem that is using the MahalDist Criterion for assignment of each object to a cluster (you can find more inf. about my research [here](https://stackoverflow.com/questions/51492439/is-isolation-forest-iforest-a-method-that-could-be-directly-applied-to-big-dat)). The problem happens when I'm trying to assign a point to a cluster using MahalDist as long as its covariance matrix is singular and I just somehow heard that RobustPCA would be helpful! Thanks a lot ... – SANN Aug 04 '18 at 12:23
  • @user603 what exactly is "robust statistics"? – mnm Aug 04 '18 at 13:57
  • 1
    The solution to your primary problem (computing the MD of singular data) is addressed in many places here (use the search button). See [this](https://stats.stackexchange.com/a/49914/603) one for example. Also, many clustering algorithms have deeper, more fundamental issues with singular data than just computing MD's (unbounded likelihood...). But this has nothing to do with robust statistics. Because of this, I would mention that robust statistics and clustering algos (or vise versa) solve intrinsically different issues (spoons and forks comes to mind). Hope this helps! – user603 Aug 04 '18 at 14:03
  • @nilāmbara: robust statistics is a collection of methods designed to fit the majority of the data (as opposed to all of it or in the case of clustering arbitrarily sized chunks of it). The [wiki article](https://en.wikipedia.org/wiki/Robust_statistics) is a good place to start. Think of counterparts to classical parametric statistics toolbox (regression, MD, ...) `that are not influenced much by outliers` (one example is the relationship between trimmed means and classical means). – user603 Aug 04 '18 at 14:07
  • @user603: That was a useful comment! thanks a lot ... – SANN Aug 04 '18 at 15:44
  • Methods that are robust to outliers won't help if you *already* have too little data! Singularity is only the symptom (that can be trivially avoided by adding a tiny constant to the diagonal), the cause is an underspecified problem that allows for perfect overfitting. Mahalanobis **cannot** work with too little data. – Has QUIT--Anony-Mousse Aug 05 '18 at 07:38
  • @Anony-Mousse: Yeah! you're right about MD that is not so viable for too little data and the traditional methods for outlier detection based on the MD may become degenerate. As I'm just disregarding those clusters in my research with n<=p and not using such methods like pseudo-inverse or anything like that! and waiting for them to grow as the data are coming. Because I'm afraid that the final results might become unreliable. But using pseudo-inverse of covariance matrix is still left as an option! ;-) – SANN Aug 05 '18 at 16:37
  • You can use the k nearest neighbors of each point, with k = 3p then you'll always have enough at least for a rough estimate (you may want to look at the COP outlier method). But still, I would rather not use local Mahalanobis in this case. – Has QUIT--Anony-Mousse Aug 05 '18 at 17:59
  • @Anony-Mousse: Thank you for your advice on the issue, but the problem is that the whole method is based on the MD criterion and I cannot just disregard it for a little part of the proposed approach! – SANN Aug 05 '18 at 18:19
  • Well, MD is sensitive to the quality of PCA, and PCA is sensitive to outliers, so the whole method may be problematic. Existing work may provide some ideas for these problems. – Has QUIT--Anony-Mousse Aug 05 '18 at 20:14
  • Plus, MD based on clusters - that is essentially EM outlier, where you use gaussian mixture model probabilities as outlier scores. Works well on Gaussian toy data, but much less on real data. Just to point you towards another prior work. – Has QUIT--Anony-Mousse Aug 05 '18 at 20:16
  • @Anony-Mousse: You may not believe it! but almost all of those real data sets that I used in my research, were following Gaussian distribution in their clusters! You can test it on your own! just do a clustering on a real data set according to the prior knowledge available about its intrinsic distribution (like the number of classes in it) and after applying PCA on each cluster and plotting the histogram of each feature of the class separately, you will see the almost Normality in the distribution of the class! It's really awesome ;-) ... – SANN Aug 05 '18 at 20:26
  • That is an artifact by projecting with PCA. Since it mixes all dimensions, the resulting factors tend to appear more normal. So at that point, you already lost this information. – Has QUIT--Anony-Mousse Aug 05 '18 at 21:23
  • More formally, if you have uniform cube data with many dimensions, PCA is expected to produce components that approach normal distributions because of the CLT. So the PCA visualization may be misleading. – Has QUIT--Anony-Mousse Aug 05 '18 at 21:30
  • @Anony-Mousse: Thank you for informing me about such phenomenon! I just created a uniform random matrix and applied PCA on it and checked the histograms! They just made me disappointed! but what do you mean with CLT?! I didn't find anything useful on google search. – SANN Aug 06 '18 at 03:32
  • And just as you know, PCA doesn't change the intrinsic manifold of the data and so the MD is not yielding an unreliable result. Even outlier detection methods using MD are working well for data containing clusters with convex shapes. – SANN Aug 06 '18 at 03:42
  • Central Limit Theorem. Yes, PCA is invertible, so from this point of view, no information is lost. But as you can see in the unit cube example, the original axes can be valuable, and this information is encoded in the projection; it cannot be recovered easily from the projected data. Same holds for MD: at least p points are completely encoded in the weight matrix, and the MDs do not provide differentiation of these points. – Has QUIT--Anony-Mousse Aug 06 '18 at 05:13
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/81199/discussion-between-sann-and-anony-mousse). – SANN Aug 06 '18 at 05:29

0 Answers0