-1

It is stated here[1] that we can use ROBPCA to detect outliers for multivariate data. After reading the manual ([2] page 12 : "multivariate normal model etc."), I think the ROBPCA method is also designed for outlier detection in normally distributed data, but I need confirmation from someone.

In my case, I have a huge multivariate dataset containing more than 110,000 observations. However, most of the features are normalized frequency measures (number of something per min). These features follow nearly poisson distribution, with many many zeros.

Anyway, I tried the PcaHubert method in the rrcov package, which flags the potential outliers. I've got around 4000 outliers. However, the PcaHubert gives a warning message:

Warning message:
In covMcd(x = x, alpha = alpha, nsamp = nsamp, seed = seed, trace = trace,  :
The covariance matrix has become singular duringthe iterations of the MCD algorithm.
There are 84899 observations (in the entire dataset of 110556 obs.) 
lying on the hyperplane with equation a_1*(x_i1 - m_1) + ... + a_p*(x_ip - m_p) =0 
with (m_1,...,m_p) the mean of these observations and coefficients a_i from the 
vector a <- c(0.1075804, 0.9790165, -0.0639553, -0.1608192)  

I guess the reason for this is that many observations of the same values (0), which leads to singular matrices. I don't know if the outlier detection result are in this case trustable or not. Maybe ROBPCA are not suitable for zero-inflated data, is it?

I think I have two options now:

  1. Use an outlier detection method that is sure to be used for huge multivariate poisson data. I don't know which methods are for this purpose.
  2. Transform the poisson data to normal. This is an "old friend" in statistics. There are many zeros in all the features that follow poisson, which is problematic for square-root or log transformation. I don't know what is the best way to transform the huge zero-inflated dataset.

In all, let me formulate my questions as follows:

  1. Was I right to use ROBPCA to detect ourliers for the data I described? (Given the warning messages and the non-normality of the data)
  2. If not, then can anyone give me some hints on which of the above two options I should adopt and what are the methods to address my problem indicated respectively?

Reference:

[1] Identifying outlier data in high-dimensional settings

[2] http://cran.r-project.org/web/packages/rrcov/vignettes/rrcov.pdf

nan
  • 905
  • 2
  • 10
  • 19
  • 1
    First, ROBPCA found 25660 outliers, not 4000 as you claim. This makes me suspect you don't really understand what you are doing. Concerning the error message you get: do you understand it? e.g. have you read section 6, p15 of [this](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.45.5870&rep=rep1&type=pdf) paper? – user603 May 27 '14 at 12:14
  • @user603 Hi, thanks for your reply. I haven't read the paper you mentioned, but I will do it now. Regarding the number of outliers, i don't know how you came up with 25660 outliers. PcaHubert has a "flag" attributes, according to http://svitsrv25.epfl.ch/R-doc/library/rrcov/html/PcaHubert-class.html. This attribute should flag if an outlier is an outlier or not. I count the number of FALSE flags, and it is 3317, so I say it finds around 4000 outliers. How is 25660 calculated then? – nan May 27 '14 at 12:46
  • @user603 I know, you get 25667 by subtracting 84899 from 110556. The number around 4000 is acquired by the flag function. I'll take a look into the paper to see how this is explained. – nan May 27 '14 at 12:56
  • 1
    if the message you posted above is correct and corresponds to the output of the ROBPCA function, then, it means FMCD has identified 25660 data points that are literally *arbitrarily* far away from the other 84899 observations. You should first identify these observations and set them aside. – user603 May 27 '14 at 13:31
  • With a large number of outliers, I would suspect the model is not very useful. ROBPCA assumes linearity. Inspection of residuals of the outliers, basically where they are, may help in determining a data transformation of use. There are lots of transformations, like square and square root, to try to see if the linear model better applies to those circumstances. – Carl Jan 23 '17 at 16:11
  • @user603 Do not agree. It is not the explained that is problematic. It is the 25667 unexplained that requires attention. There is usually a better model, but if it is not looked for, it will not be found. The other possibility is corrupted data. So, is the data corrupted? – Carl Apr 13 '18 at 22:59
  • @Carl: geometrically, this situation is identical to the following two dimensional setting. We have a sample $(x_i, y_i)_{i=1}^{110566}$. 84899 of those lie *exactly* on a line $y_i = \alpha + \beta x_i$ and 25667 points are off this line (have $|e_i|>0$ w.r.t. that line). We all agree it is a sign something is off with the data. I am not sure I understand how transforming the space $(x_i, y_i)$ would address that. – user603 Apr 15 '18 at 09:41
  • @user603 Actually, this may just be a [sparse matrix](https://en.wikipedia.org/wiki/Sparse_matrix). BTW, for OP $\sqrt{0}=0$ and 0's do not effect square root transformation. Suggest better sparse data treatment. – Carl Apr 15 '18 at 15:55
  • @Carl. We'll never know. It's certainly true that sparsity is one particular case of what I describe (corresponding to $\alpha=\beta=0$). In the general case ($\alpha\neq\beta$) transforming the data will prevent you from identifying the subspace (the values of $\alpha$ and $\beta$) which ROBPCA found on the raw data and returns as part of its output. – user603 Apr 15 '18 at 16:02
  • @user603 Without knowing a lot more about the data, discussion is futile. However, the OP does say "nearly poisson distribution, with many many zeros" which suggests taking the square root to linearize deviation and/or sparse data treatment. – Carl Apr 15 '18 at 16:09
  • @Carl: your are right on both counts (I had missed the latter element). But if sparsity is the issue (e.g. $\alpha=\beta=0$) then taking the square root will not affect the outcome (the 25667 points are off this line before the transformation will remain off the line *after* the transformation and likewise the 84899 that are on the line will remain on the line). – user603 Apr 15 '18 at 16:23

1 Answers1

-1

Try adding very small noise to the data. It will make the matrix non-singular and then covMCD will work.

curio17
  • 121
  • 1
  • 6
  • 1
    I think the question relates more to outlier detection than to dealing with the singular matrices. – Michael R. Chernick Jun 02 '17 at 04:13
  • I do not think this is a good idea. Consider the univariate case. Then the MCD outlyingness reduces to |xi−MCD ave(xi)|/MCD sd(xi) and here MCD sd(xi) is 0. By adding a noise εi to the data you set the 'new' denominator to any arbitrary value MCD sd(εi). This makes the MCD outlyingness take on any arbitrary value as well. Why would that be useful? – user603 Nov 16 '17 at 00:24