It is stated here[1] that we can use ROBPCA to detect outliers for multivariate data. After reading the manual ([2] page 12 : "multivariate normal model etc."), I think the ROBPCA method is also designed for outlier detection in normally distributed data, but I need confirmation from someone.
In my case, I have a huge multivariate dataset containing more than 110,000 observations. However, most of the features are normalized frequency measures (number of something per min). These features follow nearly poisson distribution, with many many zeros.
Anyway, I tried the PcaHubert method in the rrcov package, which flags the potential outliers. I've got around 4000 outliers. However, the PcaHubert gives a warning message:
Warning message:
In covMcd(x = x, alpha = alpha, nsamp = nsamp, seed = seed, trace = trace, :
The covariance matrix has become singular duringthe iterations of the MCD algorithm.
There are 84899 observations (in the entire dataset of 110556 obs.)
lying on the hyperplane with equation a_1*(x_i1 - m_1) + ... + a_p*(x_ip - m_p) =0
with (m_1,...,m_p) the mean of these observations and coefficients a_i from the
vector a <- c(0.1075804, 0.9790165, -0.0639553, -0.1608192)
I guess the reason for this is that many observations of the same values (0), which leads to singular matrices. I don't know if the outlier detection result are in this case trustable or not. Maybe ROBPCA are not suitable for zero-inflated data, is it?
I think I have two options now:
- Use an outlier detection method that is sure to be used for huge multivariate poisson data. I don't know which methods are for this purpose.
- Transform the poisson data to normal. This is an "old friend" in statistics. There are many zeros in all the features that follow poisson, which is problematic for square-root or log transformation. I don't know what is the best way to transform the huge zero-inflated dataset.
In all, let me formulate my questions as follows:
- Was I right to use ROBPCA to detect ourliers for the data I described? (Given the warning messages and the non-normality of the data)
- If not, then can anyone give me some hints on which of the above two options I should adopt and what are the methods to address my problem indicated respectively?
Reference:
[1] Identifying outlier data in high-dimensional settings
[2] http://cran.r-project.org/web/packages/rrcov/vignettes/rrcov.pdf