11

I need to fit a generalized Gaussian distribution to a 7-dim cloud of points containing quite a significant number of outliers with high leverage. Do you know any good R package for this job?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 3
    You will find links to at least four R packages for identifying multivariate outliers in the replies to a similar question at http://stats.stackexchange.com/questions/213/what-is-the-best-way-to-identify-outliers-in-multivariate-data. That might be a good start. – whuber Jul 07 '11 at 13:51
  • Maybe the question is eluding me, but as far as fitting a multivariate Gaussian distribution, why not just use the empirical mean and SD as the MLE? You can then focus on diagnostic statistics if there are high influence/leverage points. – AdamO Feb 09 '18 at 15:22
  • I think the question is about using something like a Huberized loss function to estimate the parameters. I'm not an expert, but perhaps using Huber loss to fit the mean would be a start. – Tom Dietterich Apr 20 '20 at 22:00

2 Answers2

1

This sounds like a classic multivariate Gaussian Mixture Model. I think that the BayesM package might work.

Here are some multivariate Gaussian Mixture packages

  • bayesm: cran.r-project.org/web/packages/bayesm/index.html
  • mixtools: www.jstatsoft.org/v32/i06/paper
EngrStudent
  • 8,232
  • 2
  • 29
  • 82
1

There's also mclust: http://www.stat.washington.edu/research/reports/2012/tr597.pdf http://cran.r-project.org/web/packages/mclust/index.html

One caution, though: mixture modelling in high dimensional space can get pretty CPU and memory intensive if your cloud of points is large. About four years ago I was doing a batch of 11-dimensional, 50-200K point data, and it was tending to run into 4-11GB of RAM and take up to a week to compute for each case (and I had 400). This is certainly possible, but can be a headache if you're using a shared compute cluster or have limited resources available.