4

I have about 1000 data points from some thick tailed distribution that I would like to fit a parametrized distribution to. From my data, I've made some adjustments and constructed an empirical distribution (so I have percentiles).

What is the best way to fit a mixture of parametrized distribution functions (pareto, lognormal, gamma, etc) to this empirical distribution?

So far I have been using Excel to maximize the grouped MLE function; using solver to maximize the parameters subject to sum(weights = 1). I have R as well but am new to it. It is pretty obvious that excel is getting stuck in local maxima.

How would you maximize a MLE function for a mixture distribution?

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • 3
    @robert1 If you could disclose the purpose of this fit you will increase your chances of getting really effective responses. Why do you need a parametrization instead of, say, the empirical distribution function (as revealed by the sample quantiles)? Do you have any theoretical expectations concerning the shapes of the mixture components (or concerning why one might expect this to a be a mixture)? What might be the consequences of uncertainty in the parameters? (That uncertainty will be *huge* unless your distribution is sharply multimodal.) – whuber Nov 03 '10 at 00:45
  • I'm modeling liability - in a personal auto context (let's temporarily assume there is no insurance limit); the amount of liability is almost certainly a mixture - you have a "fender bender" process, a "third party severely injured" process; a catastrophic process ("DWI + injury involving school bus w/ multiple youth injuries"). I'd like to have a parametrized distribution for use with simulating future loss and certain other applications. My sample data is small relative to the population, but I know this type of process is long tailed so I am thinking logn/pareto/gamma type distributions. –  Nov 03 '10 at 01:45

3 Answers3

4

You should google EM algorithm. The wiki has a description of the algorithm with an example of the application of this algorithm to gaussian mixtures. Perhaps, someone else can point out an R package for you.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • 1
    The EM algorithm looks promising, thank you. I'll have to become better acquainted with it - I'd be interested in exploring any R packages for this. –  Nov 03 '10 at 01:56
  • 1
    @Skrikant The [mclust](http://cran.r-project.org/web/packages/mclust/index.html) package should work here. To reproduce the example on Wikipedia (two-cluster solution), you will have to fix parameter `modelNames` (shape of covariance matrices) or `G` (No. clusters). Otherwise, the returned solution (i.e. minimizes BIC) is a three-cluster solution. Maybe you could add it to your original answer, for illustration purpose. – chl Nov 03 '10 at 08:26
  • @Skrikant After browsing the site, I also came across @csgillespie's example with [mixtools](http://cran.r-project.org/web/packages/mixtools/index.html), http://stats.stackexchange.com/questions/899/separating-two-populations-from-the-sample/1010#1010. – chl Nov 03 '10 at 09:09
2

If you want to fit univariate distribution on your data, please try using fitdistr function in the MASS package in R. Here is the information on how to use it. I am assuming that you have the full data set in addition to the quantiles.

suncoolsu
  • 6,202
  • 30
  • 46
  • 1
    Thanks suncoolsu; I looked into fitdistr but I've found that a mixture model fits my data better than a univariate distribution. I'm not sure how to use fitdistr for a multivariate distribution while constraining the sum of the weights to 1 - any tips? I think my problem is primarily in how to optimize a multivariate function subject to constraints. Thanks, Robert –  Nov 03 '10 at 00:29
  • 1
    @robert1 Sorry for the previous mistake. I thought you wanted to fit _a_ parametric distribution and _not_ mixtures. OK, before I suggest anything, can you please tell what your data set looks like - I mean how do you get your data, do you have any guesses about the number of mixtures there will be ... and other similar aspects. – suncoolsu Nov 03 '10 at 01:32
1

Write down the complete likelihood, take the derivative and do a gradient based optimization.

You can do this online very easily (that is, process one point after the other) and this might result in far faster convergence than EM if the redundancy in your data is high.

bayerj
  • 12,735
  • 3
  • 35
  • 56