2

I am working on a model that I am fitting to a medium-sized dataset ($\approx 300,000$ observations). The likelihood I have derived is a mixture model of some Beta distributions which also has the convolution of two Betas. This makes the evaluation of the log-likelihood function slow. The likelihood function has nine variables I am optimizing over. This makes even a coarse grid search very computationally slow, even if I parallelize it. A grid search of just four values per parameter requires $4^9=262,144$ evaluations of the log-likelihood, and still does not give particularly great starting parameters for the optimization. I then use the best set of parameters from the grid search as my starting guess in the actual optimization (optim in R using the L-BFGS-B method as many parameters need to be non-negative or bounded between 0 and 1). Does anyone know of a better and/or faster way to do an MLE over a large number of parameters?

Drunk Deriving
  • 218
  • 1
  • 6
  • 1
    The answers here seem promising https://stats.stackexchange.com/questions/193306/optimization-when-cost-function-slow-to-evaluate/193310#193310 – Sycorax Jun 24 '20 at 02:36

2 Answers2

3

Consider subsampling your observations, you dont need to use all 300,000 observations to estimate 9 parameters. That is an incredibly large amount of data for a fairly low dimensional parameter vector. If you picked a sub-sample of (say) 10,000 observations then you could minimise this partial likelihood instead which will be a lot faster, and the minimum is probably going to be close to the true minimum using all the data (so you can then use the minimum you find as an initial value for the full minimisation using all the data). If you do this using multiple sub-samples then it might even help you find different local minimas (but are you sure that your likelihood is really highly multimodal? Because with 300,000 observations and only 9 parameters I would expect it to be very sharply peaked).

Better yet, just use stochastic gradient descent instead, where you process the observations one-at-a-time rather than in batch: https://en.wikipedia.org/wiki/Stochastic_gradient_descent

James
  • 506
  • 1
  • 5
  • I don't think this is good set of advice for this problem. First of all, no need to subsampling. With subsampling you loose information, sure that you can estimate 9 parameters with far less samples (you could go all the way down to 9+1), but the more samples, the more precise the result will be. 300k parameters should with no problem fit RAM of desktop PC & the computation time should take not more then minutes, so no reason why the size should be problem. – Tim Jun 24 '20 at 06:23
  • Moreover, yes, mixture models are multimodal by design, there is no single true minimum. Also SGD is not the best choice of algorithm. First of all, it is not the most efficient optimization algorithm in general. Second, for mixtures there are specialised algorithms (like E-M) that are known to work better. – Tim Jun 24 '20 at 06:25
  • Hi Tim, I think you misunderstood my suggestion. The point of using subsampling is to get a good initial value for the full minimisation problem. I.e. to find an estimate of theta which is likely to be close to the true minimum, and then use this as the initial condition for optimisation when using the full 300,000 observations. – James Jun 24 '20 at 16:57
  • There is no reason why EM would be superior to stochastic gradient descent in a mixture model context when the number of observations is large, since so much computational effort in EM is being 'wasted' by using the full set of observations to produce overly accurate estimates of quantities which will already be precisely estimated using only a small subset of the data. As such, SGD will be able to run hundreds of thousands of iterations in the time it takes EM to run a single iteration. I would expect SGD to perform better for this reason, but its data-set dependent. – James Jun 24 '20 at 17:06
  • Finally the inherent multimodality of mixture models typically isnt a problem in this context since these multiple modes are just relabellings of the mixture components and hence give equivalent likelihoods. Finding any of these modes is as good as finding any of the others (there may be other modes of course, but with 300,000 observations I would expect these to be largely smoothed out so the main multimodality would just be due to the above label/component index switching) – James Jun 24 '20 at 17:06
  • EM is a standard approach to mixture models, SDG not (see e.g. https://stats.stackexchange.com/questions/64193/fitting-a-gaussian-mixture-model-using-stochastic-gradient-descent), so it’d help if you could give some references showing the superiority you mention. – Tim Jun 24 '20 at 17:51
  • "Standard" applications of mixture models dont have 300,000 observations. Stochastic gradient descent is a standard technique for optimising parameters over large data sets since it uses the data more efficiently than batch gradient descent algorithms like EM (since it is does not 'waste time' computing quantities to unneeded levels of precision). Batch algorithms typically dont scale well to large data sets for this reason. Here are some specific discussions of SGD in a mixture model framework, but really this is part of a much larger discussion on batch vs stochastic algorithms – James Jun 24 '20 at 18:54
  • 2
    https://papers.nips.cc/paper/8021-stochastic-expectation-maximization-with-variance-reduction.pdf https://arxiv.org/pdf/0712.4273.pdf (note that these are both discussing stochastic vs batch EM. If I was the OP personally then I'd first try vanilla SGD to see how well it performed since its trivial to implement, and only think about implementing stochastic EM if it didnt work well. But with 300000 observations I'd definitely be using some kind of stochastic/minibatch algorithm, if the likelihood is taking too long to evaluate) – James Jun 24 '20 at 18:55
  • To turn the question around, if the OP had 300 billion observations rather than 300,000, and the (complete data) likelihood took over a day to evaluate would you accept that batch methods like EM are completely infeasible and that stochastic techniques are the more sensible approach? The only question is the cross-over point at which stochastic algorithms are likely to be better and I would cautiously suggest that 300,000 observations is likely to be above that point if the likelihood is complicated. But really the OP needs to just try it and see how it works, there are R packages. – James Jun 24 '20 at 19:05
2

Unless for trivial problems, of simple hyperparameter tuning, nobody uses grid search as an optimization algorithm. Mixture of beta distributions is parametrized by continuous parameters, so there is infinitely many values of the parameters, and their combinations, that you would need to check. This would not end in finite time if using grid search.

There is a whole branch of mathematics devoted to optimization and a number of algorithms for optimizing different functions. In R there are build-in optim and optimize functions, while in python there is scipy.optimize.minimize that provide the most popular optimization algorithms. Those algorithms take the function that you want to minimize and find the values for you. However, please take time to read the documentation first and read about the algorithms, since those functions implement many different algorithms, where some are known to work better for some classes of problems, then others.

As for mixture, the most popular algorithm for fitting mixtures is E-M algorithm, in fact, the linked Wikipedia page uses Gaussian mixture as an example illustrating the algorithm.

Since you do not seem to be familiar with optimization, I'd recommend to start with learning more about it. There's a handbook Algorithms for Optimization by Kochenderfer and Wheeler and recorded Stanford lectures by Stephen Boyd, that could serve as a nice introduction.

Glorfindel
  • 700
  • 1
  • 9
  • 18
Tim
  • 108,699
  • 20
  • 212
  • 390
  • You are correct in that I am not familiar with the algorithms used in optimization, but I think you misunderstood me. I am using a grid search to only get a good starting guess for my optimization (I am using optim in R for the actual optimization). Without the grid search, optim is falling into many local minimums (negative log-likelihood function) and failing to find the global minimum as it is a complicated likelihood. I question was there a better way to do this? I have thought about writing an EM algorithm, but didn't know how much it would help/speed it up. – Drunk Deriving Jun 23 '20 at 23:01
  • @DrunkDeriving I wouldn't expect grid search to help in here, mixture models just have many local optimas because of the nature of mixture distribution that is build of different distributions, so has many peaks. That is why there are specialized algorithms like E-M for that, the general optimization algorithms are likely to fail in here. – Tim Jun 24 '20 at 06:20