1

I have a rather straight-forward algorithm for finding the maximum-likelihood parameter of a probability distribution using sub-sampling. I'm fairly confident this algorithm is not novel and was hoping someone might recognize it from the following description. I will use biased coin-flipping as the example, as it is probably the easiest to understand.

Assume we have observed $N$ coin flips. Call the set of observations $S$. We would like to compute an approximation the Bernoulli parameter $p^*$ which maximizes the likelihood of the observed data without overfitting.

  1. Select a trial value $p$ for the parameter.
  2. Uniformly randomly select $n$ samples from $S$. Call this sub-sampled set $s$.
  3. Compute the maximum-likelihood parameter $\theta(s)$ for the sub-sample.
  4. Using $\lVert p - \theta(s) \rVert^2$ (or similar) as error, compute one step of gradient-descent to update $p$
  5. Repeat steps 2-4 until desired convergence

Can anyone recognize this as a specific case of a well-known algorithm?

1 Answers1

1

This sounds like a variation of Stochastic Gradient Descent.

However, one key difference is point 3. Typically, the sub-likelihood function is not fully optimized, but rather just increased (or, more accurately, a single step of gradient descent is taken). In general, this is often safer. For example, consider something like logistic regression. In this case, it's very easy to have a subset of the data to demonstrate perfect separation (in fact, this is guaranteed if the number of coefficients is not greater than the number of sub-samples). This is called the Hauck-Donner effect, and results in non-finite MLE's of the coefficients. Note that if this occurred at any step of your algorithm, your estimates would now be non-finite and would never recover. However, this won't happen during SGD, as the derivative (and thus the step) will be finite for each step of the algorithm.

Cliff AB
  • 17,741
  • 1
  • 39
  • 84