Gradient-Descent for Parameter search

Question

I have a rather straight-forward algorithm for finding the maximum-likelihood parameter of a probability distribution using sub-sampling. I'm fairly confident this algorithm is not novel and was hoping someone might recognize it from the following description. I will use biased coin-flipping as the example, as it is probably the easiest to understand.

Assume we have observed $N$ coin flips. Call the set of observations $S$. We would like to compute an approximation the Bernoulli parameter $p^*$ which maximizes the likelihood of the observed data without overfitting.

Select a trial value $p$ for the parameter.
Uniformly randomly select $n$ samples from $S$. Call this sub-sampled set $s$.
Compute the maximum-likelihood parameter $\theta(s)$ for the sub-sample.
Using $\lVert p - \theta(s) \rVert^2$ (or similar) as error, compute one step of gradient-descent to update $p$
Repeat steps 2-4 until desired convergence

Can anyone recognize this as a specific case of a well-known algorithm?

score 1 · Accepted Answer · answered Nov 29 '18 at 19:57

This sounds like a variation of Stochastic Gradient Descent.

However, one key difference is point 3. Typically, the sub-likelihood function is not fully optimized, but rather just increased (or, more accurately, a single step of gradient descent is taken). In general, this is often safer. For example, consider something like logistic regression. In this case, it's very easy to have a subset of the data to demonstrate perfect separation (in fact, this is guaranteed if the number of coefficients is not greater than the number of sub-samples). This is called the Hauck-Donner effect, and results in non-finite MLE's of the coefficients. Note that if this occurred at any step of your algorithm, your estimates would now be non-finite and would never recover. However, this won't happen during SGD, as the derivative (and thus the step) will be finite for each step of the algorithm.

I did not know that the the MLE could be undefined or infinite, but the logistic regression example makes a lot of sense. — littlebenlittle, Nov 29 '18 at 20:22

Gradient-Descent for Parameter search

1 Answers1