I individually understand what different optimizer functions, such as gradient descent, Adams, etc, and I understand what estimator functions, such as maximum likelihood, do.
But I am having a hard time putting the two concepts together. Is MLE used instead of an optimizer or is does the optimizer maximize the MLE? In what segment of the network does the MLE come into play?
My current understanding is the following. An input is fed into a network, and assuming we're not using a pre-trained network, we randomly assign weights/biases, then, once the input has forward propagated, a loss function is used to calculate the error between the output and the target. This loss is then "back propagated" through the network and the weights are tweaked. This seems like a closed loop. Where in this does MLE come into play?