3

I am reading Chris Bishop's Pattern Recognition and Machine Learning.

In Section 2.3.5 he introduces some ideas on the contribution of the $n$th observation in a data set to the maximum likelihood estimator of the mean.

He says that the larger number of observations, the contribution of the last data set doesn't amount much, really. This makes good sense.

He then continues to introduce this:

"However, we will not always be able to derive a sequential algorithm by this route, and so we seek a more general formulation of sequential learning, which leads us to the Robbins-Monro algorithm."

My question is: I am not entirely clear on the motivation. How are these two ideas connected? I would be glad to hear some better insights into what is happening.

Greenparker
  • 14,131
  • 3
  • 36
  • 80
cgo
  • 7,445
  • 10
  • 42
  • 61
  • Related question/answer [here] (https://stats.stackexchange.com/questions/255792/confusion-about-robbins-monro-algorithm-in-bishop-prml/) – Don Slowik Apr 08 '19 at 21:12

1 Answers1

0

Robbins-Monro gives an iterative way to find the zero of regression function $E[z|\theta_{ML}]$. Then Bishop shows the role of z is played here by the gradient of the log-likelihood. This is a random variable whose distribution given $\theta_{ML}$ is $N(z|\theta_{ML} - \theta; \sigma^2)$ as explained in Secton 2.3.5, eq. 2.136. So now iterating on $\theta_{ML}$ for $E[z|\theta_{ML}]=0$ will yield $\theta_{ML} = \theta$, the true value of the mean.

Don Slowik
  • 571
  • 4
  • 9