Maximum Likelihood Estimation Question: why do we need starting values?

Question

I am trying to understand the principles behind Maximum Likelihood Estimation. And I would be appreciative if someone spotted any misunderstandings or else confirmed that I am understanding this correctly. Following is in the context of normal distribution.

The likelihood for a single observation is: $$L(y_i|x, β, σ^2) = (1 /√ 2πσ^2)e^{(y_i−x_iβ)^2/2σ^2} $$

So if I understand correctly, we start with a single y observation, and given some values of x, β, $σ^2$, we are calculating likelihood of that y observation (which is the mean of the normal distribution curve for that observation).

Then, given the same values of x, β, $σ^2$, we further calculate likelihood for the remaining y values. Then we take product of these likelihood values.

So the joint likelihood for the whole sample:

$$L(y_i|x, β, σ^2) = \prod_{i=1}^{N}(1 /√ 2πσ^2)e^{(y_i−x_iβ)^2/2σ^2} $$

Then we consider log-likelihood of this function because log properties will allow such transformation to this function that make differentiation much easier.

So in the matrix form, the log-likelihood is given by

$$\ln L = (−2/N) \ln(2π) − (2/N)\ln(σ^2) −(1/2σ^2)(y − Xβ)'(y − Xβ) $$

Now to find maximum likelihood parameters we need to find where this function is maximised. So we differentiate with respect to β and σ^2 separately and set these expressions to zero. So by solving for β and σ^2, we get in the matrix form:

$$βˆ = (X'X)^{−1} X'y$$

$$ σ^2 = e'e/N $$

So we solve these equations and get both the mean and the variance parameters of normal distribution curve fitted to our sample data with the maximum likelihood.

Question: why do we need iterative procedures and where exactly do they go to in this process? My intuitive feeling would be that in the very beginning when likelihood of individual y data points are calculated given some values of x,β and σ^2; these β and σ^2 parameters are randomly chosen and all likelihoods are estimated. Then perhaps the process is repeated with other randomly chosen β and σ^2; and iterations continue until some 'convergence happens' but I am not sure what has to converge and why since the formula for the likelihood function already exists.

score 2 · Answer 1 · answered Jun 12 '20 at 03:47

Your example is not a good one to think of your question. The reason is that the equations suggest that you don't need iterative procedures of the kind you described: simply get the $\hat\beta$ by solving a linear system $X'X\hat\beta=X'y$, then the residuals will follow from which it is trivial to calculate the variances.

In practice there's a few places where the optimization can be necessary. One example is a very large data set. In this case it may not be possible to efficiently solve the linear system using typical linear algebra approach, or it may not even be possible altogether. In this case, you directly maximize the log likelihood. It is done iteratively by starting from values of coefficients and the variance.

Another situation is when log likelihood max doesn't have an analytical solution. In case of Gaussian that you used the solution is easy. This is not always the case, so you have to optimize the log likelihood numerically.

Maximum Likelihood Estimation Question: why do we need starting values?

1 Answers1