I am trying to understand the principles behind Maximum Likelihood Estimation. And I would be appreciative if someone spotted any misunderstandings or else confirmed that I am understanding this correctly. Following is in the context of normal distribution.
The likelihood for a single observation is: $$L(y_i|x, β, σ^2) = (1 /√ 2πσ^2)e^{(y_i−x_iβ)^2/2σ^2} $$
So if I understand correctly, we start with a single y observation, and given some values of x, β, $σ^2$, we are calculating likelihood of that y observation (which is the mean of the normal distribution curve for that observation).
Then, given the same values of x, β, $σ^2$, we further calculate likelihood for the remaining y values. Then we take product of these likelihood values.
So the joint likelihood for the whole sample:
$$L(y_i|x, β, σ^2) = \prod_{i=1}^{N}(1 /√ 2πσ^2)e^{(y_i−x_iβ)^2/2σ^2} $$
Then we consider log-likelihood of this function because log properties will allow such transformation to this function that make differentiation much easier.
So in the matrix form, the log-likelihood is given by
$$\ln L = (−2/N) \ln(2π) − (2/N)\ln(σ^2) −(1/2σ^2)(y − Xβ)'(y − Xβ) $$
Now to find maximum likelihood parameters we need to find where this function is maximised. So we differentiate with respect to β and σ^2 separately and set these expressions to zero. So by solving for β and σ^2, we get in the matrix form:
$$βˆ = (X'X)^{−1} X'y$$
$$ σ^2 = e'e/N $$
So we solve these equations and get both the mean and the variance parameters of normal distribution curve fitted to our sample data with the maximum likelihood.
Question: why do we need iterative procedures and where exactly do they go to in this process? My intuitive feeling would be that in the very beginning when likelihood of individual y data points are calculated given some values of x,β and σ^2; these β and σ^2 parameters are randomly chosen and all likelihoods are estimated. Then perhaps the process is repeated with other randomly chosen β and σ^2; and iterations continue until some 'convergence happens' but I am not sure what has to converge and why since the formula for the likelihood function already exists.