When you write
[...] to initialize a time series of random white noise (the errors), and than perform a first fit, to obtain a first model, than calculate the errors compared to the actual data, and than fit it again with the newly obtained errors, compare it to the actual data to obtain new errors and so on.
you actually outline the need for an estimation method that simultaneously handles the estimation of the vector of residuals as well as that of the parameters.
Say one has,
${y}_t = \boldsymbol{x}_t\boldsymbol{\beta} + {\varepsilon}_t$
Where $\boldsymbol{x}_t$ may contain anything you want, why not ${y}_{t-i}$ for $i=1,...,p$. Putting aside the discussion about the conditions related to $p$.
In the MA(q) case, one assumes that ${\varepsilon}_t = {r}_t + \sum_{i=1}^q \lambda_i {r}_{t-i}$.
Which leads to
${y}_t = \boldsymbol{x}_t\boldsymbol{\beta} + {r}_t + \sum_{i=1}^q \lambda_i {r}_{t-i}$
or reformulated in matricial terms using backshift-operator,
$\boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \left(\boldsymbol{I} + \sum_{i=1}^q \lambda_i \boldsymbol{B}_i\right)\boldsymbol{r}$
Given that playing with MLE actually means playing with distribution-conditioned errors, you have to rearrange the above last equation as
$\left(\boldsymbol{I} + \sum_{i=1}^q \lambda_i \boldsymbol{B}_i\right)^{-1}\left(\boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta}\right) = \boldsymbol{r}$
In practice, this means playing with distribution-conditioned residuals
$\left(\boldsymbol{I} + \sum_{i=1}^q \widehat{\lambda}_i \boldsymbol{B}_i\right)^{-1}\left(\boldsymbol{y} - \boldsymbol{X}\widehat{\boldsymbol{\beta}}\right) = \widehat{\boldsymbol{r}}$.
So arranged, one can both maximize the (knowledge-driven) likelihood that is assumed for our residuals in conjunction of the estimation of our parameters: that is simultaneously. Hence the frequent use of MLE.
But iterative approaches like the one you described are also used in practice, in the case of, say, GMM when dealing with endogeneity, stoping when a convergence criterion is met.