In short: I wonder when I would ever want to use a multilevel model as opposed to a linear regression with appropriate structure.
In detail:
When I look at Wikipedia, I understand that multilevel models describe the following situation:
$$ y_{ij} = \beta_{0j} + \beta_{1j} X_{ij} + \epsilon_{ij} $$
with the usual meanings. In the following, I take it that $i$ denotes an individual (or the smallest unit), and $j$ is some kind of level. For simplicity, say it is a state of the country, or a school.
To me, the meaning of $\beta_{0j}$ is that the intercept may vary across levels. Similarly, to me $\beta_{1j}$ indicates that the effect of $X_{ij}$ varies over the levels (state, school, ...).
If you told me to regress some $X$ on some $y$ in a way that takes into account that the overall means may differ between groups $j$, and that the effect of $X$ on $y$ may differ between groups $j$, here is what I would naturally do: I would run the following regression:
$$ y_{ij} = \beta_1 + \sum_{k=2}^K 1(j=k) + \beta_2 X_{ij} + \sum_{k=2}^K 1(j=k)\beta_2 X_{ij} + \epsilon_{ij} $$
where $1(j=k)$ is a dummy for membership in group $k$, and it is understood that there are $K$ groups.
What have I done? I have included dummies or, as we call them in econometrics, fixed effects for each group. These take mean differences between different groups into account. Similarly, I have interacted the coefficient of interest $\beta$ with said dummies to see if there are differences by group.
Of course, there is the question of inference. By using all these dummies and interactions, one may hope that all structural dependencies have been absorbed and the remaining error term is white noise. However, as econometricians, we worry intensely about incorrect inferential statements due to some form of heteroskedasticity or unobserved heterogeneity (not to mention endogeneity!). In particular, group-specific heteroskedasticity is the econometrician's incarnation of Freddy Krueger, and in this example it seems fair to say that individual members of group $j$ may have some elements of the variance-covariance matrix in common. I would thus compute so-called cluster-robust variance covariance matrices, which provide me with a variance-covariance matrix that I can use to get correct or at least conservative inferential statements.
Now let me compare both models to make sure you understand what I believe to be true (also to make it easier for you to point out flaws in my understanding):
Similarities: Both account for differences in group means. Both can account for differences in the coefficient of interest by level. And the multilevel model, if distributional assumptions hold, as well as the simple linear regression with fixed effects and interactions, computed with a cluster-robust variance-covariance matrix, provide me with correct inferential statements.
Differences: In the linear regression case, I don't need any distributional assumptions, and as far as I know even in the absense of clustering, the procedure provides valid inferential statements, though possibly conservative ones. In the multilevel model, if distributional assumptions do not hold, I am not sure what happens, but I would guess nothing good.
My question: In what kind of situation would I ever prefer to fit a multilevel model? Is there something the multilevel model can do that the linear regression with "level dummies" and group-interactions cannot do?