Can the maximum-likelihood method be derived from something else?

Question

I am an author of a paper, in which we show that the maximum-likelihood (ML) method can be derived a limiting case of an iterated weighted least-squares fit. https://arxiv.org/abs/1807.07911

We, the authors of this paper, reached no consensus whether this is a true derivation of the maximum-likelihood method, in the sense that you only need to accept the iterated weighted least-squares method as fundamental (which in turn can be founded on simpler arguments).

In the literature that I know, the maximum-likelihood method is introduced ad hoc. It can be motivated, but not derived from

the likelihood principle (which is an axiom itself as far as I know)
the optimal properties of likelihood ratios in certain tests (there is no ratio in ML)
the optimal properties of the ML estimator in the asymptotic limit

The latter explains the success of ML, but you cannot start from the desired properties and derive ML as the only possible solution (or perhaps one can and I am not aware).

So, can the maximum-likelihood method for estimation be derived from more basic arguments?

PS: Why do we care? Also in statistics, it would be nice to have a simple unified foundation of everything.

kjetil b halvorsen · Answer 1 · 2020-10-01T01:53:56.870

4

Your paper gives known results, see for instance Can you give a simple intuitive explanation of IRLS method to find the MLE of a GLM? or search this site on the tag irls. IRLS is in reality a version of the Newton method for optimization, and is just one of many ways to optimize the likelihood function numerically. I cannot see how that by itself can be used as justification for the likelihood function or of estimation by maximum likelihood.

For an intuitive explanation of likelihood, see Maximum Likelihood Estimation (MLE) in layman terms. There is no need to go via least squares, which is much less general.

edited Oct 01 '20 at 01:53

answered Dec 17 '18 at 17:43

kjetil b halvorsen

63,378
26
142
467

1

Thank you for pointing out additional references, we reworked the draft to include better references to prior work. We do not claim that we found something new, but the connection between likelihood and GLM is not well known in our community, hence the paper. – olq_plo Dec 17 '18 at 18:48

Sextus Empiricus · Answer 2 · 2020-10-04T09:29:55.127

I would say that the methods using 'maximum likelihood' and 'iterated reweighted least squares' are independent concepts. It is only in a specific case, the generalized linear model (GLM), that the two coincide.

The article speaks of two different type of derivations.

One is the derivation done in the 70s by Nelder and Wedderburn, and in different forms by others, that linked GLM models with iterated reweighted least squares (IRLS).
Another (the section 4 Unbinned maximum-likelihood from IRLS) is a derivation which is a bit different from the typical Iterated reweighted least squares. It is a derivation that shows that minimizing $\chi$ in Pearson's chi-squared method (by means of IRLS) coincides with maximizing the likelihood in the limit of the binsizes going to zero.

The "derivation" that is spoken about in the references (Wedderburn's work on GLM) is not a derivation of the maximum likelihood out of iterated reweighted least squares or vice versa.

That derivation is about proof of the equivalence between the two in a particular situation (in the case of generalized linear models).

I think that the below Venn diagram shows intuitively where these two methods coincide, and in which cases they differ. The methods are not the same (only for a specific subset are they the same, namely for GLMs). Therefore the methods are independent concepts that stand on their own and the one is not derived out of the other.

Iterated Reweighted Least Squares

The generalized least squares method, and related derived iterated reweighted least-squares (IRLS), stands on its own.

The motivation for this IRLS method is pragmatic: It is the best linear unbiased estimator. It is related to the Gauss–Markov theorem (and related to Aitken's extension). The variance of a linear estimator is independent from the specific error distribution, except the covariance matrix of the errors (and under the right circumstances some linear estimator, the leats squares solution, can be shown to have the lowest variance of all linear estimators).

(and so, because of that independence from the specific error distribution except for the variance, whenever ML coincides with a linear estimator, which is for GLMs, then the specific error distribution does not matter, and it is all about the variance of the error distribution only, and the two happen to give the same result)

Maximum Likelihood

Maximum likelihood also stands on its own.

The motivation for maximum likelihood is not something that was derived. It started out of ideas about inverse probability and fiducial probability (but without the Bayesian prior). As long as functions are 'well-behaved' then those principles/ideas/intuitions make sense and you get that MLE (or related estimates like bias-corrected estimates) are consistent and efficient.

The MLE, and likelihood, relate to some results that are independent from least-squares like Wilks theorem and the Cramer-Rao bound.

The maximum likelihood method is an independent principle and applied early in the history of statistical/probability thinking, before/independent from generalized least-squares methods.

- Difference with Least Squares

The maximum likelihood estimate extends to a larger region than what can be done with IRLS. Note that the Gauss Markov theorem and the generalized least squares relate specifically to linear estimators.

Example (Laplace distribution): Say, you estimate the location parameter of a Laplace distributed variable (which is not a GLM). In that case the sample median (maximum likelihood estimate) will be the minimum variance unbiased estimator (MVUE), which will be better than the sample mean (least-squares estimate), which is indeed the BLUE but not the MVUE.

So such a simple example already shows that there is more than just least squares estimation.

The Poisson distribution happens to be a special case where the BLUE is also MVUE. The $\lambda$ that minimizes the likelihood function happens to be the mean $\lambda = \frac{1}{n} \sum_{i=1}^n y_i$. The same thing is true for the normal distribution which happens to coincide with least-squares estimation (but least-squares is not a general principle that relates to any distribution).

The maximum likelihood estimate, or other estimates based in likelihood, is a more general principle that applies more widely than just those generalized linear models.

The equivalence between iterated reweighted least squares and maximum likelihood

In the article, it is mentioned that

the maximum-likelihood (ML) method can be derived a limiting case of an iterated weighted least-squares fit.

But I believe that this is in the wrong direction, or maybe none of the principles can be derived out of the other.

In your article, you refer to sources like

The iterated (re)weighted least-squares methods (IWLS or IRLS) are well known in statistical regression [8]

J. A. Nelder, R. W. M. Wedderburn, Generalized Linear Models, J. R. Statist. Soc. A135, 370–384 (1972).

Where the maximum likelihood is the starting point and IRLS is derived as a method to compute the maximum likelihood in the case of GLM problems.
Charles, Frome, and Yu [9] derived that the IWLS fit gives the exact same result as the ML fit for a family of distributions.

A. Charles, E.L. Frome, P. L. Yu, The Equivalence of Generalized Least Squares and Maximum Likelihood Estimates in the Exponential Family, J. Am. Stat. Assoc. 71, 169–171 (1976).

Where IRLS is the starting point and it is shown to be equivalent to ML only in the case of specific exponential models, namely GLM problems.

In Wedderburn 1974, this is extended further. You can take any IRLS as a starting point and turn it into a quasi-likelihood model. (and in the case of a single parameter distribution is the quasi likelihood function, derived from IRLS, the same as a true likelihood function).

The same article by Wedderburn shows how iterated reweighted least-squares is equivalent to the Gauss-Newton method of optimizing the likelihood function (and mentioned in Kjetil's answer).

About the article.

I find it confusing. I am not sure what the point is that it wants to make. You write

The insights discussed here are known in the statistics community [1, 2], but less so in the high-energy physics community.

It is admirable to explain the points from those two references. But which insights are being shared is not so clear to me. There seems to be a thing about computation speed differences (or the number of iterations, which is different than speed) and robustness differences (whatever that might mean) between using the iterated reweighted least squares method and the MIGRAD and MINUIT algorithms, but this is not made very clear (and my intuition tells me that IRLS is just like Gauss-Newton, also a straightforward minimizer of the likelihood, where it is not clear why this should be better than MIGRAD or MINUIT). Figure 1 shows something but I do not understand how it makes it more clear or whether it is even sufficient (heuristic) proof/indication.

About the derivation.

We start by considering a histogram of $k$ samples $x_i$. Since the samples are independently drawn from a PDF, the histogram counts $n_l$ are uncorrelated and Poisson-distributed. Following the IWLS approach, we minimize the function

$$Q( \mathbf{p} ) = \sum_l \frac{( n_l − kP_l )^2}{k \hat{P̂}_l}$$

and iterate, where $P_l ( \mathbf{p} ) = \int_{x_l}^{x_l + \delta x} f(x;\mathbf{p}) dx$ is the expected fraction of the samples in bin $l$...

I do not see this as a typical iterated reweighted least-squares method where you only assume the variance as a function of the mean.

Instead, in this derivation, you use very explicitly the distribution density $f(x;\mathbf{p})$ as a starting point and not only the variance of the distribution as a function of mean of the distribution.

What you are basically doing is showing that a discretized version of likelihood maximization by histograms and by means of Poisson distribution for the distribution in the bins is asymptotically equal to a continuous likelihood maximization. (You are basically finding the distribution that optimizes Pearson's Chi-squared statistic and then take the limit of bin sizes to zero).

While it is an interesting derivation for the equivalence of the two, this is not really the typical iterated reweighted least squares method where you only need the variance of the distribution as a function of the mean of the distribution. What you do here is not a fitting of the (conditional) expectation, but instead a fitting of the entire probability density (by means of a histogram).

Iteratively minimizing the IRLS expression

$$Q( \mathbf{p} ) = \sum_l \frac{( n_l − kP_l )^2}{k \hat{P̂}_l}$$

Is equivalent to minimizing the related loss function

$$L( \mathbf{p} ) = \sum_l P_l + \frac{ n_l}{k} \log P_l $$

or since $ \sum_l P_l$ is constant

$$L( \mathbf{p} ) = \sum_l \frac{ n_l}{k} \log P_l $$

So you presuppose some sort of likelihood or goodness of fit and just use IRLS (optimizing $L$ by iteratively optimizating $Q$) to compute an optimal most likely fit of the data with the distribution function. This is not really like maximum likelihood is derived from IRLS. It is more like maximum likelihood is derived from the decision to fit the distribution function $f(x;\mathbf{p})$.

I am accepting this answer, because it is very thoughtful, but I would be glad if we could discuss some more. Regarding the paper, the intent was two-fold. There is a practical aspect: using iterated weighted least squares may solve some problems better than a direct numerical optimisation of the log-likelihood. The second aspect was this more deep philosophical one, to show how the maximum-likelihood method can be derived if you start from weighted least-squares. I think this starting point is as valid as just declaring maximum likelihood an axiom. — olq_plo, Oct 01 '20 at 20:46
Regarding the question what is more fundamental, it does not matter what came first, historically speaking. Physicists thought for 400 years that Newton's axioms are fundamental, but now we believe that there is a deeper axiom, that of extremal action. — olq_plo, Oct 01 '20 at 20:49
In the part of the paper where we derive maximum-likelihood from iterated least squares, we derive the general maximum likelihood formula, i.e. find the extremum of log probabilities, for a general probability density. So it is not true that IWLS/IRLS does not cover all cases that maximum likelihood covers. We use this trick that we start from a histogram, where the bin counts are Poisson distributed, and then let the bin width go to zero. The data then is just a 1D point cloud and the model is a generic pdf with some parameters. It does not have to be part of the exponential family. — olq_plo, Oct 01 '20 at 21:01
@olq_plot I have revised my answer somewhat in a way that is, I hope, a bit more clear. (but to be honest, I still have to go through it and also see whether it might help on your last three comments, I have rewritten it a bit quickly) — Sextus Empiricus, Oct 02 '20 at 09:06
Regarding your 3rd comment... I have to revise your derivation again (to be honest I skipped quickly over it because I did not expect anything new from it), but I doubt that you can be able to derive a general ML formula. You can derive likelihood formula's for GLMs or quasi likelihood formula's but not *all* likelihood formula's. The Laplace distribution is an easy example that can not be derived from IRLS. (Your trick with bin counts sounds like something different than parameter estimation) — Sextus Empiricus, Oct 02 '20 at 09:27
I just re-read your derivation, and I added my comments about it in the question. In my opinion, it is confusing different applications of iterated reweighted least-squares and it is *not* like you derive the maximum likelihood method from the iterated reweighted least-squares method. The argument is cyclical because you explicitly insert the distribution $f(x;p)$. That is different from Wedderburns work and Charles, Frome, and Yu where the distribution $f(x;p)$ (quasi or not) follows from the least squares method in which the only assumption that has been made is the variance of the errors. — Sextus Empiricus, Oct 02 '20 at 16:02
I have to remove the "accept" again if you admit that you haven't really read our mathematical argument in detail. We are not confusing anything and we are not making a cyclical argument. — olq_plo, Oct 03 '20 at 13:18
You also did not remove your statement that you find our paper confusing. Everything is confusing if you do not take the time to think about it. Whether you see the point of the paper or not is also not part of the answer to the question. — olq_plo, Oct 03 '20 at 13:22
"I do not see this as a typical iterated reweighted least-squares method where you only assume the variance as a function of the mean." In this part you correctly describe what we have shown, so it seems you have understood what we have written, but you disregard it based on vague grounds. If the derivation is correct, we have derived the score functions as a limiting case from IWLS. That is what we said and that is what we did. If you did not know that maximum likelihood existed, you could have derived it here. As a starting point you only need IWLS, histograms, and pdfs/cdfs. — olq_plo, Oct 03 '20 at 13:38
@olq_plo you write "The insights discussed here are known in the statistics community [1, 2], but less so in the high-energy physics community" Those two references are *all* about GLMs. What you do with the derivation in section 4 is an entirely different thing from the typical IRLS used in those two references (which is estimating the mean Y, knowing only the variance as function of Y and nothing about the distribution, and that is not estimating the distribution of Y which is what you do with those histograms) — Sextus Empiricus, Oct 03 '20 at 13:44
@olq_plo *"If the derivation is correct, we have derived the score functions as a limiting case from IWLS."* I would not say that your derivation is completely IWLS. IWLS was just the optimization technique that was used. But you made an *additional* assumption that it is a good idea to fit the histogram of the conditional distribution $y|x \sim f(x;\mathbf{p})$. Where does that idea come from that you need to fit the histogram? That idea is not contained in IWLS, IWLS is just used to *do* the fit, but it does not decide *what* to fit. — Sextus Empiricus, Oct 03 '20 at 14:09
The reason that you limit case (bin size to zero) ended up in maximizing likelihood is because you decided to fit the distribution function, and not because you decided to use IWLS. — Sextus Empiricus, Oct 03 '20 at 14:11