19

My questions are:

  1. Are generalized linear models (GLMs) guaranteed to converge to a global maximum? If so, why?
  2. Furthermore, what constraints are there on the link function to insure convexity?

My understanding of GLMs is that they maximize a highly nonlinear likelihood function. Thus, I would imagine that there are several local maxima and the parameter set you converge to depends on the initial conditions for the optimization algorithm. However, after doing some research I have not found a single source which indicates that there are multiple local maxima. Furthermore, I am not so familiar with optimization techniques, but I know the Newton-Raphson method and IRLS algorithm are highly prone to local maxima.

Please explain if possible both on an intuitive and mathematical basis!

EDIT: dksahuji answered my original question, but I want to add the followup question [2] above. ("What constraints are there on the link function to insure convexity?")

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
DankMasterDan
  • 1,188
  • 1
  • 10
  • 23
  • I think some restrictions must be required before that could be so. What is the source for the statement? – Glen_b Feb 24 '14 at 04:26
  • Several sites seemed to imply it however I couldnt find anything which mentioned it outright, so I also welcome its disproof! – DankMasterDan Feb 24 '14 at 14:59
  • as long as the likelihood is well defined everywhere on the domain (and ignoring some tangential numerical issues) I think yes. Under those conditions, the hessian is <0 everywhere on the domain so the likeihood is globally concave. Btw, the function are not 'highly non-linear' *in the parameters* and that's what matters. – user603 Feb 24 '14 at 15:30
  • @user603 what is your source/proof that the hessian is <0 everywhere? – DankMasterDan Feb 25 '14 at 21:10
  • 1
    Logistic, Poisson, and Gaussian regressions are often convex given a "good" link function. However, with arbitrary link function, they are not convex. – Memming Feb 25 '14 at 22:07
  • @Memming : assuming I use the canonical link, what is your proof/source that it will be convex for those distributions? – DankMasterDan Feb 25 '14 at 22:08
  • 1
    Just a note about my edit (which DankMasterDan brought up): the OP talked about minimizing the likelihood, which was obviously wrong; one would either maximize the likelihood or minimize the deviance. – StasK Feb 26 '14 at 00:32

2 Answers2

16

The definition of exponential family is:

$$ p(x|\theta) = h(x)\exp(\theta^T\phi(x) - A(\theta)), $$

where $A(\theta)$ is the log partition function. Now one can prove that the following three things hold for 1D case (and they generalize to higher dimensions--you can look into properties of exponential families or log partition):

  1. $ \frac{dA}{d\theta} = \mathbb{E}[\phi(x)]$

  2. $ \frac{d^2A}{d\theta^2} = \mathbb{E}[\phi^2(x)] -\mathbb{E}[\phi(x)]^2 = {\rm var}(\phi(x)) $

  3. $ \frac{ \partial ^2A}{\partial\theta_i\partial\theta_j} = \mathbb{E}[\phi_i(x)\phi_j(x)] -\mathbb{E}[\phi_i(x)] \mathbb{E}[\phi_j(x)] = {\rm cov}(\phi(x)) \Rightarrow \Delta^2A(\theta) = {\rm cov}(\phi(x))$

The above result prove that $A(\theta)$ is convex(as ${\rm cov}(\phi(x))$ is positive semidefinite). Now we take a look at likelihood function for MLE:

\begin{align} p(\mathcal{D}|\theta) &= \bigg[\prod_{i=1}^{N}{h(x_i)}\bigg]\ \exp\!\big(\theta^T[\sum_{i=1}^{N}\phi(x_i)] - NA(\theta)\big) \\ \log\!\big(p(\mathcal{D}|\theta)\big) &= \theta^T\bigg[\sum_{i=1}^{N}\phi(x_i)\bigg] - NA(\theta) \\ &= \theta^T[\phi(\mathcal{D})] - NA(\theta) \end{align}

Now $\theta^T[\phi(\mathcal{D})]$ is linear in theta and $-A(\theta)$ is concave. Therefore, there is a unique global maximum.

There is a generalized version called curved exponential family which would also be similar. But most of the proofs are in canonical form.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
dksahuji
  • 545
  • 3
  • 13
  • so does this mean that GLM have a unique global minima nomatter which link function is chosen (including the noncanonical ones)? – DankMasterDan Feb 26 '14 at 02:25
  • 3
    I will try to answer as far as I percieve it. $ p(x|\theta) = h(x)exp(\eta(\theta)^T\phi(x) - A(\eta(\theta)))$ is the case you are talking about. This still is concave in $\eta$ but may not be in $\theta$ so $\eta$ should be such that the whole log likelihood is concave in $\theta$. – dksahuji Feb 26 '14 at 05:17
  • Note that the question asks about convergence, rather than just existence, but with a few restrictions, that, too, may be doable. – Glen_b Feb 26 '14 at 06:22
  • @Glen_b Can you elaborate? I dont know any such restrictions. Maybe something like restrictions on stepsize in a gradient based optimizer to gaurantee convergence in case of concave function. – dksahuji Feb 26 '14 at 11:11
  • 1
    As an example, there'd be a need to say we can't have things like complete separation. The convergence criteria will in practice leave us some distance from the optimum and I've seen cases where that distance can be quite large. That sort of thing might be regarded as distinct from the theoretical convergence. – Glen_b Feb 26 '14 at 13:14
  • 1
    @Glen_b That might be true in general but I am not able to see any reason for concave function to not converge to optima within small tolerable value. But I would say that I dont have any practical experience with these and I have just started. :) – dksahuji Feb 26 '14 at 15:27
  • @dksahuji Thank you! So 2 followups: 1) is there any guide on how to choose the link function so that the whole likelihood will still be convex 2) Do the canonical link functions have this property? – DankMasterDan Feb 26 '14 at 22:01
  • Regarding the IRLS, See the last slide in [this](http://web.as.uky.edu/statistics/users/pbreheny/760/S13/notes/2-19.pdf) presentation ... there is no guarantee to that IRLS can find the unique solution. – Stat May 14 '14 at 19:45
  • @Stat if we choose $\eta$ such that the log likelihood is concave in $\theta$ then I there should be no reason to not get close enough to global optima. I would just use gradient decent or maybe better first order methods. Do you agree? – dksahuji Dec 14 '17 at 23:25
  • Are there any models for which -$A(\theta)$ is strictly concave? – newbie Nov 06 '20 at 01:19
  • _"Now $\theta^T[\phi(\mathcal{D})]$ is linear in theta and $-A(\theta)$ is concave. Therefore, there is a unique global maximum."_ How does the uniqueness of the global maximum follow from the concavity of the function? (Note that the _strict_ concavity has not been established) – paperskilltrees Jul 08 '21 at 05:19
5

I was investigating this heavily during my thesis. The answer is that the GLM likelihood is not always convex, it is only convex under the right assumptions. A very good investigation of this was made by Nelder and Wedderburn in their paper "On the Existence and Uniqueness of the Maximum Likelihood Estimates for Certain Generalized Linear Models" which can be found at https://www.jstor.org/stable/2335080

Jon Lachmann
  • 183
  • 1
  • 4