Choosing IRLS over gradient descent in logistic regression

Question

I am currently reading Bishop [1] and got confusion on why should we take IRLS (Iterative Re-weighted Least Square) as it seems that using gradient descent that with one derivative at a time would solve the problem, what is the meaning of introducing Hessian matrix? Or did I understand anywhere wrong with that algorithm?

Could it be faster? Why?
What does "re-weighted" mean here? Isn't that gradient descent also updates their weight iteratively so the weights are also "re-weighted"?
The method that IRLS takes is Newton-Raphson, which could give exactly the same result with standard least square solution in linear regression model as below.

$$ w_{new}\; =\; w_{old}\; -\; H^{-1}\nabla E(w) $$

$$ \nabla E(w) = \sum_{n\; =\; 1}^{N}{\left( w^{T}\phi _{n}-t_{n} \right)}\phi _{n}\; =\; \phi ^{T}\phi w\; -\; \phi ^{T}t $$

$$ \nabla\nabla E(w) = \sum_{n\; =\; 1}^{N}{\phi _{n}}\phi _{n}^{T}\; =\; \phi ^{T}\phi $$

$$ w_{new}\; =\; w_{old}\; -\; \left( \phi ^{T}\phi \right)^{-1}\left\{ \phi ^{T}\phi w_{old}\; -\; \phi ^{T}t \right\}\; =\; \left( \phi ^{T}\phi \right)^{-1}\phi ^{T}t $$

Therefore I just wonder, when the method once applied in linear regression, is there only one iteration that the algorithm is required to run?

[1]: Bishop, Christopher (2006),
Pattern Recognition and Machine Learning,
Springer-Verlag New York

The opening paragraph of this question is mostly covered in the answer to your previous question. — Glen_b, Jan 12 '16 at 11:00
Possible duplicate of [Why should we use IRLS in logistic regression?](http://stats.stackexchange.com/questions/190278/why-should-we-use-irls-in-logistic-regression) — kjetil b halvorsen, Jan 12 '16 at 11:10
@Kjetil It's a different question. The questions numered 2. and 3. are not in the other one at all, and 1. is only partly answered there. — Glen_b, Jan 12 '16 at 11:11
@xxx222 I have put in what I presume to be the intended reference, You'll need to make sure that's the one you mean — Glen_b, Jan 12 '16 at 11:15

score 5 · Accepted Answer · edited Jun 11 '20 at 14:32

Yes, IRLS could be faster, as I said in my answer to your previous question. For example, if the log-likelihood is nearly quadratic (which it will usually be if you are able to start fairly close to the maximum and the sample size isn't very small), then it may converge in only a couple of steps. Note, in fact that on p240, Bishop says

Although such an approach might intuitively seem reasonable, in fact it turns out to be a poor algorithm, for reasons discussed in Bishop and Nabney (2008)
Notice the "LS" part. IRLS proceeds by performing weighted least squares, but the weights to observations are updated each step (re-weighted).
The particular version of IRLS Bishop presents might possibly be Newton-Raphson, but this is not necessarily the case in general (it could be Fisher scoring for example, which is related to but slightly different from actual Newton Raphson). But yes, a single step of IRLS on a (already correctly-weighted) regression problem suffices.

Choosing IRLS over gradient descent in logistic regression

1 Answers1

Linked

Related