I am currently reading Bishop [1] and got confusion on why should we take IRLS (Iterative Re-weighted Least Square) as it seems that using gradient descent that with one derivative at a time would solve the problem, what is the meaning of introducing Hessian matrix? Or did I understand anywhere wrong with that algorithm?
- Could it be faster? Why?
- What does "re-weighted" mean here? Isn't that gradient descent also updates their weight iteratively so the weights are also "re-weighted"?
- The method that IRLS takes is Newton-Raphson, which could give exactly the same result with standard least square solution in linear regression model as below.
$$ w_{new}\; =\; w_{old}\; -\; H^{-1}\nabla E(w) $$
$$ \nabla E(w) = \sum_{n\; =\; 1}^{N}{\left( w^{T}\phi _{n}-t_{n} \right)}\phi _{n}\; =\; \phi ^{T}\phi w\; -\; \phi ^{T}t $$
$$ \nabla\nabla E(w) = \sum_{n\; =\; 1}^{N}{\phi _{n}}\phi _{n}^{T}\; =\; \phi ^{T}\phi $$
$$ w_{new}\; =\; w_{old}\; -\; \left( \phi ^{T}\phi \right)^{-1}\left\{ \phi ^{T}\phi w_{old}\; -\; \phi ^{T}t \right\}\; =\; \left( \phi ^{T}\phi \right)^{-1}\phi ^{T}t $$
Therefore I just wonder, when the method once applied in linear regression, is there only one iteration that the algorithm is required to run?
[1]: Bishop, Christopher (2006),
Pattern Recognition and Machine Learning,
Springer-Verlag New York