Questions tagged [hessian]

For on-topic questions involving the Hessian matrix, a square matrix generalizing the second derivative. Please include also a statistical methods tag. For purely mathemathical questions about the Hessian it is better to ask on math.SE at https://math.stackexchange.com/.

Wikipedia has an article with further references.

92 questions
219
votes
9 answers

Why is Newton's method not widely used in machine learning?

This is something that has been bugging me for a while, and I couldn't find any satisfactory answers online, so here goes: After reviewing a set of lectures on convex optimization, Newton's method seems to be a far superior algorithm than gradient…
Fei Yang
  • 2,181
  • 3
  • 8
  • 4
47
votes
1 answer

Explanation of min_child_weight in xgboost algorithm

The definition of the min_child_weight parameter in xgboost is given as the: minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than…
User123456789
  • 613
  • 1
  • 5
  • 9
28
votes
6 answers

Why not use the third derivative for numerical optimization?

If Hessians are so good for optimization (see e.g. Newton's method), why stop there? Let's use the third, fourth, fifth, and sixth derivatives? Why not?
echo
  • 823
  • 7
  • 13
11
votes
2 answers

Name for outer product of gradient approximation of Hessian

Is there a name for approximating the Hessian as the outer product of the gradient with itself? If one is approximating the Hessian of the log-loss, then the outer product of the gradient with itself is the Fisher information matrix. What about in…
Neil G
  • 13,633
  • 3
  • 41
  • 84
7
votes
1 answer

gradient descent and local maximum

I read that gradient descent converge always to a local minimum while other methods as Newton's method this is not guaranteed (if the Hessian is not definite positive); but if the start point in GD is unfortunately a local maximum (and then the…
7
votes
1 answer

Why is the Hessian of the log likelihood function in the logit model not negative *semi*definite?

The Hessian of the log likelihood function is $$\frac{\partial^2 \ln(\beta \mid x)}{\partial \beta \partial \beta'} = -\sum_{i=1}^n…
Fredrik P
  • 436
  • 3
  • 12
6
votes
1 answer

Gradient and hessian of the MAPE

I want to use MAPE(Mean Absolute Percentage Error) as my loss function. def mape(y, y_pred): grad = <<<>>> hess = <<<>>> return grad, hess Can someone help me understand the hessian and gradient for MAPE as a loss function? We need to…
Arc
  • 235
  • 2
  • 6
6
votes
1 answer

Interpretation of eigenvectors of Hessian inverse

I'm reading a paper in which they use the eigenvectors of the inverse Hessian of a continuous probability distribution to characterize dimensions along which the distribution is most and least constrained. I'm having some trouble with the intuition…
Vivek Subramanian
  • 2,613
  • 2
  • 19
  • 34
5
votes
2 answers

What is a consequence of an ill-conditioned Hessian matrix?

In this publication I found an explanation of the Hessian matrix, along with what it means for it to be ill-conditioned. In the paper, there is this link given between the error surface and the eigenvalues of the Hessian matrix: The curvature of…
kamilazdybal
  • 672
  • 8
  • 20
5
votes
0 answers

How the Hessian matrix is used in optimization if you can't invert it

I've seen quite a lot of work to do with approximating the Hessian such as the Hessian Vector Product but I'm not entirely sure how knowing the Hessian helps us evaluate the gradient step to take. Newton's method utilizes the inverse Hessian such…
tryingtolearn
  • 499
  • 5
  • 11
5
votes
3 answers

How does the second derivative inform an update step in Gradient Descent?

I was reading the deep learning book by Begnio, Goodfellow and Courville and there was one section where they explain the second derivative that I don't understand (section 4.31): The second derivative tells us how the first derivative will change…
5
votes
1 answer

Why does the determinant of the Hessian grow with n?

Context: I'm trying to understand BIC on a deeper level. I'm using BIC for model/structure selection for Bayesian networks. I'm confused because BIC is an approximation to the likelihood of a model, and the likelihood should never decrease when the…
Lizzie Silver
  • 1,009
  • 10
  • 22
4
votes
1 answer

Parameter uncertainity in least squares optimization: rescaling Hessian

Given a least squares optimization problem of the form: $$ C(\lambda) = \sum_i ||y_i - f(x_i, \lambda)||^2$$ I have found in multiple questions/answers (e.g. here) that an estimate for the covariance of the parameters can be computed from the…
4
votes
1 answer

Variance of maximum likelihood estimator in R

In different sources there is an algorithm how to calculate the variance of MLE in R. To keep it short: construct the negative log likelihood function. minimize it via nlm or optim with hessian=TRUE invert the Hessian and read out the diagonal…
4
votes
1 answer

Why calculating standard error of an mle (and confidence intervals) from Hessian matrices?

I might not have fully understood these concepts, and I am confused about how standard error is calculated. Here are my understandings and confusions, let me know where went wrong. EDIT: I was taking about the hessian matrix output from R…
1
2 3 4 5 6 7