27

I have difficulty to derive the Hessian of the objective function, $l(\theta)$, in logistic regression where $l(\theta)$ is: $$ l(\theta)=\sum_{i=1}^{m} \left[y_{i} \log(h_\theta(x_{i})) + (1- y_{i}) \log (1 - h_\theta(x_{i}))\right] $$

$h_\theta(x)$ is a logistic function. The Hessian is $X^T D X$. I tried to derive it by calculating $\frac{\partial^2 l(\theta)}{\partial \theta_i \partial \theta_j}$, but then it wasn't obvious to me how to get to the matrix notation from $\frac{\partial^2 l(\theta)}{\partial \theta_i \partial \theta_j}$.

Does anybody know any clean and easy way of deriving $X^T D X$?

COOLSerdash
  • 25,317
  • 8
  • 73
  • 123
DSKim
  • 1,109
  • 2
  • 11
  • 15
  • 3
    what did you get for $\frac{\partial^2 l}{\partial \theta_i \partial \theta_j}$? – Glen_b Aug 27 '13 at 02:23
  • 2
    Here is a good set of slides that show the exact calculation you are looking for: http://sites.stat.psu.edu/~jiali/course/stat597e/notes2/logit.pdf –  Aug 27 '13 at 05:19
  • 1
    I found a wonderful video which computes the Hessian step by step. [Logistic regression (binary) - computing the Hessian](https://www.youtube.com/watch?v=jUwjbiBUR-k&list=PLD0F06AA0D2E8FFBA&index=112) – Naomi Jan 22 '18 at 06:54

1 Answers1

37

Here I derive all the necessary properties and identities for the solution to be self-contained, but apart from that this derivation is clean and easy. Let us formalize our notation and write the loss function a little more compactly. Consider $m$ samples $\{x_i,y_i\}$ such that $x_i\in\mathbb{R}^d$ and $y_i\in\mathbb{R}$. Recall that in binary logistic regression we typically have the hypothesis function $h_\theta$ be the logistic function. Formally

$$h_\theta(x_i)=\sigma(\omega^Tx_i)=\sigma(z_i)=\frac{1}{1+e^{-z_i}},$$

where $\omega\in\mathbb{R}^d$ and $z_i=\omega^Tx_i$. The loss function (which I believe OP's is missing a negative sign) is then defined as:

$$l(\omega)=\sum_{i=1}^m -\Big( y_i\log\sigma(z_i)+(1-y_i)\log(1-\sigma(z_i))\Big)$$

There are two important properties of the logistic function which I derive here for future reference. First, note that $1-\sigma(z)=1-1/(1+e^{-z})=e^{-z}/(1+e^{-z})=1/(1+e^z)=\sigma(-z)$.

Also note that

\begin{equation} \begin{aligned} \frac{\partial}{\partial z}\sigma(z)=\frac{\partial}{\partial z}(1+e^{-z})^{-1}=e^{-z}(1+e^{-z})^{-2}&=\frac{1}{1+e^{-z}}\frac{e^{-z}}{1+e^{-z}} =\sigma(z)(1-\sigma(z)) \end{aligned} \end{equation}

Instead of taking derivatives with respect to components, here we will work directly with vectors (you can review derivatives with vectors here). The Hessian of the loss function $l(\omega)$ is given by $\vec{\nabla}^2l(\omega)$, but first recall that $\frac{\partial z}{\partial \omega} = \frac{x^T\omega}{\partial \omega}=x^T$ and $\frac{\partial z}{\partial \omega^T}=\frac{\partial \omega^Tx}{\partial \omega ^T} = x$.

Let $l_i(\omega)=-y_i\log\sigma(z_i)-(1-y_i)\log(1-\sigma(z_i))$. Using the properties we derived above and the chain rule

\begin{equation} \begin{aligned} \frac{\partial \log\sigma(z_i)}{\partial \omega^T} &= \frac{1}{\sigma(z_i)}\frac{\partial\sigma(z_i)}{\partial \omega^T} = \frac{1}{\sigma(z_i)}\frac{\partial\sigma(z_i)}{\partial z_i}\frac{\partial z_i}{\partial \omega^T}=(1-\sigma(z_i))x_i\\ \frac{\partial \log(1-\sigma(z_i))}{\partial \omega^T}&= \frac{1}{1-\sigma(z_i)}\frac{\partial(1-\sigma(z_i))}{\partial \omega^T} =-\sigma(z_i)x_i \end{aligned} \end{equation}

It's now trivial to show that

$$\vec{\nabla}l_i(\omega)=\frac{\partial l_i(\omega)}{\partial \omega^T} =-y_ix_i(1-\sigma(z_i))+(1-y_i)x_i\sigma(z_i)=x_i(\sigma(z_i)-y_i)$$

whew!

Our last step is to compute the Hessian

$$\vec{\nabla}^2l_i(\omega)=\frac{\partial l_i(\omega)}{\partial \omega\partial \omega^T}=x_ix_i^T\sigma(z_i)(1-\sigma(z_i))$$

For $m$ samples we have $\vec{\nabla}^2l(\omega)=\sum_{i=1}^m x_ix_i^T\sigma(z_i)(1-\sigma(z_i))$. This is equivalent to concatenating column vectors $x_i\in\mathbb{R}^d$ into a matrix $X$ of size $d\times m$ such that $\sum_{i=1}^m x_ix_i^T=XX^T$. The scalar terms are combined in a diagonal matrix $D$ such that $D_{ii}=\sigma(z_i)(1-\sigma(z_i))$. Finally, we conclude that

$$ \vec{H}(\omega)=\vec{\nabla}^2l(\omega)=XDX^T$$

A faster approach can be derived by considering all samples at once from the beginning and instead work with matrix derivatives. As an extra note, with this formulation it's trivial to show that $l(\omega)$ is convex. Let $\delta$ be any vector such that $\delta\in\mathbb{R}^d$. Then

$$\delta^T\vec{H}(\omega)\delta = \delta^T\vec{\nabla}^2l(\omega)\delta = \delta^TXDX^T\delta = \delta^TXD(\delta^TX)^T = \|\delta^TDX\|^2\geq 0$$

since $D>0$ and $\|\delta^TX\|\geq 0$. This implies $H$ is positive-semidefinite and therefore $l$ is convex (but not strongly convex).

Mark Cramer
  • 339
  • 2
  • 10
Manuel Morales
  • 1,151
  • 8
  • 7
  • 3
    In the last equation, shouldn't it be $||\delta D^{1/2}X||$ since $XDX^\top$ = $XD^{1/2}(XD^{1/2})^\top$? – appletree Sep 04 '18 at 11:42
  • 2
    Shouldn't it be $X^T D X$? – Chintan Shah Feb 03 '19 at 21:21
  • 3
    @ChintanShah No because the answer defined $X$ to be a $d \times m$ matrix that is the concatenation of column vectors $x_i$. This is already the transpose of what is usually the $m \times d$ dimensional design matrix used in stats. – gowrath Apr 21 '20 at 14:15
  • Re: the missing sign of the loss function you mention, I think the OP's $l$ is actually the log-likelihood (to be maximized), not a loss (to be minimized), and is therefore not missing a minus sign after all. – chsk Oct 08 '21 at 18:50