Questions tagged [matrix-calculus]

Matrix calculus deals with the problems of differentiating (possibly matrix-valued) functions of matrices

Matrix calculus deals with the problems of differentiating (possibly matrix-valued) functions of matrices, extending further the well-known calculus formulae.

From "regular" calculus, the section on derivatives of powers, we know that $$ \frac{\rm d}{{\rm d}x} x^{-1} = -x^{-2} $$ How does this look for matrices? What is the derivative of the inverse of a matrix... if any? Well it turns out that that the concept of a derivative of a matrix with respect to a matrix is a fairly complicated one: if you take a derivative of an $m\times n$ matrix by a $p\times q$ matrix, you end up with an object that must have $m\cdot n\cdot p\cdot q$ entries. This could be a tensor with four dimensions, but few statisticians have training in tensors (most theoretical physicists do, though). It is, however, reasonably easy to talk about differentials of matrices.

If a matrix $A$ is given by $$ A = \begin{pmatrix} a_{11} & a_{12} & \ldots & a_{1n} \\ a_{21} & a_{22} & ... & a_{2n} \\ ... & ... & ... & ... \\ a_{m1} & a_{m2} & ... & a_{mn} \end{pmatrix} $$ with elements treated akin to the elements of a multivariate vector in multivariable calculus, then its differential is $$ {\rm d}A = \begin{pmatrix} {\rm d}a_{11} & {\rm d}a_{12} & \ldots & {\rm d}a_{1n} \\ {\rm d}a_{21} & {\rm d}a_{22} & ... & {\rm d}a_{2n} \\ ... & ... & ... & ... \\ {\rm d}a_{m1} & {\rm d}a_{m2} & ... & {\rm d}a_{mn} \end{pmatrix} $$

Let us re-write the derivative expression for the inverse in differentials: $$ {\rm d}\bigl( x^{-1} \bigr) = -1/x^2 \, {\rm d}x $$

The matrix differential of the inverse simply takes care of lack of commutativity of matrix operations: $${\rm d}(A^{-1}) = -A^{-1} \, {\rm d} A \, A^{-1}$$ Of course, this simplifies to the "standard" expression with $m=n=1$, and matrix $A$ is simply a scalar.

Most of the operations have the differentials you'd expect them to have: $$ {\rm d}(A+B) = {\rm d}A + {\rm d}B $$ $$ {\rm d}(AB) = {\rm d}A \, B + A {\rm d}B $$ $$ {\rm d}(A \otimes B) = {\rm d}A \otimes B + A \otimes {\rm d}B $$ where the latter is the Kronecker product. Some matrix operations are, however, unique to matrices, and differentials for them require extra work, such as the determinant $$ {\rm d} \, {\rm det} A = {\rm det}A \cdot {\rm tr} (A^{-1} \, {\rm d}A ) $$ where $A$ is full rank, or the eigenproblem: $$ A=A^T, Au = \lambda u, \| u \| =1 \Rightarrow {\rm d}\lambda = u^T ({\rm d}A) u, {\rm d}u = (\lambda I_n - A)^+ ( {\rm d}A) u $$ where $X^+$ is the Moore-Penrose inverse of $X$. If $A$ is not symmetric, the eigenvalues and eigenvectors are, in general, complex, and the expressions for differentials are more complicated.

If $x \in \mathbb{R}^n$ and $f: \mathbb{R}^n \to \mathbb{R}^m$ is an $m$-dimensional vector function of $x$, then the Jacobian matrix of this transformation is the $m \times n$ matrix of partial derivatives: $$ {\rm D} f(x) = \frac{\partial f(x)}{\partial x'} = \begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \ldots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \ldots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \ldots & \frac{\partial f_m}{\partial x_n} \end{pmatrix} $$

Having introduced the differentials, Magnus and Neudecker (2007) argue in favor of the following definition of the derivative of a matrix function $F(X)$, if you really want to define one, based on the Jacobians: $$ {\rm D} F(X) = \frac{\partial {\rm vec} F(X)}{\partial ({\rm vec} X)'} $$ They discuss the advantage of this definition in terms of mathematical consistency it provides (such as that the derivative of an identity function should probably be some sort of an identity matrix, that one-to-one transformations have non-degenerate Jacobians with non-zero determinatns, and that the chain rule could be defined in the traditional manner for matrix functions, etc.) over some other definitions. However in many problems, working directly with differentials is more convenient than working with derivatives.

References:

Wikipedia article

Magnus, J. R., and H. Neudecker (2007). Matrix Calculus with Applications in Statistics and Econometrics, 3rd ed

55 questions
22
votes
4 answers

Textbooks on Matrix Calculus?

See this question on Math SE. Short story: I read The Elements of Statistical Learning and got frustrated when I was trying to verify some of the results, e.g., given $$\text{RSS}(\beta) =…
Clarinetist
  • 3,761
  • 3
  • 25
  • 70
7
votes
1 answer

How to differentiate with respect to a matrix?

How can I differentiate the following by $\mathbf{W}$ ? \begin{equation} \mathbf{Y} = (\mathbf{W}^T\mathbf{x} + b)^2 \end{equation} Where $\mathbf{W} \in \mathcal{R}^{d\times D}$ and $\mathbf(x)\in \mathcal{R}^{d\times 1}$ How to calculate…
user570593
  • 1,099
  • 2
  • 13
  • 19
6
votes
1 answer

What are the 2nd derivatives of the log multivariate normal density?

I develop open-source statistical software (http://openmx.psyc.virginia.edu/), but matrix calculus is not my strong point. I need the 1st and 2nd derivatives of the log multivariate normal density. I was happy to find the 1st derivatives here on…
6
votes
2 answers

Resources for matrix calculus for optimization

I'm a grad student trying to absorb from the book Pattern Recognition and Machine Learning. However, I found that I really need a good grasp of matrix calculus before I can deduct the formulas myself (since I think in this way learning could be more…
Eddie Xie
  • 517
  • 1
  • 5
  • 14
5
votes
2 answers

matrix-calculus - Understanding numerator/denominator layouts

Consider the following machine-learning model: Here, $J = \frac{1}{m} \sum_{i = 1}^{m} L(\hat{y}^{(i)}, y^{(i)})$, and $m$ is the number of training-examples. While performing reverse-mode differentiation (or back-propagation), I have the…
5
votes
2 answers

Proof of normal equation in regression using tensor notation

I'm struggling with a proof of the normal equation, so I posted a question which hopefully will get resolved soon. However, I mentioned there that I'm uncomfortable with the proofs dealing with matrix calculus, particularly when it comes to…
Robert Smith
  • 3,191
  • 3
  • 30
  • 46
5
votes
2 answers

Derivative of a quadratic form wrt a parameter in the matrix

I want to compute the derivative of: $\frac{\partial y^T C^{-1}(\theta)y}{\partial \theta_{k}}$, (Note that C is a covariance matrix that depends on a set of parameters $\theta$) for which I used the chain rule: $ \frac{\partial y^T…
Emily W
  • 51
  • 2
5
votes
1 answer

What if do not use any activation function in the neural network?

or, for example, is it good to use activation function only for a last layer? As I know, if there are no activation functions in neural network, feedforward will be like simple matrix multiplication, but I don't understand why this is bad.
5
votes
1 answer

Derivative of $x^T A^Ty$ with respect to $\Sigma$ where $A$ is (an upper triangle matrix and ) Cholesky decomposition of $\Sigma$

I would like to evaluate: $$ \frac{ \partial x^T A^Ty}{\partial \Sigma} $$ where $A$ is a Cholesky decomposition of $\Sigma$ and an upper triangle matrix such that $\Sigma = A^T A$, $x$ and $y$ are a vector of length the same as the dimension of the…
4
votes
1 answer

Differentiation step in OLS

In deriving the parameter estimate in OLS, we differentiate the following (in matrix form) $$y^T y - 2\beta^T X^T y + \beta^T X^T X \beta$$ The part of the differentiation I don't understand is why $$\beta^T X^T X \beta$$ differentiates to $$2X^T X…
Bill
  • 556
  • 1
  • 9
4
votes
1 answer

What reparametrization of vector parameters makes the Jeffreys prior correspond to the uniform prior?

What reparametrization of vector of parameters $\theta$ makes the Jeffreys prior $$\sqrt{\det I(\theta)}$$ correspond to the uniform prior? A change of parametrization from $\theta$ to $\eta$ changes the Fisher information as follows (I…
Neil G
  • 13,633
  • 3
  • 41
  • 84
4
votes
1 answer

Gaussian process regression - Matérn kernel gradient issue

I'm trying to use a Matérn 5/2 kernel for GP regression, so my kernel function is $ K(x,x')\triangleq\theta_0(1+\sqrt{5r(x,x')}+5/3r)\exp(-\sqrt{5r}), $ where $r(x,x')\triangleq\sum_{d=1}^D (x_d-x'_d)^2/\theta_d^2$ I want to optimize the marginal…
Mike Adriano
  • 161
  • 5
3
votes
1 answer

Minimize SSE function

Consider a data set in which each target $t_n$ is associated with a weighting factor $r_n > 0$, so that the sum-of-squares error funtion becomes $$SE(w)= \frac{1}{2} \sum_{n=1}^N r_n \left(\mathbf{w}^T \phi(x_n)− t_n\right)^2.$$ Find an expression…
3
votes
1 answer

Integrate out (covariance) matrix in Normal-Wishart distribution

In Gelman's Bayesian Data Analysis Chapter 3.6, he introduces the multivariate normal with unknown mean and variance, with the priors $\Sigma\sim \text{Inv-Wishart}_{\nu_0}(\Lambda_0^{-1})$ $\mu\rvert \Sigma \sim N(\mu_0, \Sigma/\kappa_0)$ and the…
bayes
  • 135
  • 6
3
votes
1 answer

Integrating out parameter with improper prior

I got this problem while I was reading the book "Machine Learning: A Probabilistic Perspective" by Kevin Murphy. It is in section 7.6.1 of the book. Assume the likelihood is given by $$ \begin{split} p(\mathbf{y}|\mathbf{X},\mathbf{w},\mu,\sigma^2)…
1
2 3 4