matrix-calculus - Understanding numerator/denominator layouts

Question

Consider the following machine-learning model:

Here, $J = \frac{1}{m} \sum_{i = 1}^{m} L(\hat{y}^{(i)}, y^{(i)})$, and $m$ is the number of training-examples.

While performing reverse-mode differentiation (or back-propagation), I have the following questions:

Using the numerator-layout.

What would be the dimension of the derivative $\frac{\mathrm{d} J}{\mathrm{d} \mathbf{L}}$?
- Should it be a column-vector of dimension $(m, 1)$, because $\mathbf{L}$ is a row-vector of dimension $(1, m)$ (Source: here)
  - However, using this notation causes issues while computing the derivative $\frac{\mathrm{d} J}{\mathrm{d} \mathbf{a}} = \frac{\mathrm{d} J}{\mathrm{d} \mathbf{L}} \frac{\mathrm{d} \mathbf{L}}{\mathrm{d} \mathbf{a}}$; since, $\frac{\mathrm{d} \mathbf{L}}{\mathrm{d} \mathbf{a}}$ would be an $(m, m)$ matrix, while $\frac{\mathrm{d} J}{\mathrm{d} \mathbf{L}}$ is an $(m, 1)$ vector.
  - But, this notation does serve well when computing the derivatives of the form $\frac{\mathrm{d}y}{\mathrm{d}\mathbf{X}}$, where $y = f(X)$; $\mathbf{X}$ is a matrix of dimension $(m, n)$; and $f(\mathbf{X})$ is a scalar-valued function.
- Or should it be a row-vector, because according to the numerator-layout the derivative has the dimensions --> $\text{numerator-dimension} \times (\text{denominator-dimension})^\intercal = (1,1)\times(m, 1)$ (Source: here)
  - Also, (for this point) is my understanding even correct?

PS: also, is there any definitive guide from which I can learn matrix-calculus from the first principals. Although, the following sources are good, they still leave a lot of gaps:

gunes · Answer 1 · 2021-11-28T00:18:23.510

Unfortunately, I didn't come across a resource that doesn't leave gaps. It's a disputed area. Even the chain rule may sometimes not make a lot sense, e.g. some terms might be 3D tensors that the matrix multiplication is not well-defined because of matrices differentiated by vectors or vice versa.

Having said that, these rules are also very useful if you comply. For ex, define your vectors as column vectors, i.e. $n\times 1$. The chain rule looks like:

$$\frac{\partial J}{\partial \mathbf a^T} = \frac{\partial J}{\partial \mathbf L^T}\frac{\partial \mathbf L^T}{\partial \mathbf a^T}$$

where all the row vectors are transposed. Then, the matrix multiplication sizes would match. All equations you see in the sources assume a harmony, e.g. a chain rule expanding right as above would not make sense with denominator notation, so it'd expand left instead, but the sources rarely mention it. Therefore, whenever a vector is of concern, it'd be better to assume it's a column vector (wikipedia equations note this).

score 6 · Accepted Answer · answered Nov 27 '21 at 22:27

If you think of $L$ as a column vector, then I think both your sources agree that $\frac{dJ}{dL}$ should be a row vector.

But what if you really want $L$ as a row vector. Surely, the math shouldn't "care" about how you arrange your collection of numbers. One way to clarify this is by designating dimensions of your objects as "covariant" or "contravariant".

Many things are contravariant, meaning they change opposite to a change in basis (if you go from a bigger unit, "hours" to a smaller unit "seconds", your measurements become bigger). On the other hand, a derivative, like "m/hour" becomes smaller when you change the units to "m/second", hence "co".

Things which are "co" can be multiplied with things which are "contra", e.g. 5 m/second * 10 seconds = 50m. Yet it makes much less sense to multiply two "contra" or two "co" together (admittedly, second^2 or m^2/second^2 are sometimes useful units, but this is not always the case).

So yes, you could say that $\frac{dJ}{dL}$ is a "column" covector with size $m$, and $\frac{dL}{da}$ is a matrix with shape (contra-$m$, co-$m$). We could write $\left(\frac{dJ}{dL}\right)^i = \frac{\partial J}{\partial L_i}$, and $\left(\frac{dL}{da}\right)_i^j = \frac{\partial L_i}{ \partial a_j}$ (we give superscripts to "co" dimensions, and subscripts to "contra", to make things clear). Then, following our rule that co can only be multipled by contra, we see that

$$\left(\frac{dJ}{da}\right)^j = \sum_{i=1}^m \left(\frac{dJ}{dL}\right)^i \left(\frac{dL}{da}\right)_i^j = \left(\frac{dJ}{dL}^T \frac{dL}{da} \right)^j$$

So even if you "force" $\frac{dJ}{dL}$ into a column, if you want to respect our new multiplication rule, you need to transpose before applying matrix mult.

To take this a step further, let's say we are interested in $\frac{da}{dX}$, which has shape (contra-$m$, co-$(n,m)$): $\left( \frac{da}{dX} \right)_j^{u,v} = \frac{\partial a_j}{\partial X_{u,v}}$. Then we have

$$\left(\frac{dJ}{dX}\right)^{u,v} = \sum_{j=1}^m \left(\frac{dJ}{da}\right)^j \left(\frac{da}{dX}\right)_j^{u,v}$$

To translate this back to "numerator layout" matrix calculus terms, you could say that column vectors are always contravariant, row vectors are always covariant or "covectors", gradients are covariant, hence always row vectors. An $m$ by $n$ Jacobian matrix is contra-$m$, co-$n$. This works nicely because if you think of a column vector as a (contra-$n$, co-1) matrix or a row vector as a (contra-1, co-$m$) matrix, notice that by following the ordinary rules of matrix mutliplcation, you'll never accidentally multiply two contra / two co together, and the product of two objects will always be in a (contra, co) form. On the other hand, "denominator layout" has everything in (co, contra) form, which is just as fine and accomplishes the same thing.

However, if you start working with less standard objects, like the derivative of a matrix with respect to a vector, or the derivative of a row vector with respect to a column vector (as in our example above), then you'll need to keep track for yourself what is covariant and what is contravariant.

This is an interesting way to look at matrix-differentiation. I haven't encountered this sort of outlook/convention in any of the resources that I mentioned above. — x.projekt, Nov 28 '21 at 08:51

matrix-calculus - Understanding numerator/denominator layouts

2 Answers2