Consider the following machine-learning model:
Here, $J = \frac{1}{m} \sum_{i = 1}^{m} L(\hat{y}^{(i)}, y^{(i)})$, and $m$ is the number of training-examples.
While performing reverse-mode differentiation (or back-propagation), I have the following questions:
Using the numerator-layout.
- What would be the dimension of the derivative $\frac{\mathrm{d} J}{\mathrm{d} \mathbf{L}}$?
- Should it be a column-vector of dimension $(m, 1)$, because $\mathbf{L}$ is a row-vector of dimension $(1, m)$ (Source: here)
- However, using this notation causes issues while computing the derivative $\frac{\mathrm{d} J}{\mathrm{d} \mathbf{a}} = \frac{\mathrm{d} J}{\mathrm{d} \mathbf{L}} \frac{\mathrm{d} \mathbf{L}}{\mathrm{d} \mathbf{a}}$; since, $\frac{\mathrm{d} \mathbf{L}}{\mathrm{d} \mathbf{a}}$ would be an $(m, m)$ matrix, while $\frac{\mathrm{d} J}{\mathrm{d} \mathbf{L}}$ is an $(m, 1)$ vector.
- But, this notation does serve well when computing the derivatives of the form $\frac{\mathrm{d}y}{\mathrm{d}\mathbf{X}}$, where $y = f(X)$; $\mathbf{X}$ is a matrix of dimension $(m, n)$; and $f(\mathbf{X})$ is a scalar-valued function.
- Or should it be a row-vector, because according to the numerator-layout the derivative has the dimensions --> $\text{numerator-dimension} \times (\text{denominator-dimension})^\intercal = (1,1)\times(m, 1)$ (Source: here)
- Also, (for this point) is my understanding even correct?
- Should it be a column-vector of dimension $(m, 1)$, because $\mathbf{L}$ is a row-vector of dimension $(1, m)$ (Source: here)
PS: also, is there any definitive guide from which I can learn matrix-calculus from the first principals. Although, the following sources are good, they still leave a lot of gaps: