Interpretations of chain rule for backprop

Question

In Goodfellow et al.'s Deep Learning, the authors write on page 203:

Let $w \in \mathbb{R}$ be the input to the graph. We use the same function $f: \mathbb{R} \rightarrow \mathbb{R}$ as the operation that we apply at every step of a chain: $x = f(w)$, $y = f(x)$, $z = f(y)$. To compute $\frac{\partial z}{\partial w}$, we apply [the chain rule of calculus] and obtain: \begin{align} &\frac{\partial z}{\partial w} \tag{6.50}\\ = &\frac{\partial z}{\partial y}\frac{\partial y}{\partial x}\frac{\partial x}{\partial w} \tag{6.51}\\ =&f'(y)f'(x)f'(w) \tag{6.52}\\ =&f'(f(f(w)))f'(f(w))f'(w) \tag{6.53} \end{align} Equation (6.52) suggests an implementation in which we compute the value of $f(w)$ only once and store it in the variable $x$. This is the approach taken by the back-propagation algorithm. An alternative approach is suggested by equation (6.53), where the subexpression $f(w)$ appears more than once. In the alternative approach, $f(w)$ is recomputed each time it is needed.

Two questions:

It's unclear to me why equation (6.52) suggests that we compute $f(w)$ once and store it in $x$. I'm not sure what this equation is suggesting at all. Equation (6.53) make sense to me but apparently, it's not what is used by backprop, so to really understand the algorithm, I would like to understand the importance of (6.52) and how it relates to the mechanics of backprop.
Also, in the alternative approach (equation (6.53)), why would you recompute $f(w)$ each time rather than storing it and reusing the stored value?

Edit: On page 204, they write:

Backpropagation thus avoids the exponential explosion in repeated subexpression evaluations.

as a reason to use backprop.

I'm also wondering, if you weren't to do backprop, how would you fit the neural network?

for 3 - see [Is it possible to train a neural network without backpropagation?](https://stats.stackexchange.com/a/235868/215801) — Oren Milman, Sep 27 '18 at 09:40

Interpretations of chain rule for backprop

0 Answers0