Why is the formula for the density of a transformed random variable expressed in terms of the derivative of the inverse?

Question

In this very nice answer, the intuitive explanation of the formula for the density of a transformed random variable, $Y = g(X)$, leads naturally to an expression like

$$f_Y(y) = \frac{f_X(g^{-1}(y))}{g'(g^{-1}(y))},$$ where $f_X(x)$ is the density function of $X$ (and assuming for simplicity that $g(x)$ is monotone increasing).

However, this formula is often presented (without much explanation) as

$$f_Y(y) = f_X(g^{-1}(y)) (g^{-1})'(y) ,$$

which follows from an application of the Inverse Function Theorem. I have seen this pattern in several places: expositions yield the first expression (for example here), but the canonical result seems to be communicated in terms of the second expression, such as the Wikipedia reference. Some write-ups motivate it in terms of the former and then explicitly invoke the substitution $$\frac{1}{g'(g^{-1}(y))} = (g^{-1})'(y).$$

Is there anything pedagogically interesting to say about this? Is there a reason to disprefer what seems to be the more "intuitive" expression? Is the more standard version in terms of the derivative of the inverse simply easier for students to remember and calculate with?

The intuitive explanation I usually give to my students is that$$f_Y(y)\,\text dy=F_X(x)\,\text dx$$when $y=g(x)$ leads to$$f_Y(y)=F_X(x)\,\frac{\text dx}{\text dy}$$which is exactly$$f_Y(y)=F_X(g^{-1}(y))\,\frac{\text d}{\text dy}g^{-1}(y)$$ — Xi'an, May 22 '21 at 15:42
I have never bothered to memorize formulas like this, finding it easier and more insightful just to remember that we must transform the entire *probability element* rather than just the density. For examples see [this site search for "probability element"](https://stats.stackexchange.com/search?q=%22probability+element%22+score%3A2). — whuber, May 22 '21 at 18:54
@whuber Well, I can't argue with that. I suspect the story is simply that it is easy enough to define $f_Y(y) = \frac{d}{dy} F_X(g^{-1}(y))$, and then the "usual formula" follows from the fundamental theorem of calculus and the chain rule, which are two ideas (or at least phrases) students in a calc based probability course have heard of before. But I think your earlier answer does a better job at motivating what the chain rule is accomplishing specifically in the probability context. — R Hahn, May 22 '21 at 19:28

score 4 · Accepted Answer · answered May 23 '21 at 03:42

It seems that the heuristic described by @whuber in their answer to the linked problem can be modified slightly to yield the change of variables formula for the density in its more familiar form. Consider a finite sum approximation to the probability elements; the "conservation of mass" requirement stipulates that $$h_X(x_j) \Delta_X(x_j) = h_Y(y_j) \Delta_Y(y_j).$$ Here $h_X(x_j)$ is the height and $\Delta(x_j)$ is the width of the interval on which $x_j$ is the center.

Suppose that $h_X(x)$ is known and $y = g(x)$ for a monotone continuous function $g(\cdot)$. The goal is to solve for $h_Y(y)$ in terms of $g(\cdot)$ and $h_X(\cdot)$. To do so, we will fix either $\Delta_X(x_j)$ or $\Delta_Y(y_j)$ to be some constant $\Delta$ for all values of its argument. Then we will solve for $h_Y(y)$ and take a limit as $\Delta \rightarrow 0$. Which of $\Delta_X(x_j)$ or $\Delta_Y(y_j)$ is set to the constant determines which of the two forms of the formula is arrived at.

Setting $\Delta_Y(y_j) = \Delta$ gives the more common form. $$\begin{aligned} h_Y(y) \Delta &= h_X(x)\left [g^{-1} \left(y + \dfrac{\Delta}{2} \right) - g^{-1} \left(y - \dfrac{\Delta}{2} \right) \right ],\\ h_Y(y) &= h_X(g^{-1}(y))\frac{\left [g^{-1} \left(y + \dfrac{\Delta}{2} \right) - g^{-1} \left(y - \dfrac{\Delta}{2} \right) \right ]}{\Delta},\\ h_Y(y) &\rightarrow h_X(g^{-1}(y)) (g^{-1})'(y). \end{aligned} $$

Setting $\Delta_X(x_j) = \Delta$ gives the other (equivalent) expression. $$\begin{aligned} h_X(x) \Delta &= h_Y(y) \left [g \left(x + \dfrac{\Delta}{2} \right) - g \left(x - \dfrac{\Delta}{2} \right) \right ],\\ h_Y(y) &= h_X(g^{-1}(y)) \frac{ \Delta}{g \left(x + \dfrac{\Delta}{2} \right) - g \left(x - \dfrac{\Delta}{2} \right) },\\ h_Y(y) &\rightarrow \frac{h_X(g^{-1}(y))}{g'(g^{-1}(y))}. \end{aligned} $$

Presumably this argument fails when Riemann sums fail and more measure theory is called for, but this line of reasoning satisfies my curiosity well enough. Specifically, the first approach, setting $\Delta_Y(y) = \Delta$ at the outset, inherits the same intuition as explained in @whuber's answer to the other question, but arrives at an expression that will match most other texts (which is desirable to me for pragmatic reasons). Of course, intuition is very personal, so YMMV.

Ben · Answer 2 · 2021-05-26T04:03:10.557

One heuristic way to look at this is to consider the probability density as a scaled probability by considering an "infinitesimally small" region encompassing a point. For any infinitesimally small distances $\Delta_X > 0$ and $\Delta_Y > 0$ you have:

$$\begin{align} \Delta_X \times f_X(x) &= \mathbb{P}(x \leqslant X \leqslant x + \Delta_X) \quad \quad \quad \quad (1) \\[12pt] \Delta_Y \times f_Y(y) &= \mathbb{P}(y \leqslant Y \leqslant y + \Delta_Y) \quad \quad \quad \quad \ (2) \\[12pt] \end{align}$$

Now, suppose we consider a point $y$ where $g^{-1}$ is differentiable. To facilitate our analysis, we will define the infinitesimal quantity $\Delta_X \equiv g^{-1}(y + \Delta_Y) - g^{-1}(y)$. We then have:

$$\begin{align} f_Y(y) &= \frac{\mathbb{P}(y \leqslant Y \leqslant y + \Delta_Y)}{\Delta_Y} \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \text{from } (2) \\[6pt] &= \frac{\mathbb{P}(y \leqslant g(X) \leqslant y + \Delta_Y)}{\Delta_Y} \\[6pt] &= \frac{\mathbb{P}(g^{-1}(y) \leqslant X \leqslant g^{-1}(y + \Delta_Y))}{\Delta} \\[6pt] &= f_X(g^{-1}(y)) \times \frac{g^{-1}(y + \Delta_Y) - g^{-1}(y)}{\Delta_Y} \quad \quad \quad \quad \text{from } (1) \\[8pt] &= f_X(g^{-1}(y)) \times \frac{\Delta_X}{\Delta_Y} \\[12pt] &= f_X(g^{-1}(y)) \times (g^{-1})'(y) \\[12pt] \end{align}$$

(The step from the third to the fourth line follows from taking $x = y+\Delta_Y$ and applying equation $(2)$ to express the probability as a scaled density.) Alternatively, letting $\Delta_X$ be the free infinitesimal and defining $\Delta_Y \equiv g(x+\Delta_X) - g(x)$ then we have:

$$\begin{align} f_X(x) &= \frac{\mathbb{P}(x \leqslant X \leqslant x + \Delta_X)}{\Delta_X} \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \text{from } (1) \\[6pt] &= \frac{\mathbb{P}(g(x) \leqslant g(X) \leqslant g(x + \Delta_X))}{\Delta_X} \\[6pt] &= \frac{\mathbb{P}(g(x) \leqslant Y \leqslant g(x + \Delta_X))}{\Delta_X} \\[6pt] &= \frac{\mathbb{P}(g(x) \leqslant Y \leqslant g(x) + \Delta_Y)}{\Delta_Y} \times \frac{\Delta_Y}{\Delta_X} \\[6pt] &= f_Y(g(x)) \times \frac{\Delta_Y}{\Delta_X} \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \ \ \text{from } (2) \\[8pt] &= f_Y(g(x)) \times g'(x) \\[12pt] \end{align}$$

Now, this argument can be tightened to give a formal demonstration of the result, but the heuristic version shows how the derivative term arises. It arises from the fact that the region $[y, y+\Delta_Y]$ for the original random variable $Y$ corresponds to the region $[g^{-1}(y), g^{-1}(y + \Delta_Y)]$ for the random variable $X$. The derivative term is just the ratio of the lengths of the latter region over the length of the former region, when $\Delta_Y$ is small.

+1. I'd be grateful for an explanation of how you get from the third line to the fourth line in your dervation? — COOLSerdash, May 23 '21 at 08:37
@COOLSerdash: I have expanded the answer to be more explicit about this step. — Ben, May 23 '21 at 11:13
+1 Thanks, this is a little cleaner than what I came up with. I would just add that defining $$\Delta_Y = g(x + \Delta_X) - g(x) = g(g^{-1}(y) + \Delta_X) - g(g^{-1}(y))$$ gives the alternative expression. — R Hahn, May 23 '21 at 13:30
@Ben If you add the remark as I mentioned above, I'll accept this answer. I don't want to leave it unanswered and I'd rather not accept my own answer, but I think it is important that an answer to my question covers the relationship between the heuristic/derivation and the two distinct forms of the resulting expression. — R Hahn, May 26 '21 at 03:05
@RHahn: Okay, I have updated the answer to add show that you can go the other way to get the alternative expression. — Ben, May 26 '21 at 04:03
I'm sorry, I was unclear. I still want to solve for $f_Y(y)$, but using the other free infinitesimal. Compare to my answer to see what I mean? Not trying to make extra work for you. I appreciate your time. — R Hahn, May 26 '21 at 05:08
Okay. In any case, I'm comfortable with my answer as it stands. Thanks, Ben. — Ben, May 26 '21 at 09:43

Why is the formula for the density of a transformed random variable expressed in terms of the derivative of the inverse?

2 Answers2

Linked