20

In Bishop's Pattern Recognition and Machine Learning I read the following, just after the probability density $p(x\in(a,b))=\int_a^bp(x)\textrm{d}x$ was introduced:

Under a nonlinear change of variable, a probability density transforms differently from a simple function, due to the Jacobian factor. For instance, if we consider a change of variables $x = g(y)$, then a function $f(x)$ becomes $\tilde{f}(y) = f(g(y))$. Now consider a probability density $p_x(x)$ that corresponds to a density $p_y(y)$ with respect to the new variable $y$, where the suffices denote the fact that $p_x(x)$ and $p_y(y)$ are different densities. Observations falling in the range $(x, x + \delta x)$ will, for small values of $\delta x$, be transformed into the range $(y, y + \delta y$) where $p_x(x)\delta x \simeq p_y(y)δy$, and hence $p_y(y) = p_x(x) |\frac{dx}{dy}| = p_x(g(y)) | g\prime (y) |$.

What is the Jacobian factor and what exactly does everything mean (maybe qualitatively)? Bishop says, that a consequence of this property is that the concept of the maximum of a probability density is dependent on the choice of variable. What does this mean?

To me this comes all a bit out of the blue (considering it's in the introduction chapter). I'd appreciate some hints, thanks!

Akimiya
  • 3
  • 2
ste
  • 444
  • 3
  • 11
  • 3
    ["Intuitive explanation for the density of a transformed variable"](http://stats.stackexchange.com/questions/14483) might be helpful. Concerning "Jacobian," please [search our site](http://stats.stackexchange.com/search?q=Jacobian). – whuber Sep 26 '16 at 14:42
  • 1
    For a great description of the Jacobian factor see Khan Academy's video tutorial on the Jacobian determinant. https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/jacobian/v/the-jacobian-determinant – JStrahl Jul 04 '17 at 12:14

1 Answers1

8

I suggest you reading the solution of Question 1.4 which provides a good intuition.

In a nutshell, if you have an arbitrary function $ f(x) $ and two variable $x$ and $y$ which are related to each other by the function $x = g(y)$, then you can find the maximum of the function either by directly analyzing $f(x)$: $ \hat{x} = argmax_x(f(x)) $ or the transformed function $f(g(y))$: $\hat{y} = argmax_y(f(g(y))$. Not surprisingly, $\hat{x}$ and $\hat{y}$ will be related to each as $\hat{x} = g(\hat{y})$ (here I assumed that $\forall{y}: g^\prime(y)\neq0)$.

This is not the case for probability distributions. If you have a probability distribution $p_x(x)$ and two random variables which are related to each other by $x=g(y)$. Then there is no direct relation between $\hat{x} = argmax_x(p_x(x))$ and $\hat{y}=argmax_y(p_y(y))$. This happens because of Jacobian factor, a factor that shows how the volum is relatively changed by a function such as $g(.)$.

MajidL
  • 196
  • 1
  • 2