Graphical intuition of statistics on a manifold

Question

On this post, you can read the statement:

Models are usually represented by points $\theta$ on a finite dimensional manifold.

On Differential Geometry and Statistics by Michael K Murray and John W Rice these concepts are explained in prose readable even ignoring the mathematical expressions. Unfortunately, there are very few illustrations. Same goes for this post on MathOverflow.

I want to ask for help with a visual representation to serve as a map or motivation towards a more formal understanding of the topic.

What are the points on the manifold? This quote from this online find, seemingly indicates that it can either be the data points, or the distribution parameters:

Statistics on manifolds and information geometry are two different ways in which differential geometry meets statistics. While in statistics on manifolds, it is the data that lie on a manifold, in information geometry the data are in $R^n$, but the parameterized family of probability density functions of interest is treated as a manifold. Such manifolds are known as statistical manifolds.

I have drawn this diagram inspired by this explanation of the tangent space here:

[Edit to reflect the comment below about $C^\infty$:] On a manifold, $(\mathcal M)$, the tangent space is the set of all possible derivatives ("velocities") at a point $p\in \mathcal M$ associated with every possible curve $(\psi: \mathbb R \to \mathcal M)$ on the manifold running through $p.$ This can be seen as a set of maps from every curve crossing through $p,$ i.e. $C^\infty (t)\to \mathbb R,$ defined as the composition $\left(f \circ \psi \right )'(t)$, with $\psi$ denoting a curve (function from the real line to the surface of the manifold $\mathcal M$) running through the point $p,$ and depicted in red on the diagram above; and $f,$ representing a test function. The "iso-$f$" white contour lines map to the same point on the real line, and surround the point $p$.

The equivalence (or one of the equivalences applied to statistics) is discussed here, and would relate to the following quote:

If the parameter space for an exponential family contains an $s$ dimensional open set, then it is called full rank.

An exponential family that is not full rank is generally called a curved exponential family, as typically the parameter space is a curve in $\mathcal R^s$ of dimension less than $s.$

This seems to make the interpretation of the plot as follows: the distributional parameters (in this case of the families of exponential distributions) lie on the manifold. The data points in $\mathbb R$ would map to a line on the manifold through the function $\psi: \mathbb R \to \mathcal M$ in the case of a rank deficient non-linear optimization problem. This would parallel the calculation of the velocity in physics: looking for the derivative of the $f$ function along the gradient of "iso-f" lines (directional derivative in orange): $\left(f \circ \psi \right)'(t).$ The function $f: \mathbb M \to \mathbb R$ would play the role of optimizing the selection of a distributional parameter as the curve $\psi$ travels along contour lines of $f$ on the manifold.

BACKGROUND ADDED STUFF:

Of note I believe these concepts are not immediately related to non-linear dimensionality reduction in ML. They appear more akin to information geometry. Here is a quote:

Importantly, statistics on manifolds is very different from manifold learning. The latter is a branch of machine learning where the goal is to learn a latent manifold from $R^n$-valued data. Typically, the dimension of the sought-after latent manifold is less than $n$. The latent manifold may be linear or nonlinear, depending on the particular method used.

The following information from Statistics on Manifolds with Applications to Modeling Shape Deformations by Oren Freifeld:

While $M$ is usually nonlinear, we can associate a tangent space, denoted by $TpM$, to every point $p \in M$. $TpM$ is a vector space whose dimension is the same as that of $M$. The origin of $TpM$ is at $p$. If $M$ is embedded in some Euclidean space, we may think of $TpM$ as an affine subspace such that: 1) it touches $M$ at $p$; 2) at least locally, $M$ lies completely on one of side of it. Elements of TpM are called tangent vectors.

[...] On manifolds, statistical models are often expressed in tangent spaces.

[...]

[We consider two] datasets consist of points in $M$:

$D_L = \{p_1, \cdots , p_{NL}\} \subset M$;

$D_S = \{q_1, \cdots , q_{NS}\} \subset M$

Let $µ_L$ and $µ_S$ represent two, possibly unknown, points in $M$. It is assumed that the two datasets satisfy the following statistical rules:

$\{\log_{\mu L} (p_1), \cdots , \log_{\mu L}(p_{NL})\} \subset T_{\mu L}M, \quad \log_{\mu L}(p_i) \overset{\text{i.i.d}}{\sim} \mathscr N(0, \Sigma_L)$ $\{\log_{\mu S} (q_1), \cdots , \log_{\mu S}(q_{NS})\} \subset T_{\mu S}M, \quad \log_{\mu S}(q_i) \overset{\text{i.i.d}}{\sim} \mathscr N(0, \Sigma_S)$

[...]

In other words, when $D_L$ is expressed (as tangent vectors) in the tangent space (to $M$) at $\mu_L$, it can be seen as a set of i.i.d. samples from a zero-mean Gaussian with covariance $\Sigma_L$. Likewise, when $D_S$ is expressed in the tangent space at $\mu_S$ it can be seen as a set of i.i.d. samples from a zero-mean Gaussian with covariance $\Sigma_S$. This generalizes the Euclidean case.

On the same reference, I find the closest (and practically only) example online of this graphical concept I am asking about:

Would this indicate that data lie on the surface of the manifold expressed as tangent vectors, and parameters would be mapped on a Cartesian plane?

What are you trying to do here? Draw manifolds? Most of them are too boring to draw. For instance, try Gaussian distribution. — Aksakal, Oct 11 '16 at 00:52
I would normally think of parameter spaces as vector spaces, e.g. $\theta\in\mathbb{R}^n$. If I were to think of parametric "manifolds", the first thing that comes to mind would be "constraint systems" e.g. [$f(\theta)=0$](https://en.wikipedia.org/wiki/Implicit_function). Otherwise, why is the space not "complete"? (What is defining the subset that is the "manifold"?) — GeoMatt22, Oct 11 '16 at 00:54
It seems rather broad currently, to me. I would suggest either: 1) Present a question relating to a definite statistical problem you are working on that may relate to manifold concepts, or 2) Just start at the earliest part of the textbook* that "almost makes sense", and post a straight-up [tag:self-study] question (or series of them). (*Note that I am not familiar with the text you mention, and not really the field either I would guess!) — GeoMatt22, Oct 11 '16 at 01:06
For "geometrical interpretation of probability distributions", I interpreted your beginning quote to be about *families* of distributions (or models). So the manifold would be over the parameter space defining some parametric model. I guess an example would be a Gaussian Mixture model, where the component fractions must sum to 1, so are on a linear manifold? (You might just ask a question for whuber to spell out some examples related to the quote) — GeoMatt22, Oct 11 '16 at 01:10
Hopefully, @whuber will come along and elaborate on the comments he was making in chat. — gung - Reinstate Monica, Oct 11 '16 at 01:17
The short answer to your edited question is "no." The tangent space describes the velocities of all smooth paths in the manifold. Its principal role in statistics is in maximizing likelihoods, where the manifold describes a finitely parameterized family. In "manifold learning," a manifold is used as a local approximation to data--it's a curved version of the "column space" in linear regression. There, the tangent space is *embedded* within the ambient Euclidean space. Locally, it describes the "directions" of the data and its normal bundle gives the "error" directions. — whuber, Sep 13 '17 at 13:26
(BTW, your initial characterization of the tangent space as a family of maps $f:\mathbb{R}\to\mathbb{R}$ seems corrupted; it's certainly not correct.) — whuber, Sep 13 '17 at 13:50
@whuber Is the current version closer to the truth in re: to your last correction? — Antoni Parellada, Sep 13 '17 at 14:19
Sort of. "$C^\infty$" is not used correctly, though. The underlying idea is that functions $f:M\to\mathbb{R}$ can have various directional derivatives at any given point $p\in M$. The values of those derivatives at $p$ form a vector space whose dimension is identical to that of $M$. This is the tangent space $T_pM$. It is *not* a set of "possible maps," though. (A more modern definition of $T_pM$, inspired by abstract algebraic geometry, *is* a set of maps: it is the dual vector space--the space of linear forms--of the space of derivations of germs of functions defined around $p$.) — whuber, Sep 13 '17 at 15:11
@whuber Are you referring to the cotangent space, akin to $V^*$? — Antoni Parellada, Sep 13 '17 at 15:15
Yes: the cotangent space $T_p^{*}M$ at $p$ can be defined as the derivations of germs of functions around $p$. The tangent space at $p$ (therefore!) is simply its dual. $T^{*}M$ and $TM$ acquire a topology--that is, admit a notion of two tangent spaces $T_pM$ and $T_qM$ being "near"--by means of the coordinate charts on $M$. This reduces the definition (and the problem of visualization) to that of defining the tangent space $T_x\mathbb{R}^n$. This is the set of all vectors originating at $x$. Spivak, in *Calculus on Manifolds*, provides a clear, elementary definition of this sort. — whuber, Sep 13 '17 at 15:20

Antoni Parellada · Answer 1 · 2017-10-18T21:56:13.617

A family of probability distributions can be analyzed as the points on a manifold with intrinsic coordinates corresponding to the parameters $(\Theta)$ of the distribution. The idea is to avoid a representation with an incorrect metric: Univariate Gaussians $\mathcal N(\mu,\sigma^2),$ can be plotted as points in the $\mathbb R^2$ Euclidean manifold as on the right side of the plot below with the mean in the $x$-axis and the SD in the in the$y$ axis (positive half in the case of plotting the variance):

However, the identity matrix (Euclidean distance) will fail to measure the degree of (dis-)similarity between individual $\mathrm{pdf}$'s: on the normal curves on the left of the plot above, given an interval in the domain, the area without overlap (in dark blue) is larger for Gaussian curves with lower variance, even if the mean is kept fixed. In fact, the only Riemannian metric that “makes sense” for statistical manifolds is the Fisher information metric.

In Fisher information distance: a geometrical reading, Costa SI, Santos SA and Strapasson JE take advantage of the similarity between the Fisher information matrix of Gaussian distributions and the metric in the Beltrami-Pointcaré disk model to derive a closed formula.

The "north" cone of the hyperboloid $x^2 + y^2 - x^2 = -1$ becomes a non-Euclidean manifold, in which each point corresponds to a mean and standard deviation (parameter space), and the shortest distance between $\mathrm {pdf's,}$ e.g. $P$ and $Q,$ in the diagram below, is a geodesic curve, projected (chart map) onto the equatorial plane as hyperparabolic straight lines, and enabling measurement of distances between $\mathrm{pdf's}$ through a metric tensor $g_{\mu\nu}\;(\Theta)\;\mathbf e^\mu\otimes \mathbf e^\nu$ - the Fisher information metric:

$$D\,\left ( P(x;\theta_1)\,,\,Q(x;\theta_2) \right)=\min_{\theta(t)\,|\,\theta(0)=\theta_1\;,\;\theta(1)=\theta_2}\;\int_0^1 \; \sqrt{\left(\frac{\mathrm d\theta}{\mathrm dt} \right)^\top\;I(\theta)\frac{\mathrm d \theta}{\mathrm dt}dt}$$

with $$I(\theta) = \frac{1}{\sigma^2}\begin{bmatrix}1&0\\0&2 \end{bmatrix}$$

The Kullback-Leibler divergence is closely related, albeit lacking the geometry and associated metric.

And it is interesting to note that The Fisher information matrix can be interpreted as the Hessian of the Shannon entropy:

$$g_{ij}(\theta)=-E\left[ \frac{\partial^2\log p(x;\theta)}{\partial \theta_i \partial\theta_j} \right]=\frac{\partial^2 H(p)}{\partial \theta_i \partial \theta_j}$$

with

$$H(p) = -\int p(x;\theta)\,\log p(x;\theta) \mathrm dx.$$

This example is similar in concept to the more common stereographic Earth map.

The ML multidimensional embedding or manifold learning is not addressed here.

score 1 · Answer 2 · answered Oct 11 '16 at 01:12

1

There's more than one way to link probabilities to geometry. I'm sure you have heard of elliptical distributions (e.g. Gaussian). The term itself implies geometry link and it's obvious when you draw its covariance matrix. With manifolds it's just placing every possible parameter value in the coordinate system. For instance, a Gaussian Manifold would be in two dimensions: $\mu,\sigma^2$. You can have any value of $\mu\in R$ but only positive variances $\sigma^2>0$. Hence, Gaussian manifold would be a half of entire $R^2$ space. Not that interesting

answered Oct 11 '16 at 01:12

Aksakal

55,939
5
90
176

I guess I thought that a "manifold" is supposed to be *lower dimension* than its embedding space? So a [halfspace](https://en.wikipedia.org/wiki/Half-space_(geometry)) would not count? – GeoMatt22 Oct 11 '16 at 01:26
With Gaussian it's not even a manifold, right. You need constraints, so it becomes some kind of a plane or line – Aksakal Oct 11 '16 at 01:36
I am trying to understand the implications of your answer... Do you mean "**a** geometry link"? Also, I just found [this related post on MathOverflow](http://mathoverflow.net/q/79419/74150). – Antoni Parellada Oct 11 '16 at 03:05
Skimming the link to Wikipedia's information geometry page, if that is the framework, then I believe the "embedding space" is a *function* space. So the manifold would be a set of points (PDFs) in the function space, which are "indexed" by the parameters. Probably the function spaces are infinite-dimensional then? – GeoMatt22 Oct 11 '16 at 03:45
3

It becomes more interesting with an appropriate metric... like the Fisher-Rao one, and then becomes the Poincare hyperbolic half-place https://en.wikipedia.org/wiki/Poincar%C3%A9_half-plane_model – mic Oct 11 '16 at 10:22
3

To all: (1) the manifolds that describe parametric families are *intrinsic* manifolds: they need not be embedded in any vector space. (2) They are more than just differentiable manifolds: the Fisher Information endows them with a *Riemannian metric*--a local distance--that enables them to be studied geometrically. This makes the "half of entire $\mathbb{R}^2$ space" into a curved surface. – whuber Oct 11 '16 at 16:13
@whuber Can I "save" as an attachment / edit your comments the other day on Ten Fold? Also, did you notice, in whichever degree of scrutiny you might have gone through the references in the OP, contradictions with your comment 5 hours ago? – Antoni Parellada Oct 11 '16 at 21:39
Which aspects of my comment appear to be contradictory, Antoni? – whuber Oct 11 '16 at 21:41
Because he offers no actual definitions, your ability to learn from Freifeld's thesis may be limited. – whuber Oct 11 '16 at 21:57

Graphical intuition of statistics on a manifold

2 Answers2