2

In ML the term "embedding" gets tossed around a lot and the term basically means the construction of a function that takes a high-dimensional vector to a low-dimensional vector in such a way that the high-dimensional vector can be recovered by the low-dimensional vector. (...at least in the case of autoencoders, not sure if word embedding cares about reconstructing the word.)

Embedding was also a word that I encountered in topology and differential geometry. https://en.wikipedia.org/wiki/Embedding

Does there exist any correspondence between the usage of these terminologies between math and ML?

1 Answers1

5

In pure mathematics an embedding is any function $f\colon X \to Y$ that is injective and structure-preserving. What do these terms mean?

  • Injective Different elements of $X$ are always mapped to different elements of $Y$. Formally: For every $x_1,x_2 \in X$, $f(x_1) \neq f(x_2)$.

  • Structure-preserving This depends on the context, but generally means that if some property holds of $x_1, x_2, \ldots, x_n$, then the same property holds of $f(x_1), f(x_2), \ldots, f(x_n)$. For example, if $X$ and $Y$ are equipped with multiplication, then an embedding $f$ would preserve it: For every $x_1,x_2 \in X$, $f(x_1 \cdot x_2) = f(x_1) \cdot f(x_2)$.

The machine-learning use of the term embedding is similar to this. Here we are concerned with (finite) subsets $X \subset \mathbb{R}^n$ and functions $f\colon \mathbb{R}^n \to \mathbb{R}^m$ such that $f(X)$ has approximately the same structure as that of $X$. (Here $f(X)$ denotes the image of $X$ under $f$. Formally: $f(X) = \{f(x) : x \in X \}$.) Like the mathematical definition, exactly what this means depends on the context. Two examples:

  • In the case of t-SNE, a set $X$ of high-dimensional vectors in $\mathbb{R}^n$ ($n>3$) is embedded into low-dimensional space $\mathbb{R}^m$ (typically $m = 2$ or $m = 3$) in such a way that if $x_1$ and $x_2$ are neigbours in $\mathbb{R}^n$, then their images under the embedding are also neighbours in $\mathbb{R}^m$. The embedding is caculated by means of probability densities; full details can be found in the original paper by van der Maaten and Hinton.

  • The idea behind autoencoders is similar: We use an artificial neural network to find an $m$-dimensional approximation of a set $X$ of $n$-dimensional vectors. This amounts to again finding an embedding $f\colon X \to \mathbb{R}^m$ such that if $x_1$ and $x_2$ are neigbours in $\mathbb{R}^n$, then their images $f(x_1)$ and $f(x_2)$ are neighbours in $\mathbb{R}^m$. The embedding $f$ is calculated in a very different way to t-SNE; in fact, an autoencoder finds the embedding $f$ by finding a subsequent embedding $g\colon f(X) \to \mathbb{R}^n$ such that the composed map $g\circ f\colon X \to \mathbb{R}^n$ preserves the identity of each $x \in X$ as closely as possible, i.e. the distance between $x$ and $g\circ f(x)$ should be as small as possible. Another difference with t-SNE is that the embedding generated by the autoencoder should generalise to points outside the training set $X$.

dwolfeu
  • 454
  • 1
  • 4
  • 13
  • 2
    +1 This is a good answer. Just want to mention that embeddings in machine learning aren't restricted to mappings from $\mathbb{n}$ to $\mathbb{m}$. For example, one can embed objects like graphs, strings, etc. Also, we aren't necessarily concerned with finite sets. For example, in the autoencoder case you mentioned, we certainly train on a finite set. But, we'd like the learned mapping to generalize well over the underlying distribution, which is often defined on a continuous space. – user20160 Jun 20 '20 at 20:05