Linear projection (or fully connected layer) is perhaps one of the most common operations in deep learning models. When doing linear projection, we can project a vector $x$ of dimension $n$ to a vector $y$ of dimension size $m$ by multiplying a projection matrix $W$ of shape $n \times m$. My question is, is it a principle that $n$ should always be larger than or equal to $m$? In another word, it does not make much sense to project a vector to a space with even larger dimension size. If it is true, is there any theory foundation for such kind of 'bottleneck' operations?
1 Answers
It's most common to go from a larger dimension to a smaller dimension, but it's not a rule or a requirement.
One example of prominent model that uses projection to higher dimension is in the second layer of word2vec. First, the model projects the input (words) down to the embedding dimension $k < n$, then project back up to the original dimension $n$ to compute the loss. See: "Efficient Estimation of Word Representations in Vector Space" by Tomas Mikolov et al.
Another example is overcomplete representations. The basic idea is to use a basis that is larger than the dimensionality of the input. A good resource is Michael S. Lewicki & Terrence J. Sejnowski; Learning Overcomplete Representations. Neural Computation 2000; 12 (2): 337–365.
In an overcomplete basis, the number of basis vectors is greater than the dimensionality of the input, and the representation of an input is not a unique combination of basis vectors. Overcomplete representations have been advocated because they have greater robustness in the presence of noise, can be sparser, and can have greater flexibility in matching structure in the data. Overcomplete codes have also been proposed as a model of some of the response properties of neurons in primary visual cortex. Previous work has focused on finding the best representation of a signal using a fixed overcomplete basis (or dictionary). We present an algorithm for learning an overcomplete basis by viewing it as probabilistic model of the observed data. We show that overcomplete bases can yield a better approximation of the underlying statistical distribution of the data and can thus lead to greater coding efficiency. This can be viewed as a generalization of the technique of independent component analysis and provides a method for Bayesian reconstruction of signals in the presence of noise and for blind source separation when there are more sources than mixtures.

- 76,417
- 20
- 189
- 313
-
Autoencoders also project down and back up again -- usually not linearly, but the linear version is still a sensible denoising operation by principal components. (Hinton talks about it here: https://www.youtube.com/watch?v=PSOt7u8u23w) – Thomas Lumley Mar 01 '22 at 07:07