TL;DR: Subspaces are low-dimensional, linear portions of the entire signal space that are expected to contain (or be close to) a large part of the observable and useful signals or transformations thereof, with additional tools that allow us to compute interesting things on the data
We are given a set of data. To manipulate them more easily, it is common to embed them, or represented them in a well-adapted mathematical structure (from the plenty of structures we have in algebra or geometry), to perform operations, prove things, develop algorithms, etc. For instance in channel coding, group or ring structures can be better adapted. In a domain called mathematical morphology, one uses lattices.
Here, for standard signals or images, we often suppose a linear structure: signals can be weighted, added: $\alpha x+ \beta y$. This is the base for linear systems, like traditional windowing, filtering (convolution), differentiating, etc.
So, a mathematical structure of choice lies in vector spaces. Vector spaces equipped with tools: a dot product (that can be used to compare data), a norm (to mesure distances). These tools help us compute. Indeed, energy minimization and linearity are strongly related.
Then, a data of $N$ samples naturally lives in the classical linear space of $N$ dimension. It is quite big (think of million-pixel images). It contains an awful lot of other "uninteresting" data: any $N$ dimensional "random" vector. Most of them are and will never be observed, have no meaning, etc.
The reasonable quantity of signals that you can record, up to variations, is very small relatively to the big space.
Even more, we are often interested in structured information. So if you subtract noise effects, unimportant variations, the proportion of useful signals is very very tiny within the whole potential signal space.
One very useful hypothesis (heuristic, to help discover) is that those interesting signals live close together, or at least along regions of the space that "make sense". An example: suppose that some extraterrestrial intelligence has no other detection system than a very precise dog detector. They will get, across the Solar system, almost nothing, except many points located on something vaguely looking like a sphere, with large empty spaces (oceans), and sometimes very concentrated (urban areas). And the point cloud moves around a center, with a constant periodicity, and rotating on itself. Those aliens have discovered something!
Anyway, the partial-sphere looking point cloud is interpretable... maybe a planet?
So, our dog point cloud could have been fully 3D, but they are concentrated on a 2D surface (lower dimension), that seems relatively regular (in altitude) and smooth: most dogs live at intermediate altitudes.
These smooth low-dimensional parts of space are sometimes called smooth manifolds or varieties. Their structure and operators allow to compute things. For instance: distances, distributions, etc. Inter-dog distances make more sense when computed along the Earth surface (in spherical 2D coordinates) than directly through the planet with the standard 3D norm! But this can still be complicated to deal with. Let us simplify this a bit more.
Looking a little closer, the dog points are almost located on close-to-flat surfaces: countries, even continents. Those flat surfaces are portions of linear (or affine) subspaces. Still, you can now compute inter-dog distance, more easily, and design an algorithm for dog matching that will make you rich.
The story continues a bit. Sometimes, natural data does not assemble around a clear structure, directly. Unveiling this inherent structure is at the core of DSP. To help us in this direction, we can resort to data transformations to concentrate them better (Fourier, time-frequency, wavelets), filtering.
And if we find a suitable subspace, most algorithms become simpler, more tractable, and so on: adaptive filtering, denoising, matching.
[ADDITION] A typical use is the following: a signal can be better concentrated with a well-chosen orthogonal transform. In the meantime, a zero-mean random Gaussian noise remains Gaussian under an orthogonal transformation. Typically, the covariance matrix can be diagonalized. If you sort the eigenvalues in decreasing ordre, the smallest ones tend to flatten (they correspond to noise), and the highest more or less correspond to the signal. Hence, by thresholding the eigenvalues, it because possible to remove the noise.