20

Imagine you have a dataset of 1000 observations. To keep things intuitive imagine they are (x,y) coordinates. They are temporary independent, so that makes it easier.

You wish you had about a million observations, but you only have 1000. How should you generate a million simulated observations?

Are there any proofs that describe the most mathematically precise way to do this?

You want to be true to your original dataset. How do you do that without adding your own bias?

This is a simple problem, and a general one. But I don't know if it's trivial. Seems like it should be.

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
Legit Stack
  • 453
  • 2
  • 11
  • 2
    You may look for the `bootstrap' keyword. – Xi'an Mar 05 '20 at 10:16
  • @xi`an that's actually why I made this question. I heard of the "over-dispersed Poisson bootstrap model" in actuarial reserving and was told if you use any model other than the chain ladder method in that context you loose data and therefore bias your simulated observations ("resamples"?) In some way guaranteed. So then, that's a general principle that should be well understood and applicable to all kinds of data augmentation in all domains, only you're domain specific priors need to change. Yet, the only answers I get here or anywhere else are essentially, "it depends." – Legit Stack Mar 05 '20 at 14:10

3 Answers3

33

The reason you "wish you had a million observations" is typically because you want to use the data to to infer something that you don't already know. For example, you might want to fit a model, or make predictions. In this context, the data processing inequality implies that, unfortunately, simulating additional data is less helpful than one might hope (but this doesn't mean it's useless).

To be more specific, let $Y$ be a random vector representing unknown quantities we'd like to learn about, and let $X$ be a random vector representing the data. Now, suppose we simulate new data using knowledge learned from the original data. For example, we might fit a probability distribution to the original data and then sample from it. Let $\tilde{X}$ be a random vector representing the simulated data, and $Z = [X, \tilde{X}]$ represent the augmented dataset. Because $Z$ was generated based on $X$, we have that $Z$ and $Y$ are conditionally independent, given $X$. That is:

$$p(x,y,z) = p(x,y) p(z \mid x)$$

According to the data processing inequality, the mutual information between $Z$ and $Y$ can't exceed that between $X$ and $Y$:

$$I(Z; Y) \le I(X; Y)$$

Since $Z$ contains $X$, this is actually an equality. In any case, this says that, no matter how we try to process the data--including using it to simulate new data)--it's impossible to gain additional information about our quantity of interest (beyond that already contained in the original data).

But, here's an interesting caveat. Note that the above result holds when $\tilde{X}$ is generated based on $X$. If $\tilde{X}$ is also based on some external source $S$, then it may be possible to gain additional information about $Y$ (if $S$ carries this information).

Given the above, it's interesting to note that data augmentation can work well in practice. For example, as Haitao Du mentioned, when training an image classifier, randomly transformed copies of the training images are sometimes used (e.g. translations, reflections, and various distortions). This encourages the learning algorithm to find a classifier that's invariant to these transformations, thereby increasing performance. Why does this work? Essentially, we're introducing a useful inductive bias (similar in effect to a Bayesian prior). We know a priori that the true function ought to be invariant, and the augmented images are a way of imposing this knowledge. From another perspective, this a priori knowledge is the additional source $S$ that I mentioned above.

user20160
  • 29,014
  • 3
  • 60
  • 99
9

Are there any proofs that describe the most mathematically precise way to do this?

Any transformation would have some math behind it.

However, I do think image data augmentation would depend on the specific use case / domain knowledge in specific field.

For example, if we want to detect dog or cat, we can flip images for augmentation. This is because we know an upside down dog is still a dog. On the other hand, if we are doing digits recognition, flip images upside down may be not a good way because 6 and 9 are different digits.

For other domain, say computer vision on medical image, I do not know if flip/mirror on images will make since on chest X ray.

Therefore, it is domain specific and may not captured by some general math model.

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
2

The question is, why do you want to do data augmentation?

Of course, more data is better, but your augmented dataset is redundant: your million augmented data points are not as good as a million actual data points.

An alternative way of thinking of data augmentation is in terms of teaching invariances. For example, CNNs in deep learning are translationally invariant, which is a good thing for image recognition. Unfortunately, we would wish they were invariant to rotations as well (a leaning cat is still a cat), which is not easy to do within the architecture.

In summary: Data augmentation is a way to create a model that is roughly invariant with respect to a set of transformations when you cannot force that invariance elsewhere (be it the features or the model).

Answering your question, the only way to determine the valid data augmentation procedures is to apply domain knowledge. How can your data points be perturbed or modified without substantially changing them? What do you want your model to learn to ignore?

Let me prove that there is no general way, and there cannot be one. Consider the case of predicting the position of an object at $t=1$ given that your $(x, y)$ are the initial positions. A logical data augmentation scheme would be to displace the points microscopically, surely they will end up almost at the same position, right? But if the system is chaotic (for example, a double pendulum), the microscopical deviations would produce exponentially diverging trajectories. What data augmentation can you apply there? Maybe perturbations of the points that lie in large basins of attractions. That would bias your data since you will have fewer samples for the chaotic regimes (which is not necessarily a bad thing!). In any case, any perturbation scheme you come up with will come from a careful analysis of the problem at hand.

Davidmh
  • 201
  • 1
  • 5
  • 1
    Worth noting that there _are_ attempts to hard-bake invariances like rotation into NN architectures as well, like convolution does with translations. https://arxiv.org/abs/1801.10130 . Arguably, with the right kind of model, augmentation should never have any benefit. – leftaroundabout Mar 05 '20 at 20:48
  • @leftaroundabout true, but for problems complex enough to be practical, I bet I can think of more invariances than what you can bake in the model. For images exposure, noise level, illumination conditions, white balance... are all things you may not want to depend on. And for those of us working on less common kinds of data, it is even more relevant, since we don't have the benefit of a large body of NN research. – Davidmh Mar 05 '20 at 21:18
  • 1
    Actually, exposure, illumination and noise are amongst the easiest: they're largely taken care of already by (pseudo-) linearity of any matrix- and relu layers. Where it gets really tricky is with things like _spatial deformation_ of 3D objects visible in an image. But also even with relatively simple blurs I'm not sure there are good solutions yet. – leftaroundabout Mar 05 '20 at 21:31