Affects of including generated data into "real" dataset

Question

I was thinking about what the outcome of the following idea would be. Let's say that we have a Generative Adversarial Network (GAN) that has "successfully" (i.e., Discriminator is not able to differentiate between real and fake) mapped a noise distribution to the distribution of the "real" data set (e.g., human faces). If we take the generated images of the Generator and include them in the human faces data set, would this mean that the original 'real' data set can keep growing in size, and hence we are generating new valuable data that can be used to train other models? I've formulated the question in a very loose manner, purposely to invite someone to expose the immediate problems with it.

Thanks

fprac · Answer 1 · 2022-02-24T23:21:10.533

What you mean by "Discriminator is not able to differentiate between real and fake" is important. At one extreme, suppose the discriminator performs perfectly, i.e. it is able to determine if a new sample was drawn from the true dataset. Then the generator being successful would mean that it perfectly reproduces samples from the true dataset. In that case, you wouldn't gain anything from augmenting your data with generated cases. At the other extreme, say the discriminator doesn't perform any better than randomly guessing "real" or "fake". In this case the generator could produce anything and be viewed as successful. You certainly wouldn't want to augment your data with those generated cases. These extremes may seem contrived, but one can tilt the performance of a GAN in either direction by making different choices for architecture and/or the objective functions. For example, if the depth/width of the discriminator are very small, it may not be able to perform any better than the random guessing.

In short, the ways in which a GAN augments your data (the kinds of variation that would introduce) could be difficult to control or predict. The success of this kind of augmentation would certainly require an intelligent entity to assess the quality of the generated data beyond what the performance measures of the components of the GAN can reflect (i.e. cost functions). Other methods, like perturbing training data according to admissible transformations (for example, adding a little noise to images), would be easier to control.

Affects of including generated data into "real" dataset

1 Answers1