I'm trying to wrap my head around generating from a VQ-VAE with PixelCNN prior. Mostly, I'm curious how to go about generating variations of a given "class", or object. My (foggy) understanding, at the moment, is that the model quantizes the latent space, so that the vectors associated with a given quantization point represent a similar "class", or at least some form of similarity between images. But I'm not clear how to explore variations of one such class? (Intuitively, I'd vary the latent, of course, but how would I avoid moving into another cluster/class?)
Specifically, my application involves taking a given input (x) and producing variations on that input (x1 ... xN) that maintain some perceptible relationship to the original. Is the VQ-VAE + PixelCNN a good choice for such a task? I should mention that, although I can provide class labels, I'm also interested in models that can classify or cluster in a self-supervised manner. If I were to provide labels, there would be a relatively large number of them (e.g., 200+). Also, I will be interested in direct generation/sampling, without an input x.