1

I'm trying to wrap my head around generating from a VQ-VAE with PixelCNN prior. Mostly, I'm curious how to go about generating variations of a given "class", or object. My (foggy) understanding, at the moment, is that the model quantizes the latent space, so that the vectors associated with a given quantization point represent a similar "class", or at least some form of similarity between images. But I'm not clear how to explore variations of one such class? (Intuitively, I'd vary the latent, of course, but how would I avoid moving into another cluster/class?)

Specifically, my application involves taking a given input (x) and producing variations on that input (x1 ... xN) that maintain some perceptible relationship to the original. Is the VQ-VAE + PixelCNN a good choice for such a task? I should mention that, although I can provide class labels, I'm also interested in models that can classify or cluster in a self-supervised manner. If I were to provide labels, there would be a relatively large number of them (e.g., 200+). Also, I will be interested in direct generation/sampling, without an input x.

jbm
  • 121
  • 4

1 Answers1

1

At the end of the process of VQ-VAE you will have a categorical distribution which you can sample from, and generate images that will looks "real" based on the distribution of your training set. You can encode the class into one on the dimensions in the latent space, this will make it possible to control over which class to generate. i.e. you can choose your own class, then sample the rest of the random integer from the trained prior and generate image from this class. I am not sure which changes exactly you want to perform on your images but every time you sample randomly from the prior you will get different images i.e. different images for different sampled codes. You can't however "control" which specific changes to make, because the latent space is not easily interpretable. Regarding the self un supervised clustering: you can try the following: after sampling from the prior and getting codes (or alternatively run the encoder over a given image and get the codes) you can pull all the relevant embeddings according to the codes, concatenate them, this will create a vector that will represent one image. You can then generate a vector for each image, and cluster these vectors with your favourite clustering algorithm.

ofer-a
  • 1,008
  • 5
  • 9