2

I understand that in order to improve your generative model performance it is quite useful to compare your output and the target in the feature space, as stated in the paper Perceptual Losses for Real-Time Style Transfer and Super-Resolution, in which is used for style transfer and super resolution generative models:

enter image description here

My question is regarding the support network layers selection to use in the loss calculation (In the above example a VGG16 net is used). As seen in the following component of the loss:

enter image description here

in which is calculated the L2 loss of the target vs our output in the feature space of the layer j.

In many examples a set of these layers are selected to be added in the loss function calculation. My questions are:

  • How do I select these layers?

  • Should I use them as hyperparameters to tune in a validation set?

  • Should I select them using some different criteria (for example knowing which kind of features the layer represent)?

Nikaido
  • 614
  • 5
  • 18

1 Answers1

2

The selection of layers for feature reconstruction loss has got to do with how one wants their content image to be transferred. Consider figure 3 of the same paper mentioned in the OP. It shows the feature representations at different layers of the loss network ($\phi$). A closer look indicates that the image reconstructions from the lower (initial) layers of the network more or less results in similar-looking images. But, when the same images are reconstructed by minimizing the loss function defined by considering the higher layers of the network, we see that the spatial structure is preserved. Since the general idea of style transfer is to infuse as much of the stylistic features while keeping the fundamental appearance of the content image intact, it makes more sense to consider higher layers of the defined loss network. Another possible advantage of using higher layers in the loss function is that it forces the network to learn the overall spatial structure and not optimize towards matching the target the color, texture, and shapes in the target image.

Figure 3 of this paper shows that the emphasis put on matching the style versus matching the content can be indeed looked at as a hyperparameter. More weight to the content image results in an output that is almost similar to the original content image itself, but with only a minor effect of the style and vice versa.

Therefore, the bottom line is this: the features of the top (higher) layers are better at capturing the spatial structure (therefore, are more abstract), while the lower layers' features tend to focus more on the "visual" aspects of the image. So, one has to choose the layers based on the ratio of style and content of the images one desires to get as the output.

nagaK
  • 401
  • 2
  • 3