Why is "weight clipping" needed for Wasserstein GANs?

Question

I am reading the original paper on the Wasserstein GAN:

https://arxiv.org/pdf/1701.07875.pdf

and I came across this paragraph:

I don't understand the statement: "$\mathcal{W}$ is compact implies that all the functions $f_w$ will be $K$-Lipschitz for some $K$ that only depends on $\mathcal{W}$". Here, we are talking about a family of functions $\{f_w\}_{w \in W}$. Why does the index coming from a compact space means that the functions will be $K$-Lipschitz continuous? If I can understand this, then I can understand why we need to clip the weights to a compact space such as a box.

Danica · Answer 1 · 2020-01-22T01:29:30.773

Certainly this statement is not always true strictly as written: letting $\sigma$ be the logistic function $\sigma(x) = 1 / (1 + \exp(-x))$, consider a (very simple) network of the form $$ f_w(x) = \begin{cases} \sigma\left(\frac{x}{w}\right) & w \ne 0 \\ \mathrm{sgn}(x) & w = 0 \end{cases} .$$ Letting $\mathcal W = [-1, 1]$, $\mathcal W$ is compact, yet $f_0$ is not Lipschitz, and each other $f_\epsilon$ is Lipschitz but with some constant that becomes infinite as $\epsilon \to 0$.

But: consider a more typical network $f_w^L$ given recursively by $$ f_w^{(0)}(x) = x \qquad f_w^{(\ell)}(x) = \sigma_\ell(W_\ell f_w^{(\ell-1)}(x) + b_\ell) ,$$ where $w$ contains all of the parameters $W_\ell$, $b_\ell$ for each layer $\ell$ (and each $\sigma_\ell$ is some fixed Lipschitz activation function). Then we have that $$ \lVert f_w^{(L)} \rVert_\mathrm{Lip} \le \lVert \sigma_{L} \rVert_\mathrm{Lip} \; \lVert W_L \rVert_\mathrm{op} \lVert f_w^{(L-1)} \rVert_\mathrm{Lip} \le \prod_{\ell=1}^L \lVert \sigma_{\ell} \rVert_\mathrm{Lip} \; \lVert W_\ell \rVert_\mathrm{op} .$$ Now, if $\mathcal W$ is compact, then there is some single constant $D$ such that $\lVert W_\ell \rVert_\mathrm{op} \le D$ for every $w \in \mathcal W$.^* We've also assumed that each $\lVert \sigma_\ell \rVert_\mathrm{Lip}$ is constant and independent of $w$. Thus, for any $w \in \mathcal W$, we have that $$ \lVert f_w^{(L)} \rVert_\mathrm{Lip} \le \prod_{\ell=1}^L \lVert \sigma_{\ell} \rVert_\mathrm{Lip} \; \lVert W_\ell \rVert_\mathrm{op} \le D^L \prod_{\ell=1}^L \lVert \sigma_{\ell} \rVert_\mathrm{Lip} ,$$ a constant independent of the particular choice of $w$.

^{* $\mathcal W$ being compact, assuming we're making reasonable decisions about what topology we mean "compact" in, implies that the set of valid $W_i$ in $\mathcal W$ is also compact, which implies that the operator norm is bounded.}

Why is "weight clipping" needed for Wasserstein GANs?

1 Answers1