-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Feature: experiment with z which are a mix of normals, censored normals, binomials, and categoricals.
Typically, in almost all GANs, the original z random noise is just a bunch of Gaussian variables (N(0,1)). This is chosen as a sheer default (when in doubt, a random variable is just a Gaussian), but there is no reason the z couldn't be something totally different. StyleGAN, for example, finds that it's important to transform z into w using no less than 8 layers (#26), which suggests that the standard z is... not great, and does affect the results quite a bit (both the quality of generated samples, and also control of the model for editing, everyone finds that edits on w work a lot better than z in StyleGAN). The w is learning to transform the Gaussians into some distribution more appropriate - and potentially much rougher and spikier and more discrete?
The intuition here is that using just normals forces the GAN to find 128 or 512 or whatever 'factors' which can vary smoothly; but we want a GAN to learn things like 'wearing glasses', which are inherently discrete: a face either is or is not wearing glasses, there's no such thing as wearing +0.712σ glasses, or wearing −0.12σ of glasses-ness, even though that's more or less what using only normals is forcing the GAN to do. So, you get weird things where face interpolations look like raccoons halfway through as you force the GAN to generate partial-glass-ness. If the GAN had access to a binary variable, it could assign that to 'glasses' and simply snap between glasses/no-glasses while interpolating on the latent space without any nonsensical intermediates being forced by smoothness. (Since the G isn't being forced to generate artifactual images which D can easily detect & penalize, it may also have a easier time learning.)
mooch's BigGAN paper experiments with the latent space to offer something better than a giant w (which we've thus far found a little hard to stabilize inside BigGAN training). In particular, he finds that binary 0/1 (Bernouilli) variables and rectified (censored) normals perform surprisingly better:
A mix of Gaussians and binary variables, inspired by InfoGAN (where the z is the usual Gaussian provided for randomness & modeling nuisance variables, and c is a vector of binary variables which are the human-meaningful latent variables one hopes the GAN will automatically learn), also performs well.
mooch did not thoroughly explore this space because he wanted to use truncation, so there's potential for getting improvements here. (We could, for example, use a mix of Bernouillis and censored normals, and simply only apply the truncation trick to the censored normals.)
One suggestion would be that we could expand the latent z a lot by tacking on an additional few hundred censored normals or binomials in addition to the existing normals, feeding a mix into each BigGAN block.
Simply modifying the z code should be easy. compare_gan already provides a z_generator abstraction which lets us pass in distribution_fn=tf.random.uniform or distribution_fn=tf.random.normal etc. tf.random supports categorical / Bernouilli, and normals (but not, it seems, censored normals, only truncated normals, which are different, but censored is easy: if x > I then 0). So we could define a distribution which, say, returns the first half of censored normals and a second half of Bernouillis.
