-
Notifications
You must be signed in to change notification settings - Fork 2
Description
It would be nice to make BigGAN more like StyleGAN in a few ways, primarily its stability. StyleGAN also seems to have a better latent space with more disentangled and linearized factors, which is great for editing but may also play a role in the stability.
As far back as IllustrationGAN, people have noted that feeding the z embedding (which is typically just a bunch of random normals, like 128 N(0,1)s) straight into the CNN doesn't work as well as feeding z into 1 or 2 FC layers first. Me & FeepingCreature noticed this for WGAN as well. BigGAN experimented with a variety of other starting distributions, but kept N(0,1) for simplicity. StyleGAN took the breathtaking step of plopping in no less than 8 FC layers to transform z into... something, before passing the final resulting w into the rest of StyleGAN. (It's also worth noting that StackGAN's implementation appears to pass its text embeddings through at least 1 FC layer as part of its "noise augmentation", which may be part of why its text→image works and ours doesn't.) The theory is that the 'true' latent space of the data distribution is extremely nonlinear and complex and non-normal and that starting with N(0,1)s is extremely unhelpful to the convolution layers trying to create something realistic, but that passing z through a deep stack of FC layers lets it be massaged into something that encodes in an easy-to-understand-way everything the rest of StyleGAN needs to know, and reduces the need for things like global attention layers to enforce consistency.
Perhaps this would be useful for BigGAN? We can simply paste in an 8x512 block after z and see how it goes. As tweaks go, this one should be very easy to do and potentially quite helpful, and so is relatively high priority. If it works well, we can consider whether self-attention layers would work even better. (The better this works, the more evidence it provides for the idea of an all-attention GAN.)