-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Something to consider as a highly speculative research project once Ganbooru is done.
A distinct trend in recent DL has been self-attention moving beyond sequence or text data to image data as well, increasingly augmenting or replacing convolutions ("attention is all you need"). Examples:
- https://ai.facebook.com/blog/making-transformer-networks-simpler-and-more-efficient/
- https://arxiv.org/abs/1904.09925
- https://openreview.net/forum?id=HkxaFoC9KQ
- https://arxiv.org/abs/1911.03584
- https://arxiv.org/abs/1911.12287
- https://arxiv.org/abs/2001.04451
- https://arxiv.org/abs/1904.10509
In Bello et al, the classifier using solely attention is both parameter and FLOP-efficient compared to the classical CNN, and although it doesn't set the SOTA, the performance curves look good. Another paper argues that self-attention learns convolution-like attention patterns, but they are are different and more tailored; but still other papers note that NNs with attention appear to have better inductive biases, generalize in more human-like ways, and resist adversarial examples better; and of course, we've long noted in GANs that attention appears to help a lot in global coherency.
So the logical question becomes, can we create a GAN which drops convolution layers entirely in favor of just self-attention layers? The self-attention may require a lot of data in order to surpass the convolution hardwired priors & compute-efficiency but it will ultimately deliver better results. (Making it another bitter lesson #28?). Fortunately, we have both a lot of data and a lot of compute in TPU pods.
Bringing this up on Twitter, no one has a solid theoretical argument why an all-attention GAN would fail, other than the obvious point that the memory consumption of straightforward 256px or 512px self-attention layers may be too large for TPUs.
We can work around that with reversible attention layers to save memory, use of more efficient self-attention approximations like Reformer, or by rethinking the architecture entirely: why do we need 256px or 512px upscaling modules at all? You wouldn't necessarily have to do the obvious approach of using self-attention+nearest-neighbors-upscaling blocks iterated (16→32→64→128→256→512px). You could have a deep stack of small self-attention layers, and then do the pixel generation in two or three big blocks. Why not use something like the deep w stack of StyleGAN, (#26) where the self-attention layers keep transforming a latent representation of the image, before spitting out the actual pixels? This would be a 'skinny' GAN. You could see this as taking the StyleGAN z→w 8x512 FC embedding to the logical limit, and souping it up with self-attention instead of regular FC layers. It's one of the weirdest parts of StyleGAN, but seems important in generating a really useful linear disentangled w seed. The final output could also be done with a single pass into Reformer, which scales up to 256^2 = 65536 pixels and could potentially do 512^2 = 262144 pixels (this is easier than the usual sequence task because it only needs to 'decode' the latent state, like a bottleneck).
Another interesting option towards all-attention GANs: combine pixel-level loss, implemented using U-Net for pixel segmentation ("A U-Net Based Discriminator for Generative Adversarial Networks", Schönfeld et al 2020) with FB's new simpler object detector, "DETR: End-to-End Object Detection with Transformers", Carion et al 2020.
Back in early March 2020, shawwn tested the simplest most obvious thing of just replacing (most of) the convolution layers in a 128px StyleGAN with self-attention, and it does still work with anime faces, so no immediate problem there...