Beyond BigGAN: all-attention GANs?

Something to consider as a highly speculative research project once Ganbooru is done.

---

A distinct trend in recent DL has been self-attention moving beyond sequence or text data to image data as well, increasingly augmenting or replacing convolutions (["attention is all you need"](https://arxiv.org/pdf/1706.03762v3.pdf)). Examples:

- https://ai.facebook.com/blog/making-transformer-networks-simpler-and-more-efficient/
- https://arxiv.org/abs/1904.09925
- https://openreview.net/forum?id=HkxaFoC9KQ
- https://arxiv.org/abs/1911.03584
- https://arxiv.org/abs/1911.12287
- https://arxiv.org/abs/2001.04451
- https://arxiv.org/abs/1904.10509

In Bello et al, the classifier using solely attention is both parameter and FLOP-efficient compared to the classical CNN, and although it doesn't set the SOTA, the performance curves look good. Another paper argues that self-attention learns convolution-like attention patterns, but they are are different and more tailored; but still other papers note that NNs with attention appear to have better inductive biases, generalize in more human-like ways, and resist adversarial examples better; and of course, we've long noted in GANs that attention appears to help a lot in global coherency.

So the logical question becomes, can we create a GAN which drops convolution layers entirely in favor of just self-attention layers? The self-attention may require a lot of data in order to surpass the convolution hardwired priors & compute-efficiency but it will ultimately deliver better results. (Making it another bitter lesson #28?). Fortunately, we have both a lot of data and a lot of compute in TPU pods.

Bringing this up on Twitter, no one has a solid theoretical argument why an all-attention GAN would fail, other than the obvious point that the memory consumption of straightforward 256px or 512px self-attention layers may be too large for TPUs.

We can work around that with reversible attention layers to save memory, use of more efficient self-attention approximations like Reformer, or by rethinking the architecture entirely: why do we need 256px or 512px upscaling modules at all? You wouldn't necessarily have to do the obvious approach of using self-attention+nearest-neighbors-upscaling blocks iterated (16→32→64→128→256→512px). You could have a deep stack of small self-attention layers, and then do the pixel generation in two or three big blocks. Why not use something like the deep _w_ stack of StyleGAN, (#26) where the self-attention layers keep transforming a latent representation of the image, before spitting out the actual pixels? This would be a 'skinny' GAN. You could see this as taking the StyleGAN z→w 8x512 FC embedding to the logical limit, and souping it up with self-attention instead of regular FC layers. It's one of the weirdest parts of StyleGAN, but seems important in generating a really useful linear disentangled w seed. The final output could also be done with a single pass into Reformer, which scales up to 256^2 = 65536 pixels and could potentially do 512^2 = 262144 pixels (this is easier than the usual sequence task because it only needs to 'decode' the latent state, like a bottleneck).

Another interesting option towards all-attention GANs: combine pixel-level loss, implemented using U-Net for pixel segmentation (["A U-Net Based Discriminator for Generative Adversarial Networks"](https://arxiv.org/abs/2002.12655), Schönfeld et al 2020) with FB's new simpler object detector, ["DETR: End-to-End Object Detection with Transformers"](https://arxiv.org/abs/2005.12872), Carion et al 2020.

Back in early March 2020, shawwn tested the simplest most obvious thing of just replacing (most of) the convolution layers in a 128px StyleGAN with self-attention, and it does still work with anime faces, so no immediate problem there...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Beyond BigGAN: all-attention GANs? #31

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Beyond BigGAN: all-attention GANs? #31

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions