Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Beyond BigGAN: all-attention GANs? #31

@gwern

Description

@gwern

Something to consider as a highly speculative research project once Ganbooru is done.


A distinct trend in recent DL has been self-attention moving beyond sequence or text data to image data as well, increasingly augmenting or replacing convolutions ("attention is all you need"). Examples:

In Bello et al, the classifier using solely attention is both parameter and FLOP-efficient compared to the classical CNN, and although it doesn't set the SOTA, the performance curves look good. Another paper argues that self-attention learns convolution-like attention patterns, but they are are different and more tailored; but still other papers note that NNs with attention appear to have better inductive biases, generalize in more human-like ways, and resist adversarial examples better; and of course, we've long noted in GANs that attention appears to help a lot in global coherency.

So the logical question becomes, can we create a GAN which drops convolution layers entirely in favor of just self-attention layers? The self-attention may require a lot of data in order to surpass the convolution hardwired priors & compute-efficiency but it will ultimately deliver better results. (Making it another bitter lesson #28?). Fortunately, we have both a lot of data and a lot of compute in TPU pods.

Bringing this up on Twitter, no one has a solid theoretical argument why an all-attention GAN would fail, other than the obvious point that the memory consumption of straightforward 256px or 512px self-attention layers may be too large for TPUs.

We can work around that with reversible attention layers to save memory, use of more efficient self-attention approximations like Reformer, or by rethinking the architecture entirely: why do we need 256px or 512px upscaling modules at all? You wouldn't necessarily have to do the obvious approach of using self-attention+nearest-neighbors-upscaling blocks iterated (16→32→64→128→256→512px). You could have a deep stack of small self-attention layers, and then do the pixel generation in two or three big blocks. Why not use something like the deep w stack of StyleGAN, (#26) where the self-attention layers keep transforming a latent representation of the image, before spitting out the actual pixels? This would be a 'skinny' GAN. You could see this as taking the StyleGAN z→w 8x512 FC embedding to the logical limit, and souping it up with self-attention instead of regular FC layers. It's one of the weirdest parts of StyleGAN, but seems important in generating a really useful linear disentangled w seed. The final output could also be done with a single pass into Reformer, which scales up to 256^2 = 65536 pixels and could potentially do 512^2 = 262144 pixels (this is easier than the usual sequence task because it only needs to 'decode' the latent state, like a bottleneck).

Another interesting option towards all-attention GANs: combine pixel-level loss, implemented using U-Net for pixel segmentation ("A U-Net Based Discriminator for Generative Adversarial Networks", Schönfeld et al 2020) with FB's new simpler object detector, "DETR: End-to-End Object Detection with Transformers", Carion et al 2020.

Back in early March 2020, shawwn tested the simplest most obvious thing of just replacing (most of) the convolution layers in a 128px StyleGAN with self-attention, and it does still work with anime faces, so no immediate problem there...

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions