Kieran Didi

The unification of representation learning and generative modelling

2025-12-31T00:00:00+00:00

Introduction
Background
Overview of the Four Phases
Phase 1: Aligning Diffusion Features to Vision Foundation Models
Phase 2: Aligning the VAE Latent Space to Foundation Models
Phase 3: Operating Directly in Vision Foundation Model Feature Spaces
Phase 4: Questioning the Need for Pretrained Representations
The Other Direction: Generative Models as Representations
- From Pixel Prediction to Embedding Prediction
- Diffusion Models Learn Representations Too
Representation Learning and Alignment in Molecular Machine Learning
- Molecular Embeddings: Borrowing from NLP and Computer Vision
- Where to Go from Here?
Conclusion
Credits
References

For the TLDR version of this post, see this version on the OPIG blog.

If you would like to cite this post in an academic context, you can use this BibTeX snippet:

@misc{didi2025r4g,
  author = {Didi, Kieran},
  title = {The unification of representation learning and generative modelling},
  url = {https://kdidi.netlify.app/blog/ml/2025-12-31-r4g/},
  year = {2025}
}

Introduction

Both generative modeling and representation learning have made impressive advances in recent years, particularly in computer vision. Diffusion ¹²³ and flow models ⁴⁵ have achieved unprecedented generation quality, while self-supervised paradigms like CLIP ⁶, DINO ⁷, and MAE ⁸ have enabled state-of-the-art performance on classification, detection, and depth estimation. Yet generation has remained separate from other vision tasks, raising a natural question: can we create unified representations useful for both discriminative and generative tasks?

Fig 1. Three parallel timelines showing the independent evolution of representation learning methods, latent generative modeling architectures, and the recent convergence of these fields in R4G (Representation for Generation).

This field—sometimes termed Representation for Generation (R4G)—has evolved rapidly over the past year, with multiple groups independently converging on similar insights. The rapid development reveals fundamental questions about visual representations: Are features learned during generation inherently different from those learned discriminatively? Can we bridge these paradigms for more efficient systems? Recent evidence suggests diffusion models already learn semantically meaningful representations ⁹¹⁰, and that generative classifiers exhibit surprisingly human-like properties ¹¹. Is there a way these two can benefit from each other more explicitly?

Fig 2. Evolution from no alignment (Phase 0) through feature alignment (Phase 1), VAE alignment (Phase 2), and VAE-less direct embedding diffusion (Phase 3). Phase 4 (pixel-space diffusion without pretrained models) represents a parallel evolution that has been ongoing throughout, with complementary contributions to pure latent-space methods. Each block shows the architectural approach and which papers introduced key innovations.

When I started reading into the literature, I was honestly quite overwhelmed by the sheer number of papers and approaches being proposed there this year alone, with new papers coming out every week. But after some reading and discussion of some of these papers with the respective authors as well as colleagues some patterns started to emerge. In this blog post I try to organize recent developments into four phases reflecting my take on how the field developed during 2025: from initial alignment strategies to questioning whether pretrained representations are necessary at all. As part of this I also touch upon pixel-space versus latent diffusion models (again) and how the trend goes both ways, i.e. how we can use generative models for representation learning. Finally, because at heart I am a molecule guy, I share some of my thoughts on how these ideas are beginning to influence molecular machine learning, and exciting directions to pursue there.

Fig 3. An explosion of papers in this field has made it hard to keep an overview; by the end of this post you should hopefully be able to read any of these papers here and place it on your mental map into one of the phases we will discuss and be able to compare it to similar approaches.

Background

To talk about the unification of representation learning and generative modelling, it might be wise to shortly talk about each of these separately and review what happened in each of them recently. It has been known for quite a while that they are intimitely related (for a recent take on this see this excellent talk by Kaiming He from a CVPR2025 workshop). However, in practice they still function quite differently, both in terms of the losses and training recipes employed as well as the neural network architectures used. What is the latest in both of these fields?

Fig 4. Generative Modelling and Representation Learning are intimately connected, two sides of the same coin: while representation learning tries to map from data to some semantic representation space (e.g. to allow for easier classification of objects in an image), generative modelling wants to maps from abstract concepts like text prompts to actual data samples. Image from the “Foundation of Computer Vision” online book

Generative modelling: Latent Diffusion Models

Fig 5. Two-stage training: first, a VAE compresses images into latent space; second, diffusion operates in this latent space. This modular design accelerated adoption but separated tokenizer training from diffusion model training.

Diffusion models revolutionized generation by framing it as iterative denoising. While Sohl-Dickstein et al. ¹² first introduced the core idea of learning to reverse a diffusion process in 2015, the approach didn’t scale until Ho et al. ¹ introduced DDPMs that gradually add noise and learn to reverse it; Song et al. looked at it from a score-based perspective ¹³; and in 2021 they got together to formalize a unified perspective through stochastic differential equations ². This was all still in pixel-space; each pixel was denoised in RGB space, hindering both scalability as well as performance. The key breakthrough came with latent diffusion: Vahdat et al. adopted their already NVAE work ¹⁴ to propose LSGM ¹⁵, a theoretically principled framework for joint VAE+diffusion training with tractable score matching and proper variational bounds. However, despite superior theory, LSGM’s engineering complexity, including spectral regularization, careful hyperparameter tuning and variance reduction, limited practical adoption.

Rombach et al.’s Latent Diffusion Models (LDMs) ¹⁶ simplified this dramatically. Rather than joint end-to-end training, LDMs adopted a two-stage design: first, a VAE compresses images into lower-dimensional latents (typically $256 \times 256 \times 3 \to 32 \times 32 \times 4$ ); second, diffusion operates in this latent space. A key insight of the LDM paper—and where it fundamentally differs from LSGM—is that during autoencoding the model only encodes perceptually relevant details in the latent space, but not all fine-grained ultra high entropy information (like texture details). This pixel-level detail is re-generated in the decoder. This comes from incorporating a patch-based discriminator loss (discard local details and regenerate them) as well as the LPIPS loss (reconstruct in a feature space of perceptually relevant features) in addition to the regular MSE loss. The LDM paper calls this semantic vs. perceptual compression (see Fig. 2 in ¹⁶). This is drastically different from LSGM: in LSGM, all high-entropy local details are encoded in latent space—the same information that regular pixel-space DDPM models must model—which is why perhaps LSGM took longer to train and required bigger models in the latent space. LDM’s smart compression scheme for visual signals makes things way more scalable for image and video data. For a more in-depth discussion on latent diffusion models, see Sander Dieleman’s excellent blog ¹⁷. This simplified approach produced better perceptual quality and more stable training, allowing extension of the approach to other modalities like video generation ¹⁸¹⁹.

The modular two-stage approach provided significant advantages: VAEs pretrained once could be reused across different diffusion models, researchers could iterate independently on each component, and pretrained autoencoders from other work could be directly incorporated. This modularity accelerated research and deployment and enabled breakthroughs like Stable Diffusion XL ²⁰. However, as subsequent sections discuss, this separation between tokenizer and generative model is now being reconsidered.

Peebles and Xie’s Diffusion Transformer (DiT) ²¹ demonstrated that transformers could replace U-Nets, achieving state-of-the-art ImageNet generation with favorable scaling. DiT operates on latent patches, treating them as sequences like Vision Transformers. A key finding: model complexity correlates strongly with sample quality—increasing depth, width, or tokens consistently improves generation. The largest DiT-XL/2 model established transformers as scalable alternatives for diffusion, serving as the baseline against which subsequent alignment methods would be measured.

Recent developments have also explored alternative generative paradigms. Flow matching ⁴ provides a simulation-free approach to training continuous normalizing flows with conceptually simpler and more flexible formulations than standard diffusion. The relationship between diffusion and flow matching has been clarified ²², showing they are fundamentally equivalent under certain conditions, differing primarily in parameterization and sampling schedules—one can make a good diffusion model work just as well, and one can also define suboptimal paths in flow matching. The popularity of flow matching likely stems more from its conceptual simplicity than from clear performance advantages. These rectified flow transformers have been successfully scaled to production systems ²³.

Representation Learning: self-supervised vision foundation models

Self-supervised learning aims to learn general-purpose visual representations without manual labels, enabling models to exploit vast unlabeled corpora ²⁴²⁵²⁶²⁷²⁸²⁹. Early approaches were largely contrastive: they defined positive and negative pairs and trained encoders so that positives map to nearby features while negatives are pushed apart ²⁴²⁵³⁰. Subsequent work progressively weakened the dependence on labels, explicit negatives, and even pixel-level reconstruction, moving toward architectures that predict high-level, semantic representations ⁷³¹³²²⁸³³.

A natural starting point for self-supervised representation learning is cross-modal contrastive learning, where aligned pairs provide supervision “for free.” CLIP jointly trains image and text encoders so that the similarity between matching image–caption pairs is maximized and that between mismatched pairs is minimized, using a large-scale contrastive objective over Internet-scale image-text datasets ⁶¹⁹²³. This removes the need for class labels but depends on enormous amounts of paired data, and on sufficiently many negatives in each batch to avoid trivial solutions where the model encodes only coarse semantics ³⁴.

SimCLR showed that contrastive learning can work in a purely single-modal setting ³⁰. Two heavily augmented views of the same image form a positive pair, and all other images in the batch serve as negatives. Combined with strong data augmentation, a temperature-scaled InfoNCE loss, and large encoder capacity, SimCLR achieves supervised-level performance on ImageNet, demonstrating that labels are not strictly necessary for high-quality features. However, as with most contrastive learning methods including the ones described before, this comes at the cost of extremely large batch sizes, which are needed to provide enough negative examples so that the contrastive loss encourages fine-grained, non-trivial representations rather than collapsing to coarse global features ²⁴.

SwAV improves on this regime by replacing explicit pairwise comparisons with online clustering ²⁶²⁷. Instead of contrasting features directly, SwAV assigns representations to prototype clusters and enforces consistency of these assignments across multiple augmentations of the same image. This “swapped prediction” mechanism preserves many advantages of contrastive learning while being more memory-efficient and less sensitive to batch size, making it easier to scale to large datasets and long training schedules.

Vision–Language Models and Sigmoid Contrastive Losses

Vision-language pretraining extends contrastive learning to cross-modal settings at scale. CLIP demonstrated that large-scale image-text contrastive learning yields highly transferable visual representations and strong zero-shot performance across tasks ⁶¹⁹. However, CLIP’s softmax-based loss ties batch size directly to the number of effective negatives, which complicates scaling and makes training expensive.

SigLIP addresses this by replacing the softmax contrastive loss with a pairwise sigmoid loss over image-text similarities ³⁴. This loss operates independently on each pair, enabling smaller batch sizes while still learning strong fine-grained alignments between images and text. SigLIP 2 further augments this recipe by combining contrastive training with captioning-style objectives, self-supervised losses, and improved data mixtures, leading to better semantic understanding, localization, and dense prediction performance ³⁵.

Self-Distillation and Momentum Encoders

A key limitation of contrastive methods is their reliance on negatives. BYOL and related methods showed that it is possible to dispense with explicit negatives by using a momentum-updated teacher network ³⁶³⁷. The student is trained to match the teacher’s representation of a differently augmented view of the same image; the teacher parameters are an exponential moving average of the student’s, which stabilizes training and prevents collapse in practice.

DINO extends this self-distillation paradigm and reveals several surprising properties of the resulting representations ⁷. Without labels or negatives, DINO learns features whose attention maps correspond to object boundaries and support unsupervised semantic segmentation, indicating non-trivial semantic organization. In principle, such momentum-encoder methods require only a single image per batch, since supervision comes from matching teacher and student outputs rather than contrasting with other samples.

DINOv2 scales this recipe with larger Vision Transformers, improved optimization, and a carefully curated, diverse training set ³¹. The resulting models produce highly robust and transferable features that rival or surpass supervised pretraining across many benchmarks, as well as serving as strong vision foundation encoders for downstream tasks, including generative modeling ³⁸³⁹. However, prolonged self-distillation can gradually erode fine-grained spatial information, especially in dense feature maps used for pixel-level tasks.

To address this, DINOv3 introduces Gram anchoring, a regularization that stabilizes dense feature representations over long training schedules by constraining second-order statistics across patches and scales ³². This mitigates the tendency of self-distillation to over-smooth features, preserving detailed structure that is crucial for dense prediction and generative tokenization while maintaining the semantic strengths of the DINO family.

Masked Image Modeling and Predictive Architectures

In parallel, masked image modeling treats images analogously to masked language modeling in NLP. MAE masks a large fraction of image patches (typically around 75%) and trains an asymmetric encoder-decoder architecture to reconstruct the missing pixels ⁸. This forces the encoder to focus on global structure rather than local texture, producing efficient representations that work well for many downstream tasks with modest finetuning.

iBOT combines masked prediction with self-distillation, using a teacher network as an online tokenizer that predicts semantic tokens for masked patches instead of raw pixels ⁴⁰. This hybrid objective closes much of the gap between contrastive and masked modeling approaches, yielding representations that perform strongly on both image-level classification and dense prediction.

Joint-embedding predictive architectures such as I-JEPA take a more explicitly semantic view: instead of reconstructing pixels, they predict high-level latent representations of masked regions from visible context ²⁸. By operating entirely in representation space, I-JEPA avoids over-emphasizing low-level details and focuses learning on abstract structure, leading to scalable training and strong transfer across tasks.

Toward a Platonic Representation and Implications for Generative Models

Recent work from Philip Isola’s lab at MIT has provided empirical evidence for a remarkable phenomenon: representations learned by different models, architectures, and even modalities converge toward a shared structure as models scale and training data diversifies ²⁹. This convergent behavior has motivated the Platonic Representation Hypothesis, which posits that as models grow in capacity and are trained on increasingly rich data, their internal representations converge toward a shared statistical model of reality; a “platonic” representation that is largely independent of any specific task or architecture ²⁹.

The evidence for this convergence comes from multiple angles. The foundational work demonstrates that features from independently trained vision and language models become more aligned as scale and data diversity increase, and that different self-supervised objectives yield embeddings that occupy similar subspaces up to simple linear transformations ²⁹. Subsequent research has shown that cross-modal training can benefit each modality individually: Gupta et al. demonstrate that leveraging unpaired multimodal data (e.g., text, audio, or images) consistently improves downstream performance in unimodal tasks, exploiting the assumption that different modalities are projections of a shared underlying reality ⁴¹. Perhaps most strikingly, Wang et al. show that when language models are prompted with sensory instructions (e.g., “see” or “hear”), their representations become more similar to specialist vision and audio encoders, revealing that text-only models implicitly encode multimodal structure that can be activated through appropriate prompting ⁴². This suggests that even purely text-trained language models converge toward similar representations as vision models, with the convergence becoming stronger as models scale ²⁹⁴².

This hypothesis has direct implications for generative modeling. If discriminative vision foundation models such as DINOv2, DINOv3, and MAE converge toward an approximately optimal visual representation, then explicitly leveraging these encoders can accelerate the training and improve the quality of generative models that would otherwise have to discover similar structures from scratch ³¹³⁸³⁹⁴³. Recent work on aligning diffusion models to pretrained visual encoders—through feature alignment, representation regularization, or joint training of tokenizers and generators—can thus be viewed as an attempt to steer generative models toward this platonic representation early in training ⁴⁴⁴⁵⁴⁶⁴⁷⁴⁸⁴⁹. This perspective sets the stage for our discussion of Phase 1 methods that explicitly align diffusion features to vision foundation models, and for later sections analyzing the emerging convergence between generative and discriminative representations.

Overview of the Four Phases

Before diving into the details, let me briefly outline the four phases and what to expect. Phase 1 introduces representation alignment—regularizing diffusion features to match pretrained vision encoders like DINOv2. Phase 2 takes this deeper by incorporating semantic structure into the VAE latent space itself. Phase 3 questions whether we need VAE compression at all, proposing to diffuse directly in pretrained representation spaces. Phase 4 represents a parallel evolution that has been ongoing throughout: improving pixel-space diffusion through architectural innovation, questioning whether pretrained representations are necessary at all.

A key insight that will emerge: spatial structure alignment matters more than global semantic information for generation quality. Methods that preserve local self-similarity patterns consistently outperform those optimizing for classification accuracy. Additionally, while these techniques are presented in the context of Latent Diffusion Models (where most scaling happens), many of the core ideas—representation alignment, semantic regularization—are general and apply equally to pixel-space methods.

It’s also worth noting that in its most general form, the “representation space” we generate from can be pure noise, as in standard pixel-space diffusion or GANs. Even in latent models, all generation ultimately originates from noise—the question is what structure we impose on the intermediate representations.

Beyond these four phases, we’ll also explore the reverse direction: how generative modeling itself can serve as a pretraining objective for learning discriminative representations. This bidirectional relationship—representations helping generation, and generation producing useful representations—suggests these paradigms may be more unified than historically assumed.

Phase 1: Aligning Diffusion Features to Vision Foundation Models

The first wave at the end of 2024/start of 2025 recognized that diffusion models learn semantically meaningful representations during training, but more slowly and less effectively than specialized discriminative models ⁵⁰⁵¹. The solution: align intermediate diffusion features with pretrained vision encoders to guide training.

REPA ⁴⁴ introduced this paradigm in October 2024 through straightforward regularization. The method extracts features from intermediate diffusion layers, projects them through small MLPs, and maximizes cosine similarity with frozen DINOv2 encoder features. This auxiliary loss complements standard denoising:

\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{diffusion}} + \lambda \mathcal{L}_{\text{align}}

The paper builds upon the insights from earlier work that diffusion models learn discriminative representations during denoising, but it takes the critical step to show that aligning these emerging representations with high-quality pretrained features accelerates convergence. Longer training improves weak natural alignment, but the REPA loss strengthens this alignment from the start, leading to better representations and better generation—a dual benefit suggesting a genuinely helpful inductive bias.

Fig 6. Left shows baseline LDM architecture. Right shows REPA with alignment loss from intermediate diffusion features to frozen DINOv2 representations, speeding early training through semantic guidance.

REG ⁴⁵ extended this by entangling semantic class tokens with latent content during denoising. Rather than just aligning intermediate features, REG concatenates the [CLS] token from frozen DINOv2 with noisy latents, training the diffusion model to jointly reconstruct noise and original [CLS] token. This minimal overhead (single token, $<0.5\%$ FLOPs increase) provides stronger guidance than feature alignment alone. Interestingly, class token concatenation helps substantially even without explicit REPA alignment, though combining both works best—suggesting multiple mechanisms for incorporating semantic structure can be complementary.

Fig 7. REG concatenates the [CLS] token from frozen DINOv2 with noisy latents, enabling joint reconstruction of both image content and semantic class information directly from pure noise.

HASTE ⁴⁸ addressed a key REPA limitation: alignment helps dramatically early but can plateau or degrade later. Once the generative model begins modeling the full data distribution, the lower-dimensional discriminative teacher becomes a constraint rather than guide. The discriminative encoder focuses on task-relevant semantics while discarding generative details; forcing continued alignment may prevent learning the distribution’s full complexity. HASTE introduces two-phase training: Phase I simultaneously distills attention maps (relational priors) and feature projections like REPA (semantic anchors) for rapid initial convergence. Phase II terminates alignment at a predetermined iteration, freeing the model to exploit its generative capacity. This simple modification achieves dramatic acceleration, with the key insight that alignment is most valuable for initial structure but counterproductive once basic semantic organization is learned.

Fig 8. Phase I applies holistic alignment distilling both attention maps and features. Phase II terminates alignment one-shot at a fixed iteration, freeing the diffusion model to model the full distribution without the discriminative teacher constraint.

Several puzzling trends emerged in representation alignment that defied conventional understanding. Larger model variants within the same encoder family often led to similar or even worse generation performance despite higher ImageNet-1K accuracy—DINOv2’s larger variants showed diminishing returns, while PE and C-RADIO exhibited this counterintuitive pattern even more starkly⁴⁴. More strikingly, representations with dramatically higher global semantic understanding consistently underperformed: PE-Core-G (82.8% ImageNet accuracy) generated worse images than PE-Spatial-B (53.1% accuracy), and SAM2-S achieved strong generation performance despite only 24.1% ImageNet accuracy - approximately 60% lower than many competing encoders. Perhaps most revealing, controlled experiments showed that explicitly injecting global information through CLS token mixing improved linear probing accuracy from 70.7% to 78.5% while simultaneously degrading generation quality, with FID worsening from 19.2 to 25.4.

iREPA’s analysis in December 2025⁵² resolved these contradictions by demonstrating that spatial structure—the self-similarity patterns between patch tokens—not global semantics, drives representation alignment effectiveness. To quantify this insight, the authors measured spatial self-similarity structure ⁵³ across patch tokens and performed large-scale correlation analysis across 27 vision encoders and three model sizes. Spatial structure metrics exhibited remarkably strong correlation with generation FID (Pearson |r| > 0.852 for metrics like Local Distance Similarity, Short-Range Spatial Similarity, Cosine Distance Similarity, and Relative Mean Spatial Contrast), far exceeding ImageNet-1K accuracy’s predictive power (|r| = 0.26). This explained SAM2’s paradoxical success: despite poor classification accuracy, it maintained strong spatial structure that proved ideal for generation. The authors then took this further and made two small modifications to the REPA recipe: by replacing standard MLP projection with convolutional layers that preserve local spatial relationships and implementing spatial normalization to accentuate relational structure transfer, iREPA (implemented in fewer than 4 lines of code) consistently improves convergence speed across diverse encoders, model sizes, and training recipes including REPA, REPA-E (more on this later), and MeanFlow (a few-step training method). This aligns with HASTE’s emphasis on attention distillation: the success lies in teaching spatial organization coherence rather than transferring high-level semantic concepts.

These findings are quite intuitive when you think about what generative models need to do: they must model all spatial structure in detail, which is expensive to discover from scratch. In contrast, global semantic understanding is more relevant for classification than for pixel-level generation. A model that knows “this is a dog” but doesn’t understand the spatial relationships between patches will generate poorly, while a model with strong spatial coherence but weaker global semantics can still produce coherent images.

Another angle to think about why alignment helps is that the diffusion training objective is inherently high variance: at each iteration, we present a noisy input $\mathbf{x}_t$ and ask the model to predict $\mathbf{x}_0$ , but the optimal prediction is not any single $\mathbf{x}_0$ seen during training—it’s the expectation $\mathbb{E}[\mathbf{x}_0 \mid \mathbf{x}_t]$ over all possible clean images consistent with that noisy input ⁵⁴. At high noise levels, this expectation corresponds roughly to the mean of the entire dataset. The model must learn to implicitly average over many possible reconstructions, but it only ever sees individual samples as supervision. This mismatch between what we supervise (samples) and what we want (expectations) makes the objective noisy and slows representation learning.

Representation alignment methods like REPA address this by providing a low-variance auxiliary signal. The pretrained vision encoder already captures useful spatial and semantic structure through objectives that don’t suffer from this sample-vs-expectation mismatch. By aligning to these representations, we essentially provide the denoiser with a shortcut to the internal structure it needs for effective denoising, bypassing the slow process of discovering this structure through the high-variance diffusion objective alone.

In summary, phase 1 establishes clear patterns: Using pretrained vision foundation models through representation alignment dramatically accelerates diffusion training. However, the methods operate at the level of intermediate diffusion features, leaving the VAE latent space unchanged. Phase 2 takes the logical next step: incorporating semantic structure into the latent space itself.

Phase 2: Aligning the VAE Latent Space to Foundation Models

While Phase 1 aligned intermediate diffusion features, Phase 2 recognized that the latent space itself—the compressed VAE representation—could incorporate semantic structure from vision foundation models. This deeper integration addresses the fundamental trade-off between reconstruction quality and learnability of the latent distribution.

Fig 9. LSGM (2021) aims for smooth trajectories in latent space by normalizing distributions. LDM (2022) emphasizes highly compressed latents for computational efficiency. EQ-VAE and VA-VAE tackle the trade-off: improving encoder equivariance and aligning latent encodings with pretrained models to create learnable high-dimensional spaces. Image kindly adapted from Arash Vahdat.

Standard LDM treats VAE and diffusion training as independent, with the VAE optimized solely for pixel reconstruction (and perceptual quality by auxiliary losses relying on discriminators or metrics like LPIPS ¹⁷). This pixel-focused objective produces latents encoding low-level details effectively but lacking semantic structure. Increasing latent dimensionality improves reconstruction but creates higher-dimensional, more complex spaces for diffusion to learn—an “optimization dilemma” where better reconstruction leads to harder generation.

Fig 10. VAE encoder trains with both reconstruction loss and alignment loss to frozen VFM features, creating latents that are both reconstructive and semantically meaningful for efficient diffusion model training.

VA-VAE ⁴⁷ directly tackles this: it aligns the VAE’s latent space with pretrained vision foundation models during tokenizer training rather than relying solely on pixels via their VF loss:

\mathcal{L}_{\text{VA-VAE}} = \mathcal{L}_{\text{recon}} + \beta \cdot \text{KL} + \lambda_{\text{align}} \mathcal{L}_{\text{VF}}.

\mathcal{L}_{\text{VF}} = w_{\text{hyper}} \, w_{\text{adaptive}} \Big( \mathcal{L}_{\text{mcos}} + \mathcal{L}_{\text{mdms}} \Big).

with

\mathcal{L}_{\text{mcos}} = \frac{1}{h w} \sum_{i=1}^{h} \sum_{j=1}^{w} \operatorname{ReLU} \!\left( 1 - m_{1} - \frac{z'_{ij} \cdot f_{ij}} {\lVert z'_{ij}\rVert \,\lVert f_{ij}\rVert} \right),

\mathcal{L}_{\text{mdms}} = \frac{1}{N^{2}} \sum_{i,j} \operatorname{ReLU} \left( \left| \frac{z_i \cdot z_j}{\lVert z_i\rVert \,\lVert z_j\rVert} - \frac{f_i \cdot f_j}{\lVert f_i\rVert \,\lVert f_j\rVert} \right| - m_{2} \right),

where $z$ denotes latent vectors from the VAE latent feature map and $f$ denotes vectors from the frozen vision foundation feature map for the same image; $Z' = WZ$ linearly projects VAE latents to the VFM feature dimension; $z'_{ij}$ and $f_{ij}$ are the projected latent and VFM feature at spatial position $(i,j)$ ; $z_i, f_i$ are vectors at position $i$ after flattening the $h \times w$ grid into $N = h w$ tokens; $m_1$ and $m_2$ are cosine-similarity margins; $\mathcal{L}_{\text{mcos}}$ enforces pointwise alignment, $\mathcal{L}_{\text{mdms}}$ aligns pairwise relational structure; $w_{\text{adaptive}} = \lVert\nabla \mathcal{L}_{\text{recon}}\rVert / \lVert\nabla \mathcal{L}_{\text{VF,raw}}\rVert$ rescales VF gradients to match the reconstruction loss, and $w_{\text{hyper}}$ is a user-set scalar (e.g., $0.1$ ) controlling the overall VF strength.

The VF loss encourages both point-by-point alignment (individual latent vectors close to VFM features) and relative alignment (relationships between latents match relationships between features), using adaptive weighting similar in spirit to loss balancing in GANs ⁵⁵. This yields semantically organized high-dimensional latent spaces that retain reconstruction quality while being more learnable for downstream generative models.

By semantically structuring the latent space of the VAE, it reduces the diffusion model’s burden, allowing it to focus on learning the distribution rather than also discovering semantic organization. The latent space provides appropriate inductive bias—semantic structure “baked in” through VFM alignment while pixel details are captured through reconstruction.

Fig 11. Left shows stage-wise training with frozen VAE. Right shows REPA-E with careful gradient flow: alignment loss flows to both components, diffusion loss uses stop-gradient on VAE encoder, and VAE receives alignment gradients through BatchNorm for latent normalization.

After diffusion model alignment as well as VAE alignment had been demonstrated, REPA-E ⁴⁶ takes integration further through joint VAE+diffusion training, challenging the convention that these components should train separately. It demonstrates that while naive end-to-end training with diffusion loss alone is ineffective (causing latent space collapse), representation alignment provides necessary constraints for successful joint optimization. The key innovation proved to be careful gradient control. Alignment loss flows to both VAE and diffusion model, but diffusion loss uses stop-gradient on the VAE encoder to prevent collapse (the VAE shouldn’t change to make diffusion easier at reconstruction’s cost). In addition, to keep the latent space normalised, the VAE receives alignment gradients only through BatchNorm normalisation. This enables joint optimization: the VAE improves to produce latents both reconstructive and well-aligned, while the diffusion model learns in this evolving but stable space. Joint optimization improves the VAE itself, leading to better latent structure (higher VFM alignment, better class separation) and downstream performance. While in LSGM ¹⁵ pretraining of the VAE was necessary, true end-to-end training is now possible.

3-Stage-Aligner ⁵⁶ proposes an alternative strategy: rather than training the VAE from scratch with alignment, freeze a pretrained encoder (e.g., DINOv2), map into into a low-dimension space via an adapter block and learn to align a decoder through three stages. Stage 1 (Latent Alignment) freezes the VFM encoder and trains the adapter plus decoder, establishing a semantic latent space with basic reconstruction capabilities. The resulting latents are semantically grounded but exhibit color shifts and missing fine-grained details since the frozen encoder was not trained for reconstruction. Stage 2 (Perceptual Alignment) jointly optimizes adapter and encoder (now unfrozen) with semantic preservation loss maintaining alignment with original frozen VFM features:

\mathcal{L}_{\text{Stage 2}} = \mathcal{L}_{\text{recon}} + \lambda_2 \lVert\text{Enc}(x) - \text{Enc}_{\text{frozen}}(x) \rVert^2

The L2 loss prevents encoder drift from the pretrained semantic structure while allowing capture of fine-grained color and texture. Stage 3 (Decoder Refinement) freezes both encoder and adapter, allowing the decoder to better exploit the latent representation changed during Stage 2 without disturbing semantic structure.

Fig 12. Stage 1 establishes semantic grounding with frozen encoder. Stage 2 allows encoder refinement with semantic preservation loss. Stage 3 optimizes decoder for reconstruction quality, carefully balancing semantic preservation with fine-grained detail capture.

This yields semantically rich tokenizers where latent space inherits discriminative structure from the pretrained encoder. The three-stage process carefully balances semantic preservation (maintaining VFM structure) with reconstruction quality (capturing fine-grained details), avoiding color shifts of purely frozen encoders and semantic drift of fully unconstrained fine-tuning. Note the contrast with REPA-E’s end-to-end approach: 3-Stage-Aligner returns to explicit staged training rather than joint optimization. This reflects a broader pattern in Phase 2—there’s a zoo of different methods (end-to-end vs. staged, joint vs. separate alignment) and it remains unclear which approach is definitively best, as many of these concurrent works don’t directly compare to each other under identical conditions.

Phase 2 establishes that incorporating semantic structure directly into VAE latent space—whether through alignment loss during training (VA-VAE), end-to-end joint optimization (REPA-E), or staged adaptation of frozen encoders (3stage-aligner)—produces superior results compared to standard pixel-focused VAE training. These semantically-structured spaces are easier to learn (faster convergence) and produce better final quality. However, they still rely on the two-stage VAE+diffusion pipeline, raising a natural question: do we need VAE compression at all?

Phase 3: Operating Directly in Vision Foundation Model Feature Spaces

The third phase represents a more radical departure, questioning whether the VAE bottleneck is necessary at all. Instead of compressing images through a VAE and then aligning the latent space, these methods propose directly using pretrained vision foundation model features as the “latent space” for diffusion, or training autoencoders specifically to preserve discriminative information rather than minimize reconstruction error.

Based on the observation made in Perception Encoder ⁵⁷ that the best visual embeddings for downstream tasks are often not at the output of vision networks but rather in intermediate layers, VFM-VAE ⁴³ merges frozen VFM features from different parts of the network as latent representations. However, VFMs focus on semantic understanding, producing spatially coarse features (e.g., DINOv2 ViT-L outputs $16 \times 16$ for $256 \times 256$ images) sacrificing pixel fidelity. VFM-VAE redesigns the decoder with multi-scale latent fusion (combining features from multiple VFM layers, providing both semantic guidance from deep layers and spatial detail from shallow layers) and progressive resolution reconstruction (building up resolution gradually through decoder blocks, starting from coarse VFM features and progressively adding detail). In addition, the embedding dimensionality of VFMs is often too high for effective generative modelling; VFM-VAE circumvents this by mapping the different embeddings into a compressed latent space that is regularised via KL divergence, thereby still containing a VAE but with strong initialisation by a VFM.

Fig 13. In VFM-VAE, multiple VFM encodings are compressed into a single latent representation that is then projected out to pixel space via multi-scale decoders. The right side shows that this latent space is more robust to geometric perturbations and achieves strong reconstruction as well as generation.

This enables high-quality reconstruction from semantically rich but spatially compact representations. The work also introduces SE-CKNNA metric for diagnosing representation dynamics during diffusion training. SE-CKNNA measures how well semantic structure in latent space is preserved during noising, revealing that semantic structure degrades nonlinearly with noise level, with critical thresholds where class separability breaks down. Using these insights, the authors develop joint tokenizer-diffusion alignment strategy dramatically accelerating convergence. The frozen pretrained encoder ensures the latent space maintains semantic alignment even under distribution shifts—Phase 2 methods that fine-tune encoders risk semantic drift; VFM-VAE’s frozen encoder ensures consistent structure. However, this requires architectural innovations (multi-scale fusion, progressive reconstruction) to overcome reconstruction challenges of coarse frozen features, which prevents easy adoption.

SVG ⁵⁸ tries to avoid these complex architectural modifications by taking a principled approach analyzing why VAE latent spaces are problematic: they lack clear semantic separation and strong discriminative structure. Standard VAE latents exhibit semantic entanglement (different classes overlap) and poor class compactness (same-class samples widely dispersed). This makes the distribution difficult for diffusion to learn, as it must simultaneously discover semantic structure and model fine-grained variation. To overcome this, SVG constructs latent representations from frozen DINO features providing semantically discriminative structure with clear class separation, augmented with lightweight residual branch capturing fine-grained details:

z_{\text{final}} = z_{\text{DINO}} + \alpha \cdot z_{\text{residual}}

where frozen DINOv2 provides semantics and a learned residual encoder captures color, texture, and other details DINO discards. Normal VAE latents are semantically entangled, but alignment to VFM models enables clearer class separation and more compact classes. The SVG encoder proves important for fine-grained color details. No diffusion model tricks are needed since in the case of the chosen VFM DINOv3, the latent space is small enough (384-dimensional) to be modelled without compression. However, the alignment loss is crucial: without it, the decoder over-relies on the residual encoder, and numerical range differences between normalized frozen DINOv3 features and unnormalized learned residuals can distort semantic embeddings.

Fig 14. Left shows VA-VAE with learned encoder aligned to VFM. Right shows SVG with frozen DINO encoder plus lightweight residual encoder capturing fine-grained details, enabling clearer semantic separation without VAE training.

While SVG emphasises the need for a modest embedding space dimensionality and the need for a residual encoder that makes up for missing pixel-level details in the VFM embeddings, RAE ⁵⁹ tries to replace the VAE solely with pretrained representation encoders paired with trained decoders, without additional compression or auxiliary encoders. The authors systematically explore encoders from diverse self-supervised methods (DINO, SigLIP, MAE) and analyze challenges of operating diffusion transformers in resulting high-dimensional spaces. While standard VAE latents are low-dimensional ( $32 \times 32 \times 4$ , or 4K dimensions), representation encoder outputs are much higher ( $16 \times 16 \times 1024$ for DINOv2 ViT-L, or 262K dimensions). This poses challenges for diffusion transformers that generally perform poorly in such high-dimensional spaces.

RAE identifies and addresses sources of difficulty through theoretically motivated solutions. First, standard DiT bottlenecks all tokens through the same hidden dimension, so when input tokens have higher dimensionality, this creates an information bottleneck. RAE introduces a wide DDT head that maintains high-dimensional representations through a final shallow-but-wide layer while keeping the majority of the DiT block lower-dimensional. Second, standard schedules are designed based on spatial dimensions assuming certain statistical properties. Representation encoder outputs have different characteristics (already normalized, different variance structure). Therefore, RAE makes the noise schedule depend on actual data statistics rather than assuming fixed properties. Third, since the decoder trains separately from the frozen encoder, mismatch can occur at inference—the diffusion model produces slightly imperfect samples, but the decoder was trained on clean representations. Following TarFlow ⁶⁰, RAE adds noise augmentation during decoder training for robustness to imperfect samples.

RAE demonstrated that high-quality reconstruction from frozen DINO encoders with strong representations is possible. Computational overhead is minimal since DiT cost depends mostly on sequence length, not token dimension (which the wide head addresses). The DiT adjustments are necessary: scaling width to token dimension, making noise schedule data-dependent instead of spatial-dependent, and using noise-augmented decoding due to discrete decoder training. An additional benefit of RAE is that high-resolution synthesis is trivially enabled by swapping decoders with different patch sizes—the frozen encoder and trained diffusion model remain unchanged.

Fig 15. Shows systematic exploration of different pretrained encoders (DINO, SigLIP, MAE) as frozen latent encoders, with DiT adjustments (wide DDT head, data-dependent noise schedule, noise-augmented decoding) enabling effective diffusion in high-dimensional representation spaces.

However, while RAE allowed the direct use of pretrained VFMs as encoders, it has two main limitations:

The modifications of the diffusion model required to make this work were substantial.
There is no emphasis whatsoever on reconstruction, limiting editing capabilities of these models and making them potentially vulnerable to drifting off the data manifold.

FAE ⁶¹ focuses on tackling the first of these challenges by introducing a simple adoption via a single attention layer that allows the usage of standard LightningDiT recipes. By then training to both reconstruct images and preserve pretrained features, FAE creates truly unified representation serving as both generative latent space and discriminative feature space. The simple translation layer (a single attention layer between frozen encoder features and generative decoder) provides minimal but effective transformation. This allows use of standard diffusion models again without the RAE modifications, demonstrating that the right architectural intervention can eliminate the need for extensive model adjustments. It also shows that the simple translation layer preserves the spatial structure in latent space, which aligns with the iREPA insights that spatial structure is the main determinant for how effective alignment will be for generation quality ⁵². While these tricks seem useful to avoid the RAE architecture modifications like noise-augmented decoding and wide DDT head, recent work suggests that these modifications are not necessary once one scales RAE models up to larger sizes where the DiT width is anyway larger than the VFM embedding space ⁶².

Fig 16. Unlike RAE, FAE introduces a lightweight “translation layer” (a single attention block) to align frozen pretrained encoder features with the generative decoder. This minimal intervention preserves spatial structure and discriminative power.

An alternative perspective on RAE’s convergence difficulties argues that the problem is fundamentally geometric, not architectural. Standard flow matching uses linear interpolation between noise and data, creating probability paths that cut through Euclidean space. When data lies on a curved manifold—such as the hypersphere that representation encoders like DINOv2 produce—these straight-line paths pass through low-density regions in the interior of the manifold rather than following its surface. This “Geometric Interference” causes standard diffusion transformers to fail on representation spaces.

RJF (Riemannian Flow Matching with Jacobi Regularization) ⁶³ addresses this by explicitly modifying the probability paths: when the manifold geometry is known (a hypersphere for DINO features), it replaces linear interpolation with geodesic interpolation (SLERP), constraining paths to follow the manifold surface using Riemannian flow matching ⁶⁴. It additionally corrects for curvature-induced error propagation via Jacobi regularization ⁶⁵. This enables standard DiT architectures to converge without width scaling: a DiT-B (131M parameters) achieves FID 3.37 where prior methods fail entirely. RJF is in spirit similar to CDC-FM (Carré du champ flow matching) ⁶⁶, which also modifies probability paths to respect data geometry; the key difference is that RJF requires explicit knowledge of the manifold (enabling geodesic paths), while CDC-FM estimates local curvature from data via geometry-aware noise covariances, making it more general but less precise when the manifold structure is known.

Fig 17. Both CDC-FM (left) and RJF (right) modify the probability path structure to follow the data manifold structure, with CDC-FM using a spatially varying, anisotropic Gaussian noise whose covariance captures local manifold geometry and RJF using Riemannian flow matching and Jacobi regularization.

While FAE and RJF tackled the architectural adoption problem of RAE, PS-VAE ⁶⁷ tackled the editing problem that comes with the fact that RAE does not encourage the latent space to encode reconstruction capability explicitly. By training sequential representation as well as pixel decoders as well as finetuning the pretrained representation encoder with reconstruction losses, they find a good balance between reconstruction and representation capabilities and show that this balance allows them to perform superior generation and editing.

Most recently, UAE ⁶⁸ offers a theoretical unification through its “Prism Hypothesis,” which posits that semantic and pixel representations correspond to different frequency bands of a shared spectrum. Unlike SVG which adds a separate residual encoder, or RAE which relies on a heavy decoder, UAE initializes its encoder from DINOv2 and utilizes a frequency-band modulator to disentangle the latent space. It explicitly aligns the low-frequency band to the semantic teacher while dedicating high-frequency bands to residual details, effectively harmonizing semantic abstraction with pixel fidelity in a single compact latent space. For related work on frequency-band analysis in autoencoders, see also work on spectral autoencoders ⁶⁹ and the associated blog post.

Phase 3 methods establish that VAE compression is not fundamental to high-quality latent diffusion. By directly using pretrained vision foundation model features as latent representations (with appropriate architectural modifications handling high-dimensionality, spatial coarseness, and reconstruction challenges), we achieve generation quality comparable to or exceeding VAE-based methods while maintaining discriminative power of the original pretrained encoder. However, all Phase 3 methods still rely on pretrained vision foundation models. Phase 4 takes the final step: questioning whether we need pretrained representations at all.

Phase 4: Questioning the Need for Pretrained Representations

After three phases focused on progressively sophisticated ways to leverage pretrained models, Phase 4 represents a countertrend: can we achieve similar benefits by training from scratch with better objectives and architectures? This phase questions whether dependency on external pretrained models is fundamental or merely a workaround for suboptimal training procedures.

USP ⁷⁰ embodies this philosophy through fully end-to-end training jointly optimized for both generative and discriminative objectives. Rather than initializing from external representations, it employs a multi-task loss combining generation and discrimination such as contrastive learning, masked prediction, or classification. Generative and discriminative objectives complement one another: generative learning encourages modeling the full data distribution, while discriminative tasks promote the discovery of semantically meaningful structure. Joint optimization thus produces representations that are simultaneously generative (capable of synthesis) and discriminative (useful downstream), reducing the reliance on separate pretraining stages. This raises a critical question: does representation alignment solve deep architectural deficiencies, or does it merely accelerate learning? If the latter, the necessity of pretrained models could wane as compute, data, and training recipes continue to scale.

A similar spirit underlies large-scale systems such as FLUX2-VAE ⁷¹, which demonstrates that sophisticated tokenizers can be learned directly through end-to-end training rather than depending on pretrained vision foundation features. Although little is publicly known about its technical details, FLUX2-VAE’s production success suggests that with sufficient scale and engineering, high-quality tokenizers and representations can emerge organically from task training alone. Yet, “without pretrained representations” does not necessarily mean “cheap to train”: the total computational cost may rival or even exceed that of conventional pretraining pipelines. Whether the elegance of end-to-end architectures outweighs the modularity, interpretability, and reusability of pretrained components remains an open question.

The same shift is visible in the recent renaissance of pixel-space diffusion models, which challenge the long-held assumption that latent diffusion is a prerequisite for high-resolution, high-quality generation. Methods such as JiT (Just image Transformer) ⁷², PixelDiT ⁷³, DeCo (frequency-DeCoupled diffusion) ⁷⁴, DiP (Diffusion in Pixel space) ⁷⁵, and SiD2 (Simpler Diffusion v2) ⁷⁶ illustrate a broader trend: architectural innovation can substitute for latent-space compression. By employing patch-based Transformers, efficient multi-scale attention, or frequency-aware loss designs, these models demonstrate that the efficiency, quality, and stability advantages traditionally attributed to latent spaces can also be achieved through direct pixel-space training.

EPG (End-to-End Pixel-Space Generative Pretraining) ⁷⁷ pushes this idea further by integrating representation learning into pixel-space diffusion itself. Rather than discarding the notion of learned structure, it reimagines representation pretraining as part of the diffusion process. EPG pretrains encoders through self-supervised objectives along deterministic diffusion trajectories, learning temporally consistent and semantically distinct features directly in pixel space. This pretraining endows the encoder with structured initialization analogous to pretrained vision models, but derived natively from the diffusion task. The result is a model that successfully trains consistency and diffusion systems from scratch, reportedly the first to achieve stable training of high-resolution consistency models without any pretrained VAEs or diffusion models. EPG leverages the dispersive loss ⁴⁹, a simple plug-and-play regularizer that encourages diffusion model representations to disperse in the model’s intermediate feature space (analogous to contrastive learning) without requiring positive pairs, improving generation quality without interfering with the sampling process.

However, it’s important to note that these pixel-space methods mostly tackle ImageNet-scale generation. At true production level—think FLUX, Sora, Veo, and similar systems—to the best of my knowledge these are all latent models. The field is moving toward video generation, which is so computationally expensive that compression remains essential. Scaling pixel-space methods to high-resolution, text-driven image or video generation at production quality remains to be demonstrated. Additionally, at production level, efficiency is critically important: serving models to millions of users requires fast generation and manageable inference costs, especially as video and world models become the next frontier. For some discussion on the latent vs pixel-space trade-off you can look at the replies to this tweet By Sander Dieleman.

The Other Direction: Generative Models as Representations

The four phases above focused on one direction of unification: using pretrained representations to improve generation. But the relationship is bidirectional—generative modeling itself can serve as a powerful pretraining objective for learning representations useful in discriminative tasks. This “other direction” has gained significant momentum, suggesting that generation and representation learning may be two views of the same underlying process.

From Pixel Prediction to Embedding Prediction

MAE ⁸ pioneered masked pixel reconstruction for vision, demonstrating that predicting masked image patches creates strong representations. However, the pixel-level reconstruction objective tends to focus on low-level details rather than high-level semantics. Could predicting embeddings instead of pixels yield better representations?

AIM v1 (Autoregressive Image Models) ⁷⁸ revisits autoregressive modeling for vision with modern architectures and large-scale data. Unlike early work like iGPT ⁷⁹ or D-iGPT ⁸⁰, AIM uses Vision Transformers and is trained on billions of images. The work demonstrates two key findings: (1) visual feature performance scales with both model capacity and data quantity, exhibiting similar scaling laws to large language models, and (2) the value of the autoregressive objective function correlates with downstream performance, providing a meaningful training signal. AIM-7B achieves 84.0% ImageNet fine-tuning accuracy and shows particularly strong performance when trained on diverse, uncurated web data.

AIM v2 ⁸¹ extends this to multimodal autoregressive models, demonstrating that the same autoregressive paradigm can be applied across images and text, creating unified representations that span modalities. NEPA (Next-Embedding Prediction) ⁸² takes this further by predicting embeddings from pretrained models rather than raw pixels—by operating in a semantic embedding space, NEPA focuses on high-level features rather than low-level details, bridging generative objectives with the representation-focused methods discussed in earlier sections.

Diffusion Models Learn Representations Too

The broader pattern is that generative objectives—whether autoregressive, masked, or diffusion-based—can serve dual purposes: they enable sampling of new examples and, as a byproduct, learn representations useful for discriminative tasks. Recent work on improving diffusion autoencoders ³⁸ and using masked autoencoders as tokenizers ³⁹ further blurs this line.

Several methods explicitly bridge generative and discriminative training within diffusion models. Robust representation consistency models ⁸³ use contrastive denoising to learn consistent representations along diffusion trajectories, improving both robustness and downstream performance. EPG ⁷⁷, discussed in Phase 4, exemplifies this approach by pretraining encoders along diffusion trajectories to learn structured representations natively from the generative task.

These developments suggest that the distinction between “representation learning” and “generative modeling” may be more historical than fundamental—both aim to learn useful structure from data, just with different downstream applications in mind.

Representation Learning and Alignment in Molecular Machine Learning

The ideas from visual representation learning and generative modeling are beginning to influence molecular and protein modeling, suggesting broader applicability of these concepts beyond computer vision. Neural network potentials (NNPs), particularly MACE (Message Passing Atomic Cluster Expansion) ⁸⁴⁸⁵, have emerged as foundation models for atomistic chemistry that exhibit striking parallels to vision foundation models like DINO:

Embeddings transfer across tasks: MACE’s internal representations, learned for predicting quantum-mechanical energies and forces, generalise remarkably well to diverse downstream tasks. These embeddings can predict molecular properties far beyond the original training objective, allowing accurate property predictions not only in materials (the original domain) but also in small molecules ⁸⁶ and proteins ⁸⁷.
Platonic convergence with scale: Just as vision models trained with different objectives converge toward similar representations as they scale, independently trained molecular models exhibit the same phenomenon. Work from MIT demonstrates that ostensibly different molecular models can be mapped into a common latent space with minimal performance loss ⁸⁸, while complementary work from London shows that NNPs trained on large, diverse datasets discover comparable latent organizations ⁸⁹—a molecular analogue of the Platonic Representation Hypothesis.
Representation alignment benefits generative models: MACE-REPA directly applies the Phase 1 alignment paradigm to molecular force fields ⁹⁰. Instead of aligning diffusion features to DINO, it aligns force-field encoder representations to frozen MACE features using auxiliary losses. This demonstrates that the core insight—leveraging structured pretrained representations to accelerate training—transfers robustly from image diffusion to atomistic simulations.

Molecular Embeddings: Borrowing from NLP and Computer Vision

The pattern of using foundation model embeddings for downstream prediction is not unique to vision or molecules—it directly parallels the NLP paradigm where LLM embeddings are aggregated (e.g., via mean pooling) and fed to prediction heads. In computer vision, DINO embeddings (particularly the [CLS] token) serve the same role. For molecules and proteins, MACE produces atomic descriptors that must be aggregated into molecular or residue-level representations before downstream prediction.

Fig 17. The embedding paradigm across domains: foundation models (LLMs, DINO, MACE) produce local embeddings that are aggregated and fed to task-specific prediction heads. In molecules, MACE atomic descriptors are pooled via learned aggregators like REM3DI; in proteins, they can be pooled to residue-level descriptors via GNN pooling.

This pattern has been successfully instantiated for both molecules and proteins. REM3DI ⁸⁶ learns to aggregate MACE atomic descriptors into smooth, rotation-invariant molecular representations that achieve state-of-the-art performance on property prediction benchmarks. For proteins, similar approaches pool MACE atomic descriptors to the residue level, enabling prediction of per-residue properties like NMR chemical shifts or pKa values⁸⁷.

Fig 18. Left: REM3DI aggregates MACE atomic descriptors into molecular descriptors for property prediction. Right: MACE atomic descriptors can be pooled to residue-level representations for protein property prediction, extracting canonical environment descriptors from local atomic neighborhoods.

Where to Go from Here?

Fig 19. Three complementary approaches to leveraging atomistic foundation models: (1) using pretrained embeddings directly for downstream tasks, (2) aligning generative model representations to foundation model features, and (3) drawing architectural inspiration from what makes foundation models generalise well.

Using pretrained embeddings remains powerful for property prediction, though naively incorporating them does not seem to help structure prediction—as we explored in the RF3 paper ⁹¹, simply conditioning on MACE embeddings did not improve structure prediction accuracy. Representation alignment (as in MACE-REPA) has started but remains in its infancy compared to the sophisticated alignment methods developed for images.

A third, perhaps underexplored angle is architectural inspiration: why does MACE work so well and generalise so broadly? Three factors likely contribute: (1) training on large-scale DFT data, (2) physics-grounded objectives (energies and forces), and (3) strong locality bias—MACE operates on strictly local atomic environments rather than global molecular graphs. SLAE ⁹² takes exactly this approach for proteins: it adopts the physics-grounded objective by predicting Rosetta energy terms (hydrogen bonding, solvation, electrostatics) and embraces strict locality by encoding all-atom environments rather than full protein graphs. This places SLAE conceptually close to aligned VAEs in vision—reconstruction ensures geometric fidelity while auxiliary physics heads encourage the latent space to align with physically meaningful axes.

Conclusion

The field has undergone rapid evolution, progressing through four distinct phases: (1) aligning diffusion features with pretrained representations, (2) incorporating semantic structure into VAE latent spaces, (3) directly using pretrained representations as latent spaces, and (4) questioning whether pretrained representations are necessary at all. Parallel developments in pixel-space diffusion and generative representation learning have further enriched the landscape.

Several clear patterns emerge: representation alignment dramatically accelerates training, spatial structure may be more important than global semantics, VAE compression is not fundamental, and principles transfer beyond vision to molecular modeling. However, fundamental questions remain: What makes representations learnable? What is the optimal compression rate? How do we unify multiple modalities? Should we train jointly or in stages?

The answers likely depend on scale, application requirements, and computational constraints. At research scale, leveraging pretrained models provides clear advantages for rapid iteration and exploration. At production scale, state-of-the-art systems like FLUX, Veo, and Sora demonstrate that multi-stage latent approaches—not necessarily end-to-end training—can achieve maximum quality. This suggests that the modularity of staged training, where VAEs are pretrained and reused, offers both efficiency and quality benefits at scale. While pixel-space methods continue to advance and may avoid certain reconstruction artifacts, latent-space methods currently dominate production deployments due to their computational efficiency, which is critical when serving millions of users or generating expensive video content.

Looking forward, I am very excited to see how these advances in vision will translate to the molecular world. I myself have worked quite a bit on generative modelling for proteins ⁹³ and small molecules ⁹⁴, recently also leveraging latent diffusion ⁹⁵, so I will follow this space with great interest!

Credits

Thanks to everyone who gave feedback to this blogpost, especially Arash Vahdat, Karsten Kreis and the rest of the GenAIR team as well as members of the Baker lab for interesting discussions about this topic. Also thanks to Philip Isola and the rest of the “Foundation of Computer Vision” team for making their great text openly accessible from which I got the title image for this blogpost as well as the representation vs generation figure.

References

Ho, J., et al. (2020). Denoising Diffusion Probabilistic Models. NeurIPS. https://arxiv.org/abs/2006.11239 ↩︎ ↩︎²
Song, Y., et al. (2020). Score-Based Generative Modeling through Stochastic Differential Equations. ICLR. https://arxiv.org/abs/2011.13456 ↩︎ ↩︎²
Lai, C.-H., Song, Y., Kim, D., Mitsufuji, Y., & Ermon, S. (2025). The principles of diffusion models. arXiv preprint arXiv:2510.21890. https://arxiv.org/abs/2510.21890 ↩︎
Lipman, Y., et al. (2024). Flow Matching for Generative Modeling. ICLR. https://arxiv.org/abs/2210.02747 ↩︎ ↩︎²
Albergo, M. S., & Vanden-Eijnden, E. (2023). Building Normalizing Flows with Stochastic Interpolants. ICLR. https://arxiv.org/abs/2209.15571 ↩︎
Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML. https://arxiv.org/abs/2103.00020 ↩︎ ↩︎² ↩︎³
Caron, M., et al. (2021). Emerging Properties in Self-Supervised Vision Transformers. ICCV. https://arxiv.org/abs/2104.14294 ↩︎ ↩︎² ↩︎³
He, K., et al. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR. https://arxiv.org/abs/2111.06377 ↩︎ ↩︎² ↩︎³
Kadkhodaie, Z., Mallat, S., & Simoncelli, E. (2025). Unconditional CNN denoisers contain sparse semantic representation of images. arXiv preprint arXiv:2506.01912. https://arxiv.org/abs/2506.01912 ↩︎
Liang, Q., Liu, Z., Ostrow, M., & Fiete, I. (2024). How Diffusion Models Learn to Factorize and Compose. arXiv preprint arXiv:2408.13256. https://arxiv.org/abs/2408.13256 ↩︎
Jaini, P., Clark, K., & Geirhos, R. (2024). Intriguing properties of generative classifiers. arXiv preprint arXiv:2309.16779. https://arxiv.org/abs/2309.16779 ↩︎
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. International Conference on Machine Learning, 2256–2265. https://arxiv.org/abs/1503.03585 ↩︎
Song, Y., & Ermon, S. (2020). Improved techniques for training score-based generative models. Advances in neural information processing systems, 33, 12438–12448. ↩︎
Vahdat, A., & Kautz, J. (2020). NVAE: A Deep Hierarchical Variational Autoencoder. NeurIPS. https://arxiv.org/abs/2007.03898 ↩︎
Vahdat, A., et al. (2021). Score-based generative modeling in latent space. NeurIPS. https://arxiv.org/abs/2106.05931 ↩︎ ↩︎²
Rombach, R., et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR. https://arxiv.org/abs/2112.10752 ↩︎ ↩︎²
Dieleman, S. (2025). Generative modelling in latent space. https://sander.ai/2025/04/15/latents.html ↩︎ ↩︎²
Blattmann, A., et al. (2023). Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. CVPR. https://arxiv.org/abs/2304.08818 ↩︎
Brooks, T., et al. (2024). Video generation models as world simulators. OpenAI Blog. https://openai.com/index/video-generation-models-as-world-simulators/ ↩︎ ↩︎² ↩︎³
Podell, D., et al. (2023). SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. ICLR. https://arxiv.org/abs/2307.01952 ↩︎
Peebles, W., & Xie, S. (2023). Scalable Diffusion Models with Transformers. ICCV. https://arxiv.org/abs/2212.09748 ↩︎
Gao, R., Hoogeboom, E., Heek, J., Bortoli, V. D., Murphy, K. P., & Salimans, T. (2024). Diffusion meets flow matching: Two sides of the same coin. arXiv preprint arXiv:2401.08740. https://arxiv.org/abs/2401.08740 ↩︎
Esser, P., et al. (2024). Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. ICML. https://arxiv.org/abs/2403.03206 ↩︎ ↩︎²
Oord, A. v. d., Li, Y., & Vinyals, O. (2019). Representation Learning with Contrastive Predictive Coding. arXiv preprint arXiv:1807.03748. https://arxiv.org/abs/1807.03748 ↩︎ ↩︎² ↩︎³
Chen, X., & He, K. (2020). Exploring Simple Siamese Representation Learning. CVPR. https://arxiv.org/abs/2011.10566 ↩︎ ↩︎²
Caron, M., et al. (2019). Deep Clustering for Unsupervised Learning of Visual Features. ECCV. https://arxiv.org/abs/1807.05520 ↩︎ ↩︎²
Caron, M., et al. (2021). Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. NeurIPS. https://arxiv.org/abs/2006.09882 ↩︎ ↩︎²
Assran, M., et al. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR. https://arxiv.org/abs/2301.08243 ↩︎ ↩︎² ↩︎³
Huh, M., Cheung, B., Wang, T., & Isola, P. (2024). The Platonic Representation Hypothesis. arXiv preprint arXiv:2405.07987. https://arxiv.org/abs/2405.07987 ↩︎ ↩︎² ↩︎³ ↩︎⁴ ↩︎⁵
Chen, T., et al. (2020). A Simple Framework for Contrastive Learning of Visual Representations. ICML. https://arxiv.org/abs/2002.05709 ↩︎ ↩︎²
Oquab, M., et al. (2024). DINOv2: Learning Robust Visual Features without Supervision. arXiv preprint arXiv:2304.07193. https://arxiv.org/abs/2304.07193 ↩︎ ↩︎² ↩︎³
Simeoni, O., et al. (2025). DINOv3. arXiv preprint arXiv:2508.10104. https://arxiv.org/abs/2508.10104 ↩︎ ↩︎²
Assran, M., et al. (2025). V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning. arXiv preprint arXiv:2506.09985. https://arxiv.org/abs/2506.09985 ↩︎
Zhai, X., et al. (2023). Sigmoid Loss for Language Image Pre-Training. ICCV. https://arxiv.org/abs/2303.15343 ↩︎ ↩︎²
Tschannen, M., et al. (2025). SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features. arXiv preprint arXiv:2502.14786. https://arxiv.org/abs/2502.14786 ↩︎
Grill, J.-B., et al. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. NeurIPS. https://arxiv.org/abs/2006.07733 ↩︎
Richemond, P. H., et al. (2020). BYOL works even without batch statistics. arXiv preprint arXiv:2010.10241. https://arxiv.org/abs/2010.10241 ↩︎
Skorokhodov, I., et al. (2025). Improving the Diffusability of Autoencoders. arXiv preprint arXiv:2502.14831. https://arxiv.org/abs/2502.14831 ↩︎ ↩︎² ↩︎³
Chen, H., et al. (2025). Masked Autoencoders Are Effective Tokenizers for Diffusion Models. arXiv preprint arXiv:2502.03444. https://arxiv.org/abs/2502.03444 ↩︎ ↩︎² ↩︎³
Zhou, J., et al. (2021). iBOT: Image BERT Pre-Training with Online Tokenizer. ICLR. https://arxiv.org/abs/2111.07832 ↩︎
Gupta, S., Sundaram, S., Wang, C., Jegelka, S., & Isola, P. (2025). Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models. arXiv preprint arXiv:2510.08492. https://arxiv.org/abs/2510.08492 ↩︎
Wang, S. L., Isola, P., & Cheung, B. (2025). Words That Make Language Models Perceive. arXiv preprint arXiv:2510.02425. https://arxiv.org/abs/2510.02425 ↩︎ ↩︎²
Bi, T., Zhang, X., Lu, Y., & Zheng, N. (2025). Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models. arXiv preprint arXiv:2510.18457. https://arxiv.org/abs/2510.18457 ↩︎ ↩︎²
Yu, S., et al. (2025). Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think. arXiv preprint arXiv:2410.06940. https://arxiv.org/abs/2410.06940 ↩︎ ↩︎² ↩︎³
Wu, G., et al. (2025). Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think. arXiv preprint arXiv:2507.01467. https://arxiv.org/abs/2507.01467 ↩︎ ↩︎²
Leng, X., Singh, J., Hou, Y., Xing, Z., Xie, S., & Zheng, L. (2025). REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers. arXiv preprint arXiv:2504.10483. https://arxiv.org/abs/2504.10483 ↩︎ ↩︎²
Yao, J., Yang, B., & Wang, X. (2025). Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models. arXiv preprint arXiv:2501.01423. https://arxiv.org/abs/2501.01423 ↩︎ ↩︎²
Wang, Z., et al. (2025). REPA Works Until It Doesn’t: Early-Stopped, Holistic Alignment Supercharges Diffusion Training. arXiv preprint arXiv:2505.16792. https://arxiv.org/abs/2505.16792 ↩︎ ↩︎²
Wang, R., & He, K. (2025). Diffuse and Disperse: Image Generation with Representation Regularization. arXiv preprint arXiv:2506.09027. https://arxiv.org/abs/2506.09027 ↩︎ ↩︎²
Xiang, W., et al. (2023). Denoising Diffusion Autoencoders are Unified Self-Supervised Learners. ICCV. https://arxiv.org/abs/2303.09769 ↩︎
Chen, X., Liu, Z., Xie, S., & He, K. (2024). Deconstructing denoising diffusion models for self-supervised learning. arXiv preprint arXiv:2401.14404. https://arxiv.org/abs/2401.14404 ↩︎
Singh, J., Leng, X., Wu, Z., Zheng, L., Zhang, R., Shechtman, E., & Xie, S. (2025). What matters for Representation Alignment: Global Information or Spatial Structure? arXiv preprint arXiv:2512.10794. https://arxiv.org/abs/2512.10794 ↩︎ ↩︎²
Shechtman, E., & Irani, M. (2007). Matching local self-similarities across images and videos. 2007 IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–8). IEEE. https://ieeexplore.ieee.org/document/4270170 ↩︎
Dieleman, S. (2023). The geometry of diffusion guidance. https://sander.ai/2023/08/28/geometry.html ↩︎
Goodfellow, I., et al. (2014). Generative Adversarial Networks. NeurIPS. https://arxiv.org/abs/1406.2661 ↩︎
Chen, B., et al. (2025). Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models. arXiv preprint arXiv:2509.25162. https://arxiv.org/abs/2509.25162 ↩︎
Bolya, D., et al. (2025). Perception Encoder: The best visual embeddings are not at the output of the network. arXiv preprint arXiv:2504.13181. https://arxiv.org/abs/2504.13181 ↩︎
Shi, M., et al. (2025). Latent Diffusion Model without Variational Autoencoder. arXiv preprint arXiv:2510.15301. https://arxiv.org/abs/2510.15301 ↩︎
Zheng, B., Ma, N., Tong, S., & Xie, S. (2025). Diffusion Transformers with Representation Autoencoders. arXiv preprint arXiv:2510.11690. https://arxiv.org/abs/2510.11690 ↩︎
Zhai, S., Zhang, R., Nakkiran, P., Berthelot, D., Gu, J., Zheng, H., Chen, T., Bautista, M. A., Jaitly, N., & Susskind, J. (2024). Normalizing flows are capable generative models. arXiv preprint arXiv:2412.06329. https://arxiv.org/abs/2412.06329 ↩︎
Gao, Y., Chen, C., Chen, T., & Gu, J. (2025). One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation. arXiv preprint arXiv:2512.07829. https://arxiv.org/abs/2512.07829 ↩︎
Tong, S., Zheng, B., Wang, Z., Tang, B., Ma, N., Brown, E., Yang, J., Fergus, R., LeCun, Y., & Xie, S. (2026). Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders. arXiv preprint arXiv:2601.16208. https://arxiv.org/abs/2601.16208 ↩︎
Kumar, A., & Patel, V. M. (2026). Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders. arXiv preprint arXiv:2602.10099. https://arxiv.org/abs/2602.10099 ↩︎
Mathieu, E., & Nickel, M. (2020). Riemannian Continuous Normalizing Flows. Advances in Neural Information Processing Systems, 33, 2503-2515. https://arxiv.org/abs/2006.10605 ↩︎
Zaghen, O., Eijkelboom, F., Pouplin, A., & Bekkers, E. J. (2025). Towards Variational Flow Matching on General Geometries. arXiv preprint arXiv:2502.12981. https://arxiv.org/abs/2502.12981 ↩︎
Bamberger, J., Jones, I., Duncan, D., Bronstein, M. M., Vandergheynst, P., & Gosztolai, A. (2025). Carré du champ flow matching: better quality-generalisation tradeoff in generative models. arXiv preprint arXiv:2510.05930. https://arxiv.org/abs/2510.05930 ↩︎
Zhang, S. (2025). Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing. arXiv preprint arXiv:2509.25162. https://arxiv.org/abs/2509.25162 ↩︎
Fan, W., Diao, H., Wang, Q., Lin, D., & Liu, Z. (2025). The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding. arXiv preprint arXiv:2512.19693. https://arxiv.org/abs/2512.19693 ↩︎
Falck, F., et al. (2025). Spectral Autoencoders. arXiv preprint arXiv:2505.11278. https://arxiv.org/abs/2505.11278 ↩︎
Chu, X., Li, R., & Wang, Y. (2025). USP: Unified Self-Supervised Pretraining for Image Generation and Understanding. arXiv preprint arXiv:2503.06132. https://arxiv.org/abs/2503.06132 ↩︎
Black Forest Labs. (n.d.). FLUX. https://bfl.ai/research/representation-comparison ↩︎
Li, T., & He, K. (2025). Back to Basics: Let Denoising Generative Models Denoise. arXiv preprint arXiv:2511.13720. https://arxiv.org/abs/2511.13720 ↩︎
Yu, Y., Xiong, W., Nie, W., Sheng, Y., Liu, S., & Luo, J. (2025). PixelDiT: Pixel Diffusion Transformers for Image Generation. arXiv preprint arXiv:2511.20645. https://arxiv.org/abs/2511.20645 ↩︎
Ma, Z., Wei, L., Wang, S., Zhang, S., & Tian, Q. (2025). DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation. arXiv preprint arXiv:2511.19365. https://arxiv.org/abs/2511.19365 ↩︎
Chen, Z., et al. (2025). DiP: Taming Diffusion Models in Pixel Space. arXiv preprint arXiv:2511.18822. https://arxiv.org/abs/2511.18822 ↩︎
Hoogeboom, E., Mensink, T., Heek, J., Lamerigts, K., Gao, R., & Salimans, T. (2025). Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion. arXiv preprint arXiv:2410.19324. https://arxiv.org/abs/2410.19324 ↩︎
Lei, J., Liu, K., Berner, J., Yu, H., Zheng, H., Wu, J., & Chu, X. (2025). Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training. arXiv preprint arXiv:2510.12586. https://arxiv.org/abs/2510.12586 ↩︎ ↩︎²
El-Nouby, A., Klein, M., Zhai, S., Bautista, M. A., Toshev, A., Shankar, V., Susskind, J. M., & Joulin, A. (2024). Scalable pre-training of large autoregressive image models. arXiv preprint arXiv:2401.08541. https://arxiv.org/abs/2401.08541 ↩︎
Chen, M., et al. (2020). Generative Pretraining from Pixels. ICML. https://arxiv.org/abs/2009.14794 ↩︎
Ren, S., Wang, Z., Zhu, H., Xiao, J., Yuille, A., & Xie, C. (2023). Rejuvenating image-gpt as strong visual representation learners. arXiv preprint arXiv:2312.02147. https://arxiv.org/abs/2312.02147 ↩︎
Fini, E., Shukor, M., Li, X., Dufter, P., Klein, M., Haldimann, D., Aitharaju, S., da Costa, V. G. T., Béthune, L., Gan, Z., et al. (2025). Multimodal autoregressive pre-training of large vision encoders. Proceedings of the Computer Vision and Pattern Recognition Conference, 9641-9654. https://arxiv.org/abs/2502.14786 ↩︎
Xu, S., Ma, Z., Chai, W., Chen, X., Jin, W., Chai, J., Xie, S., & Yu, S. X. (2025). Next-Embedding Prediction Makes Strong Vision Learners. arXiv preprint arXiv:2512.16922. https://arxiv.org/abs/2512.16922 ↩︎
Lei, J., Berner, J., Wang, J., Chen, Z., Ba, Z., Ren, K., Zhu, J., & Anandkumar, A. (2025). Robust Representation Consistency Model via Contrastive Denoising. arXiv preprint arXiv:2501.13094. https://arxiv.org/abs/2501.13094 ↩︎
Batatia, I., et al. (2025). A foundation model for atomistic materials chemistry. The Journal of Chemical Physics, 163(18). https://pubs.aip.org/aip/jcp/article/163/18/184110/3372267/A-foundation-model-for-atomistic-materials ↩︎
Bernstein, N. (2024). From GAP to ACE to MACE. arXiv preprint arXiv:2410.06354. https://arxiv.org/abs/2410.06354 ↩︎
Wedig, S., Elijošius, R., Schran, C., & Schaaf, L. L. (2025). REM3DI: Learning smooth, chiral 3D molecular representations from equivariant atomistic foundation models. NeurIPS 2025 Workshop on Symmetry and Geometry in Neural Representations. https://openreview.net/forum?id=jOmZsvXoK5 ↩︎ ↩︎²
Bojan, M., Vedula, S., Maddipatla, A., Sellam, N. B., Napoli, F., Schanda, P., & Bronstein, A. M. (2025). Representing local protein environments with atomistic foundation models. arXiv preprint arXiv:2505.23354. https://arxiv.org/abs/2505.23354 ↩︎ ↩︎²
Edamadaka, S., Yang, S., Li, J., & Gómez-Bombarelli, R. (2025). Universally Converging Representations of Matter Across Scientific Foundation Models. arXiv preprint arXiv:2512.03750. https://arxiv.org/abs/2512.03750 ↩︎
Li, Z., & Walsh, A. (2025). Platonic representation of foundation machine learning interatomic potentials. arXiv preprint arXiv:2512.05349. https://arxiv.org/abs/2512.05349 ↩︎
Pinede, L., Yang, S., Nam, J., & Gomez-Bombarelli, R. (2025). Unifying Force Prediction and Molecular Conformation Generation Through Representation Alignment. ICML 2025 Generative AI and Biology (GenBio) Workshop. https://openreview.net/pdf?id=yzkHGHvC74 ↩︎
Corley, N., Mathis, S., Krishna, R., Bauer, M. S., Thompson, T. R., Ahern, W., …, Didi, K., …, Baker, D., & DiMaio, F. (2025). Accelerating Biomolecular Modeling with AtomWorks and RF3. bioRxiv. https://www.biorxiv.org/content/10.1101/2025.08.14.670328v2 ↩︎
Chen, Y., Lu, T., Zhao, C., Wayment-Steele, H., & Huang, P. (2025). SLAE: Strictly Local All-atom Environment for Protein Representation. bioRxiv. https://www.biorxiv.org/content/10.1101/2025.10.03.680398v1 ↩︎
Geffner, T., Didi, K., Zhang, Z., Reidenbach, D., Cao, Z., Yim, J., Geiger, M., Dallago, C., Kucukbenli, E., Vahdat, A., & others. (2025). Proteina: Scaling flow-based protein structure generative models. arXiv preprint arXiv:2503.00710. https://arxiv.org/abs/2503.00710 ↩︎
Schneuing, A., Harris, C., Du, Y., Didi, K., Jamasb, A., Igashov, I., Du, W., Gomes, C., Blundell, T. L., Lio, P., & others. (2024). Structure-based drug design with equivariant diffusion models. Nature Computational Science, 4(12), 899–909. https://www.nature.com/articles/s43588-024-00737-x ↩︎
Geffner, T., Didi, K., Cao, Z., Reidenbach, D., Zhang, Z., Dallago, C., Kucukbenli, E., Kreis, K., & Vahdat, A. (2025). La-proteina: Atomistic protein generation via partially latent flow matching. arXiv preprint arXiv:2507.09466. https://arxiv.org/abs/2507.09466 ↩︎

Dealing with the flood of protein structures

2024-04-07T00:00:00+00:00

With the explosion of protein structure prediction and the sheer number of predicted protein structures available in databases nowadays, we can ask exciting new questions that would have been unanswerable only a few years ago. However, we need new tools in order to answer these questions and deal with the flood of structural data. In this post, I describe a few of these new tools and the reasoning behind them.

How protein structure prediction changed the game
FoldComp: compressing protein structures to managable sizes
MMseqs2: sequence alignment in speed-mode
FoldSeek: structural clustering of the protein universe
Applications: clustering the protein universe

If you would like to cite this post in an academic context, you can use this BibTeX snippet:

@misc{didi2024proteinstructureuniverse,
  author = {Didi, Kieran},
  title = {Dealing with the flood of protein structures},
  url = {https://kdidi.netlify.app/blog/proteins/2024-04-07-protein-structure-universe/},
  year = {2024}
}

How protein structure prediction changed the game

The PDB as a database of experimental protein structures keeps growing, currently standing at nearly 218k entries. However, it seems small compared to the AlphaFoldDB (>200m) and ESMAtlas (772m structures), powered by the recent advances in protein structure prediction via methods like AlphaFold2 and ESMFold.

This development changed the game in protein biology. While until recently the gap between available protein sequences and structures widened further and further, we suddenly have a wealth of structural information that was unimaginable a decade ago. This quote from Mohammed AlQuraishi (Columbia University) sums up this paradigm shift well:

Everything we did with protein sequences we can now do with protein structures

While that is a theoretically true and very exciting prospect, there is one big problem: we do not have tools to deal with such amounts of structural data. Here a visual comparison between the size of the PDB and the AFDB:

Visual comparison of the size of the PDB vs the AFDB. Source: YouTube

You can see that we deal with a different order of magnitude in data here. This brings up a plethora of issues, starting from pure memory usage (the storage for AFDB is 23 TB) to questions of how we move these enormous amounts of data and also process them.

Many groups have developed tools in the last years to tackle this issue. Especially the Steinegger lab has produced some fantastic tools in that space from which I want to present three here in this blogpost: Foldcomp for structure compression, Foldseek for structure clustering and mmseqs for sequence clustering (also very important in that context for generating both input MSAs and training splits).

Tools from the Steinegger Lab. Source: YouTube

FoldComp: compressing protein structures to managable sizes

Paper
Talk
The trouble with compression

A perfect compression format satisfied all these three conditions:

The compressed files are small.
The compression and decompression algorithms are fast
The reconstruction is either lossless or (if lossy) has minimal reconstruction error.

Fulfilling all of these at the same time is hard, so one always has to think about how to balance between them.

As described in the first section of this post, there have been efforts for compressed protein structure formats such as MMTF or binaryCIF. However, given the sheer amount of predicted protein structures, the authors decided that more efficient algorithms are needed.

People have tried this in the past by talking inspiration from image compression algorithms such as PNG and JPEG as in the example of the PIC algorithm. These lossless formats are great since they reconstruct your data perfectly, but often leave some performance in terms of both speed and size on the table by focusing on reconstruction quality.

Therefore, looking into lossy compression formats often pays off if you are fine with paying a small penalty in terms of reconstruction error. Since our measurements of protein structures contain measurement errors anyway, we can often pay this penalty and still get great results for our biological problems such as for example energy calculations from MD trajectories.

The FoldComp compression scheme

In this spirit, Kim et al. from the Steinegger lab decided to build a lossy compression format that converts the nearly 100 bytes of 3D coordinates per residue into only 13 bytes of compressed internal coordinates (in this case torsion angles).

FoldComp compression scheme. Source: YouTube

As you can see in that graphic, they do not only save the backbone and side-chain torsion angles, but also bond angles. This should in theory not be necessary since one should be able to reconstruct the full-atom structure by just using torsion angles. However, this theory assumes an idealised protein backbone geometry with constant bond angles, which is a bit too simplistic in practice to get very low reconstruction error. Encoding these bond angles improves the reconstruction a lot.

In order to not make the space occupied by both torsion and bond angles to demanding, they employ a quantisation step where they save both of these entities as discretised pre-defined values. This procedure is also commonly known as binning and has been used to great extent machine learning for weights and activations as well as for optimiser states, up to the extreme of recent 1-bit LLMs.

NeRF and the lever-arm effect

Saving the actual bond angles helped to lower the reconstruction error for the first few residues that were reconstructed. However, the longer the polymer chain get, the bigger the reconstruction error became down the line. This problem is related to a phenomenon known as lever-arm effect in engineering. It describes the propagation of an error on the rotation measurement in a series of successive measurements, with the error magnitude increasing the longer the distance between the original measurement and the reconstruction.

To understand this in the context of proteins, let’s look at the method FoldComp and others in the field use to convert the stored torsion angles back into 3D coordinates: the NeRF (Natural Extension Reference Frame) method (unrelated to the NeRF in machine learning which stands for Neural Radiance Fields).

There have been multiple versions of NeRF such as pNeRF and MP-NeRF that make it more efficient via parallelisation, but the basic algorithmic ideas stay the same:

We can place our first backbone atom wherever we want and define this as our origin: $A_1(0,0,0)$
Given a first backbone atom ( $A_1$ ), we can place the second one arbitrarily in space and just constrain its position by the known bond distance $d_1$ : $A_2(0,0,d_1)$
Given the first two backbone atoms ( $A_1, A_2$ ), we can place the third one in space by using the literature bond distance $d_2$ and angle $\theta_1$ : $A_3(0, \sin(\theta_1) * d_2, d1 - \cos(\theta_1) * d_2)$
Given the first three backbone atoms ( $A_1, A_2, A_3$ ), we can place the fourth one in space by using the literature bond distance ( $d_3$ ), the literature (or saved in the case of FoldComp) angle $\theta_2$ and the stored torsion angle $\tau_1$ . To do this, we first define a new coordinate system called specialised reference frame centered at $A_3$ using spherical coordinates and places $A_4^*$ there:

\begin{aligned} %!!15 A_4^* &= (d_3 \cos(\theta_2), d_3 \cos(\tau_1) \sin(\theta_2), d_3 \sin(\tau_1) \sin(\theta_2)) \end{aligned}

Calculation of $A_4^*$ in the specialised reference frame.

We then rototranslate $A_4^*$ back from that specialised reference frame back to our original coordinate system via $A_4 = RA_4^* + A_3$ and with

\begin{aligned} %!!15 R &= [\hat{A}_{2-3}, \hat{n} \times \hat{A}_{2-3}, \hat{n}] \\ \hat{A}_{2-3} &= \frac{A_2 A_3}{\mid A_2 A_3 \mid}\\ \hat{n} &= \frac{A_1 A_2 \times \hat{A}_{2-3}}{\mid A_1 A_2 \times \hat{A}_{2-3} \mid } \end{aligned}

Rototranslation of $A_4^*$ back to the original coordinate system to form $A_4$ .

We can repeat step 4 for all forthcoming atoms until we are at the end of the polymer chains.

Reconstruction of the backbone works in a similar way, just using different values for bond distances, bond angles and torsion angles.

NeRF algorithm. Source: Structural Bioinformatics Library

As a fun fact, and another anecdote to how small the world of science is: Charlie Strauss, the lead author of the original NeRF paper from 2005, is from Seattle and did a summer job as a highschooler with Prof Tillman at the University of Washington working on mars metereology. That gave him both inspiration and grit to go into science, ending up in the Los Alamos National Laboratory where he supervised the NeRF paper that was published in 2005. As another unexpected twist of events, in the 90s he took a year long sabbatical at UW working in the lab of David Baker and improving their Rosetta algortihm for protein structure prediction. Yes, you heard right, the David Baker whose lab became synonymous with protein design and is still a pioneer in that field. Wha a funny world we live in.

Now with this NeRF algorithm at hand, we can go about reconstruction our 3D cartesian coordinates from our internal coordinates represented as torsion and bond angles. There is only one problem: the previously mentioned lever-arm effect.

We will get pretty accurate reconstruction for the first few residues, but small errors will accumulate since every reconstruction step is only relative to the previous ones. You can imagine this with an analogy: let’s say you want to follow a route on Google Maps. The routing instructions are successive relative statements (“turn left in 100 meters”, “go straight for 1km”, …), similar to how the torsion and bond angles during NeRF reconstruction are relative reconstruction steps. What you want in the end however is the full path to your correct destination; you are therefore “reconstructing” the correct path from these relative instructions.

Now if you follow the instructions very carefully at the start and only turn left instead of right close to the destination you will still be very close to your actual destination; your reconstruction error is low. However, if you start off your journey by taking the wrong exit in a roundabout and just keep following the instructions (ignoring that Google Maps will try to course-correct you), you will end up god-knows where! The error you made at the beginning propagates down to all of your successive steps and will accumulate, leading to a massive reconstruction error at the end.

The same is happening for the NeRF algorithm: a small reconstruction error at the start will lead to a large reconstruction error later along the peptide backbone, leaving you with a poorly reconstructed protein.

This phenomenon is not new by any means, not even in the protein community: while the first protein structure predicition methods like RGN used recurrent networks based on torsion angle prediction for reconstructing protein backbones, later methods like AlphaFold2 instead leveraged transformer-based architectures that utilise parallel reconstruction directly in Cartesion space instead of sequential reconstruction via internal coordinates. Similar observations where made in protein structure generation: FoldingDiff, a diffusion model by Microsoft Research, leveraged a torsion-angle based representation to generate protein backbones, and while that worked well for relatively short proteins, they note on page 4 of the SI that for larger proteins lever-arm effects play a role (although the model seems to be relatively robust in some cases).

The lever-arm solution: bidirectional NeRF and anchoring

While some machine learning algorithms like Int2Cart were developed to ameliorate the lever-arm problem, the FoldComp authors decided to stick with good-old NeRF and instead give it a boost via two approaches:

Bidirectionality: They start NeRF from the N- and the C-terminus of the polypeptide chain and using a weighted average of both reconstructions at each position to get a better consensus position. This requires us to save the position of the first and last residue in Cartesian coordinates, since now we cannot place them arbitrarily in space, but need them to be at the correct distance and orientation to each other. This helps a lot with lowering the reconstruction error at the start and the end of the protein backbone, but leaves the center still relatively vulnerable to lever-arm effects.
Anchoring: if we now saved the first and the last amino acid, why stop there? Of course we do not want to save the 3D coordinates of every residue; if we do that we do not need a NeRF reconstruction to begin with. But the authors found that even doing that for every 25th amino acid in the backbone improved results dramatically, landing in a sweet spot where both memory requirements are still reduced a lot but reconstruction error is also way below experimental resolution accuracy (around 0.1 Angstrom for the backbone and around 0.15 for all-atom RMSD).

With these two improvements, they managed to strike a good balance: they are as fast as gzip when decompressing and are a lot faster than other tools when compressing (10% of gzip) and reduce the storage requirements by a lot (2.9 GB vs the original 23 TB for the AFDB), all of this while mainting very low reconstruction errors, making it a very useful tool for large-scale structural bioinformatics.

MMseqs2: sequence alignment in speed-mode

Paper
Talk

Wait, you might say, you promised tools for large-scale protein structure analysis; why are we discussing a sequence alignment method?

Bear with me, for I have my reasons:

Sequence alignment and clustering is one of the most-studied topics in bioinformatics and underpins many of the technologies and scientific discoveries made in the last decades, so it is generally something to be aware of
Even as part of machine learning approaches for protein structure, sequence alignment and clustering is often used to create meaningful splits for training and test datasets (for more info I gave this lecture about that topic)
We will see later that structure alignment tools like FoldSeek reuse many of the components and ideas from MMseqs2, so it is useful to have it in the back of your mind.

Why do we need fast sequence alignment?

With that out of the way, what is MMseqs2 and which problem does it solve?

MMseqs2 (Many-against-Many sequence searching) is a tool that allows you to align and search protein sequence in a high-throughput manner while still retaining sensitivity. One application of this is metagenomics, where we get billions of possible ORFs (Open Reading Frames) from cheap DNA sequencing, but then need to search for potential hits in massive online databases like UniProt or KEGG to confirm that these potential ORFs are actually real genes. The exponential growth of sequencing data leads to a rare situation here where the cost for the computational analysis by far exceeds the actual sequencing cost, making the sequence search part fo the pipeline the real bottleneck.

Another application that might be a bit closer to home is MSA generation. Algorithms like AlphaFold heavily rely on MSAs as input to extract coevolutionary information and predict the structure of the input sequence. While the original AlphaFold2 used tools like JackHMMER and HHBlits for MSA generation, these profile-HMMs based tools are still relatively slow (although a lot faster than the original Viterbi or Forward algorithms that are classically used for scoring in hidden markov models). By using MMseqs2 instead for this particular application, ColabFold achieved 40-60 faster search and enabled everyone to predict protein structures via Google Colaboratory.

Prefiltering is key

How does it get this massive speed-up? The gold-standard for sequence alignment is still dynamic programming in the form of the [Needleman-Wunsch algorithm] for global sequence alignment or the Smith-Waterman algorithm for local sequence alignment. These algorithms give the optimal alignment, but take $O(nm)$ time for aligning two sequences of length $n$ and $m$ and are therefore impractical for many applications.

Many new tools still use these algorithms in the backend, but put a harsh prefilter before them so that the search space is reduced by multiple orders of magnitude while discarding as few true positives as possbile, passing only the most promising candidates for alignment to the expensive dynamic programming algorithms. MMSeqs2 is no different: it’s biggest selling point is the strong prefilter that is based on kmers; to be more precise, it looks for 2 consecutive 7-mers on a diagonal, and we will now spend some time to try and understand that statement.

The MMSeqs2 prefilter is divided into 4 different stages that correspond to nested for-loops:

As a preprocessing step, we take all our target sequences we might align query sequences to and create a precomputed index table of 7-mers that will allow fast 7-mer lookup. Each kmer acts in this index table as a key, and the corresponding value contains an index for the target sequence and an index for the position in that target sequence, uniquely identifying the position of that k-mer in the target sequence database.
We now enter our first for-loop by processing each query sequence one by one and produce all possible 7-mers in a sliding window fashion.
For each of these k-mers, we now produce a list of similar k-mers, where similarity is judged by some score threshold, either via a BLOSUM score or some profile score that judges how similar the generated k-mer is to the query k-mer.
For each k-mer in that list, we now query the precomputed index table and see if we find a hit via our k-mer lookup. If we find a hit, we process to our fourth and last nested for-loop.
We now check if we offset between the position of that k-mer in the query and in the target sequence has been observed the last time we checked. If that is the case, it means we already found two k-mers that match between the two sequences in the same reference offset, which is a sign that these two sequences have a high chance of having a good alignment. This process is often visualised by plotting the query position on the x and the target position on the y axis and looking of both kmers occur on the same diagonal. If we find two of these as just described, MMSeqs2 calls this a double diagonal hit and causes that sequence to be saved for more detailed analysis later.

MMSeqs2 Prefilter algorithm (a lot going on in that figure, but hopefully the description helps). Source: MMSeqs2 Paper

This prefilter already cuts down the number of hits by a lot. However, the result is still to expensive for a full Smith-Waterman alignment. Part of what makes this dynamic programming algorithm very expensive is the possibility to include gaps in the alignment. Therefore, as an additional filter, the sequences that gave double diagonal hits undergo an ungapped alignment that is relatively fast (although slower than the prefilter). If the best diagonal of that alignment has a score above a predefined threshold, we finally do a proper gapped alignment and get our final result out.

MMSeqs2 progressively filters out hits and passes them to more and more expensive alignment stages. Source: YouTube

In addition to that, the authors play all tricks in the hardware book to be fast, from AVX2 that allow 32 1-byte operations like add/mult/max to be computed in parallel per CPU clock cycle to optimising CPU cache allocation in the double diagonal hit matching stage and vectorizing both the ungapped and gapped alignment stages.

Use the prefilter for clustering

The prefiltering algorithm is not only useful for alignments, but also for sequence clustering, a task that is useful in for example creating biologically relevant train-test splits in machine learning. To cluster a sequence set with MMSeqs2, we run it either just through the prefiltering or optionally also through the alignment module and then use the output similarity graph as an input to a clustering algorithm of our choice.

If we choose the easy-cluster mode of MMSeqs2, it will just pass that similarity graph to a classic cascaded clustering algorithm. If we want to cluster large datasets, we can instead use the easy-linclust command that leverages the Linclust algorithm to cluster sequence sets in linear time, again using k-mer based analysis workflows.

Another cool property of MMSeqs2 clustering is the possibility to add new sequences to an existing clustering while maintaining stable cluster identifiers. eliminating the need to recluster the entire sequence set.

FoldSeek: structural clustering of the protein universe

Paper
Talk

Sequence alignment as described before is one of the main pillars in bioinformatics and useful for a variety of applications, from detecting homology to creating training splits for machine learning models.

However, when talking about protein structure, sequence alignments do not always tell the full story: in many cases, proteins may have very different sequences but very similar structures. This could be due to remote homology such as in the case of ubiquitin and it’s mysterious cousin Sumo which have been separated by more than 1 billion years of evolutionarity history but still are structurally strikingly similar despite a sequence identity of only 16%.

This makes the idea of structural alignment and structural clustering very appealing: with this, you could detect these remote homologies, enabling you to detect very remote homologies while also preventing your machine learning models that deal with protein structures from cheating via such examples.

However, structure alignment is quite complex: as described before, we can find an optimal solution for aligning a sequence of length $n$ to a sequence of length $m$ via dynamic programming in $O(nm)$ time since we need $n*m$ operations to populate the whole dynamic programming matrix.

For structure alignment, the problem is a lot more complicated due to the absence of natural local bounds: if we change a sequence alignment at some position, a previously aligned segment somewhere else stays unchanged. Since structural alignment operates via concerted 3D rototranslations, the introduction of gaps outside an aligned region might still affect the already aligned region due to residues that are close in 3D but far in sequence space.

Therefore, structural alignment algorithms like Dali based on distance matrices and TM-align based on the TM-score are relatively slow, preventing their application on the new scale of data we face (TM-Align would need around a year to search through the AFDB on a single CPU core). Foldseek, on the other hand, is 4-5 orders of magnitude faster and therefore suitable for such large-scale searches.

Structure to Sequence: the 3Di alphabet

How is that done? The main idea is to translate structural information into some kind of sequence-based representation that allows the use of fast sequence alignment tools. This has been tried before with tools like CLePAPS and mulPBA, but has not found widespread use due to them ony describing secondary backbone structure.These tools build on the three-letter code of helix, sheet and coil and refine it further by describing the backbone around a single residue by one of 10-20 letters. This increases the speed by reducing the problem to sequence alignment, but only captures helical and sheet-like regions well, while the large amount of information in loop regions is not captured well due to the structure there being mostly determined by interactions between different residues. In addition, neighboring residues are highly correlated (helices or sheet stretch for quite a bit in a protein), making that encoding even less informative.

FoldSeek does away with this and instead describes the tertiary instead of the backbone secondary structure via a 20-letter alphabet called 3Di. More specifically you do the following:

Select a residue to encode and its nearest 3D neighbor. They started defining “nearest” as “smallest CB-CB distance”, but then replaced that with the concept of a virtual center for reasons explained later.
Get the CA atoms of these two residues as well as the CA atoms of the residues before and after them in the sequence (in total 6), extract distance- and angle-based features from this 6-atom constellation and collect them in a 10D-descriptor.
Discretise this information into one of the 20 letters from the 3Di alphabet.

FoldSeek stages in part b of the figure. We will come back to part a. Source: FoldSeek Paper

We will talk in more detail about step 1 and 3 of this process, but you can see how the resulting 3Di sequence can be fed into any sequence-based program to get a structural alignment or clustering. In the paper, the authors show that they can do that with similar sensitivity as actual structural alignment programs, but at a fraction of the computational cost.

Virtual centers optimise conservation of interactions and tertiary vs. local interactions

The virtual center described above is determined by a pre-specified procedure described in the SI (Suppl. Fig. 1):

It lies on the plane defined by N, CA and CB
CB, CA and the virtual center form a 90 degree angle
The CA-virtual center distance is twice the CA-CB distance

Construction of the virtual center in FoldSeek. Source: (Suppl. Fig. 1)

In the case of glycine, a virtual CB is approximated by idealising the backbone geometry as a tetrahedron.

Why is this better than just taking the CB-CB distance? Two reasons:

Conservation of Interactions: we want to make sure that in the case of structurally aligning two homologs, the nearest neighbor of residue $i$ in structure one should be the same as for residue $i$ in structure 2. If this would not be the case and we would choose a different nearest neighbor, the extracted 10D descriptor would look different, we would assign the residue different 3Di letters in the two structures and the structural alignment would fail. Empirically, they found that the CB-CB distance is not a great criterion for that and therefore came up with the virtual center definition that fulfills this desideratum more often.
Tertiary vs. local interactions: One of the downsides of the previous alphabets such as CLePAPS and mulPBA was that they have a lot of repeated information encoded by only describing local interactions as part of the secondary structure description (e.g. “these 10 residues all are in a helix”). If our 3Di alphabet ends up encoding mainly local interactions between neighbors in sequence (as would often be the case if we choose the CB-CB distance as criterion for nearest neighbor) then we end up in the same spot of mainly describing redundant local interactions. One can think about it from an information theoretic perspective in terms of mutual information: in the case of only encoding the amino acid identity, the mutual information between structurally aligned residues is the same no matter if we correct by correlation between neighbouring letters to account for local interactions or not. Other structural alphabets show a higher mutual information than pure amino acid encoding (i.e. performing only classic sequence alignment), but that difference shrinks a lot when we correct for the neighbor letter correlation. FoldSeek therefore aims to minimise the amount of local interactions it encodes and maximise the amount of tertiary interaction that is encoded. By moving the virtual center further away from the backbone and orienting it into a different direction than the CB, we achieve this goal of often encoding interactions between residues that are not neighbors in sequence.

Learning the 3Di alphabet via a VQ-VAE

Given the 10-dimensional descriptor that encodes distance- and angle-based features from the residue and its nearest neighbour as judged by the virtual center, how do we actually decide which of the 20 letters of the alphabet we assign this residue to? Well, one could do something simple like k-means clustering (which the authors started out with), but you can be smarter than that by considering the fact that our 3Di alphabet should learn maximally conserved structural states between homologs.

Therefore, the authors leverage a VQ-VAE (vector-quantized variational autoencoder) to learn the 3Di alphabet first encoding the 10D descriptor via 3-layer neural network encoder into a bottlenecked representation, than mapping it to one of the 20 discrete 3Di states (that is where the VQ part comes in) and then reconstructing the 10D descriptor again via a 3-layer neural network decoder. The crucial part here is that the reconstruction target is not the input as it is for classic VAEs. Just reconstructing the exact 10D descriptor could lead to overfitting on the exact values instead of encoding features that allow us to identify conserved states between structures. Therefore, our reconstruction target is the 10D descriptor of a structurally aligned homolog. The structural alignment in this case was part of training dataset preparation via one of the more expensive classical tools.

By targeting not the same 10D descriptor but the descriptor of a homolog, the VQ-VAE is forced to encode a discretised representation that is useful for identifying homologs, exactly the use case we are building this algorithm for. This procedure is quite clever and can be seen as similar to denoising autoencoders, where instead of swapping out the output, the input is corrupted with some noise in order for the network to learn a useful representation and avoid overfitting.

Speeding things up by building on mmseqs2

We have now trained our VQ-VAE and can use it to encode a protein structure into a 3Di sequence. We could just leave it there and leverage good-old dynamic programming via Smith-Waterman to get local alignments. But the authors were aiming for speed, so they did not stop there and took inspiration from their MMseqs2 sequence aligner described above. In fact, they use exactly the same pipeline!

In part a, we can see that Foldseek uses the same prefilter and alignment modules as MMseqs2. Source: FoldSeek Paper

Since the 3Di representation is just a sequence, we can plug that sequence into the MMseqs2 prefilter and alignment modules and get ultra-fast structural alignment. We can benefit from the clever prefilter design as well as the hardware optimisations like AVX2 instructions, optimised CPU cache, vectorisation and so on.

Applications: clustering the protein universe

Using the turbo tandem of MMSeqs2 and FoldSeek as well as integrating these advancements into structure prediction methods via ColabFold has led to a flurry of new research directions.

For one, both sequence and structure clustering is now possible on scales that were not imaginable before. The Uniclust databases was created by sequence-similarity based clustering via MMSeqs2 at 90%, 50% and 30% pairwise sequence similarity. The resulting databases showed better consistency of functional annotations than the corresponding UniRef databases, arguable due to the better clustering algorithms.

Using a combination of MMSeqs2 and Foldseek, it was possible to perform clustering on the whole AlphaFold database, identifying new putative homologs that demonstrate the value of such a resource for studying protein evolution and function on such a large scale.

Other applications opened up in phylogenetics, the study of evolutionary relationships among biological entities such as species or individuals: the use of Foldseek enabled fast homology detection via structural phylogenetics for proteins in the twilight zone, meaning that their sequence similarity is already very low but remote homology via structural similarity is still possible. In another study, a combination of MMSeqs2, ColabFold and FoldSeek enabled cross-phyla protein annotation, a task considered very challenging. Even more, protein structure prediction methods themselves were improved by applying MMSeqs2 to the Sequence Reads Archive (SRA), resulting in petabase-scale homology search and the construction of better MSAs (seems like in protein structure prediction we are now back to the old game of “who has the bigger MSA”).

While the tools as they stand right now are amazing, the algorithms behind them can still be improved. This includes making the last Smith-Waterman alignment more efficient via algorithms such as BlockAligner that uses adaptive dynamic programming with SIMD-acceleration, or even making it differentiable in order to backpropagate through the MSA construction step and enable full end-to-end-learning.

At the same time, it is still worthwile looking for other approaches to these challenges. Some of them include SWAMPNN structure alignment via ProteinMPNN that is more sensitive than FoldSeek while still being faster than many of the classical algorithms, as well as language models used to perform protein search and annotation. All in all, one can say that we can now indeed do many of the things we can do with sequences also with structures, and it will be exciting to see the scientific discoveries that result from that endeavour!

How to accelerate PyTorch on your GPU

2024-03-04T00:00:00+00:00

Recently the CUDA MODE lecture series started with some amazing talks about how you can use tools like CUDA or Triton to speed up your PyTorch programs (join the Discord in case you are interested to learn more). Here I want to summarise and review some of the concepts and tools from the lecture and write them together in a coherent blog post.

1. Profiling
2. Integrating CUDA kernels into PyTorch
- 2.1 load_inline function
- 2.2 Numba
3. Integrate Triton kernels into PyTorch
4. torch.compile
Credits

1. Profiling

Profiling is the process of measuring the time and resources that a program uses. It is a crucial step in the development of any software, as it allows you to identify bottlenecks and areas for improvement. In the context of GPU programming, profiling is especially important, as the performance of a GPU program can be highly dependent on factors such as memory access patterns, kernel launch configurations, and the specific hardware being used. It is also not trivial to profile GPU code, as the operations are executed asynchronously on the GPU and we cannot simply measure execution time like we would with CPU code. In the following sections are a few tools to get you started on that for PyTorch code (for this you need to have access to a GPU, e.g. via Google Colab or a local machine with a CUDA-enabled GPU).

1.1 `torch.cuda.Event`

To profile the time a torch opertion takes, you can use torch.cuda.Event. We cannot use the time module for this, because the operations are executed asynchronously on the GPU. Let us write a short function to profile the time a function call takes:

import torch
def time_pytorch_function(func, input):
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    # Warmup
    for _ in range(5):
        func(input)
    start.record()
    func(input)
    end.record()
    torch.cuda.synchronize()
    return start.elapsed_time(end)

We do a few warmup steps at the start to make sure that things like memory allocation calls, PyTorch’s JIT fuser and other things are not included in the timing. Then we record the start and end of the function call and synchronize the GPU to make sure that the timing is correct. For more details on these things see this blog post.

Let’s try this with a simple toy example:

b = torch.randn(100000, 100000).cuda()

def square_2(a):
    return a * a

def square_3(a):
    return a ** 2

print(time_pytorch_function(torch.square, b))
print(time_pytorch_function(square_2, b))
print(time_pytorch_function(square_3, b))
#output:
# 3.2753279209136963
# 3.272671937942505
# 3.2755520343780518

We can see that the multiplication a * a is slightly faster than the power operation a ** 2. However, we have no idea why this is happening; it is the same operation, so are they using different CUDA kernels? We can use the torch.autograd.profiler to find out.

1.2 `torch.autograd.profiler`

Fortunately, we do not have to write all profiling tools ourselves PyTorch has a built-in profiler. Let us look again at the same operations:

print("=============")
print("Profiling torch.square")
print("=============")

# Now profile each function using pytorch profiler
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    torch.square(b)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

print("=============")
print("Profiling a * a")
print("=============")

with torch.autograd.profiler.profile(use_cuda=True) as prof:
    square_2(b)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

print("=============")
print("Profiling a ** 2")
print("=============")

with torch.autograd.profiler.profile(use_cuda=True) as prof:
    square_3(b)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

This gives us the following output:

We can see that a * a calls the faster aten::mul operation, while a ** 2 calls the slower aten::pow operation, explaining our previous results.

ATen is a C++ library that is part of the PyTorch C++ API. It is the foundational tensor and math library on which PyTorch is built and exposes the Tensor operations in PyTorch directly in C++. ATen is a very creative name, as it stands for “A tensor library”. You can here more about the differences between the torch API and the ATen API in this podcast episode.

Let us now profile a simple neural network forward pass:

import torch
import torch.nn as nn
import torch.nn.functional as F

data = torch.randn(1, 1, 32, 32).cuda()
with torch.autograd.profiler.profile(use_cuda=True) as prof:
  output = torch.nn.Linear(32, 32).cuda()(data)
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Which gives us the following output:

We can see that the aten::linear and the aten::addmm operation are the most time-consuming operations in this forward pass. In another post I dig into how one can find the actual implementation of these functions in the PyTorch codebase to understand what they actually do, but for it is enough to know that aten::linear is the operation that applies a linear transformation to the input data and aten::addmm is the operation that performs a matrix multiplication of the input data with the weight matrix and adds a bias term.

1.3 `torch.profiler`

Another, more visual way to profile your code is to use torch.profiler. This is a more high-level interface to the profiler and allows you to export the profiling data to a Chrome trace file. Here is an example of how to use it:

import torch
from torch.profiler import profile, record_function, ProfilerActivity

def trace_handler(prof):
    print(prof.key_averages().table(
        sort_by="self_cuda_time_total", row_limit=-1))
    prof.export_chrome_trace("/tmp/test_trace_" + str(prof.step_num) + ".json")

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,],    
    schedule=torch.profiler.schedule(
        wait=1,
        warmup=1,
        active=2,
        repeat=1),
    on_trace_ready=trace_handler
    ) as p:
        for iter in range(10):
            output = torch.nn.Linear(32, 32).cuda()(torch.randn(1, 1, 32, 32).cuda())
            # send a signal to the profiler that the next iteration has started
            p.step()

We still get a terminal output:

However, we also get a Chrome trace file that we can open in Chrome to visualize the profiling data:

We can see that the majority of the time is actually spent on the cpu, moving data to the GPU, whereas the actual matrix multiplication is quite fast (and uses a special CUDA kernel called volta_sgemm_32x32_sliced1x4_tn).

1.4 `ncu` profiler

The ncu profiler is a command-line tool that comes with the CUDA toolkit. It is a very powerful tool that allows you to profile your CUDA kernels in great detail. You invoke it by running ncu python script.py. It will then run your script and profile all the CUDA kernels that are called. It will then generate a report in the form of a ncu_logs file that contains helpful numbers and recommendations on how to optimize your code.

A similar tool from the CUDA toolkit is nsys, which also allows you to profile your code. It is however less focused on detailed CUDA kernel performance analysis, but more the overall system-wide performance, as well as understanding how the communication between CPU and GPU impacts performance.

We can mark code we want to profile via the torch.cuda.nvtx API that allows us to start capturing via range_push() and stop capturing via range_pop(). In the code example below, we profile a single linear layer; we also delay tehs tart of profiling until iteration 10 to allow for warm-up time.

import torch
import torch.nn as nn
import torch.nn.functional as F

for i in range(20):
    if i == 10: torch.cuda.cudart().cudaProfilerStart()
    if i >= 10: torch.cuda.nvtx.range_push(f"Iteration {i}")
    data = torch.randn(1, 1, 32, 32).cuda()
    output = torch.nn.Linear(32, 32).cuda()(data)
    if i >= 10: torch.cuda.nvtx.range_pop()

The call to torch.cuda.cudart().cudaProfilerStart() indicates to NSys to only care about profiling from this iteration on.

To get the profiling output now, we need to install and use the nsys toolkit. There are many CLI options you can choose for it, but one of the simplest calls might be nsys profile -o output_profile python script.py. This will produce a file called output_profile.nsys-rep which you can then open in the NSight Systems UI (if you run your profiling on a remote machine, transfer the report file your local machine so that you can run the GUI application). For the a simple linear layer it will look something like this:

NSys Profiling report for a single linear layer in PyTorch.

We can see that the actual computation only takes a bit of time, while there is a long time before that gets spent on data transfer via calls to the CUDA API like MemCopy. Only at the end is the ampere_sgemm_32x32_sliced kernel called that performs the actual matrix multiplication in tiles of 32 by 32.

To profile more complex code like a whole ResNet for example, we can either set the profiling points still manually as described in this community post or we can use tools such as autonvtx that just wrap our model and deal with the profiling setup for us. Doing this for a simple ResNet results in the following profiler output:

NSys Profiling report for a ResnNet in PyTorch.

In this case we can see that a way bigger chunck of time is spent on CUDA calls and actual computation. We also see that there are calls to the cuDNN backend for operations such as batch normalization.

NSys can seem overwhelming and is a bit more overhead to get set up compared to the options presented before, but it can give you some detailed insights as well as suggestions what kind of things to improve in your code.

2. Integrating CUDA kernels into PyTorch

CUDA is written in C++ and is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows software developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing. Since it is written in C++, it is not immediatly obvious how to integrate it into our ML code that is normally written in Python libraries like PyTorch. However, there are several options.

2.1 `load_inline` function

The easiest way to integrate CUDA kernels into PyTorch is to use the torch.utils.cpp_extension module. This module allows you to compile C++ code into a shared library and then load it into Python. Here is an example of how to do this via the load_inline function for a simple matrix squaring operation:

import torch
from torch.utils.cpp_extension import load_inline

# Define the CUDA kernel and C++ wrapper
cuda_source = '''
__global__ void square_matrix_kernel(const float* matrix, float* result, int width, int height) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < height && col < width) {
        int idx = row * width + col;
        result[idx] = matrix[idx] * matrix[idx];
    }
}

torch::Tensor square_matrix(torch::Tensor matrix) {
    const auto height = matrix.size(0);
    const auto width = matrix.size(1);

    auto result = torch::empty_like(matrix);

    dim3 threads_per_block(16, 16);
    dim3 number_of_blocks((width + threads_per_block.x - 1) / threads_per_block.x,
                          (height + threads_per_block.y - 1) / threads_per_block.y);

    square_matrix_kernel<<>>(
        matrix.data_ptr(), result.data_ptr(), width, height);

    return result;
    }
'''

cpp_source = "torch::Tensor square_matrix(torch::Tensor matrix);"

# Load the CUDA kernel as a PyTorch extension
square_matrix_extension = load_inline(
    name='square_matrix_extension',
    cpp_sources=cpp_source,
    cuda_sources=cuda_source,
    functions=['square_matrix'],
    with_cuda=True,
    extra_cuda_cflags=["-O2"],
    # build_directory='./load_inline_cuda',
    # extra_cuda_cflags=['--expt-relaxed-constexpr']
)

a = torch.tensor([[1., 2., 3.], [4., 5., 6.]], device='cuda')
print(square_matrix_extension.square_matrix(a))

# Output:
# tensor([[ 1.,  4.,  9.],
#         [16., 25., 36.]], device='cuda:0')

We see that the output is the same as if we used a PyTorch function. If we want to inspect the generated code, we can set the build_directory argument of the load_inline function to see the generated code in the specified directory.

2.2 Numba

Another way to integrate CUDA kernels into PyTorch is to use the numba library. This is a just-in-time (JIT) compiler that translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library. It can also be used to generate CUDA kernels.

from numba import cuda

# CUDA kernel
@cuda.jit
def square_matrix_kernel(matrix, result):
    # Calculate the row and column index for each thread
    row, col = cuda.grid(2)

    # Check if the thread's indices are within the bounds of the matrix
    if row < matrix.shape[0] and col < matrix.shape[1]:
        # Perform the square operation
        result[row, col] = matrix[row, col] ** 2

# Example usage
import numpy as np

# Create a sample matrix
matrix = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32)

# Allocate memory on the device
d_matrix = cuda.to_device(matrix)
d_result = cuda.device_array(matrix.shape, dtype=np.float32)

# Configure the blocks
threads_per_block = (16, 16)
blocks_per_grid_x = int(np.ceil(matrix.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(np.ceil(matrix.shape[1] / threads_per_block[1]))
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)

# Launch the kernel
square_matrix_kernel[blocks_per_grid, threads_per_block](d_matrix, d_result)

# Copy the result back to the host
result = d_result.copy_to_host()

# Result is now in 'result' array
print(matrix)
print(result)

3. Integrate Triton kernels into PyTorch

3.1 Using Triton

Triton is both a domain-specific language (DSL) and a compiler for writing highly efficient GPU code. It actually does not generate CUDA code, but PTX code, which is a lower-level intermediate representation of the CUDA code (basically the assembly language of CUDA). Newer features in PyTorch like torch.compile actually leverage Triton kernels under the hood, so it is worth understanding how it works. Since Triton is written in Python, it is easy to integrate with PyTorch. Here is an example of how to use Triton to write a simple matrix squaring operation:

# Adapted straight from https://triton-lang.org/main/getting-started/tutorials/02-fused-softmax.html
import triton
import triton.language as tl
import torch

@triton.jit
def square_kernel(output_ptr, input_ptr, input_row_stride, output_row_stride, n_cols, BLOCK_SIZE: tl.constexpr):
    # The rows of the softmax are independent, so we parallelize across those
    row_idx = tl.program_id(0)
    # The stride represents how much we need to increase the pointer to advance 1 row
    row_start_ptr = input_ptr + row_idx * input_row_stride
    # The block size is the next power of two greater than n_cols, so we can fit each
    # row in a single block
    col_offsets = tl.arange(0, BLOCK_SIZE)
    input_ptrs = row_start_ptr + col_offsets
    # Load the row into SRAM, using a mask since BLOCK_SIZE may be > than n_cols
    row = tl.load(input_ptrs, mask=col_offsets < n_cols, other=-float('inf'))

    square_output = row * row
    
    # Write back output to DRAM
    output_row_start_ptr = output_ptr + row_idx * output_row_stride
    output_ptrs = output_row_start_ptr + col_offsets
    tl.store(output_ptrs, square_output, mask=col_offsets < n_cols)


def square(x):
    n_rows, n_cols = x.shape
    # The block size is the smallest power of two greater than the number of columns in x
    BLOCK_SIZE = triton.next_power_of_2(n_cols)
    # Another trick we can use is to ask the compiler to use more threads per row by
    # increasing the number of warps (num_warps) over which each row is distributed.
    # You will see in the next tutorial how to auto-tune this value in a more natural
    # way so you don't have to come up with manual heuristics yourself.
    num_warps = 4
    if BLOCK_SIZE >= 2048:
        num_warps = 8
    if BLOCK_SIZE >= 4096:
        num_warps = 16
    # Allocate output
    y = torch.empty_like(x)
    # Enqueue kernel. The 1D launch grid is simple: we have one kernel instance per row o
    # f the input matrix
    square_kernel[(n_rows, )](
        y,
        x,
        x.stride(0),
        y.stride(0),
        n_cols,
        num_warps=num_warps,
        BLOCK_SIZE=BLOCK_SIZE,
    )
    return y


torch.manual_seed(0)
x = torch.randn(1823, 781, device='cuda')
y_triton = square(x)
y_torch = torch.square(x)
assert torch.allclose(y_triton, y_torch), (y_triton, y_torch)

We see that the output of the Triton kernel is the same as the output of the PyTorch function.

3.2 Debugging Triton

Once we go to compiled code, we hopefully gain speed, but loose some of the flexibility that comes with eager execution, e.g. easy debugging via pdb and other Python debuggers or simple print statements.

Fortunately, Triton has a debugger now: we can invoke it by changing the triton.jit decorator to triton.jit(interpret=True). This will allow you to set normal Python breakpoints and step through the kernel line by line.

The interpret=True option was recently deprecated, so you can instead use os.environ["TRITON_INTERPRET"] = "1".

When doing that, you will see that most objects in the kernel are of the type WrappedTensor. So if you want to inspect a variable, you have to access its .tensor attribute.

Let’s look at this in action with a simple vector addition kernel from the Triton Docs.

If you do not have a GPU available, you can run this code in a Google Colab by first choosing a GPU runtime and then executing the following lines to get the latest Triton version and set up your CUDA libraries correctly:

!ldconfig /usr/lib64-nvidia
!ldconfig -p | grep libcud
!pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly

Let us implement a simple vector addition kernel together with a helper function to call the kernel as well as some code to generate data and call that function:

import triton
import triton.language as tl
import torch

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    breakpoint()
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y
    tl.store(output_ptr + offsets, output, mask=mask)

def add(x: torch.Tensor, y: torch.Tensor):
    output = torch.empty_like(x)
    assert x.is_cuda and y.is_cuda and output.is_cuda
    n_elements = output.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']), )
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
    torch.cuda.synchronize()
    return output

torch.manual_seed(0)
size = 98432
x = torch.rand(size, device='cuda')
y = torch.rand(size, device='cuda')
output_torch = x + y
output_triton = add(x, y)
print(f'The maximum difference between torch and triton is '
      f'{torch.max(torch.abs(output_torch - output_triton))}')

Does not seem to complicated, but what are all these in-built Triton variables like tl.program_id? What do the offsets look like? And what values do my data pointers have? If we try to set breakpoint() to enter the PDB debugger, we get a NameError.

To answer these questions, import os and set the interpret flag for Triton to true: os.environ["TRITON_INTERPRET"] = "1". Now our breakpoint() works like a charm and we can interactively debug our Triton kernel (even inside a notebook!).

Via this, we learn for example that our pid is 0 in the first iteration, 1 in the second and so on! These iterations correspond to the workgroup/tile id, similar to the blockIdx in CUDA.

We also see that the offsets are a contiguous array of indices that are used to later access the vectors. We can also see that x_ptr and y_ptr contain memory addresses. So what happens is that in x = tl.load(x_ptr + offsets, mask=mask), Triton loads the whole block of memory from x_ptr and including all the offset locations. The compiler here makes sure that these memory accesses are efficient via e.g. memory coalesence.

3.3 Triton Deep-Dive

What does Triton do under the hood? It converts the Python code first into a custom Triton IR and then via the Triton compiler into the well-known LLVM-IR. From there PTX code is generated. Basically, Triton leverages LLVM heavily and (quote from the paper) “just a few data- and control-flow extensions to LLVM-IR could enable various tile-level optimization passes which jointly lead to performance on-par with vendor libraries.” These extensions allow Triton to do things like shared memory allocation or memory coalescence, things that in CUDA the GPU programmer has to handle manually.

From this news article

We can look at all these different intermediate representations by saving the compiled kernell to a variable and then accessing the asm field that contains the IRs for various levels.

compiled = add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
print("IR", compiled.asm['ttir'])
print("TTGIR", compiled.asm['ttgir'])
print("LLIR", compiled.asm['llir'])
print("PTX", compiled.asm['ptx'])

TTIR (Intermediate Representation): This IR is what people generally refer to when they say Triton IR. Inspecting it you see that it looks relatively similar to the original Triton code, just with many of the operations split up into more fundamental steps like initialising constants, loading and broadcasting data and finally (after computation) storing it again. We see that our original kernel is now wrapped in an IR module as a tt.func public @kernel_name.
TTGIR (Triton Thread-Group Intermediate Representation): Triton can be used for different accelerators, and the GPU is one of them. In that case, Triton will lower TTIR into TTGIR, where GPU-specific operations like thread synchronizations, call coalescences and shared memory allocations are performed.
LLIR (Low-Level Intermediate Representation): After TTGIR, the code is transformed into LLIR, the lowest level of IR. If we inspec the LLIR, we can see at the start the we use a LLVMDialectModule. This indicates that the IR we are talking about is the LLVM IR, part of a larger collections of module and reusable compiler technologies as part of the LLVM project. The idea is that no matter from which IR we lower into LLVM, we can use this IR to translate the code into different backends (for example into machine code for NVIDIA or AMD GPUs). The fact that we use a LLVMDialectModule hints that we do not only leverage LLVM, but the dialect part hints at the use of MLIR, a successor project that tries to unify the toolset of not only the IR to backend process, but also the toolset to create these IRs in the first place. You can read more about MLIR in the original paper, this developer presentation or this blogpost.
PTX (Parallel Thread Execution, also NVPTX): PTX is now the ISA (instruction set architecture) used in NVIDIA GPUs. If we normally write CUDA kernels, the NVCC compiler translates CUDA C++ code into PTX; here, Triton ends up at the same destination via a different route passing Triton IR and LLVM IR. PTX is now a proper assembly language represented in ASCII text specific for NVIDIA GPUs that contain compilers in their graphic drivers to the assembly language SASS, which is specific for each different graphics card to enable device-specific optimisations. This code is then finally transformed into binary code and executed by the GPU.

Triton Compiler Pipeline (Link)

Looking at the Triton Compiler Pipeline from Triton IR to LLVM IR, we see that many of the optimizations we specify in CUDA are performed in this transformation process; for example memory coalescence, matmul acceleration and layout adaptions.

The interesting part about Triton is that it is not limited to a specific set of hardware architectures, but can in principle be used for a variety of ISAs (Instruction Set Architectures).

Triton Compiler Ecosystem (Link)

While most programs targeted for GPUs will probably end up in LLVM IR and then get translated into the vendor-specific ISAs, code for CPUs, FPGAs and other hardware can get translated into other compiler backends, making the ecosystem modular.

3.4 Benchmarking Triton

We want to benchmark our Triton kernels similar to our CUDA kernels, of course; if they do not give us speed-ups we would not have needed to deal with them in the first place!

For profiling, we use the decorator triton.testing.perf_report to get a performance report of our kernel.

@triton.testing.perf_report(
        triton.testing.Benchmark(
            x_names=['size'],
            x_vals=[2**i for i in range(12, 28, 1)],
            x_log=True,
            line_arg='provider',
            line_vals=['triton', 'torch'],
            line_names=['Triton', 'Torch'],
            styles=[('blue', '-'), ('green', '-')],  # Line styles.
            ylabel='GB/s',  # Label name for the y-axis.
            plot_name='matrix-square-performance',  # Name for the plot. Used also as a file name for saving the plot.
            args={},  # Values for function arguments not in x_names and y_name.
        ))

def benchmark(size, provider):
    x = torch.rand(size, device='cuda', dtype=torch.float32)
    quantiles = [0.5, 0.2, 0.8]
    if provider == 'torch':
        ms, min_ms, max_ms = triton.testing.do_bench(lambda: x**2, quantiles = quantiles)
    if provider == 'triton':
        ms, min_ms, max_ms = triton.testing.do_bench(lambda: square(x), quantiles=quantiles)
    gbps = lambda ms: 12 * size / ms * 1e-6
    return gbps(ms), gbps(max_ms), gbps(min_ms)

benchmark.run(show_plots=True, print_data=True)

With this benchmark, we get both a print-out of our data as well as a graphical representation.

Triton benchmark plot.

We can see that in this case, there is no significant speed-up over PyTorch; however looking at more complex examples on the Triton page, there are speed-ups to be achieved.

To read more about Triton, you can have a look at the original research paper, a video by the author Philippe Tillet and a Reddit discussion where he himself gave some useful perspectives on the project.

4. `torch.compile`

To get a feel for how Triton fits into the PyTorch2 compilation stack, we can leverage the fact that torch.compile actually uses Triton under the hood. We can just write a simple function and then call torch.compile on it. Then, when running the script, we set the environment variable os.environ["TORCH_LOGS"] to different values (depending on which stage of the PyTorch compilation process we want to investigate) or set these values directly in PyTorch via torch._logging.set_logs(argument) with different arguments.

Stage	Value for TORCH_LOGS (Env. variable)	Argument to `set_logs` (Python function)
Dynamo Tracing	`+dynamo`	`dynamo=logging.DEBUG`
Traced Graph	`graph`	`graph=True`
Fusion Detections	`fusion`	`fusion=True`
Triton Output Code	`output_code`	`output_code=True`

Triton DL Stack (Link)

Looking at this, we can see that Triton leverages some heuristics to enable autotuning and other efficiency improvements. For example, it infers data types and element numbers and then uses this information to optimize the kernel.

Credits

Thanks to the CUDA MODE lecture series for the inspiration for this post and the community around that for interesting discussions!

How to represent protein structures in ML

2024-02-03T00:00:00+00:00

Machine Learning approaches empower a new suite of algorithms and applications in structural biology and protein engineering/design. However, there is quite a gap between how protein structure data is classically stored in databases and how machine learning algorithms deal with data. Here, I want to bridge that gap and show how current algorithms such as AlphaFold2 make use of protein structure data in practice.

Protein Structure File Formats: PDB vs PDBx/mmCIF vs MMTF vs BinaryCIF
Input Data for Machine Learning Algorithms
Reference Systems: Local reference frames vs reference-free methods
Batching: Padded versus sparse
AFDB, ESMAtlas & co: how to deal with large databases
Summary

If you would like to cite this post in an academic context, you can use this BibTeX snippet:

@misc{didi2024proteinrepresentations,
  author = {Didi, Kieran},
  title = {How to represent protein structures in ML},
  url = {https://kdidi.netlify.app/blog/proteins/2024-02-03-protein-representations/},
  year = {2024}
}

Protein Structure File Formats: PDB vs PDBx/mmCIF vs MMTF vs BinaryCIF

Before we turn to machine learning algorithms such as AlphaFold2, let’s shortly discuss how these coordinates are stored in the PDB to start with.

Over the years there has been quite an evolution with respect to data formats for protein structures.

PDB format (legacy)

The original PDB format introduced in 1976 was intended as a human-readable file that would allow researchers to exchange data easily. While very successful, it is a very wasteful format by today’s standards in terms of whitespace and indentation, making automatic parsing realtively difficult.

Here an excerpt of the PDB file of a Lysozyme structure:

# file: "168l.pdb"
HEADER    HYDROLASE (O-GLYCOSYL)                  24-MAR-95   168L              
TITLE     PROTEIN FLEXIBILITY AND ADAPTABILITY SEEN IN 25 CRYSTAL FORMS OF T4   
TITLE    2 LYSOZYME                                                             
COMPND    MOL_ID: 1;                                                            
COMPND   2 MOLECULE: T4 LYSOZYME;                                               
COMPND   3 CHAIN: A, B, C, D, E;                                                
COMPND   4 EC: 3.2.1.17;                                                        
COMPND   5 ENGINEERED: YES                                                      
SOURCE    MOL_ID: 1;                                                            
SOURCE   2 ORGANISM_SCIENTIFIC: ENTEROBACTERIA PHAGE T4;                        
SOURCE   3 ORGANISM_TAXID: 10665;                                               
SOURCE   4 EXPRESSION_SYSTEM_VECTOR_TYPE: PLASMID;                              
SOURCE   5 EXPRESSION_SYSTEM_PLASMID: M13                                       
KEYWDS    HYDROLASE (O-GLYCOSYL)                                                
EXPDTA    X-RAY DIFFRACTION                                                     
AUTHOR    X.-J.ZHANG,B.W.MATTHEWS                                               
REVDAT   5   07-FEB-24 168L    1       REMARK SEQADV                            
REVDAT   4   29-NOV-17 168L    1       REMARK HELIX                             
REVDAT   3   24-FEB-09 168L    1       VERSN                                    
REVDAT   2   01-APR-03 168L    1       JRNL                                     
REVDAT   1   10-JUL-95 168L    0                                                
JRNL        AUTH   X.J.ZHANG,J.A.WOZNIAK,B.W.MATTHEWS                           
JRNL        TITL   PROTEIN FLEXIBILITY AND ADAPTABILITY SEEN IN 25 CRYSTAL      
JRNL        TITL 2 FORMS OF T4 LYSOZYME.                                        
JRNL        REF    J.MOL.BIOL.                   V. 250   527 1995              
JRNL        REFN                   ISSN 0022-2836                               
JRNL        PMID   7616572                                                      
JRNL        DOI    10.1006/JMBI.1995.0396                                       
REMARK   1                                                                      
REMARK   1 REFERENCE 1                                                          
REMARK   1  AUTH   L.H.WEAVER,B.W.MATTHEWS                                      
REMARK   1  TITL   STRUCTURE OF BACTERIOPHAGE T4 LYSOZYME REFINED AT 1.7        
REMARK   1  TITL 2 ANGSTROMS RESOLUTION                                         
REMARK   1  REF    J.MOL.BIOL.                   V. 193   189 1987              
REMARK   1  REFN                   ISSN 0022-2836                               
REMARK   2                                                                      
REMARK   2 RESOLUTION.    2.90 ANGSTROMS.
...
SEQRES   1 A  164  MET ASN ILE PHE GLU MET LEU ARG ILE ASP GLU GLY LEU          
SEQRES   2 A  164  ARG LEU LYS ILE TYR LYS ASP THR GLU GLY TYR TYR THR          
SEQRES   3 A  164  ILE GLY ILE GLY HIS LEU LEU THR LYS SER PRO SER LEU          
SEQRES   4 A  164  ASN ALA ALA LYS SER GLU LEU ASP LYS ALA ILE GLY ARG          
SEQRES   5 A  164  ASN CYS ASN GLY VAL ILE THR LYS ASP GLU ALA GLU LYS
...
HELIX    1  A1 ILE A    3  GLU A   11  1                                   9    
HELIX    2  A2 LEU A   39  ILE A   50  1                                  12    
HELIX    3  A3 LYS A   60  ARG A   80  1                                  21    
HELIX    4  A4 ALA A   82  SER A   90  1                                   9    
HELIX    5  A5 ALA A   93  MET A  106  1                                  14    
...
ATOM      1  N   MET A   1      74.851  69.339  -6.260  1.00 37.97           N  
ATOM      2  CA  MET A   1      75.137  68.258  -5.357  1.00 38.78           C  
ATOM      3  C   MET A   1      73.896  67.665  -4.750  1.00 40.36           C  
ATOM      4  O   MET A   1      72.862  68.348  -4.627  1.00 40.50           O  
ATOM      5  CB  MET A   1      76.039  68.696  -4.203  1.00 40.16           C      

You can imagine how parsing something like the resolution automatically from this might be quite a pain. The main structure of such a PDB file is as follows:

it starts with a HEADER and some additional metadata such as the authors and the journal where the structure was published
then there are many REMARKS that give additional information like the resolution of the structure and the experimental method by which it was acquired
what follows is the SEQRES (short for sequence representation) that lists the sequence for the structure for quick parsing (more information on this here)
then some information about assigned secondary structure indicated via HELIX or SHEET
finally, we have the actual structure information, prefaced with the ATOM qualifier, describing the atom type, the residue name, which chain it is part of and of course the coordinates as well as additional metadata such as the B-factor

Two important things to note at this point:

Counterintuitively, the SEQRES information does not always align with the sequence contained in the structure itself via the ATOM fields. This is a problem that plagues later data formats as well and can be attributed to a variety of reasons, mostly that flexible loops and chain ends are often not resolved in experimental structures but nonetheless present in the SEQRES representation. That is the reason why models like AlphaFold2 and OpenFold require tools like KAlign to align the sequence representation to the structure representation in cases where they do not match during template search (see for example this file in the OpenFold codebase or section 1.2.3 in the AlphaFold 2 SI (page 5)).
The atom names are not just the chemical elements (C, N, O, …), but have specific other descriptors depending on where in the amino acid this element occurs (C can be C, CA, CB, CG, …). How each of these amino acids is named exactly is described in the PDB Chemical Component Dictionary, but in general you can keep in mind that for many atoms we enumerate them with Greek characters after the atom symbol; CG then stands for “Carbon Gamma”, i.e the third carbon atom in the chain.

The PDB format does not support Greek characters, so the atom names are translated into the most similar Latin letters:

Atom name	Pronunciation	PDB name
α	alpha	A
β	beta	B
γ	gamma	G
δ	delta	D
ε	epsilon	E
ζ	zeta	Z
ν	nu	H

C $\alpha$ is thus called CA, O $\gamma$ is called OG and so on. Sometimes (e.g. in Asp) there may be two identical atoms in the same position, in which case they are named 1 and 2, e.g. the two carboxyl atoms in Asp are called OD1 and OD2. Later in this article we will see a representation of these atoms for all amino acids, but for now we can use the PDBeChem interface to look up this representation for the amino acid (or, in fact, any chemical component in the PDB) that we are interested in.

If you insert SER for the amino acid serine in the “Code” search box, hit the Search button and upon getting the result click the Atoms tab on the left-hand side of the page, you will get all the atoms in that specific amino acid. We will see later that the representation in models such as AlphaFold2 is a bit shorter since a) they do not include hydrogens in the model and b) one oxygen atom is lost in the condensation of the individual amino acids into the backbone (one water molecule per bond formed to be precise).

PDBx/mmCIF format

As mentioned, the PDB format has quite a few limitations when it comes to supporting large structures as well as complex chemistries. To improve on this, a new format called PDBx/mmCIF was introduced and is currently the default format in the PDB. It uses the ASCII character set and is a tabular data format, in which data items have a name of the format _categoryname.attributename, for example _citation_author.name. If there is only one value for this data item, it is displayed in the same line as a key-value pair. If there are multiple values for these names, a loop_ token prefaces the categories, followed by rows of data items where the different values are separeted by white spaces.

Compared to the legacy PDB format where a structure is just described as a list of atoms and amino acids, PDBx/mmCIF has more semantics in its representation. One example of this is the concept of an entity, which is defined as a chemically distinct part of a structure as represented in the PDBx/mmCIF data file. For example, a chemical ligand would be an entity, as would chains in a protein. Importantly, these entities can be present multiple times: a homodimer will have one entity since the same chain is present twice.

With this background, let us look at the PDBx/mmCIF file for the same lysozyme structure we looked at before:

# file: "168l.cif"
data_168L
# 
_entry.id   168L 
# 
_audit_conform.dict_name       mmcif_pdbx.dic 
_audit_conform.dict_version    5.385 
_audit_conform.dict_location   http://mmcif.pdb.org/dictionaries/ascii/mmcif_pdbx.dic 
# 
loop_
_database_2.database_id 
_database_2.database_code 
_database_2.pdbx_database_accession 
_database_2.pdbx_DOI 
PDB   168L         pdb_0000168l 10.2210/pdb168l/pdb 
WWPDB D_1000170153 ?            ?                   
# 
...
_entity.id                         1 
_entity.type                       polymer 
_entity.src_method                 man 
_entity.pdbx_description           'T4 LYSOZYME' 
_entity.formula_weight             18373.139 
_entity.pdbx_number_of_molecules   5 
_entity.pdbx_ec                    3.2.1.17 
_entity.pdbx_mutation              ? 
_entity.pdbx_fragment              ? 
_entity.details                    ? 
# 
_entity_poly.entity_id                      1 
_entity_poly.type                           'polypeptide(L)' 
_entity_poly.nstd_linkage                   no 
_entity_poly.nstd_monomer                   no 
_entity_poly.pdbx_seq_one_letter_code       
;MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNCNGVITKDEAEKLFNQDVDAAVRGILR
NAKLKPVYDSLDAVRRCALINMVFQMGETGVAGFTNSLRMLQQKRWDAAAAALAAAAWYNQTPNRAKRVITTFRTGTWDA
YKNL
;
_entity_poly.pdbx_seq_one_letter_code_can   
;MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNCNGVITKDEAEKLFNQDVDAAVRGILR
NAKLKPVYDSLDAVRRCALINMVFQMGETGVAGFTNSLRMLQQKRWDAAAAALAAAAWYNQTPNRAKRVITTFRTGTWDA
YKNL
;
_entity_poly.pdbx_strand_id                 A,B,C,D,E 
_entity_poly.pdbx_target_identifier         ? 
# 
loop_
_entity_poly_seq.entity_id 
_entity_poly_seq.num 
_entity_poly_seq.mon_id 
_entity_poly_seq.hetero 
1 1   MET n 
1 2   ASN n 
1 3   ILE n 
1 4   PHE n 
1 5   GLU n 
1 6   MET n 
1 7   LEU n 
1 8   ARG n 
...
loop_
_chem_comp.id 
_chem_comp.type 
_chem_comp.mon_nstd_flag 
_chem_comp.name 
_chem_comp.pdbx_synonyms 
_chem_comp.formula 
_chem_comp.formula_weight 
ALA 'L-peptide linking' y ALANINE         ? 'C3 H7 N O2'     89.093  
ARG 'L-peptide linking' y ARGININE        ? 'C6 H15 N4 O2 1' 175.209 
ASN 'L-peptide linking' y ASPARAGINE      ? 'C4 H8 N2 O3'    132.118 
ASP 'L-peptide linking' y 'ASPARTIC ACID' ? 'C4 H7 N O4'     133.103 
CYS 'L-peptide linking' y CYSTEINE        ? 'C3 H7 N O2 S'   121.158 
GLN 'L-peptide linking' y GLUTAMINE       ? 'C5 H10 N2 O3'   146.144 
GLU 'L-peptide linking' y 'GLUTAMIC ACID' ? 'C5 H9 N O4'     147.129 
...
loop_
_atom_site.group_PDB 
_atom_site.id 
_atom_site.type_symbol 
_atom_site.label_atom_id 
_atom_site.label_alt_id 
_atom_site.label_comp_id 
_atom_site.label_asym_id 
_atom_site.label_entity_id 
_atom_site.label_seq_id 
_atom_site.pdbx_PDB_ins_code 
_atom_site.Cartn_x 
_atom_site.Cartn_y 
_atom_site.Cartn_z 
_atom_site.occupancy 
_atom_site.B_iso_or_equiv 
_atom_site.pdbx_formal_charge 
_atom_site.auth_seq_id 
_atom_site.auth_comp_id 
_atom_site.auth_asym_id 
_atom_site.auth_atom_id 
_atom_site.pdbx_PDB_model_num 
ATOM 1    N N   . MET A 1 1   ? 74.851  69.339  -6.260  1.00 37.97  ? 1   MET A N   1 
ATOM 2    C CA  . MET A 1 1   ? 75.137  68.258  -5.357  1.00 38.78  ? 1   MET A CA  1 
ATOM 3    C C   . MET A 1 1   ? 73.896  67.665  -4.750  1.00 40.36  ? 1   MET A C   1 
ATOM 4    O O   . MET A 1 1   ? 72.862  68.348  -4.627  1.00 40.50  ? 1   MET A O   1 
ATOM 5    C CB  . MET A 1 1   ? 76.039  68.696  -4.203  1.00 40.16  ? 1   MET A CB  1 
ATOM 6    C CG  . MET A 1 1   ? 76.921  67.555  -3.776  1.00 41.09  ? 1   MET A CG  1 
ATOM 7    S SD  . MET A 1 1   ? 77.902  67.038  -5.191  1.00 40.98  ? 1   MET A SD  1 
ATOM 8    C CE  . MET A 1 1   ? 78.748  65.645  -4.424  1.00 41.39  ? 1   MET A CE  1 
ATOM 9    N N   . ASN A 1 2   ? 74.139  66.409  -4.302  1.00 41.77  ? 2   ASN A N   1 
...
ATOM 6442 C CG  . LEU E 1 164 ? 95.884  25.834  -10.740 0.00 85.05  ? 164 LEU E CG  1 
ATOM 6443 C CD1 . LEU E 1 164 ? 96.110  27.302  -11.107 0.00 85.07  ? 164 LEU E CD1 1 
ATOM 6444 C CD2 . LEU E 1 164 ? 94.874  25.202  -11.694 0.00 85.06  ? 164 LEU E CD2 1 
ATOM 6445 O OXT . LEU E 1 164 ? 98.129  21.647  -9.779  0.00 84.32  ? 164 LEU E OXT 1 
# 

MMTF format (legacy)

PDBx/mmCIF is now the standard format for storing macromolecular data. While due to its extensible and verbose format it has rich metadata and is suited for archival purposes, it is not the best format to transmit large amounts of structural data due to redundant annotations and repetitive information as you have seen above. Also, the inefficient representation of coordinates separated by whitespaces to make it human-readable is another hurdle for fast transmission of data.

Due to these limitations, the MMTF format (Macromolecular transmission format) was introduced. It does not contain all data present in the PDBx/mmCIF files, but all the data necessary for most visualisation and structural analysis programs. The main pros of MMTF are its compact encoding and fast parsing due to binary instead of string representations.

Overview of the MMTF compression pipeline. Source: UCSD Presentation

We can see that after some data preparation, the main steps in the MMTF pipeline are various ways of encoding to reduce the file size:

After these encodings, the file size is compressed further by packing into the MessagePack format. Its slogan reads like It’s like JSON, but fast and small, indicating its flexiblity in storing data e.g. as key-value pars, but in a binarized format.

MMTF is great for fast transmission of data and rethought quite a lot of things in clever ways. However, it deviated quite a bit from the mmCIF standard and therefore never really caught on in the community. This has now been confirmed, with MMTF being deprecated from July 2024 onward.

BinaryCIF format

There was a need for a binarized efficient format for protein structure information transfer that was more aligned with the PDBx/mmCIF file format specification. Enter Binary CIF, a newer format that is easier to interconvert with the now standard PDBx/mmCIF. The Binary CIF specification is actually quite readable, so I recommend checking it out.

BinaryCIF was heavily inspired by MMTF, with many people working on both formats. This is visible in the usage of MessagePack and the different encodings employed.

Encodings employed for BinaryCIF. Source: BinaryCIF paper

There are a few additional ones that you can read up on in the specification on GitHub, but mostly the same encodings were used as in MMTF

Input Data for Machine Learning Algorithms

We’ve discussed how protein structures are stored in databases; with that done, let us talk about how they are represented in machine learning algorithms.

Amino acid encodings

Encoding the sequence information into a numerical format should not be too hard; our vocabulary size is only 20 and we do not have to deal with symmetries as we will see later with geometric information like coordinates.

However, if you actually look into different code bases, you will soon find a decade-old problem revived again:

The old ordeal of standardisation. Source: xkcd.com

Depending on which codebase you use, the ordering of amino acids used to encode them into numerical format might be different, introducing the possibility of silent but horrible bugs later down the line. Some alphabets even have a different vocabulary size since they deal with post-translational modifications, non-canonical amino acids or other phenomena you encounter in the wild west of structural biology.

For many applications, people use a de-facto standard by adapting the encoding defined by AlphaFold2. If we look at the OpenFold codebase, we can see that their ordering includes the 20 canonical amino acids together with an unknown residue token represented by X:

# This is the standard residue order when coding AA type as a number.
# Reproduce it by taking 3-letter AA codes and sorting them alphabetically.
restypes = [
    "A",
    "R",
    "N",
    "D",
    "C",
    "Q",
    "E",
    "G",
    "H",
    "I",
    "L",
    "K",
    "M",
    "F",
    "P",
    "S",
    "T",
    "W",
    "Y",
    "V",
]
restype_order = {restype: i for i, restype in enumerate(restypes)}
restype_num = len(restypes)  # := 20.
unk_restype_index = restype_num  # Catch-all index for unknown restypes.

restypes_with_x = restypes + ["X"]
restype_order_with_x = {restype: i for i, restype in enumerate(restypes_with_x)}

OpenFold amino acid encoding.

However, some other models/frameworks use an amino acid encoding that is created by sorting the 1-letter codes instead of the 3-letter codes alphabetically. If in doubt, check which encoding your data uses to avoid confusion.

Coordinates: Atom14 vs Atom37

When looking at either the original AlphaFold codebase or the open-source reproduction in PyTorch called OpenFold, many people trip over how the coordinates from the file formats discussed earlier are represented inside the neural network. This confusion is enhanced by there being two different network-internal representations which are converted into each other depending on the use case scenario.

The documentation on these two representations is sparse, with one being available on a HuggingFace docstring:

Generally we employ two different representations for all atom coordinates, one is atom37 where each heavy atom corresponds to a given position in a 37 dimensional array, This mapping is non amino acid specific, but each slot corresponds to an atom of a given name, for example slot 12 always corresponds to ‘C delta 1’, positions that are not present for a given amino acid are zeroed out and denoted by a mask. The other representation we employ is called atom14, this is a more dense way of representing atoms with 14 slots. Here a given slot will correspond to a different kind of atom depending on amino acid type, for example slot 5 corresponds to ‘N delta 2’ for Aspargine, but to ‘C delta 1’ for Isoleucine. 14 is chosen because it is the maximum number of heavy atoms for any standard amino acid. The order of slots can be found in ‘residue_constants.residue_atoms’. Internally the model uses the atom14 representation because it is computationally more efficient. The internal atom14 representation is turned into the atom37 at the output of the network to facilitate easier conversion to existing protein datastructures.

What does this mean in practice? Let’s look at the code. When looking at residue_constants.residue_atoms, we get the following description for the atom14 representation:

# file: "residue_constants.py"
# A list of atoms (excluding hydrogen) for each AA type. PDB naming convention.
residue_atoms = {
    "ALA": ["C", "CA", "CB", "N", "O"],
    "ARG": ["C", "CA", "CB", "CG", "CD", "CZ", "N", "NE", "O", "NH1", "NH2"],
    "ASP": ["C", "CA", "CB", "CG", "N", "O", "OD1", "OD2"],
    "ASN": ["C", "CA", "CB", "CG", "N", "ND2", "O", "OD1"],
    "CYS": ["C", "CA", "CB", "N", "O", "SG"],
    "GLU": ["C", "CA", "CB", "CG", "CD", "N", "O", "OE1", "OE2"],
    "GLN": ["C", "CA", "CB", "CG", "CD", "N", "NE2", "O", "OE1"],
    "GLY": ["C", "CA", "N", "O"],
    "HIS": ["C", "CA", "CB", "CG", "CD2", "CE1", "N", "ND1", "NE2", "O"],
    "ILE": ["C", "CA", "CB", "CG1", "CG2", "CD1", "N", "O"],
    "LEU": ["C", "CA", "CB", "CG", "CD1", "CD2", "N", "O"],
    "LYS": ["C", "CA", "CB", "CG", "CD", "CE", "N", "NZ", "O"],
    "MET": ["C", "CA", "CB", "CG", "CE", "N", "O", "SD"],
    "PHE": ["C", "CA", "CB", "CG", "CD1", "CD2", "CE1", "CE2", "CZ", "N", "O"],
    "PRO": ["C", "CA", "CB", "CG", "CD", "N", "O"],
    "SER": ["C", "CA", "CB", "N", "O", "OG"],
    "THR": ["C", "CA", "CB", "CG2", "N", "O", "OG1"],
    "TRP": ["C", "CA", "CB", "CG", "CD1", "CD2", "CE2", "CE3", "CZ2", "CZ3", "CH2", "N", "NE1", "O"],
    "TYR": ["C", "CA", "CB", "CG", "CD1", "CD2", "CE1", "CE2", "CZ", "N", "O", "OH"],
    "VAL": ["C", "CA", "CB", "CG1", "CG2", "N", "O"]}

atom14 ordering.

We see that depending on whch amino acid we have present, a certain position in a residue array can represent a different atom (for example, position 3 is CG2 for Threonine, CG1 for Valine and N for Serine). This makes storing this information very efficient, but can be cumbersome if we need to retrieve the coordinates of a certain atom like N from our data structure.

On the other hand, the atom37 representation has a fixed atom data size for every residue. This ordering can be found in residue_constants.atom_types:

# file: "residue_constants.py"
# This mapping is used when we need to store atom data in a format that requires
# fixed atom data size for every residue (e.g. a numpy array).
atom_types = [
    "N",
    "CA",
    "C",
    "CB",
    "O",
    "CG",
    "CG1",
    "CG2",
    "OG",
    "OG1",
    "SG",
    "CD",
    "CD1",
    "CD2",
    "ND1",
    "ND2",
    "OD1",
    "OD2",
    "SD",
    "CE",
    "CE1",
    "CE2",
    "CE3",
    "NE",
    "NE1",
    "NE2",
    "OE1",
    "OE2",
    "CH2",
    "NH1",
    "NH2",
    "OH",
    "CZ",
    "CZ2",
    "CZ3",
    "NZ",
    "OXT",
]
atom_order = {atom_type: i for i, atom_type in enumerate(atom_types)}
atom_type_num = len(atom_types)  # := 37.

atom37 ordering.

Here we can see that the ordering is always the same no matter which residue is represented; however, most of the fields will always be empty since the longest amino acid (tryptophane) has only 14 atoms. We therefore exchange efficiency for standardisation, which explains why internally AF2 often uses atom14, but when it interfaces to other programs at I/O it often uses atom37.

If we think about our example of Ser again, we can see how the machine representations map to the actual amino acid (again with the caveat that hydrogens are ommited and the carboxyl oxygen is not counted since in a peptide backbone it will have let as water).

Category	Atom14	Atom37
Memory Requirements	Efficient	Wasteful
Data Layout	Varying Shape	Fixed Shape
Sequence Dependence	Yes	No

Boundary Conditions: OXT

I have been talking before about the oxygen atom of the carboxy group that is lost when two amino acids combine to form a peptide bond. Well, that is true for all amino acids except the last one at the C-terminus since there the carboxy group will still be free and has two oxygen atoms. At physiological pH the carboxy group will be deprotonated so both oxygen atoms are chemically equivalent with equal bond lengths (as opposed to the single and double bond image we always draw on paper), but our file formats still require us to name one of the oxygens at the terminus as a “normal” oxygen, i.e. O and the other one OXT. You will therefore often see the last atom in a protein structure being OXT, such as in this Biopandas tutorial. When I say “often”, I mean “not always”; the termini of protein are known to often be too flexible to crystallise, therefore the structure in our PDB files will often end prior to the C-terminus and not contain an OXT. This is not super problematic since given the planarity of the delocalised carboxy electron system, one can place the OXT easily given the carbon and the other oxygen atom. Predicted structures such as the ones from AlphaFold2 on the other hand will always contain the OXT atom since they do not have to battle experimental resolution problems.

Example: Lysozyme atom numbering

Let us now visualise the concepts we looked at so far (atom names and atom representations) with a concrete example, again based on the lysozyme structure with the PDB code 168l. Install PyMol (either the commercial or the open-source version) and open the program.

If you have not used PyMol before, you can either skip this section or look at this lesson from my Structural Bioinformatics course that goes over this in detail.

Then, execute the following commands via the integrated terminal:

fetch 168lA # get first chain of lysozyme assembly
select selection, resi 11-15 # select a subset of residues for simplicity
hide everything # hide the whole structure for clarity
show sticks, selection # show stick representation for the selected subset; carbon is green, oxygen is red, nitrogen is blue
color yellow, (name CG) # color all CG atoms yellow
color orange,  (name NH1) # color the single NH1 atom orange

After doing this, you should see something like this:

We can compare this to a schematic sketch of this protein segment, similar to what we did before with serine:

Schematic representation of the selection from our protein, with the coloring imitating our color scheme in PyMol.

We can see that PyMol knows about the atom naming convention we discussed and can select and color residues accordingly. It does this by parsing the information it gets from the PDB file and storing this inside the structure object it displays.

We can do the same thing programmatically by using a library such as Biotite.

import biotite.structure as struc
import biotite.structure.io.mmtf as mmtf
import biotite.database.rcsb as rcsb

mmtf_file = mmtf.MMTFFile.read(rcsb.fetch("168l", "mmtf"))
structure = mmtf.get_structure(mmtf_file, model=1)

chain_A = structure[
    (structure.chain_id == "A") & (structure.hetero == False)
]
print(chain_A.res_id) # array([  1,   1,   1, ..., 164, 164, 164])
selection = chain_A[(chain_A.res_id > 10) & (chain_A.res_id <= 15)]
print(selection.res_id) # [ 11 11 ... 15 15 ]
print(selection.array_length()) # 40

We see that our selection contains 40 atoms. We can check if that corresponds to the amino acids we wanted to select by checking how many non-hydrogen atoms each of these amino acids have and by subtracting on average 1 oxygen atom per amino acid for the formation of the peptide bond.

Proteinogenic amino acids and some of their properties. Source: Wikipedia

\begin{align} E + G + 2L + R - 5 &= 10 + 5 + 2*9 + 12 - 5 &= 40 \end{align}

Reference Systems: Local reference frames vs reference-free methods

We now have covered how we go from the database formats for protein structures (PDBx/mmCIF, MMTF and BinaryCIF) to the formats commonly used as inputs for machine learning models (atom14, atom37). The question now is: what do the machine learning models do with this input information?

Given that we deal with geometric quantities such as coordinates of protein structures, considerations like invariance and equivariance come into play. There is a whole field called Geometric Deep Learning that deals with with these considerations. For the usage of machine learning models for protein structure, it is important to understand the distinction between reference-free and reference-based methods.

To learn more about geometric deep learning, you can either check out the protobook by Bronstein et al., the Hitchhiker’s guide to geometric GNNs or this lecture I gave on the topic.

If we predict some molecular property (such as binding affinity, solubility or immunogenicity) it is quite obvious to a human that rotations or translations of the protein should not change the prediction of these quantities. A neural network, however, just sees different numbers when a protein is translated and therefore needs to learn that these different inputs correspond to the same protein. This can be done via data augmentation, but this can become data-inefficient. Therefore, people looked for ways to build this inductive bias of invariance or equivariance to SE(3) group actions (i.e. rotations and translations) into the model.

Local reference-based methods

On one hand, some models leverage reference-based methods, largely following the example of the original AlphaFold2 model. Here, a local reference frame for each residue is defined based on the backbone geometry, with the translational component being equal to the CA position and the rotational component originating from a Gram-Schmidt orthogonalisation with respect to the CA-C and the CA-N bond vector.

Here a paragraph from the Hitchhiker’s Guide to Geometric GNNs for 3D Atomic Systems that summarizes the current state in this field of research:

Canonical frame-based invariant GNNs. Canonical frame-based GNNs [Liu et al., 2022, Wang et al., 2022a] use a local or global frame of reference to scalarise geometric quantities into invariant features which are used for message passing, offering an alternative technique when canonical reference frames can be defined. Most notably, the Invariant Point Attention layer (IPA) from AlphaFold2 [Jumper et al., 2021] defines canonical local reference frames at each residue in the protein backbone centred at the alpha Carbon atom and using the Nitrogen and adjacent Carbon atoms. Other invariant GNNs for protein structure modelling also process similar local reference frames [Ingraham et al., 2019, Wang et al., 2023b]. IPA is an invariant message passing layer operating on an all-to-all graph of protein residues. In each IPA layer, each node creates a geometric feature (position) in its local reference frame via a learnable linear transformation of its invariant features. To aggregate features from neighbours, neighbouring nodes’ positions are first rotated into a global reference frame where they can be composed with their invariant features (via an invariant attention mechanism), followed by rotating the aggregated features back into local reference frames at each node and projecting back to update the invariant features.

These canonical local reference frames $T = (r, x) \in \text{SE(3)}$ can be used to deal with quantities in a SE(3)-invariant way. Importantly, the orientational nature of the frame allows us to be SE(3)-invariant but not E(3)-invariant, i.e. reflections are still accounted for. This is important for biological applications since chirality plays a huge role in biomolecular interactions.

Why SE(3) instead of E(3) equivariance can be important

As an example of why this is important, we can look at the task of protein structure prediction that AlphaFold2 tackled.

To learn more about AlphaFold2 and the problem of protein structure prediction, you can either check out the 3-part lecture series about AF2 by Nazim Bouatta or this lecture I gave on the topic.

Here, one important metric for measuring prediction accuracy is the GDT score. To get good at maximising this score, a natural way to think about it is to take your predicted coordinates, compare them to the ground-truth coordinates and compute something like an RMSD loss. However, this does not take rototranslations into account of course. We can remedy that by calculating a dRMSD loss, i.e. a RMSD loss on all pairwise distances in the structure. By using these internal coordinates, we are invariant to rototranslations.

However, we are also invariant to reflections! When training AlphaFold2, the team at DeepMind tested what would happen if they used this dRMSD loss for training a model.

Results if AF2 is trained with dRSMD instead of FAPE as loss. Source: AF2 SI, page 36, section 1.9.3.

You can see that while the local structure of predicted proteins (as measured by the lddt-CA score) seem very good, the global structure (as measured by the GDT score) seems to follow a bimodal distribution, with half the predictions performing well and the other half faring badly. Could this be due to the reflection invariance of the dRMSD loss? When calculating the GDT score with respect to the mirror image structure, the team observed a reversal of the distribution! Finally, when looking at the maximum of these two scores (one calculated with respect to the ground truth structure and one with respect to its mirror image), the model shows strong performance, indicating that the issue was indeed the reflection-invariant dRSMD loss.

Here frames come to our rescue and allow the definition of the so-called FAPE loss (frame-aligned point error, minimal implementation here). With their help, we can compute distance-like quantities, but in a reflection-aware way. How do we do that? We can take a predicted position $x_j$ and compute its position relative to the predicted frame of a different residue $T_i$ . With this, we effectively get a displacement vector which is however reflection-aware due to the rotational component of the frame transformation.

We can do the same thing for the ground-truth positions and frames that can be computed for the same combination of residues and score the difference as a RMSD-like quantity. This is what the FAPE loss amounts to.

FAPE loss visualisation for a single pair of residues. Source: YouTube talk at HMS.

An equivalent way of visualising this involves not looking at a single pair of residues, but considering it in the context of the whole structure. Here, we align the predicted and the target structure based on frame $T_i$ and then calculate the L2 norm of all the other residues with respect to this specific alignment. We can then repeat this for all residues in the sequence to calculate the overall FAPE loss.

FAPE loss in the context of the whole structure. Source: AF2Seq paper.

Note that there are different versions of the FAPE loss used in different parts of the model; while the final FAPE loss computes these L2 norms for all atoms, the intermediate FAPE loss only considers the CA positions.

This type of frame definition is by no means the only way you can construct frames; RGN2, another model for protein structure prediction instead uses Frenet–Serret frames to model the protein backbone.

Ambivalent mappings from frames to coordinates

At the end of AlphaFold2, the algorithm has to again map the frame-based representation into 3D coordinates. This should not be a problem since we have our backbone frames that allows to reconstruct the backbone positions, and we predict the torsion angles of the rigid groups in the side chains so that we can place all atoms correctly according to the following table.

Rigid groups for constructing all atoms from given torsion angles. Source: AF2 SI, Table 2

However, you can see a few boxed atoms in the table. These atoms are symmetric under 180 degree rotations, such as five of the six atoms in the phenyl ring of phenylalanine (PHE) and tyrosine (TYR) or the terminal carboxyl oxygens in asparagine (ASP) and glutamate (GLU).

Some of these atoms are on the rotation axis such as the terminal carbon atom in the phenyl rings and are therefore invariant to the rotation; some of the other atoms however swap positions due to the 180 degree rotation symmetry and their atom names are therefore ambiguous.

AlphaFold deals with this by renaming the atoms in a globally consistent way via lDDT loss computations (see algorithm 26 in the SI and this part of the OpenFold codebase).

Renaming convention for ambivalent atom placements. Source: AF2 SI, Table 3

Another problem that comes from this ambiguity is that the network can in theory predict to valid values for the torsion angle of these rigid groups, $\chi$ as well as $\chi + \pi$ . AlphaFold therefore allows the network to predict both angles by giving it both the predicted and the possible alternative angle (in the case of non-symmetric configurations, they are both set to the predicted value). In this way, the network is allowed to learn both valid values.

Reference-free methods: Invariant and Equivariant Update Functions

We do not necessarily need to represent our structures as frames where we define a local reference coordinate system, but can also directly operate on our coordinates as long as we update our representation at every layer in a way that properly leverages these symmetries (e.g. by SE(3) invariance or equivariance).

Examples that leverage this approach include GVP-GNN which defines equivariant update functions as well as SchNet and DimeNet that leverage invariant update functions (message passing functions in GNN-speak).

To learn more about how these different approaches can be classified, I recommend both this paper as well as the Hitchhiker’s guide to geometric GNNs.

Leaving the GNN camp for a bit, Ophiuchus showed that one can use hierarchical autoencoders to operate over protein structures which are represented by CA atoms and geometric features attached to them that describe the other atomic positions. They employ SE(3)-equivariant convolutions to operate on this representation and demonstrate its usage for compression and structure generation.

Screw these symmetries: data augmentation and other strategies

Frame-based representations have been successfull in AlphaFold and have since been used in many other models, both supervised and generative, for example RFDiffusion and Chroma. However, defining things like diffusion processes over these frames becomes quite a bit harder, and if you additionally deal with sidechains and other details, frames might be too cumbersome for your use case.

Other models therefore do not use frames, but some kind of internal coordinates that can be used without explicitly considering these symmetry constraints. Some examples of this include RGN and FoldingDiff that leverage torsion angles or ProteinSGM that leverages a mixture of torsion angles and backbone distances.

Another strategy that does not involve dealing with symmetries is - well, not dealing with symmetries. Protpardelle is a protein diffusion model that operates on pure coordinate representations via a vision transformer and does some rotational and translational data augmentation to account for these symmetries. Finally, in the small molecule world, the Molecular Conformer Fields paper showed that empirically, not enforcing these symmetry constraints explicitly can still lead to SOTA performance, sparking quite a discussion on Twitter.

Batching: Padded versus sparse

We’ve now covered the whole pipeline, starting from database formats over input formats to network-internal representations to properly handle symmetries. A final consideration comes into play when we think about batching, a commonly used technique in machine learning where you do not pass your samples one by one into the network, but combine them into a bigger tensor to achieve better hardware utilisation and therefore training performance.

There are many subtleties to choosing your batch size since generally we perform a gradient update step after each of these batches; therefore, the batch size not only influences training performance but also accuracy by changing the dynamics of our gradient descent procedure. I won’t go into detail here on that, but recommend Andrej Karpathy’s blog on general recipes for training neural networks.

The batching pain with variable-length input

This batching of tensors is trivial in many computer vision use cases since often all your images are of the same size; you can therefore just stack them along a new dimension and ready is your batch.

For protein structures, it is a bit more complicated due to variable length. One strategy to deal with this involves padding and trunction. Here, we choose some maximum length for our batch and pad structures that are shorter than this via padding tokens (for coordinates this can be 0 or a small value that is unlikely to occur exactly like this in the data) and truncate structures that are longer than this (either randomly or via some biologically defined domain boundaries). This solves our issue, but introduces new ones: often, we do not want to truncate data since we may lose important information. If we now always choose the longest structure in a batch as the maximum length, we may end up with very inefficient training if there are very short sequences in the batch and padding tokens begin to represent a significant part of our batch.

Efficient padding via length batching

To circumvent this, people took inspiration from NLP. In the transformer paper, for example, it is stated that to circumvent the inefficient padding issue, sentence pairs were batched together by approximate sequence length, resulting in more optimal padding. This has been replicated for example in generative models for protein structure. This change might influence training dynamics since now the model sees similarly-sized inputs inside every batch, but empirically still seems to work fine.

Sparse batching

In the previous section we talked about the usage of GNNs (graph neural networks) for protein structures. A popular library in the field of GNNs is PyG (PyTorch Geometric) that can be used for all kinds of graph-structure data.

In contrast to the padding-and-truncation approach I mentioned before, they opt for a sparse batching procedure they term advanced mini-batching.

Here, we treat the our graph data points in a batch as one single datapoint and use pointers to tell us about the boundaries between these. In practice, we concatenate all our node features along an existing dimension instead of stacking them along a new dimension, making padding and truncation obsolete.

Advanced mini-batching in Pytorch Geometric. Source: PyG Docs

We do something similar for the adjacency matrix which indicates the connectivity in the graphs. Stacking these in a block-diagonal fashion allows us to reuse existing algorithms for GNNs such as message-passing without having to change implementations. In addition, since the majority of elements in this matrix will be zero, we can use sparse representations that allow us to deal with this in a memory-efficient way.

If you inspect protein structures represented in this PyG format (such as in the ProteinWorkshop project we recently published), you can see that a graph will look like this:

DataBatch(
  coords=[7241, 37, 3],
  residues=[32], 
  residue_id=[7241], 
  chains=[7241], 
  seq_pos=[7241, 1], 
  batch=[7241], 
  ptr=[33])

In contrast, this same batch in the “dense” format that uses padding would look like this:

DataBatch(
  coords=[32, 385, 37, 3],
  residues=[32],
  residue_id=[32, 385],
  chains=[32, 385],
  seq_pos=[32, 385, 1])

We can notice several differences:

the dense format represents the batch as an explicit tensor dimension (first dimension of size 32) in all attributes. This dimension is not apparent in the PyG batch except for the attributes that are graph-level attributes and therefore do not change with the size of the graph (residues is an example here, for each graph it is a single list).
we can see in the dense batch that the longest protein structure in this batch is 385 residues (apparent in for example the residue_id attribute, a numerical encoding of the amino acid type). In the PyG batch, we can see that stacked together all amino acids in the batch sum to 7241. If you compare 7241 to 32*385 = 12320, we can see that padding introduces around 40% of memory overhead compared to the efficiently batched representation.
the PyG batch stores the batching information not in a separate dimension, but in separate attributes: batch indicates for each node in the batch to which graph in the batch it belongs, and ptr contains pointers to the boundaries between all the graphs in the batch to enable efficient indexing and information retrieval.

Interconversion from dense to PyG format and back is easy to do if all of the graphs are the same size: we can use the PyG DenseDataLoader for that.

In the padded case, there is no such functionality yet, but there might soon be a DensePaddingDataLoader that does exactly that.

AFDB, ESMAtlas & co: how to deal with large databases

Everything we did with protein sequences we can now do with protein structures

Visual comparison of the size of the PDB vs the AFDB. Source: YouTube

Many groups have developed tools in the last years to tackle this issue. Especially the Steinegger lab has produced some fantastic tools in that space. If you want to read more about these tools, I have a separate blogpost describing three of them in detail: Foldcomp for structure compression, Foldseek for structure clustering and mmseqs for sequence clustering (also very important in that context for generating both input MSAs and training splits).

Tools from the Steinegger Lab. Source: YouTube

Summary

In this post we discussed four different levels of information representation:

We started with the data formats in which protein structures are stored and transmitted and the evolution they underwent in the last decades.
After that we looked at how both sequence and structure information can be converted into a format that can be used by machine learning algorithms, specifically the atom14 and atom37 format.
Once inside the network, we discussed how different methods leverage this information differently, either via reference-based or reference-free methods, looking at how we can deal with geometric information while respecting the symmetries inherent to it.
Finally, we looked at how different frameworks deal with the variable length of protein structures and how this affects batching behaviour.

I hope that this post can shine some light not only on which representations are used in which circumstances but also why. If you have feedback let me know!

How does Pytorch implement a linear layer?

2024-01-10T00:00:00+00:00

PyTorch is the deep learning library. It is used by researchers and practitioners alike to build and train neural networks. It is also open source, which means that we can look at the source code to understand how it works. This is especially useful if we want to understand how a specific operation is implemented.

In my post about GPU programming in PyTorch, we saw that calling a linear layer in PyTorch via torch.nn.Linear results in a call to the aten::addmm function. The ATen library is part of the PyTorch C++ API and is responsible for the tensor operations in PyTorch. So if we want to understand how the linear layer is implemented in PyTorch, we need to dig into C++ code and understand how the aten::addmm function is implemented. This is a bit of a convoluted process, but I hope that in the process you learn as much about the PyTorch codebase as I did when I went down this rabbit hole.

PyTorch Docs and the Dispatcher
Native functions and the codegen pipeline
Navigating the at::native namespace
Structured Kernels
Where are the actual implementations?
Conclusion
Credits

PyTorch Docs and the Dispatcher

To get an idea of what these operations do, we can look at the PyTorch at Namespace docs and look for these functions. Via this we see that the aten::addmm function is defined in build/aten/src/ATen/Functions.h. Looking at the program listing, we can see that it calls at::_ops::addmm_out::call(self, mat1, mat2, beta, alpha, out).

We can look at the respective Python API to learn more about the different arguments of the addmm function. The addmm function is a matrix multiplication followed by a matrix addition of the following form:

\text{out} = \beta \text{input} + \alpha (\text{mat1} @ \text{mat2})

The mat1 and mat2 arguments are the input matrices, beta is a scaling factor for the input matrix input, alpha is a scaling factor for the matrix multiplication and out is the output tensor.

Just looking through the PyTorch GitHub repo looking for the implementation of function is unfortunately quite a pain. One of the main reasons for that is that depending on your backend (CPU, NVIDIA GPU, Apple M-series chips, …), the PyTorch dispatcher dynamically dispatches to the correct kernel for your setup.

Native functions and the codegen pipeline

Another complication is that many operations are not really fully implemented in the PyTorch codebase, but will get generated during the PyTorch build process via a code-generation pipeline (more on this in this podcast episode). This is sensible since while many operations in PyTorch are in principle quite simple (element-wise additions, activation functions, …), there is a lot of boilerplate that every operation has to implement (like bindings to python, autograd support, registering the kernel to the dispatcher, …). The codegen pipeline allows PyTorch to generate this boilerplate code automatically.

What we need to do therefore is to look at the native_functions.yaml file, with “native” functions being the modern mechansim for adding operators and functions to ATen (more details in this podcast episode). This file describes metadata about each operator that gets consumed by the codegen (more details on the different fields in this yaml file here).

If we search in the native_functions.yaml file for addmm, we find the following entry:

# file: "native_functions.yaml"
- func: addmm(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1) -> Tensor
  structured_delegate: addmm.out
  variants: function, method
  dispatch:
    SparseCPU: addmm_sparse_dense_cpu
    SparseCUDA: addmm_sparse_dense_cuda
    SparseCsrCPU, SparseCsrCUDA: addmm_sparse_compressed_dense
  tags: core

Entry for the addmm function

We see the structured_delegate field, which tells us that the actual implementation of the addmm function is in the addmm.out function (more on this later). We can find the implementation of this function in the native_functions.yaml file:

# file: "native_functions.yaml"
- func: addmm.out(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, Tensor(a!) out) -> Tensor(a!)
  structured: True
  dispatch:
    CPU: addmm_out_cpu
    CUDA: addmm_out_cuda
    MPS: addmm_out_mps
    SparseCPU: addmm_out_sparse_dense_cpu
    SparseCUDA: addmm_out_sparse_dense_cuda
    SparseCsrCPU: addmm_out_sparse_compressed_cpu
    SparseCsrCUDA: addmm_out_sparse_compressed_cuda

Entry for the addmm function

Ignoring the structured field for now, we see multiple things:

We have multiple entries for the addmm function, addmm and addmm_out. There are in fact three different versions of most PyTorch operators (however, we only see the addmm and addmm_out functions in the codebase since the in-place version is generated automatically):
- addmm: the functional version that performs the operation without modifying the original tensor and returns a new tensor, for example output = torch.add(input, other)
- addmm_: the in-place version that modifies the original tensor, for example input.add_(other)
- addmm_out: the out-of-place version that takes an additional tensor as an argument and writes the result to this tensor, for example torch.add(input, other, out=output)
We see that for each backend (CPU, CUDA, MPS, …) there is a separate implementation of the addmm function. This is because the implementation of the addmm function can be highly dependent on the specific hardware and memory layout of the input tensors. For example, the addmm function for sparse tensors is implemented differently than the addmm function for dense tensors.
In summary, this means that we need to write (#variants * #backends kernel) implementations for each operator. This is a lot of boilerplate code that the codegen pipeline can generate for us.
The variants field tells us that the addmm function can be called as a namespace function (at::addmm()) or as a Tensor method (t.addmm()). This is because PyTorch supports both functional and method-based APIs. To qualify as a Tensor method, there most be a Tensor self argument in the function signature since otherwise the function would not be able to be called as a method on a tensor. In the method variant this argument will be removed from the function signature. A function variant is always generated by ATen, but when should you also generate a method variant? From the PyTorch native README:

Tensor operations as methods are appropriate for “core” Tensor operations (e.g., add, sub, etc.), but not for more complicated neural network layers (e.g., conv2d) and internal functions designed specifically for binding (e.g., cudnn_convolution).

Navigating the `at::native` namespace

If we want to look for where a specific implementation of the addmm function is, we just need to look for the name of the function in the at::native namespace. This still does not bring us to the actual implementation of the function easily because there are more than 2000 PyTorch operators which can be grouped into various categories. We can see in the post linked in the last sentence that addmm is counted as one of the 13 composite matmul operators. There are different ways to categorize the operators (for example by shape behavior), but the point is that there are a lot of them.

To find our addmm needle in the at::native namespace haystack, we can either directly open a codespace on GitHub or we can clone the PyTorch repo. Both options give us access to a terminal where we can find the implementation of the addmm function by running git grep "addmm". This will give us a list of all files in the current folder of the PyTorch repo that contain the string addmm. We can then look through these files to find the actual implementation of the addmm function. So we do the following in summary:

git clone https://github.com/pytorch/pytorch 
cd pytorch/aten/src/ATen/native
git grep "addmm"

This gives us a lot of output, but we can see that there are two kinds of functions declarations in LinearAlgebra.cpp that look promising:

A meta function called TORCH_META_FUNC(addmm)
Multiple implementatin functions: TORCH_IMPL_FUNC(addmm_out_cpu), but also the CUDA implementation in the cuda/Blas.cpp file called TORCH_IMPL_FUNC(addmm_out_cuda)

This insight leads us to another new term we have to understand in order to make sense of the codebase: Structured Kernels.

Structured Kernels

Structured Kernels is a new (i.e. from 2021) way to define PyTorch operators. It abstracts away even more of the boilerplate code that has to be written for each operator and backend than native functions alone, to the extent that you only need to write a shape-checking function (meta function) and a kernel implementation function for the out-kernel and the structured kernel will take care of the rest.

This now explains the structured and structured_delegate fields in the native_functions.yaml file. The structured field tells us that the addmm function is a structured kernel, and the structured_delegate field tells us that the actual implementation of the addmm function is in the addmm.out function.

Pre structured kernels, entries in the native_functions.yaml file looked like this:

# file: "native_functions.yaml"
- func: addmm(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1) -> Tensor
  # structured_delegate: addmm.out removed!
  variants: function, method
  dispatch:
    #CPU, CUDA and MPS kernels added!
    CPU: addmm_cpu
    CUDA: addmm_cuda
    MPS: addmm_mps
    SparseCPU: addmm_sparse_dense_cpu
    SparseCUDA: addmm_sparse_dense_cuda
    SparseCsrCPU, SparseCsrCUDA: addmm_sparse_compressed_dense
  tags: core

# file: "native_functions.yaml"
- func: addmm.out(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, Tensor(a!) out) -> Tensor(a!)
  # structured: True removed!
  dispatch:
    CPU: addmm_out_cpu
    CUDA: addmm_out_cuda
    MPS: addmm_out_mps
    SparseCPU: addmm_out_sparse_dense_cpu
    SparseCUDA: addmm_out_sparse_dense_cuda
    SparseCsrCPU: addmm_out_sparse_compressed_cpu
    SparseCsrCUDA: addmm_out_sparse_compressed_cuda

You see that before structured kernels, both the addmm and addmm_out functions had a dispatch field that specified all the backends for which the function had to be implemented. The CPU, CUDA and MPS kernel now have to be implemented separately for the addmm and addmm_out functions. This is a lot of boilerplate code that the structured kernel can generate for us.

In the structured kernel yaml file, you see that the addmm function has a structured_delegate field that points to the addmm.out function. This is because the addmm function is a structured kernel, and the actual implementation of the addmm function is in the addmm.out function. The addmm.out function is a structured kernel that is implemented in the LinearAlgebra.cpp file.

In the ideal case of a structured kernel, the addmm function would not need any dispatch field because the addmm_out as the structural delegate would implement all the kernel implementations. This can be seen in the example from the RFC for structured kernels:

In the addmm function, however, we still see the dispatch field. This is because the addmm function is a composite matmul operator, and the implementation can be highly specific in the sparse case. Therefore we cannot rely on the structured kernel to generate the correct implementation for us, and we have to specify the dispatch field manually. If you want to learn more about how all this is implemented under the hood, check out this slide deck.

Where are the actual implementations?

We are already quite deep down in the rabbit hole and tracked down the addmm function to the LinearAlgebra.cpp and the cuda/Blas.cpp file. These files contains the meta function TORCH_META_FUNC(addmm) and the implementation functions TORCH_IMPL_FUNC(addmm_out_cpu) and TORCH_IMPL_FUNC(addmm_out_cuda). The TORCH_META_FUNC function is a meta function that checks the shapes of the input tensors and calls the correct implementation function. The TORCH_IMPL_FUNC function is the actual implementation of the addmm function for the CPU and CUDA backends.

Let us look at these in turn now.

Shape checking: `TORCH_META_FUNC(addmm)`

The TORCH_META_FUNC(addmm) function is a wrapper around ADDMM_META(). Why another wrapper, you may ask? Well, the shape checkign done is this function is transferable to other cases such as for TORCH_META_FUNC(_addmm_activation), so the wrapper promotes reusability.

Looking at the implementation of ADDMM_META(), we see that it is actually not a function but a preprocessor macro:

#define ADDMM_META() \
  TORCH_CHECK(self.scalar_type() == mat2.scalar_type(), "self and mat2 must have the same dtype, but got ", self.scalar_type(), " and ", mat2.scalar_type()); \
  TORCH_CHECK(mat1.scalar_type() == mat2.scalar_type(), "mat1 and mat2 must have the same dtype, but got ", mat1.scalar_type(), " and ", mat2.scalar_type()); \
  TORCH_CHECK(mat1.dim() == 2, "mat1 must be a matrix, got ", mat1.dim(), "-D tensor"); \
  TORCH_CHECK(mat2.dim() == 2, "mat2 must be a matrix, got ", mat2.dim(), "-D tensor"); \
  TORCH_CHECK( \
      mat1.sizes()[1] == mat2.sizes()[0], "mat1 and mat2 shapes cannot be multiplied (", \
      mat1.sizes()[0], "x", mat1.sizes()[1], " and ", mat2.sizes()[0], "x", mat2.sizes()[1], ")"); \
 \
  auto names = at::namedinference::propagate_names_for_addmm(mat1, mat2, self); \
  set_output_raw_strided(0, {mat1.sizes()[0], mat2.sizes()[1]}, {}, mat1.options(), names);

As expected, it performs a lot of checks on the input tensors. It checks that the input tensors have the same data type, that mat1 and mat2 are both 2D tensors (i.e. matrices), and that the shapes of mat1 and mat2 are compatible for matrix multiplication. It then calls the at::namedinference::propagate_names_for_addmm function to propagate the names of the input tensors to the output tensor. Finally, it sets the output tensor to the correct shape.

CPU implementation: `TORCH_IMPL_FUNC(addmm_out_cpu)`

If we look at the TORCH_IMPL_FUNC(addmm_out_cpu) function, we see that it is again a wrapper! It first expands the output tnesor to the correct shape (rows = {mat1.sizes()[0], columns = mat2.sizes()[1]}) and then calls the addmm_impl_cpu_() function.

Fortunately, this time we do not have to search long for the actual implementation of the addmm_impl_cpu_() function. It is in the same file and longer than the previous wrapper function (which makes sense since it is the actual implementation of the addmm function).

Looking at the function signature, we see the following:

static void addmm_impl_cpu_(
    Tensor &result, const Tensor &self, Tensor m1, Tensor m2, const Scalar& beta, const Scalar& alpha)

We see that the function does not return anything, but takes a reference to the output tensor result and the input tensors self, m1 and m2 as well as the scaling factors beta and alpha. It starts with a some shape asserts and data type checks. It then allocates the sizes of the different matrices to auto variables since accessing these arrays is faster than calling the size() method multiple times (we will need these sizes for the matrix multiplication). After some additional checks and resizings we get to the core of the function.

// Some paths in the code below do not handle multiplications of the form [a, 0] x [0, b]
  if (m1_sizes[1] == 0) {
    if (beta.toComplexDouble() == 0.0) {
      result.zero_();
    } else {
      if (!self.is_same(result)) {
        result.copy_(self);
      }
      result.mul_(beta);
    }
    return;
  }

Checks for the $\beta$ value. Link

As the comment tells us, the code after the excerpt cannot handle multiplications of the form $[a, 0] \times [0, b]$ , so it checks for this case and handles it separately. We can see that if the input scaling factor $\beta$ is zero, the output tensor is zeroed out. If the input scaling factor $\beta$ is not zero, the output tensor copies the entries from the self tensor and is scaled by $\beta$ . The function then returns.

After that, we cast the tensors result and m1 as matrix a and m2 as matrix b. We do this to prepare the shapes correctly for the matrix multiplication.

Finally, we get to the matrix multiplication itself. Depending on which CPU hardware we have we can still dispatch to two different implementation.

On AArch64 we can call the mkldnn_matmul function that is faster in case certain shape considerations are fulfilled:

  bool dispatched = false;
#if defined(__aarch64__) && AT_MKLDNN_ACL_ENABLED()
  // On AArch64 if LHS matrix in BLAS routine is transposed but RHS is not then
  // it is faster to call oneDNN matrix multiplication primitive with RHS*LHS
  // that will call then into Arm® Compute Library (ACL) GEMM kernel and also
  // additionally have support for running kernel with BF16 instructions
  if (transpose_c) {
    bool apply_heur = apply_mkldnn_matmul_heur(b.sizes()[0], b.sizes()[1], a.sizes()[1]);
    if (apply_heur && transpose_a && !transpose_b && result.scalar_type() == at::ScalarType::Float) {
      try {
        mkldnn_matmul(b, a, c, beta.to<float>(), alpha.to<float>());
        // We have dispatched to ACL GEMM for single precision float
        // so do not need to dispatch to BLAS GEMM below
        dispatched = true;
      } catch (const std::exception& e) {
        TORCH_WARN("mkldnn_matmul failed, switching to BLAS gemm:", e.what());
        at::globalContext().setUserEnabledMkldnn(false);
      }
    }
  }
#endif

AArch64 matrix multiplication dispatch. Link

If this option is not enabled (or if the heuristic check for the matrix shapes fails), we fall back to the gemm function from the BLAS library:

  if(!dispatched) {
    // Apply BLAS routine
    _AT_DISPATCH_ADDMM_TYPES(result.scalar_type(), "addmm_impl_cpu_", [&]{
          using opmath_t = at::opmath_type<scalar_t>;
          at::native::cpublas::gemm(
              transpose_a ? a.is_conj() ? TransposeType::ConjTranspose : TransposeType::Transpose : TransposeType::NoTranspose,
              transpose_b ? b.is_conj() ? TransposeType::ConjTranspose : TransposeType::Transpose : TransposeType::NoTranspose,
              m, n, k,
              alpha.to<opmath_t>(),
              a.const_data_ptr<scalar_t>(), lda,
              b.const_data_ptr<scalar_t>(), ldb,
              beta.to<opmath_t>(),
              c.mutable_data_ptr<scalar_t>(), ldc);
        });
  }

CPU BLAS dispatch to the GEMM function. Link

With this, we have the actual implementation of the addmm function for the CPU backend.

CUDA implementation: `TORCH_IMPL_FUNC(addmm_out_cuda)`

The CUDA implementation is quite similar on first sight: we again call the actual implementation function addmm_out_cuda_impl() which is reused in multiple other functions.

The actual implementation of the addmm_out_cuda_impl() function is in the same file and again starts with some shape asserts and data type checks. We again have some a check that looks at the case where the input scaling factor $\beta$ is zero and handles it separately:

  if (mat1.numel() == 0) {
    // By definition, when beta==0, values in self should be ignored. nans and infs
    // should not propagate
    if (beta.toComplexDouble() == 0.) {
      return result.zero_();
    }
    // TODO: We could squeeze some perf by calling at::cuda::mul_out here instead, to bypass the dispatcher.
    // That requires some fixing some internal build dependencies though.
    return at::mul_out(
        result,
        self.expand(result.sizes()),
        at::native::scalar_tensor(
            beta,
            self.scalar_type(),
            c10::nullopt /* layout */,
            at::kCPU,
            c10::nullopt /* pin_memory */));
  }

$\beta$ checks of the addmm CUDA implementation. Link

After that, we again dispatch to different kernels (this time CUDA kernels) depending on the hardware we have. The CUDA implementation is more complex than the CPU implementation since we have to take into account the different CUDA hardware architectures and the different CUDA libraries that are available. Here is one example:

    // If batch is 1 call gemm rather than bgemm
    if (num_batches == 1) {
      at::cuda::blas::gemm<scalar_t>(
          transa, transb,
          m, n, k,
          alpha_val,
          batch1_ptr, lda,
          batch2_ptr, ldb,
          beta_val,
          result_ptr, ldc);
    } else {
      at::cuda::blas::bgemm<scalar_t>(
        transa, transb,
        m, n, k,
        alpha_val,
        batch1_ptr, lda, batch1_->strides()[0],
        batch2_ptr, ldb, batch2_->strides()[0],
        beta_val,
        result_ptr, ldc, result_->strides()[0],
        num_batches
      );
   }

CUDA GEMM dispatch. Link

You can see that depending on the number of batches, we call either the gemm or the bgemm function from the CUDA BLAS library. The bgemm function is a batched version of the gemm function that can perform multiple matrix multiplications in parallel. This is useful if we have a batch of matrices that we want to multiply with the same matrix mat2. To learn more about the different CUDA BLAS functions, you can look at the cuBLAS documentation and the matrix multiplication user guide.

Conclusion

In this post, we went on a whirlwind tour of the PyTorch codebase to understand how the addmm function is implemented. We saw that the addmm function is not only a PyTorch native function specified in the native_functions.yaml file, but also a structured kernel and that the actual implementation of the addmm function is in the addmm.out function. We then looked at the addmm.out function and realised that it is a wrapper around the addmm_impl_cpu_() and addmm_impl_cuda_() functions. Upon inspecting the addmm_impl_cpu_() and addmm_impl_cuda_() it became clear that these are the actual implementations of the addmm function for the CPU and CUDA backends and look quite complicated to to different dispatch conditions, shape checks and data type checks, but the core of the function (the matrix multiplication) in the end is again a call to a kernel from a library.

I hope that this post gave you a good overview of how to find the implementation of a PyTorch operator and how to navigate the PyTorch codebase. If you have a better way to do that, let me know!

Credits

There is an amazing blog post about PyTorch internals by Ed Zang as well as his PyTorch developer podcast that helped me immensely in understanding the PyTorch codebase. Also shoutout to Christian Perone for his slides on PyTorch 2 internals that shine some light on the recent developments connected with the PyTorch 2 release.

PyTorch Logo taken from this post.

(GER) Was sind Diffusion Models

2023-05-15T00:00:00+00:00

(Die deutsche Version beginn unten!)

This post is a rather unusual one since it is in German. I have always been involved in making content available in other languages to allow more people to enjoy it, such as when I did translations for Khan Academy. After translating the posts on normalising flows by Eric Jang, I have the pleasure of now translating Lily Wang’s excellent post on diffusion models. I hope you enjoy it!

Einführung
Was sind Diffusion Models
Forward Diffusion Process
Verbindung zu Stochastic Gradient Langevin Dynamics
Reverse Diffusion Process
Credits

Einführung

GANs, VAEs und Normalising Flows sind drei Typen von Machine Learning Modellen für generative Zwecke. Alle drei haben sehr erfolgreich hochqualitative Beispiele generiert, aber jede der drei Familien hat eigene Probleme. GANs sind bekannt für instabiles Training und weniger Diversität der produzierten Beispiele durch ihr Training. VAEs basieren auf einem sogenannten “surrogate loss”. Normalising Flows müssen spezielle Architekturen verwenden, um reversible Transformationen zu konstruieren.

Diffusion Models sind von der “non-equilibrium” Thermodynamik inspiriert. Sie definieren eine Markov-Kette von Diffusionsschritten, um den Daten langsam zufälliges Rauschen hinzuzufügen, und lernen dann, den Diffusionsprozess umzukehren, um aus dem Rauschen gewünschte Datenproben zu konstruieren. Im Gegensatz zu VAEs oder Normalising Flows werden Diffusion Models mit einem festen Verfahren erlernt, und die latente Variable hat eine hohe Dimensionalität (dieselbe wie die Originaldaten).

Was sind Diffusion Models

Es wurden mehrere diffusionsbasierte generative Modelle mit ähnlichen Ideen vorgeschlagen, darunter diffusion probabilistic models (Sohl-Dickstein et al., 2015), noise-conditioned score network (NCSN; Yang & Ermon, 2019), und denoising diffusion probabilistic models (DDPM; Ho et al. 2020).

Forward Diffusion Process

Nehmen wir an, wir haben einen Datenpunkt von einer realen Datenverteilung, $x_0 \sim q(x)$ . Dann können wir einen forward diffusion process definieren, in dem wir in $T$ Schritten kleine Mengen an Gaussian noise zu dem Datenpunkt hinzufügen und damit eine Sequenz $x_1, ..., x_T$ an korrumpierten (sogenannten noised) Datenpunkten erzeugen. Wir kontrollieren die Schrittgröße zwischen diesen Datenpunkten mit der sogenannten variance schedule $\{\beta_t \in (0,1)\}^T_{t=1}$ .

\begin{aligned} q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t \textbf{I}) \hspace{10px} q(x_{1:T} = \prod^T_{t=1} q(x_t | x_{t_1})) \end{aligned}

Unser Datenpunkt $x_0$ verliert so seine erkennbaren Eigenschaften wenn $t$ größer wird. Wenn $T \to \infty$ ist $x_T$ equivalent zur isotropen Normalverteilung.

Fig. 2. Die Markovkette des forward (reverse) diffusion process, in dem eine Stichprobe durch langsames Hinzufügen/Entfernen von Rauschen erzeugt wird. Quelle: Ho et al. 2020 mit einigen zusätzlichen Anmerkungen.

Eine nützliche Eigenschaft dieses Prozesses ist dass wir $x_t$ zu einem beliebigen Zeitpunkt $t$ in geschlossener Form samplen können, und zwar mithilfe eines Reparametrisierungs-Tricks. Sei $\alpha_t = 1 - \beta_t$ und $\overline{\alpha_t} = \prod^t_{i=1} \alpha_i$ :

\begin{aligned}%!!15 x_t &= \sqrt{\alpha_t}x_{t-1} + \sqrt{1-\alpha_t} \epsilon_{t-1}; \hspace{10px} \epsilon_{t-1}, \epsilon_{t-2}, ... \sim \mathcal{n}(0,\textbf{I}) \\[1em] &= \sqrt{\alpha_t \alpha_{t-1}}x_{t-2} + \sqrt{1-\alpha_t \alpha_{t-1}} \overline{\epsilon_{t-2}} (*) \\[1em] &= ... \\[1em] &= \sqrt{\overline{\alpha_t}}x_0 + \sqrt{1-\overline{\alpha_t}}\epsilon \\[1em] q(x_t | x_{0}) = \mathcal{N}(x_t; \sqrt{\overline{\alpha_t}} x_{0}, (1-\overline{\alpha_t}) \textbf{I}) \end{aligned}

(*) Wenn wir zwei Normalverteilungen mit verschiedenen Varianzen kombinieren, hat die neue Normalverteilung die Summe der Varianzen als Varianz: $\mathcal{N}(0, \sigma^2_1\textbf{I}) + \mathcal{N}(0, \sigma^2_2\textbf{I}) = \mathcal{N}(0, (\sigma^2_1 + \sigma^2_2)\textbf{I})$ . In unserem Falle ist die kombinierte Standardabweichung $\sqrt{(1-\alpha_t) + \alpha_t(1-\alpha_{t-1})} = \sqrt{1-\alpha_t \alpha_{t-1}}$ .

Normalerweise können wir uns größere Updateschritte erlauben wenn unsere Sample mehr Rauschen enthält, also setzen wir die variance schedule so, dass $\beta_t$ mit $t$ wächst: $\beta_1 < \beta_2 < ... < \beta_t$ und daher $\overline{\alpha_1} > \overline{\alpha_2} > ... > \overline{\alpha_t}$ .

Verbindung zu Stochastic Gradient Langevin Dynamics

Langevin Dynamics ist ein Konzept aus der Physik das zur statistischen Modellierung von molekularen Systemen entwickelt wurde. Wenn dieses Verfahren mit stochastic gradient descent kombiniert wird, erhalten wir stochastic gradient langevin dynamics (Welling & Teh 2011). Dieses Verfahren kann Stichproben von einer Wahrscheinlichkeitsverteilung $p(x)$ ziehen und benötigt hierfür nur die Gradienten der Log-Wahrscheinlichkeit $\nabla_x \log p(x)$ . Die Gradienten werden mit einem Rauschterm kombiniert, um die Stichproben zu erzeugen. Die Stichproben werden dann verwendet, um die Gradienten zu schätzen, und der Prozess wird wiederholt. Dieser iterative Prozess kann als Markovkette bestehend aus Updates beschrieben werden:

x_t = x_{t-1} + \frac{\delta}{2} \nabla_x \log p(x_{t-1}) + \sqrt{\delta} \epsilon_t; \hspace{10px} \epsilon_t \sim \mathcal{N}(0, \textbf{I})

mit $\delta$ als die Schrittgröße der Updates. Wenn wir $T \to 0$ gehen lassen, geht $\epsilon \to 0$ und wir erhalten die tatsächliche Wahrscheinlichkeitsverteilung $p(x)$ .

Verglichen mit standard Gradient Descent Methoden, die nur die Gradienten der Log-Wahrscheinlichkeit verwenden, fügen wir hier einen Rauschterm hinzu. Hierdurch verhindern wir den Kollaps in lokale Minima der Wahrscheinlichkeitsverteilung.

Reverse Diffusion Process

Wenn wir den oben beschriebenen forward diffusion process umkehren und somit Stichproben von $q(x_{t-1} \vert x_{t})$ ziehen könnten, können wir aus Gauss’schen Rauschen $\x_T \sim \mathcal{N}(0, \textbf{I})$ Stichproben von $p(x)$ ziehen. Dieser Prozess wird als reverse diffusion process bezeichnet.

Falls $\beta_t$ klein genoug ist, wird $q(x_{t-1} \vert x_{t})$ ebenfalls einer Normalverteilung folgen.

Leider müssten wir die gesamte Datenverteilung $p(x)$ kennen, um $q(x_{t-1} \vert x_{t})$ zu berechnen. Dies ist in der Praxis nicht möglich. Wir können jedoch ein Modell $p_{\theta}$ lernen, dass diese bedingten Wahrscheinlichkeiten approximiert. Mithilfe dieses Modells können wir dann den reverse diffusion process durchführen und näherungsweise Stichproben von $p(x)$ ziehen:

p_{\theta}(x_{0:T}) = p(x_T) \prod^T_{t=1} p_{\theta}(x_{t-1} | x_{t}) \hspace{10px} p_{\theta}(x_{t-1} | x_{t}) = \mathcal{N}(x_{t-1}; \mu_{\theta}(x_t, t), \Sigma_{\theta}(x_t,t))

Fig. 3. Beispielhaftes Training eines Diffusion Models zum Modellieren von 2D swiss roll daten. (Quelle: Sohl-Dickstein et al. 2015)

Es ist bemerkenswert dass die reverse bedingte Wahrscheinlichkeit berechnet werden kan, wenn diese auf $x_0$ bedingt ist:

q(x_{t-1} \| x_{t}) = \mathcal{N}(x_{t-1}; \color{blue} \widetilde{\mu}(x_t, t), \color{red} \tilde{\beta}_t \mathbf{I})

Mit dem Satz von Bayes erhalten wir folgendes:

\begin{aligned}%!!15 x_t &= \\[1em] &= \\[1em] &= \\[1em] &= \\[1em] q(x_t \vert x_{0}) = \mathcal{N}(x_t; \sqrt{\overline{\alpha_t}} x_{0}, (1-\overline{\alpha_t}) \textbf{I}) \end{aligned}

Credits

Vielen Dank an Lilian Weng für ihre coolen Blogposts und die Möglichkeit, diesen Blogpost zu übersetzen!

Photo by JJ Ying on Unsplash

Annoying Errors and how to fix them

2023-03-23T00:00:00+00:00

We all now how annoying fixing bugs can be. One of the few things that is more annoying though is facing an error you encountered before but do not remember the solution for. Therefore, this post is an ongoing mental log for some of the errors I encountered at least twice to avoid repeating the bugfix process all over again from the start.

CUDA Error: Failed to initialize NVML: Driver/library version mismatch
pdb debugging error: if self.quitting: raise BdbQuit (bdb.BdbQuit)

CUDA Error: `Failed to initialize NVML: Driver/library version mismatch`

I encountered this one after having some PyTorch/CUDA errors, trying to reinstall some GPU drivers and failing miserably. Fortunately, after some digging I found this discussion on the NVIDIA forum which cleared things up.

In summary, this error is caused by your GPU having a different CUDA driver version than the one you have installed on your host machine. Sometimes you already get the version as part of the error message and can rectify it based on that. If not, follow these steps:

run run sudo nvidia-bug-report.sh
extract the bug report via gzip -d nvidia-bug-report.gz
Open extracted nvidia-bug-report.log and search for “API Mismatch”. Note down which version your client (i.e. your GPU) has.
run sudo apt install nvidia-driver-470 and replace 470 with whatever version your client reported in the bug report.

pdb debugging error: `if self.quitting: raise BdbQuit (bdb.BdbQuit)`

This one I got when I was debugging an ML program. I made a local editable install of my ML repo via pip install -e ., had a breakpoint() in my dataloader and ran my model with WandB logging.

What happens is that the dataloader was then executed in the background and waited for a signal to continue, step or something else, but waited with no avail and finally quit with BdBQuit. After reading up on this here, I managed to fix it by just running the dataloader itself locally in debug mode.

De novo proteins and where to find them - RosettaCon 2022

2022-11-17T00:00:00+00:00

There has been a lot happening recently in protein design, and it is easy to get lost in the daily flood of new papers and exciting ideas (for an overview of current models and approaches see this review which includes a great model table).

In such situations, it often helps to step back for a second, look at the bigger picture and chat to people about what is happening and what the future might hold. Luckily, RosettaCon was happening this August and provided a venue for exactly that: chatting to people about protein design and fascinating ideas! In this post I want to highlight some presentations from the conference that I think are representative of some broader directions in the field.

Scaffolding protein motifs using Deep Learning - (Baker Lab, IPD, UW)
Manifold Sampling for function-guided antibody design - Vladimir Gligorijevic (Prescient Design)
Designing epitope-specific binders in silico - Possu Huang (Stanford)
Bringing de novo proteins into the clinic - Javier Castellanos (Neoleukin)
Closing thoughts
Credits

Scaffolding protein motifs using Deep Learning - (Baker Lab, IPD, UW)

While complete de novo design of proteins is still the holy grail of the field, in practice you often want to incorporate a motif known from nature into a new custom-made scaffold. This motif-scaffolding problem can be quite tricky, since the desired motif is often energetically unfavourable, which means that the scaffold is required to balance this out in order to form a stable folded protein. In this Science publication, a team from the IPD in Seattle approached this problem via two different deep learning methods: hallucination and inpainting.

Both rely on the impressive advances in protein structure prediction in recent years. More explicitly, hallucination describes an iterative procedure in which a protein structure prediction network repeatedly predicts the structure for a given input sequence. After this prediction, a loss is calculated based on both quality of the structure in general and recapitulation of the desired motif. This loss is then used to update the network and produce a better sequence, repeating this process until a desired performance threshold is achieved.

On the other hand, inpainting treats the scaffolding problem as an information recovery task. Here, part of the sequence input is masked and the model is asked to predict this missing residues, generating novel stable proteins.

This paper is not the only one to tackle this problem. There have been efforts to use diffusion models for the scaffolding problem as well as graph neural networks. All these approaches sound promising; time will tell which of these will be applicable to which design goal.

Manifold Sampling for function-guided antibody design - Vladimir Gligorijevic (Prescient Design)

Another exciting talk by Vladimir Gligorijevic showcased some of the work that has been going on at Prescient Design and that has been published in this Manifold Design paper. The general idea of the approach (as you can see from the name) builds upon the manifold hypothesis, i.e. your high-dimensional data is normally not widely spread out in these dimensions, but is often restricted to a lower-dimensional manifold.

Since functional proteins only occupy a small fraction of overall sequence space, thinking about protein design in terms of manifold sampling sounds like a reasonable idea and is what drove the recent advances in protein language models, which basically learn to generate sequences that lie on this manifold of natural sequences (one of the early examples of explicitly formulating the problem in this way was this paper by Hie et al. from Stanford).

But this team tackled the problem via a different approach: they built a Denoising Auto-Encoder (DAE) that takes as input a protein sequence, perturbs it and then generates a new protein sequence from there. The cool thing about this unsupervised approach is a separate supervised function classifier that predicts the function of the newly generated sequence based on Gene Ontology (GO) terms and therefore serves as a guide for the DAE to generate sequences with the desired function.

The team shows some pretty cool applications of their approach, from generating new Calcium-binding proteins to an ion transporter with a novel all-alpha fold (no pun intended), and that by starting from a protein with an all-beta fold! In a follow-up paper they describe an approach to only design certain regions of the sequence, enabling workflows similar to the RFDesign pipeline mentioned above.

All in all this seems like a promising way to generate very diverse sequences conditioned on function.

Designing epitope-specific binders in silico - Possu Huang (Stanford)

Generating custom antibodies binding to a specific target is already quite a feat, but doing it not only target- but even epitope-specific would be impressive. Nothing less is what Possu Huang presented at RosettaCon. They published a lot of work on protein design and specifically backbone generation over the last couple of years, using diverse approaches ranging from Generative Adversarial Networks (GANs) to language models.

In 2020 they published a preprint on Ig-VAE, a Variational Autoencoder for generating antibody structures, and published the final version in PLOS CompBio this June. This model is inspired by nature using a single antibody scaffold and adapting it to the problem at hand. They wanted to do something similiar in silico, so they chose to create a model with three important properties:

rotational and translational invariance should be maintained
the model should be aware of torsion angles since these are very important for protein structure and function
the output should directly be 3D structures and not an intermediate output such as distance maps.

With this new generative model, they were able to embed structures in a latent space and then sample from this latent space generating novel sequences. The direct output of structures made sure that there are no inconsistencies due to an overdetermined system, and the rotational and translational invariance meant that via this inductive bias they were able to circumvent the hassle of data augmentation for different rotational poses and learn a meaningful protein representation directly.

Part of the team behind the paper has continued their work on protein structure generation since then and recently published a preprint in which they show that diffusion models can be used in a similar way to generate protein structures, showing again that novel advances in ML are often transferable to applications in biology. It will be exciting to see what the next breakthrough at this interface will be!

Bringing de novo proteins into the clinic - Javier Castellanos (Neoleukin)

Neoleukin is one of several protein design companies originating in the IPD in Seattle. Their particular focus is therapeutic design with applications in e.g. cancer immunotherapy.

Their lead candidate NL-201 is based on this publication in Nature in which they showed that this de novo protein is an effective activator of IL-2 and IL-15 agonist, which means it can activate cells which express the receptor for this signalling molecule, for example NK or T cells that are important in immunotherapy against cancer. Other IL-2 based therapeutic approaches have been challenging in the past since they bind strongly to CD25, a subunit of another subtype of IL-2-receptor that is expressed in off-target cell types but which greatly enhances both affinity and activity of IL-2. Therefore, these off-target cells experience an even higher activation than the intended cell targets, leading to a plethora of potential side effects. Via computational protein design, they were able to create a protein that selectively binds to the intended subpopulation of IL-2 receptors, showing effective responses in immunotherapy studies.

They now use several of the open-source available tools to optimize their designed sequences, which shows that protein design with machine learning is not a far-fetched dream, but actually already here!

Closing thoughts

The talks mentioned above show the breadth and depth of work going on in the protein design field, from combining established methods with new methods to the development of new algorithmic ideas. It was especially fascinating to see that de novo proteins are moving into the clinic and to the patient now, since advancing disease therapy is the goal of many protein design projects.

Finally, the collaborative and friendly atmosphere at the conference was exceptional; everybody took time for answering questions, helping others out and explaining new advances in order to move the whole field forward. For me personally, attending RosettaCon remotely was a great experience, and I hope to repeat it in person soon!

Credits

Thanks a lot to the organisers of the RosettaCon conference, both for making the conference a great experience and for allowing me to post this summary on their website and use their logo for the post on my website.

sed, awk & co - master the shell

2022-11-08T00:00:00+00:00

I recently saw two great videos regarding the command line tools sed and awk and thought it might be a good idea to put the commands and varieties explained in these videos here in order to have a quick reference for myself and others in case one struggles again to find the right pattern or syntax for using one of these tools. If you do not know awk and sed yet, I highly recommend watching these videos and getting familiar with them; using them for text manipulation and quick processing is often way quicker than writing a Python or R script for this kind of job. For those who know the two tools already, I hope that this provides a good reference for their usage!

sed - your search and replace function
awk - the allrounder
getting help - man/tldr
Closing thoughts

sed - your search and replace function

sed stands from stream editor and you can imagine it as your automated search and replace function: with it you can look for patterns and replace them with other patterns. In this part of the post we will use the following text file as an example:

# file: balance.txt
- 25,13 EUR Mon Supermarket -------

+ 13,40 EUR Tue Pizza/Drinks -
- 05,00 EUR Tue Bus --

+ 40,00 EUR Wed Refund ----

Examples:

sed s/,/./ balance_int.txt: read text from file1, substitute the first comma on each line with a full stop and write the output to file2
sed s/,/./g balance_int.txt: same as above, but with the global option /g sed substitute every comma with a full stop, not just the first one in each line
echo "15,3" | sed s/,/./: pipe input from other commands important: sed is searching for strings, not for words!
sed -i s/,/./g balance.txt: -i flag makes it read and write to the same file; input/output flags not needed in this case
sed '/+/s/,/./g' balance.txt: look for lines in balance.txt that contain a + and substitute , with . in these lines.
sed '/-/d' balance.txt: look for lines in balance.txt that contain a - and delete these lines
sed -e 's/Mon/Monday/g' -e 's/Tue/Tuesday/g' -f balance.txt: normally, sed takes first argument as expression and second input as file. In case we want to use multiple expressions and/or files, we can make this explicit with the -e and -f flags.
sed s/Pizza\/Drinks/Party/g: if the search pattern itself contains a /, we can escape that with a backslash.
sed s#Pizza/Drinks#Part#g: other possibility to circumvent this problem: just use other separators! sed is not very picky about which separators you use and is smart enough to understand what you are trying to do.
sed -n /-/p : print lines from balance.txt that have a - in them. By default sed prints all the input it processed except for deletions. -n (no) suppresses this output, and the print option /p prints the lines that match our pattern.


  sed -i 's/-*$//' balance.txt: find regex pattern in each line (here dashes (-), an arbitrary number of them (*) at the end of the line ($)) and substitute them with nothing (//).
  sed '/^$/d': find every empty line (nothing in between start (^) and end ($) of line) and delete it.
  sed 's/[A-Z]/\L&/g': find every uppercase letter and make it lowercase. To do it the other way around, replace [A-Z] with [a-z] and \L with \U.
  sed 10q balance.txt: use it as replacement for head command. without any flags, head balance.txt gives you the first ten lines of a file.



It is important to use single quotes for the sed pattern instead of double quotes. If you use single quotes, sed gets exactly the pattern that you write. But when you use double quotes, the string is first passed
to the shell and interpreted by it, which can be problematic in case of special symbols and variable/command names. It can also be beneficial, but only if you know what you are doing; otherwise, stay to single quotes
(see this thread for a more detailed discussion).

awk - the allrounder

awk is another very powerful command line tool. Most people use it for text manipulation (similar to sed), but being a full scripting language, it can do a whole bunch more! Fun fact: it got its name from its three creators who wrote the tool in the AT&T Bell Labs in 1977: Alfread Aho, Peter Weinberger, and Brian Kernighan. It is especially useful if your text has some structure in it (like a tsv/csv file for example). Here some examples on what to do with it:


  awk '{print $2}' balance.txt: print first field/column of each line. By default, spaces separate columns in awk (can be customized).
  awk '{print $0}' balance.txt: print whole lines (equivalent to cat); same output if you just use '{print}' as command for awk.
  awk -F ":" '{print $1}' /etc/passwd: use colons instead of spaces as filed separator to get all users on Linux system
  awk -F ":" '{print $1"\t"$6","$7}' /etc/passwd: print several columns with a tab between the first and second column and a comma between the second and third
  awk {'BEGIN{FS=":"; OFS="-"} {print $1,$6,$7}' /etc/passwd: change field separator to different character as part of the input
  awk -F "/" '/^\// {print $NF}' /etc/shells: set / as the field separator for the contents of /etc/passwd. Then, search for the regex pattern between slashes (^\/), which looks for lines that start with a slash (\ is needed to escape / since it is normally recognised as a special character). Then, print the last field of each line (i.e. the name of the corresponding shells).
  awk -F "/" '/^\// {print $NF}' /etc/shells | uniq | sort: output from above, just with the duplicates removed and alphabetically sorted
  df | awk '/\dev\/loop/ {print $1"\t"$2+$3}':
  awk 'length($0) > 10' /etc/shells: only print
  ps -ef | awk '{ if($NF == "/bin/zsh') print $0}': print all processes that are currently running and have /bin/zsh as end of the line
  ps -ef | awk BEGIN { for(i=1; i<=10; i++) print "Process ", i, ": ", $0}
  awk '$1 ~ /^[b,c]/ {print $0}' .bashrc: look at the content of .bashrc, check if the first column matches the regular expression ^[b,c] (i.e. does the first column start with b or c). If yes, print the line.
  awk '{print substr($0, 4)} /etc/passwd: look at the content of passwd and print every line from the fourth character on
  awk 'match($0, /,/) {print $1 " has \"\,\" character at " RSTART}' file.txt: look at the content of file.txt and look for all lines that match the pattern ,. then, print the first field of that line, followed by a string that contains the position at which , appeared in the line (RSTART).
  df | awk 'NR%2 == 0 {print "Even"}; NR%2 !=0 {print "Odd"}': NR gives you the line number. Here, take the output of df and print “Even” if the line number is even and “Odd” if the line number is odd.
  awk 'END {print NR} /etc/shells /etc/passwd': line count combined of given files
    getting help - man/tldr
  


It is often easy to get lost with all the varieties of tools out there, so here are some pointers to resources to look for help:


  man sed: gives you the (long) manual page of sed, explaining the different options
  tldr sed: gives a more concise summary of the sed command, similar to a cheat sheet
  online man page often a bit easier to read than terminal version
  great YouTube channels such as DistroTube explaining many of the tricks for shell commands; many of the example commands from this article are inspired by his videos!
  as always, StackOverflow is often the best place to visit if you try to solve a specific problem and need inspiration for how to tackle it.


Closing thoughts

As with many things, shell scripting feels very cumbersome and inefficient at the start. But once you pass this initial struggle, you will see how convenient they really are (especially since they are present on virtually any Linux machine) and how quickly you can get stuff done with them!



Python for Data Science
2022-11-03T00:00:00+00:00
This post is intended as recommended reading for the participants of the first part of the lecture series “Python for Data Science” at Heidelberg University which was conceptualised and organised by Lukas Jarosch and me, but should be interesting to anyone who wants to start working with Python.

Computers seem to be everwhere today: in our offices, our kitchens and increasingly in our labs as well. As a scientist in the natural sciences, you have better and better tools at your disposal that generate more and more data. And while back in the day an Excel table or even a lab notebook would have been sufficient, nowadays you often need software to process your data. While there is a growing amount of no-code software available that you can use without programming yourself, programming will probably form a growing part of your day-to-day job. This is why we are holding this course to get you started, with this post as your initial overview of what we are going to cover!


  Python: the swiss army knife
  Python libraries: your specialist tools    
      Pandas: The scissor to change data the way you like
      Seaborn: your magnifying glass
    
  
  Visual Studio Code: an editor you will learn to love
  Notebooks: a quick way to get started
  GitHub: collaboration is key
  StackOverflow: where you will spend most of your days
  Closing thoughts


Python: the swiss army knife

Many beginners who want to learn programming are confused about which language to start with: should I learn Python? R? The newest and coolest language like Julia and Rust? Or a classic like Java or C++?

First of all, it is important to note that the choice of your first programming language is actually not hugely important. If you continue with coding you will learn more than one language anyway, and once you have mastered one language the ideas and concepts can often be easily transferred to another.

Nevertheless, I want to give you an intuition of what programming languages are out there and why we choose to teach you Python.

At the end of the day, programming languages are just another tool to help you get your work done, similar to programs like Excel or Word, or even physical tools like a hammer or a drilling machine. So, as in real life, it does not make sense to do everything with a hammer; otherwise, every problem will look like a nail to you. So you should choose a tool or a set of tools that can do multiple things for you.

In addition, you do not want to become a craftsperson and work with complicated and specialised equipment only to fix your new picture on the wall. So, the tools you choose should not only be flexible, but also easy to handle.

With this metaphor in mind, let’s transfer these insights to coding:

First, let’s talk about versatility. Although there are many ways in which you can classify programming languages, for the purpose of this post we will keep it simple: in general there are general-purpose programming languages (GPL) and domain specific programming languages (DSL). As the names suggest, the former are languages that are used for all kind of applications, whereas the latter were designed with a specific application in mind. 
This does not mean that DSLs cannot perform calculations that GPLs can (most programming languages are Turing-complete anyway), but their syntax and structure are optimised for a specific purpose, which may make it harder to adapt them for others. 
The division is quite blurry in real life, but I like to keep it in the back of my head to keep my thoughts organised. FORTRAN for example was originally designed for numeric computation, and although some people used it for other purposes it stayed mostly in that area. SQL is another example of a language that was designed for querying databases and is nearly exclusively used for that purpose.


  


A power drill is useful for drilling holes, but not very useful for anything else.

While these languages might be great for their respective domain, they are not suited as a first programming language since you want to learn a breadth of applications before specialising later in something you want to work on with more focus.

Therefore, we will teach you a GPL. There are many out there, from C++ over Java to Python or Julia. So, which one to choose?

Well, now our second consideration comes into play: ease of use.

Generally, people often refer to high-level and low-level languages in this area. What they mean by that is how close the language you write is to the machine code your computer reads in the end and how many of the steps in between are abstracted away by your programming language. Assembler is an example of a language that is nearly at machine level: it gives you a lot of power and insight into the machine, but makes it useless for day-to-day tasks.

C++ is an example for a fairly low-level language. While you can also work on a higher level with libraries that give you access to object-oriented programming and other abstractions, you can still mess up your programs by playing around with low-level constructs such as pointers. In terms of our previous metaphor, you can think of it as a toolbox with an extensive number of complicated tools: sure, now you are not limited to one tool and are flexible, but each individual one of these is still quite difficult to work with.


  


A toolbox offers you a lot of flexibility, but requires quite some expertise to be used correctly.

C++ and Java are great languages, don’t get me wrong: while learning them I learnt a lot about programming itself and the different choices you have as a programmer in how to put an abstract project specification into practice. But the course we are teaching is not primarily for programmers; it is for scientists. You do not only want to write programs, but do a lot of other things as well like experimenting in the lab and generating the data that you will analyse via your code in the end. So although I would recommend also learning a lower-level language to anyone with a deeper interest in computer science, in our course we will focus on Python.

Python is what I would call the swiss army knife of programming languages. It is easy to learn, quick to prototype with and versatile in what it can be applied to.


  


A swiss army knife combines the best of both worlds: it is versatile and straightforward to use.

Similar to the Swiss army knife, there are situations in which Python is not the most efficient tool. If you want to write a program doing efficient numerical calculations, Python itself will not be your saviour (but maybe one of its libraries, as we will see later). In that case, a lower-level language like C++ might be more suitable. However, for the purposes of a scientist, Python is a great way into coding, both from a didactic and a practical point of view. Plus there is a large community using Python already out there, so if you get stuck, there is a high probability that someone out there had the problem before you and posted a solution!

Python libraries: your specialist tools

In this course, we will teach you two Python libraries that will come in very handy when you analyse data: Pandas and Seaborn.

Pandas: The scissor to change data the way you like

As a scientist, you will often find yourself doing an experiment and generating large amounts of data from it that you want to analyse. Doing it by hand is impossible due to the number of data points, and tools like Excel start to freeze already when opening your data file. So, what do you do?

Enter Pandas, a library for data analysis in Python. With it, many of the tasks for which you would need to write your own functions in base Python (opening Excel sheets, calculating summary statistics, filtering/combining data) are just a one-liner. It comes with its own data structure called a Pandas DataFrame, which you will learn a lot more about during the course. You can imagine it as a table storing your data in a convenient and efficient format.


  


Similar to a pair of scissors, Pandas can slice and dice your data the way you want it, reshape it and transform it so that it fits your needs.

The Pandas website has some great tutorials on how to get started with Pandas and a cookbook on how to use it for specific cases, so if you find a use case after the course that you did not encounter before or you do not remember how to handle, these are great places to start looking!

Pandas builds on another library called NumPy. NumPy is a library optimised for efficiency via vectorization and is often used in scientific applications. However, often you do not need all the flexibility that NumPy offers you and want a more concise and pre-structured formulation of your code. Here Pandas can shine: it sits in between the simplicity of base Python and the efficiency of NumPy.

Seaborn: your magnifying glass

Often, transforming your data via Pandas is only the first step of your data analysis. Numbers can only show so much, and it is often more efficient to visualise your data via graphics, both for your own understanding of your data as well as for communicating your insights to others. Here Seaborn comes to your rescue: it is a data visualisation library that allows you to directly take your data from a DataFrame and plot it easily via a myriad of formats, with all kinds of styles and customisations available. It integrates very well with Pandas and makes creating graphs dead-easy. In case you are looking for inspiration, there is an example gallery showing plots with the corresponding code so that you can easily adapt them for your own purposes.





Like a magnifying glass, seaborn allows you to see things in your data that you cannot see by just staring it at, and it allows you to show these insights to others.

Similar to Pandas, Seaborn did not come from nowhere: it is based on the library matplotlib, which is the go-to library for visualisation in Python. Again, similar to NumPy it offers you a lot of flexibility, but often you will prefer readable over extremely flexible code when analysing your data. Especially the strong integration with Pandas gives you a good reason to use Seaborn. That being said, in many circumstances I find myself switching back and forth between Pandas/Seaborn and NumPy/matplotlib; since the former two are based on the latter two, using them together often works quite well!

Visual Studio Code: an editor you will learn to love

Pandas and Seaborn and all the rest of it are great, but where do you actually write the code containing all these libraries? While you could do that just via the command line, there are way better tools available nowadays that make your life a lot easier. These are often called integrated development environment(IDE) and if you participated in the course Data Analysis with R by Jannik Buhr, you already met one of these: RStudio.

While RStudio is a nice IDE, it is very R-centric in many of its design decisions. For our purposes, we want a general-purpose IDE that can act as our workbench: we bring whatever tools we want to work with (programming language, packages, data etc.) and our IDE should support our work with these. That is why we decided to teach this course using VS Code. It is versatile, open-source and has a massive library of extensions that make your life as a programmer easier. To get started, see this amazing guide which will take you through the different steps of installing and setting up VS Code (there is also a short walkthrough-guide available on the VS Code website).


  


All your tools at the right place: VS Code is your workbench, making it easy to access everything that you need and navigate between different tasks.

To learn more about the cool things you can do with VS Code (support for R, connecting to remote machines, keyboard shortcuts etc) you can have a look at this post which I published some time ago.

Notebooks: a quick way to get started

Using the tools we mentioned so far you can create great programs for analysing your data. But how to get started? And how to document what you have done? And how to put your work into a format that is suitable for presenting at e.g. a lab meeting?

This is were noteboks come in (Jupyter notebooks for Python more specifically with the file ending .ipynb]). You can just power up your notebook and get started coding; it shows you the output directly in the document, no matter if it is a number, a plot or an image.

And once you finish your analysis, you can just add some text cells in which you describe your code more eloquently via Markdown syntax than inline-comments in your code could ever do.


  


Notebooks help you to get a quick draft of your program into code, similar to how a pencil lets you quickly draft something on paper which can be refined afterwards.

These notebooks can serve both as a quick start to some data exploration (e.g. visualising your data for quality control) and as a documentation of your work you can show to colleagues and collaborators. In fact, the lecture slides we will deliver as part of the Python course are made from notebooks!

GitHub: collaboration is key

Notebooks are a great way to share your results once you are done with analysing your data, but what if you want to share it with your colleagues who want to work on the analysis together with you? The most straightforward way would be to just send them the .ipynb file which they can then use to work on the data. This however has some caveats: first of all, your colleague needs to have the exact same format of the data as you in order to make the analysis reproducible. Second, if your analysis becomes more complicated, you may want to split your analysis into different Python files and notebooks, making exchanging these more complicated. In addition, you cannot work on the analysis at the same time, since the changes you and your colleague introduce might not be compatible and therefore merging these changes into a consistent program in the end might just be impossible. Even worse, every time you change something in the code you have to send your colleague a new version of your code and vice versa, a huge waste of time and effort. That is why GitHub (and Git) were created.


  


Coding is teamwork, and GitHub helps you discuss ideas with others and show your work to the world.

GitHub is an online service for software development and version control. It uses Git, a system for local version control, and distributes it globally so that you can work together with people all over the world. In addition, it provides some nice features that make the software development process more structured and organised: wikis, pull requests, issues, taks management, continuous integration, basically the whole software shebang you could wish for.

Git is a great system, though it takes some time to get used to. But I can assure you that this time will be well-spent since it is the de facto standard for publishing and sharing software with the world. In case you like gamified learning, here is the link to a game that makes the dive into Git a bit more fun.
StackOverflow: where you will spend most of your days

When you think about programming, you might still have the image of a hacker on your mind, sitting with his hoodie in a dark room, relentlessly maltreating his keyboard without rhyme or reason. That is far from the actual reality: you will spend most of your time looking up stuff which you either have no clue about or had a clue about at some point but lost it in the flood of other important information such as Shakira lyrics and the results of last weekend’s football match. Here, StackOverflow is your friend.


  


In case you are stuck on how to use a tool, nice people on StackOverflow can show you how to use it.

On StackOverflow, you can ask questions about basically everything related to programming and will often receive high-quality answers (given that you formulated your question well). But in most cases, you won’t even have to ask a question since another person had the same or a very similar problem before you and solved it with help from StackOverflow. Using this tool right is a skill that cannot be overestimated since you will spend a significant amount of time using this and other similar resources. There are many guides out there on how to use it best; for me, learning by doing has helped the most. And once you are advanced enough, answering questions there can be a great way to give back and stay sharp on your coding skills at the same time!

Closing thoughts

There are many great resources out there for scientist who want to learn coding for their research, such as this workshop webpage. We hope that with our lecture series we give you both the skills to apply coding for basic tasks in your research and the enthusiam to continue learning new things in order to improve even more!

Kieran Didi

The unification of representation learning and generative modelling

Introduction

Background

Generative modelling: Latent Diffusion Models

Representation Learning: self-supervised vision foundation models

From Cross-Modal Contrastive Learning to Single-Modal Contrastive Learning

Vision–Language Models and Sigmoid Contrastive Losses

Self-Distillation and Momentum Encoders

Masked Image Modeling and Predictive Architectures

Toward a Platonic Representation and Implications for Generative Models

Overview of the Four Phases

Phase 1: Aligning Diffusion Features to Vision Foundation Models

Phase 2: Aligning the VAE Latent Space to Foundation Models

Phase 3: Operating Directly in Vision Foundation Model Feature Spaces

Phase 4: Questioning the Need for Pretrained Representations

The Other Direction: Generative Models as Representations

From Pixel Prediction to Embedding Prediction

Diffusion Models Learn Representations Too

Representation Learning and Alignment in Molecular Machine Learning

Molecular Embeddings: Borrowing from NLP and Computer Vision

Where to Go from Here?

Conclusion

Credits

References

Dealing with the flood of protein structures

How protein structure prediction changed the game

FoldComp: compressing protein structures to managable sizes

The trouble with compression

The FoldComp compression scheme

NeRF and the lever-arm effect

The lever-arm solution: bidirectional NeRF and anchoring

MMseqs2: sequence alignment in speed-mode

Why do we need fast sequence alignment?

Prefiltering is key

Use the prefilter for clustering

FoldSeek: structural clustering of the protein universe

Structure to Sequence: the 3Di alphabet

Virtual centers optimise conservation of interactions and tertiary vs. local interactions

Learning the 3Di alphabet via a VQ-VAE

Speeding things up by building on mmseqs2

Applications: clustering the protein universe

How to accelerate PyTorch on your GPU

1. Profiling

1.1 torch.cuda.Event

1.2 torch.autograd.profiler

1.3 torch.profiler

1.4 ncu profiler

2. Integrating CUDA kernels into PyTorch

2.1 load_inline function

2.2 Numba

3. Integrate Triton kernels into PyTorch

3.1 Using Triton

3.2 Debugging Triton

3.3 Triton Deep-Dive

3.4 Benchmarking Triton

4. torch.compile

Credits

How to represent protein structures in ML

Protein Structure File Formats: PDB vs PDBx/mmCIF vs MMTF vs BinaryCIF

PDB format (legacy)

PDBx/mmCIF format

MMTF format (legacy)

BinaryCIF format

Input Data for Machine Learning Algorithms

Amino acid encodings

Coordinates: Atom14 vs Atom37

Boundary Conditions: OXT

Example: Lysozyme atom numbering

Reference Systems: Local reference frames vs reference-free methods

Local reference-based methods

Why SE(3) instead of E(3) equivariance can be important

Ambivalent mappings from frames to coordinates

Reference-free methods: Invariant and Equivariant Update Functions

Screw these symmetries: data augmentation and other strategies

Batching: Padded versus sparse

The batching pain with variable-length input

Efficient padding via length batching

Sparse batching

AFDB, ESMAtlas & co: how to deal with large databases

1.1 `torch.cuda.Event`

1.2 `torch.autograd.profiler`

1.3 `torch.profiler`

1.4 `ncu` profiler

2.1 `load_inline` function

4. `torch.compile`

Navigating the `at::native` namespace

Shape checking: `TORCH_META_FUNC(addmm)`

CPU implementation: `TORCH_IMPL_FUNC(addmm_out_cpu)`

CUDA implementation: `TORCH_IMPL_FUNC(addmm_out_cuda)`

CUDA Error: `Failed to initialize NVML: Driver/library version mismatch`

pdb debugging error: `if self.quitting: raise BdbQuit (bdb.BdbQuit)`