Science - Ad09336 SM
Science - Ad09336 SM
Sequence modeling and design from molecular to genome scale with Evo
Supplementary Text
Figs. S1 to S28
Tables S1 to S7
References
2
a loss of biodiversity or the emergence of new, potentially harmful species (158). Although the ecological
impacts of training whole-genome foundation models remain unknown, more immediately, it is also impor-
tant to consider the carbon footprint associated with increasing infrastructure and computational demands
(159). The capabilities of tools such as Evo, alongside other technologies for genome editing and ecological
engineering, add to complex debates about the extent to which science should intervene in evolution. As we
push the boundaries of scientific capabilities with tools such as Evo, it becomes imperative to reflect on the
interactions and boundaries between our inventions and natural evolutionary processes, aiming to preserve
ecological balance, maintain environmental sustainability, and uphold ethical standards.
The path forward for the responsible use and development of tools like Evo is anchored in the establishment
of clear, comprehensive guidelines that delineate ethical practices. These guidelines serve as a responsible
AI framework, ensuring that all shareholders—researchers, developers, and users—have a common under-
standing of the safety and ethical dimensions inherent in genetic engineering. Coupled with robust oversight
mechanisms, this approach aims to monitor and manage the application of Evo to prevent misuse and en-
sure its alignment with ethical standards. Furthermore, promoting transparency regarding the use of these
technologies and fostering open dialogue among all parties will enhance trust and collaboration within the
scientific community and beyond.
To address disparities in access and capabilities, particularly in low-income countries, the strategy includes
forging community partnerships and international collaborations. By offering targeted training and support,
these partnerships can democratize access to advanced tools like Evo, enabling a broader spectrum of scientists
and researchers to contribute to and benefit from genetic engineering innovations. At the policy level, investing
in education and capacity building emerges as a pivotal element, equipping the next generation of scientists
with the ethical acumen and technical skills to navigate the complexities of genetic research responsibly.
Central to sustaining ethical innovation is the creation of a dynamic feedback loop that engages all share-
holders in a continuous dialogue. By setting up mechanisms to collect and integrate feedback from those
involved in or impacted by Evo’s applications, the process ensures that guidelines, policies, and practices are
regularly refined in response to evolving ethical challenges and societal expectations. Collaborating with or-
ganizations such as the Global Alliance for Genomics and Health (GA4GH) (86) to develop and update genetic
engineering guidelines further solidifies this commitment to ethical excellence. This multifaceted approach
not only addresses immediate concerns but also lays the groundwork for a future where genetic engineering
advances in harmony with ethical principles and societal values.
3
Fig. S1 | Pretraining data statistics. (A) A pie chart depicting the composition of viral realms in the IMG/VR
subset of the pretraining dataset. (B) A pie chart depicting the composition of host kingdoms in the IMG/VR
subset of the pretraining dataset. We excluded viruses that are likely to infect eukaryotic hosts (Materials and
Methods). (C) The distribution of genome lengths in the IMG/VR subset of the pretraining dataset. (D) The
distribution of plasmid sequence lengths in the IMG/PR subset of the pretraining dataset. (E) The distribution
of contig lengths in GTDB.
4
Perplexity scaling with context size
2.96 Train. context size
Eval. PPL (over last 2k tokens)
2.94
2.92
2.90
2.88
2.86
2.84
2.82
2k 16k 32k 65k 131k 165k
Context size
Fig. S2 | Perplexity scaling in context length. Perplexity on a subset of the OpenGenome validation set with
Evo 131k as a function of sequence length, or context length. The perplexity is computed over the last 2048
nucleotides of each sequence, with increasing lengths of the prefix and thus of the context available to the
model. We observe perplexity to continually decrease beyond the training context length at 131k, indicated
by the vertical dashed line.
5
Scaling Rates for Compute-Suboptimal Model Sizes
5.0
Offset: 5% 5.0
Offset: 8% 5.0
Offset: 11% 5.0
Offset: 14% 5.0
Offset: 17%
Transformer++
4.5 Mamba 4.5 4.5 4.5 4.5
Hyena
Eval. PPL
Eval. PPL
Eval. PPL
Eval. PPL
Eval. PPL
4.0 StripedHyena 4.0 4.0 4.0 4.0
3.0 4 1019 1020 3.0 4 1019 1020 3.0 4 1019 1020 3.0 4 1019 1020 3.0 4 1019 1020
FLOPS FLOPS FLOPS FLOPS FLOPS
Fig. S3 | Scaling rates for compute-suboptimal model sizes by architecture. We quantify the effect on
perplexity scaling caused by a suboptimal allocation of compute budget to model or dataset size (e.g. training
a smaller model for more tokens). We estimate the compute-optimal model size (Figure S5) for each compute
budget, then reduce it by a percentage (the offset). The corresponding perplexity is obtained via the IsoFLOP
curves (Figure 1F). Transformer++ perplexity scaling rapidly degrades outside the compute-optimal frontier,
in contrast to Mamba, Hyena, and StripedHyena.
6
Fig. S4 | Direct comparison of training loss curves during hyperparameter tuning for 7B models. We tune
several Transformer++ and Mamba models as baselines, sweeping learning rates, batch size, sequence lengths,
and model depth vs. width ratio. In all settings, Hyena and StripedHyena outperformed Transformer++ and
Mamba, where both baselines experienced instability during training.
7
Transformer++ 109
1011 Mamba
Transformer++
Compute-optimal model size
Mamba
Compute-optimal #tokens
Hyena Hyena
StripedHyena StripedHyena
108
1010 107
1019 1020 1019 1020
FLOPS FLOPS
Fig. S5 | Compute-optimal tokens and model size by model. Compute-optimal allocation to dataset size
(number of tokens) and model size (number of parameters) for each compute budget, measured in FLOPS.
8
1.6 StripedHyena
Model Sizes
1.5 6.90e+07
8.40e+07
1.4 9.90e+07
Train Loss
1.3 1.35e+08
1.2
1.1
1.0
1017 1018 1019 1020
FLOPs
Fig. S6 | Loss vs. FLOPs. StripedHyena training loss vs. FLOPs for near-compute optimal runs from 8e18 to
8e19 FLOPs.
9
PPL vs % Hyena layers
3.48
(10 total layers)
3.46
3.44
Perplexity (PPL)
3.42
3.40
3.38
3.36
3.34
0 20 40 60 80 100
% Hyena layers
Fig. S7 | Perplexity comparison by amount of hyena layer percentage. We progressively replace attention
for hyena layers and observe improved perplexity, with an optimal hybrid ratio of 90% hyena and 10% atten-
tion. All models use 10 layers and a width of 768. Hyperparameters and FLOPs are controlled.
10
Fig. S8 | Performance of Evo on out-of-distribution mutational effect prediction. (A) Predictive perfor-
mance of nucleotide and protein language models on mutational effect prediction for human proteins, mea-
sured via Spearman correlation. Bar height indicates the mean; each dot indicates a different DMS study. LM:
language model; Nucl. Trans.: Nucleotide Transformer. Related to Fig. 2B. (B) Relationship between the Evo
perplexity of the wildtype nucleotide sequence (horizontal axis) and the ability for Evo to perform zero-shot
mutational effect prediction for that protein as measured via Spearman correlation (vertical axis). Each dot
corresponds to a different protein; dots are colored as to whether they are proteins from E. coli (blue), other
prokaryotes (green), or human (purple) proteins. We observed a significant negative correlation (Spearman
𝑟 = −0.79, two-sided 𝑡 -distributed 𝑃 = 8.7 × 10 −4 ) between perplexity and zero-shot function prediction per-
formance.
11
Fig. S9 | Filtering and selection of Evo-generated Cas9 candidates. 2,079,750 nucleotide sequences were
generated from a finetuned Evo model. All ORFs were extracted from generated sequences and processed
with a Cas9 pHMM to identify generated loci containing potential Cas9 proteins yielding 817,049 generations
containing a Cas9 ORF longer than 600 residues and a valid CRISPR array (as detected with MinCED). Ex-
tracted Cas9 amino acid sequences were aligned with Cas9 protein sequences found in the finetuning dataset.
We retained 132,079 generations that contained ORFs that bore less than 90% identity to the nearest training
sequence. The remaining generations were scored based on the quality of the pairwise sequence alignments
between the extracted Cas9 protein sequence and its nearest match in the training set. Alignments with an
even distribution of mismatches across the Cas9 ORFs were scored highly; alignments with an imbalanced dis-
tribution of mismatches were down-weighted. The Cas9 ORFs from the top-ranking 2,000 generations were
folded with AlphaFold2. From the predicted structures, 268 generations were selected based on pLDDT, radius
of gyration, the presence of a detected tracrRNA sequence, and the presence of RuvC and HNH domains in
the Cas9 ORF. The final 11 Evo-generated Cas9 candidates were selected from this subset through manual
inspection of predicted Cas9 structure and predicted sgRNA secondary structure.
12
Fig. S10 | CRISPR-Cas finetuning dataset composition. Breakdown of the ORFs found in the 72,813 Cas
loci used to finetune the base 8k Evo model. The 72,813 Cas loci were initially categorized as either Cas9,
Cas12a, Cas12b, Cas12g, Cas13a, Cas13b, and Cas13d sequences by extracting the ORFs using Prodigal and
searching amino acid sequences against pHMMs for each Cas subtype. We then clustered each categorized
subset of loci at 90% and 50% identity using MMSeqs2 before plotting the distributions of subtypes across
clustering thresholds.
13
A B
EvoCas9-1
EvoCas9-1
Fig. S11 | Sequence identities of Cas9 ORFs in finetuning dataset. Sequence identities to the nearest Cas
ORF (excluding the query sequence) in our finetuning dataset were computed for all Cas9 ORFs in our fine-
tuning dataset. Percentages are plotted both in linear (A) and log-scale (B). The percent identity of EvoCas9-1
to the nearest Cas ORF in the fine-tuning dataset is marked on both plots as a dashed red line.
14
Cas9 Cas12 Cas13
(Not Pre-Trained)
5 5 5
10 10 10
3 3 3
10 10 10
Counts
1 1 1
10 10 10
1 1 1
10 10 10
5 5 5
10 10 10
(Pre-Trained)
3 3 3
10 10 10
Counts
1 1 1
10 10 10
1 1 1
10 10 10
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Identity to CRISPR-Cas ORFs in OpenGenome
Fig. S12 | Sequence identities of all generated Cas ORFs to fine-tuning dataset. Sequence identities of
ORFs sampled from pretrained and non-pretrained Evo models. 400,000 generations were analyzed for each
Cas-model combination.
15
Cas9 Cas12 Cas13
(Not Pre-Trained)
5 5 5
10 10 10
3 3 3
10 10 10
Counts
1 1 1
10 10 10
1 1 1
10 10 10
5 5 5
10 10 10
(Pre-Trained)
3 3 3
10 10 10
Counts
1 1 1
10 10 10
1 1 1
10 10 10
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Degeneracy Score
Fig. S13 | Comparison of generation quality between samples drawn from pre-trained and non-pre-
trained models. A degeneracy score was calculated for samples drawn from Evo by computing the coverage
of each sequence by repetitive substrings longer than 9 nucleotides in length (Methods). We find that prompt-
ing with a Cas9 special token yields higher quality samples when sampling from a pre-trained Evo that was
finetuned on CRISPR-Cas sequences according to this degeneracy metric.
16
Cand. 1 Cand. 2 Cand. 3 Cand. 4 Cand. 5 Cand. 6 Cand. 7 Cand. 8 Cand. 9 Cand. 10 Cand. 11
sgRNA: + − + − + − + − + − + − + − + − + − + − + −
IVTT template
bp bp
2000 2000
1500 1500
500 500
400 400
300 300
200 200
100 100
Fig. S14 | CRISPR-Cas9 experimental validation screen. 11 CRISPR-Cas9 candidates along with their co-
generated guides were incubated at 37◦ C in a 20-hour cleavage reaction with a DNA target containing a shared
protospacer sequence and an NGG PAM (other PAM sequences were not tested). Candidate 4 demonstrated
robust DNA cleavage activity that was guide RNA-specific and was named EvoCas9-1.
17
A SpCas9
+ SpCas9 sgRNA
EvoCas9-1
+ EvoCas9-1 sgRNA
B NEB SpCas9
+ SpCas9 sgRNA
Purified SpCas9
+ SpCas9 sgRNA
NTG 5m 15m 1hr 3hr NTG 5m 15m 1hr 3hr NTG 5m 15m 1hr 3hr NTG 5m 15m 1hr 3hr
bp bp
2000 2000
1500 1500
1000
1000
850
850
650
650
500
400 500
300 400
200 300
200
100
100
C SpCas9
+ SpCas9 sgRNA
EvoCas9-1
+ SpCas9 sgRNA
D SpCas9
+ EvoCas9-1 sgRNA
EvoCas9-1
+ EvoCas9-1 sgRNA
NTG 5m 15m 1hr 3hr NTG 5m 15m 1hr 3hr NTG 5m 15m 1hr 3hr NTG 5m 15m 1hr 3hr
bp bp
2000 2000
1500 1500
1000 1000
850 850
650 650
500 500
400 400
300 300
200 200
100 100
Fig. S15 | Additional analysis of EvoCas9-1 effector and sgRNA performance. Cleavage reactions were per-
formed at 37◦ C with a 10:10:1 molar ratio of Cas9:sgRNA:target and a target concentration of 1 nM. (A) Full
cleavage gel corresponding to Fig. 3F. (B) Comparison of in-house purified SpCas9 with commercially avail-
able SpCas9 (NEB M0386T). (C) Timecourse comparison of in-house purified SpCas9 with purified EvoCas9-1,
both with an SpCas9 sgRNA. (D) Comparison of in-house purified SpCas9 with purified EvoCas9-1, both with
an EvoCas9-1 sgRNA.
18
A EvoCas9-1 (AF3) B Electrostatic Potential
EvoCas9-1 sgRNA
DNA target
180° 180°
SpCas9 sgRNA
DNA target
180°
180°
Fig. S16 | Predicted structures of EvoCas9-1. (A) Surface view of EvoCas9-1. (B) EvoCas9-1 electrostatics
calculated using the coulombic function in ChimeraX. (C) Surface view of SpCas9 (PDB: 4OO8). (D) SpCas9
electrostatics.
19
A 9-1 B 1000
9 s
Cas o Ca 900
Sp Ev 800
Absorbance (mAU)
kDa
700
200
150 600
120
500
85
400
60
50 300
40 200
30 100
25 0
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
15
Elution Volume (mL)
Fig. S17 | Purification of EvoCas9-1 and SpCas9. (A) SDS-PAGE analysis of final concentrated stocks of
EvoCas9-1 and SpCas9 protein. (B) Size exclusion chromatography (SEC) trace from the purification of
EvoCas9-1. 0.5mL fractions between 11.5mL and 14mL were pooled, concentrated, and used for biochemical
assays.
20
A B
TnpA − IS200 TnpA − IS605 TnpB − IS605
75
% of generated loci
ESMFold pLDDT
Prompt 80
50
<is200>
<is605>
60
25
40
0 n = 1800 n = 1800 n = 1800
TnpA TnpB TnpA + None
only only TnpB 25 50 75 100 25 50 75 100 25 50 75 100
Detected proteins % identity to training
C TnpA TnpB
D ISDge10
1.5
0.020
1.0
Entropy
0.010
0.015
Training
0.5
0.010
0.005
Probability Density
0.005 0.0
1,059 bp
0.000 0.000
ISHp608
1.5
0.020
0.010 1.0
Entropy
0.015
Generated
0.010 0.5
0.005
0.005 0.0
2,377 bp
0.000 0.000
100 200 300 400 200 400 600 800 ISEvo2
Protein Length (aa) 1.5
E 1.0
Entropy
LE RE
0.75 0.5
Avg. Entropy
0.50
0.0
0.25 2,400 bp
0.00
ISSpn6 ISStin10
−200 −100 0 TnpA 0 100 200
1.5 1.5
Avg. Entropy
1.0 1.0
Entropy
1.0
0.0
0.0 0.0
−200 −100 0 TnpA TnpB 0 100 200 1,074 bp 1,065 bp
1.2
IS200G ISEvo1
Avg. Entropy
0.4
1.0 1.0
Entropy
0.0
Fig. S18 | Additional analysis of generated IS200/IS605 sequences. (A) The detected TnpA/TnpB proteins
as a percentage of all loci generated with each special token. Colors indicate which prompt token was used.
“None” indicates that no TnpA or TnpB proteins were detected in the generated sequence. Includes 2,373,150
sequences generated with <is200> token prompt, and 2,104,350 generated with <is605> token prompt.
(B) Relationship between ESMFold pLDDT and percent identity distance from generated protein to nearest
training example. Showing for generated sequences where only a TnpA was generated (TnpA - IS200), or
where both a TnpA and a TnpB were detected (TnpA - IS605, TnpB - IS605). 1800 generated protein sequences
were randomly selected per category, with 200 selected from each equal-width bin from 10% to 100%. (C)
Showing the protein length distribution of the training examples and the generated proteins. TnpA distribution
was cut off at 450 amino acids to improve visualization. (D) The entropy of the conditional probabilities at
each position across IS200/IS605 loci. Showing 2 natural IS605 elements (ISDge10 and ISHp608), 3 natural
IS200 elements (ISSpn6, ISStin10, IS200G), and 2 Evo-generated elements (ISEvo1, ISEvo2). TnpA (short)
and TnpB (long) coding sequences are shown in blue, and the start and end of the complete element is shown
with black bars. (E) The average entropy within 250 bp of the 5′ and 3′ ends of IS200/IS605 coding sequences,
including 50 bp of the CDS itself. The entropy was calculated at each position across natural IS200 (top) or
IS605 (middle, bottom) sequences in the training set (N = 228,763). Sequence positions were aligned with
respect to the beginning and end of each respective CDS. (Top) All sequences with only a detected TnpA
(IS200-like). (Middle) All sequences with a TnpA followed by a TnpB (IS605-like). (Bottom) Sequences
where the TnpB precedes the TnpA on the forward strand (IS605-like). Gray ribbon indicates the standard
deviation of the entropy values.
21
Fig. S19 | IS605 categorical Jacobian analysis. Categorical Jacobian matrices (70) were computed for
natural IS605 systems and plotted as heatmaps thresholded between 0 and 4. Regions corresponding to the
LE, TnpA, TnpB, and RE elements are annotated in the heatmaps.
22
Fig. S20 | Filtering and selection of Evo-generated IS200/IS605-like candidates. (A, B) Downstream
computational metrics, computed on both the TnpA protein and DNA sequences, were used to filter Evo-
generated sequences to obtain a high-confidence set of IS200-like and IS605-like candidates. (C) Scatter plot
of max stem length of the longest perfect hairpin (with ≤ 5 bp loop) vs. the start position of the longest
perfect hairpin for the sequences around the tnpA CDS in Evo IS200 generations and scrambled sequences
preserving the same base frequencies. Black dashed lines depict the threshold used. (D) Schematic describing
calculation of upstream base pair propensity vector from sequences flanking the tnpA CDS in Evo IS200-like
element generations. (E-G) Hierarchical clustering of upstream base pair propensity vectors of Evo-generated
IS200-like candidate sequences after TnpA protein filtering and other DNA sequence filtering steps. Scale bar
is the same as panel (D). Selected sequences and natural IS200 sequences are labeled.
23
Fig. S21 | Activity of 12 other active Evo-generated IS200/IS605-like systems. 2% agarose gel with SYBR
Safe showing in vitro activity of Evo-generated TnpA proteins on ssDNA substrates alongside mutants in which
putative catalytic tyrosines (identified by an amino acid motif match for YXXXQ) are mutated to an alanine for
TnpAs from Evo-generated IS200-like and IS605-like elements ISEvo3 through ISEvo14. Compare to Figs. 4F
and 4I.
24
Fig. S22 | Validation of excision and insertion via nanopore sequencing. (A) Schematic describing
nanopore jump map computation procedure. (B) Nanopore jump map from nanopore sequencing of PCR
products for natural IS605 element ISHp608. (C) Nanopore jump map from nanopore sequencing of PCR
products for natural IS200 element ISSpn6, revealing multiple LEs and REs. (D-E) Nanopore jump maps from
nanopore sequencing of PCR products from in vitro TnpA excision/insertion assay of Evo-generated IS200-like
element ISEvo1 and IS605-like element ISEvo2.
25
(Continued on the next page)
26
Fig. S23 | Nanopore sequencing jump maps for experimentally validated Evo-generated IS200-like and
IS605-like systems. Jump maps computed as in Fig. S22 for nanopore sequencing of PCR products from in
vitro TnpA excision/insertion assay.
27
Fig. S24 | ISSpn6 tnpA ORF and hairpins. ISSpn6 can mobilize and insert with multiple LEs and REs, which
are illustrated above.
28
(Continued on the next page)
29
Fig. S25 | Evo-generated IS200-like systems. Illustrations of 10 additional active Evo-generated IS200-like
systems, including element annotations and relevant DNA and protein features. Compare to Fig. 4E.
30
Fig. S26 | Evo-generated IS605-like systems. Illustrations of two additional active Evo-generated IS605-like
systems, including element annotations and relevant DNA, RNA, and protein features. Compare to Fig. 4H.
31
Fig. S27 | Gene essentiality prediction under different settings. (A) Scatterplots that compare the AUROC
values between different models or different context windows. Axis labels are the same as the horizontal
labels in Fig. 5C. Each dot corresponds to a different whole-genome essentiality study. (B) Gene essentiality
prediction performance as in Fig. 5C except plotted according to the average precision, where essential genes
are considered as the “positive” examples. (C) Gene essentiality prediction performance for across 58 studies
(each dot corresponds to a different study). We performed in silico mutagenesis of each coding sequence in
a genome and computed the change in Evo likelihood, which we used to predict gene essentiality. “Evo (8k
context, multi-stop)” indicates a mutagenesis strategy that inserts multiple stop codons at the beginning of
each coding sequence. “Evo (8k context, single stop)” indicates a mutagenesis that inserts a single stop codon
at the beginning of each coding sequence. “Evo (8k context, deletion)” indicates a mutagenesis strategy that
deletes the entire sequence of the gene. “Position-based prediction” indicates a prediction strategy (not using
Evo) in which we use the position of a gene in the reference genome annotation as the predictor variable for
gene essentiality. See Materials and Methods for more details.
32
Fig. S28 | Additional analysis of genome-scale sequence generations. (A-B) GC content (A) and dinu-
cleotide frequencies (B) of Evo-generated, natural, and random sequences. (C) Codon usage of Evo-generated
sequences compared to natural reference genomes. The sequences used in this analysis were selected from
the TUD analysis in Fig. 6. (D) Mean stop codon ratios in all three reading frames of Evo-generated, natural,
and random ORFs. (E) Histograms representing the distribution of statistics computed on ESMFold-predicted
structures. These structures correspond to coding sequences found on sixteen Evo-generated sequences, each
of length ∼1 mb. These statistics are, from left to right: the percentage of residues in 𝛼-helices, the percentage
of residues in 𝛽 -sheets, the mean backbone pLDDT, and the TMscore to the closest UniRef50 structure in the
AlphaFold Protein Structure Database as determined by FoldSeek easy-search.
33
Stage 1
sequence len. 8k
batch size 4M tokens
scheduler cosine decay
learning rate 3e-4
min. learning rate 3e-5
position emb. RoPE (29)
weight decay 0.1
weight decay in hyena layers 0.0
precision mixed (BF16, FP32 on long convolution parameters)
optimizer Adam
optimizer betas 0.9, 0.95
eps 1e-8
norm RMSnorm (93)
dropout 0.0
warmup 1% total steps
checkpoint activation True
model_width 4096
num_attn_heads 32
hyena short conv. length 3
GLU width 4/3 model_width
vocab size 512
Stage 2
sequence len. 131k
batch size 1M tokens
learning rate 1e-4
min. learning rate 1e-5
position emb. linearly interpolated RoPE
interp. factor 16
34
Dataset Name Source Subset Total Total Avg
Genomes/ Bases Length
Loci/Plasmids (M) (base)
Bacterial and GTDB 85,205 273,865 3,214,184
Archaeal Genomes
Prokaryotic Viruses IMG/VR 2,653,046 36,236 13,658
Plasmids IMG/PR 214,950 5,827 27,106
CRISPR/Cas Loci Custom Cas9 5,566 43 7,798
Database
CRISPR/Cas Loci Custom Cas12 5,069 35 6,911
Database
CRISPR/Cas Loci Custom Cas13 498 4 7,559
Database
IS200/IS605 Loci Custom IS200 Loci 219,866 239 1,085
Database
IS200/IS605 Loci Custom IS605 Loci 10,720 26 2,445
Database
Table S2 | Summary statistics for the OpenGenome datasets. See Materials and Methods for further details
on the dataset sources and curating process.
35
Params (M) d_model glu_size kv_size n_heads n_layers learning rate
1 128 336 64 2 4 9.77E-04
6 320 848 64 5 5 9.57E-04
17 448 1200 64 7 7 9.36E-04
29 512 1360 64 8 9 9.15E-04
40 576 1536 64 8 10 8.95E-04
59 640 1696 64 10 12 8.70E-04
69 640 1712 64 10 14 8.56E-04
84 704 1872 64 11 14 8.37E-04
99 768 2048 64 12 14 8.18E-04
114 768 2048 64 12 16 8.00E-04
121 768 2048 64 12 17 7.75E-04
135 768 2048 64 12 19 7.50E-04
158 832 2224 64 13 19 7.25E-04
175 832 2224 64 13 21 7.00E-04
203 896 2384 64 14 21 6.75E-04
232 896 2384 64 14 24 6.50E-04
266 960 2560 64 15 24 6.25E-04
303 1024 2736 64 16 24 6.00E-04
383 1152 3072 64 18 24 5.66E-04
473 1280 3408 64 20 24 5.33E-04
572 1408 3760 128 11 24 5.00E-04
680 1536 4096 128 12 24 4.75E-04
798 1664 4432 128 13 24 4.55E-04
926 1792 4784 128 14 24 4.33E-04
1063 1920 5120 128 15 24 4.15E-04
1209 1920 5120 128 15 25 4.11E-04
Table S3 | Scaling laws model settings for Transformer++, Hyena, StripedHyena and Mamba. Layer
number for Mamba is doubled (a single block corresponds to two Mamba layers and the dedicated channel
mixer layer is removed, as described in reference (40)). Parameter counts vary slightly for each architecture.
36
Model Adkar Chen Firnberg Jacquier Kelsic Melnikov Rockah Tsuboyama Weeks Average
Evo 0.25 0.57 0.55 0.47 0.60 0.45 0.32 0.42 0.50 0.46
GenSLM 0.02 0.08 0.41 0.35 0.57 0.01 0.19 0.17 0.37 0.24
Nucleotide Transformer 0.08 0.02 0.03 0.00 0.30 0.00 0.08 0.10 0.02 0.07
RNA-FM 0.16 0.03 0.03 0.01 0.45 0.03 0.00 0.19 0.07 0.11
CARP 640M 0.06 0.67 0.40 0.40 0.56 0.56 0.48 0.28 0.30 0.41
ESM-1v 0.28 0.61 0.72 0.68 0.46 0.56 0.30 0.27 0.62 0.50
ESM-2 3B 0.37 0.66 0.65 0.61 0.49 0.58 0.24 0.24 0.61 0.49
ESM-2 650M 0.32 0.69 0.72 0.68 0.48 0.54 0.24 0.37 0.61 0.52
ProGen2 large 0.07 0.62 0.56 0.52 0.49 0.49 0.30 0.47 0.60 0.46
ProGen2 xlarge 0.23 0.67 0.40 0.38 0.48 0.56 0.32 0.46 0.59 0.46
Table S4 | Zero-shot prokaryotic protein fitness prediction. Spearman correlations of language-model-likelihood and experimental fitness across nine
datasets.
37
Model Findlay Garvie Giacomelli Kotler Silverstein Sun Average
Evo 0.06 0.15 0.09 0.13 0.15 0.29 0.14
GenSLM 0.01 0.11 0.04 0.02 0.03 0.23 0.07
Nucleotide Transformer 0.01 0.10 0.02 0.06 0.03 0.03 0.04
RNA-FM 0.04 0.02 0.03 0.09 0.02 0.07 0.05
CARP-640M 0.12 0.13 0.64 0.72 0.49 0.36 0.41
ESM-1v 0.19 0.18 0.53 0.58 0.53 0.44 0.41
ESM-2 650M 0.21 0.16 0.32 0.60 0.49 0.39 0.36
ESM-2 3B 0.21 0.13 0.37 0.62 0.50 0.38 0.37
ProGen2 large 0.19 0.04 0.54 0.49 0.49 0.34 0.35
ProGen2 xlarge 0.22 0.05 0.18 0.41 0.48 0.37 0.29
38
Model Kobori Andreasson Domingo Guy Hayden Pitt Zhang Average
Evo 0.17 0.14 0.45 0.24 0.13 0.14 0.60 0.27
GenSLM 0.11 0.10 0.29 0.05 0.19 0.18 0.12 0.15
Nucleotide Transformer 0.20 0.07 0.06 0.05 0.24 0.01 0.20 0.12
RNA-FM 0.03 0.16 0.20 0.05 0.11 0.13 0.56 0.18
39
Model Hossain Kosuri Urtecho Yu Average
Evo zero-shot 0.34 0.66 0.25 0.47 0.43
Evo ridge regression 0.49 0.58 0.25 0.53 0.46
Evo CNN 0.60 0.56 0.40 0.66 0.56
One-hot ridge regression 0.18 0.07 0.13 0.23 0.15
One-hot CNN 0.35 0.44 0.38 0.57 0.44
GC content 0.39 0.47 0.27 0.29 0.35
GenSLM 0.10 0.07 0.08 0.09 0.09
Promoter Calculator 0.60 0.53 0.69 0.66 0.62
40
Data S1 (separate file) | Sequences of primers and generated sequences experimentally validated, with
additional statistics.
41
References and Notes
1. T. H. Morgan, Sex limited inheritance in Drosophila. Science 32, 120–122 (1910).
doi:10.1126/science.32.812.120 Medline
2. J. D. Watson, F. H. C. Crick, Molecular structure of nucleic acids: A structure for deoxyribose
nucleic acid. Nature 171, 737–738 (1953). doi:10.1038/171737a0 Medline
3. M. W. Nirenberg, J. H. Matthaei, The dependence of cell-free protein synthesis in E. coli upon
naturally occurring or synthetic polyribonucleotides. Proc. Natl. Acad. Sci. U.S.A. 47,
1588–1602 (1961). doi:10.1073/pnas.47.10.1588 Medline
4. T. Dobzhansky, Genetics and the Origin of Species (Columbia Univ. Press, 1951).
5. J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool,
R. Bates, A. Žídek, A. Potapenko, A. Bridgland, C. Meyer, S. A. A. Kohl, A. J. Ballard,
A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D.
Reiman, E. Clancy, M. Zielinski, M. Steinegger, M. Pacholska, T. Berghammer, S.
Bodenstein, D. Silver, O. Vinyals, A. W. Senior, K. Kavukcuoglu, P. Kohli, D. Hassabis,
Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589
(2021). doi:10.1038/s41586-021-03819-2 Medline
6. A. Rives, J. Meier, T. Sercu, S. Goyal, Z. Lin, J. Liu, D. Guo, M. Ott, C. L. Zitnick, J. Ma, R.
Fergus, Biological structure and function emerge from scaling unsupervised learning to
250 million protein sequences. Proc. Natl. Acad. Sci. U.S.A. 118, e2016239118 (2021).
doi:10.1073/pnas.2016239118 Medline
7. C. Outeiral, C. M. Deane, Codon language embeddings provide strong signals for use in
protein engineering. Nat. Mach. Intell. 6, 170–179 (2024). doi:10.1038/s42256-024-
00791-0
8. S. Li, S. Moayedpour, R. Li, M. Bailey, S. Riahi, L. Kogler-Anele, M. Miladi, J. Miner, F.
Pertuy, D. Zheng, J. Wang, A. Balsubramani, K. Tran, M. Zacharia, M. Wu, X. Gu, R.
Clinton, C. Asquith, J. Skaleski, L. Boeglin, S. Chivukula, A. Dias, T. Strugnell, F. U.
Montoya, V. Agarwal, Z. Bar-Joseph, S. Jager, CodonBERT large language model for
mRNA vaccines. Genome Res. 34, 1027–1035 (2024). doi:10.1101/gr.278870.123
Medline
9. Ž. Avsec, V. Agarwal, D. Visentin, J. R. Ledsam, A. Grabska-Barwinska, K. R. Taylor, Y.
Assael, J. Jumper, P. Kohli, D. R. Kelley, Effective gene expression prediction from
sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).
doi:10.1038/s41592-021-01252-x Medline
10. J. L. Watson, D. Juergens, N. R. Bennett, B. L. Trippe, J. Yim, H. E. Eisenach, W. Ahern, A.
J. Borst, R. J. Ragotte, L. F. Milles, B. I. M. Wicky, N. Hanikel, S. J. Pellock, A.
Courbet, W. Sheffler, J. Wang, P. Venkatesh, I. Sappington, S. Vázquez Torres, A.
Lauko, V. De Bortoli, E. Mathieu, S. Ovchinnikov, R. Barzilay, T. S. Jaakkola, F.
DiMaio, M. Baek, D. Baker, De novo design of protein structure and function with
RFdiffusion. Nature 620, 1089–1100 (2023). doi:10.1038/s41586-023-06415-8 Medline
11. A. Madani, B. Krause, E. R. Greene, S. Subramanian, B. P. Mohr, J. M. Holton, J. L. Olmos
Jr., C. Xiong, Z. Z. Sun, R. Socher, J. S. Fraser, N. Naik, Large language models generate
functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106
(2023). doi:10.1038/s41587-022-01618-2 Medline
12. J. B. Ingraham, M. Baranov, Z. Costello, K. W. Barber, W. Wang, A. Ismail, V. Frappier, D.
M. Lord, C. Ng-Thow-Hing, E. R. Van Vlack, S. Tie, V. Xue, S. C. Cowles, A. Leung, J.
V. Rodrigues, C. L. Morales-Perez, A. M. Ayoub, R. Green, K. Puentes, F. Oplinger, N.
V. Panwar, F. Obermeyer, A. R. Root, A. L. Beam, F. J. Poelwijk, G. Grigoryan,
Illuminating protein space with a programmable generative model. Nature 623, 1070–
1078 (2023). doi:10.1038/s41586-023-06728-8 Medline
13. L. F. DaSilva, S. Senan, Z. M. Patel, A. J. Reddy, S. Gabbita, Z. Nussbaum, C. M. V.
Córdova, A. Wenteler, N. Weber, T. M. Tunjic, T. A. Khan, Z. Li, C. Smith, M. Bejan, L.
K. Louis, P. Cornejo, W. Connell, E. S. Wong, W. Meuleman, L. Pinello, DNA-
Diffusion: Leveraging Generative Models for Controlling Chromatin Accessibility and
Gene Expression via Synthetic Regulatory Elements. bioRxiv 2024.02.01.578352
[Preprint] (2024); https://doi.org/10.1101/2024.02.01.578352.
14. A. Lal, D. Garfield, T. Biancalani, G. Eraslan, regLM: Designing realistic regulatory DNA
with autoregressive language models. bioRxiv 2024.02.14.580373 [Preprint] (2024);
https://doi.org/10.1101/2024.02.14.580373.
15. M. Zvyagin, A. Brace, K. Hippe, Y. Deng, B. Zhang, C. O. Bohorquez, A. Clyde, B. Kale, D.
Perez-Rivera, H. Ma, C. M. Mann, M. Irvin, D. G. Ozgulbas, N. Vassilieva, J. G.
Pauloski, L. Ward, V. Hayot-Sasson, M. Emani, S. Foreman, Z. Xie, D. Lin, M. Shukla,
W. Nie, J. Romero, C. Dallago, A. Vahdat, C. Xiao, T. Gibbs, I. Foster, J. J. Davis, M. E.
Papka, T. Brettin, R. Stevens, A. Anandkumar, V. Vishwanath, A. Ramanathan,
GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics.
Int. J. High Perform. Comput. Appl. 37, 683–705 (2023).
doi:10.1177/10943420231201154
16. H. Dalla-Torre, L. Gonzalez, J. Mendoza Revilla, N. L. Carranza, A. H. Grzywaczewski, F.
Oteri, C. Dallago, E. Trop, H. Sirelkhatim, G. Richard, M. Skwark, K. Beguir, M. Lopez,
T. Pierrot, The Nucleotide Transformer: Building and Evaluating Robust Foundation
Models for Human Genomics. bioRxiv 2023.01.11.523679 [Preprint] (2023);
https://doi.org/10.1101/2023.01.11.523679.
17. Z. Zhou, Y. Ji, W. Li, P. Dutta, R. Davuluri, H. Liu, DNABERT-2: Efficient foundation
model and benchmark for multi-species genome. arXiv:2306.15006 [q-bio.GN] (2023).
18. Y. Tay, V. Q. Tran, S. Ruder, J. Gupta, H. W. Chung, D. Bahri, Z. Qin, S. Baumgartner, C.
Yu, D. Metzler, Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization. arXiv:2106.12672 [cs.CL] (2022).
19. S. Chen, S. Wong, L. Chen, Y. Tian, Extending context window of large language models
via positional interpolation. arXiv:2306.15595 [cs.CL] (2023).
20. H. Liu, M. Zaharia, P. Abbeel, Ring attention with blockwise transformers for near-infinite
context. arXiv:2310.01889 [cs.CL] (2023).
21. V. Fishman, Y. Kuratov, A. Shmelev, M. Petrov, D. Penzar, D. Shepelin, N. Chekanov, O.
Kardymon, M. Burtsev, GENA-LM: A family of open-source foundational DNA
language models for long sequences. bioRxiv 2023.06.12.544594 [Preprint] (2024);
https://doi.org/10.1101/2023.06.12.544594.
22. Y. Ji, Z. Zhou, H. Liu, R. V. Davuluri, DNABERT: Pre-trained Bidirectional Encoder
Representations from Transformers model for DNA-language in genome. Bioinformatics
37, 2112–2120 (2021). doi:10.1093/bioinformatics/btab083 Medline
23. Y. Hwang, A. L. Cornman, E. H. Kellogg, S. Ovchinnikov, P. R. Girguis, Genomic language
model predicts protein co-regulation and function. Nat. Commun. 15, 2880 (2024).
doi:10.1038/s41467-024-46947-9 Medline
24. M. Poli, J. Wang, S. Massaroli, J. Quesnelle, R. Carlow, E. Nguyen, A. Thomas,
StripedHyena: Moving Beyond Transformers with Hybrid Signal Processing Models,
GitHub (2023); https://github.com/togethercomputer/stripedhyena.
25. Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, A. Anandkumar,
Fourier neural operator for parametric partial differential equations. arXiv:2010.08895
[cs.LG] (2021).
26. A. Gu, K. Goel, C. Ré, Efficiently modeling long sequences with structured state spaces.
arXiv:2111.00396 [cs.LG] (2022).
27. A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gulcehre, R. Pascanu, S. De, Resurrecting
Recurrent Neural Networks for Long Sequences. arXiv:2303.06349 [cs.LG] (2023).
28. S. Massaroli, M. Poli, D. Fu, H. Kumbong, R. Parnichkun, D. Romero, A. Timalsina, Q.
McIntyre, B. Chen, A. Rudra, C. Zhang, C. Ré, S. Ermon, Y. Bengio, “Laughing Hyena
Distillery: Extracting Compact Recurrences From Convolutions” in Advances in Neural
Information Processing Systems, vol. 36, A. Oh, T. Naumann, A. Globerson, K. Saenko,
M. Hardt, S. Levine, Eds. (Curran Associates, Inc., 2023), pp. 17072–17116.
29. J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, Y. Liu, RoFormer: Enhanced transformer with
rotary position embedding. Neurocomputing 568, 127063 (2024).
doi:10.1016/j.neucom.2023.127063
30. X. Ma, C. Zhou, X. Kong, J. He, L. Gui, G. Neubig, J. May, L. Zettlemoyer, Mega: Moving
average equipped gated attention. arXiv:2209.10655 [cs.LG] (2023).
31. D. Y. Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, C. Ré, Hungry hungry hippos:
Towards language modeling with state space models. arXiv:2212.14052 [cs.LG] (2023).
32. J. Pilault, M. Fathi, O. Firat, C. Pal, P.-L. Bacon, R. Goroshin, “Block-state transformers” in
Advances in Neural Information Processing Systems, vol. 36, A. Oh, T. Naumann, A.
Globerson, K. Saenko, M. Hardt, S. Levine, Eds. (Curran Associates, Inc., 2023), pp.
7311–7329.
33. E. Nguyen, M. Poli, M. Faizi, A. Thomas, M. Wornow, C. Birch-Sykes, S. Massaroli, A.
Patel, C. Rabideau, Y. Bengio, S. Ermon, C. Ré, S. Baccus, “HyenaDNA: Long-Range
Genomic Sequence Modeling at Single Nucleotide Resolution” in Advances in Neural
Information Processing Systems, vol. 36, A. Oh, T. Naumann, A. Globerson, K. Saenko,
M. Hardt, S. Levine, Eds. (Curran Associates, Inc., 2023), pp. 43177–43201.
34. M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao, S. Baccus, Y. Bengio, S. Ermon, C. Ré,
Hyena Hierarchy: Towards Larger Convolutional Language Models. arXiv:2302.10866
[cs.LG] (2023).
35. D. H. Parks, M. Chuvochina, C. Rinke, A. J. Mussig, P.-A. Chaumeil, P. Hugenholtz, GTDB:
An ongoing census of bacterial and archaeal diversity through a phylogenetically
consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res.
50, D785–D794 (2022). doi:10.1093/nar/gkab776
36. A. P. Camargo, S. Nayfach, I.-M. A. Chen, K. Palaniappan, A. Ratner, K. Chu, S. J. Ritter, T.
B. K. Reddy, S. Mukherjee, F. Schulz, L. Call, R. Y. Neches, T. Woyke, N. N. Ivanova,
E. A. Eloe-Fadrosh, N. C. Kyrpides, S. Roux, IMG/VR v4: An expanded database of
uncultivated virus genomes within a framework of extensive functional, taxonomic, and
ecological metadata. Nucleic Acids Res. 51, D733–D743 (2023).
doi:10.1093/nar/gkac1037
37. A. P. Camargo, L. Call, S. Roux, S. Nayfach, M. Huntemann, K. Palaniappan, A. Ratner, K.
Chu, S. Mukherjeep, T. B. K. Reddy, I. A. Chen, N. N. Ivanova, E. A. Eloe-Fadrosh, T.
Woyke, D. A. Baltrus, S. Castañeda-Barba, F. de la Cruz, B. E. Funnell, J. P. J. Hall, A.
Mukhopadhyay, E. P. C. Rocha, T. Stalder, E. Top, N. C. Kyrpides, IMG/PR: A database
of plasmids from genomes and metagenomes with rich annotations and metadata. Nucleic
Acids Res. 52, D164–D173 (2024). doi:10.1093/nar/gkad964 Medline
38. J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las
Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van
den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O.
Vinyals, L. Sifre, Training Compute-Optimal Large Language Models. arXiv:2203.15556
[cs.CL] (2022).
39. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A.
Radford, J. Wu, D. Amodei, Scaling Laws for Neural Language Models.
arXiv:2001.08361 [cs.LG] (2020).
40. A. Gu, T. Dao, Mamba: Linear-time sequence modeling with selective state spaces.
arXiv:2312.00752 [cs.LG] (2024).
41. J. Meier, R. Rao, R. Verkuil, J. Liu, T. Sercu, A. Rives, “Language models enable zero-shot
prediction of the effects of mutations on protein function” in Advances in Neural
Information Processing Systems, vol. 34, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S.
Liang, J. Wortman Vaughan, Eds. (Curran Associates, Inc., 2021), pp. 29287–29303.
42. P. Notin, M. Dias, J. Frazer, J. Marchena-Hurtado, A. N. Gomez, D. Marks, Y. Gal,
“Tranception: Protein Fitness Prediction with Autoregressive Transformers and
Inference-time Retrieval” in Proceedings of the 39th International Conference on
Machine Learning, vol. 162, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, S.
Sabato, Eds. (PMLR, 2022), pp. 16990–17017.
43. G. Benegas, C. Albors, A. J. Aw, C. Ye, Y. S. Song, GPN-MSA: an alignment-based DNA
language model for genome-wide variant effect prediction. bioRxiv 2023.10.10.561776
[Preprint] (2024); https://doi.org/10.1101/2023.10.10.561776.
44. P. Notin, A. W. Kollasch, D. Ritter, L. van Niekerk, S. Paul, H. Spinner, N. Rollins, A.
Shaw, R. Weitzman, J. Frazer, M. Dias, D. Franceschi, R. Orenbuch, Y. Gal, D. S.
Marks, ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction.
bioRxiv 2023.12.07.570727 [Preprint] (2023);
https://doi.org/10.1101/2023.12.07.570727.
45. B. J. Livesey, J. A. Marsh, Updated benchmarking of variant effect predictors using deep
mutational scanning. Mol. Syst. Biol. 19, e11474 (2023). doi:10.15252/msb.202211474
Medline
46. K. K. Yang, N. Fusi, A. X. Lu, Convolutions are competitive with transformers for protein
sequence pretraining. Cell Syst. 15, 286–294.e2 (2024). doi:10.1016/j.cels.2024.01.008
Medline
47. Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu, N. Smetanin, R. Verkuil, O. Kabeli, Y.
Shmueli, A. Dos Santos Costa, M. Fazel-Zarandi, T. Sercu, S. Candido, A. Rives,
Evolutionary-scale prediction of atomic-level protein structure with a language model.
Science 379, 1123–1130 (2023). doi:10.1126/science.ade2574 Medline
48. E. Nijkamp, J. A. Ruffolo, E. N. Weinstein, N. Naik, A. Madani, ProGen2: Exploring the
boundaries of protein language models. Cell Syst. 14, 968–978.e3 (2023).
doi:10.1016/j.cels.2023.10.002 Medline
49. F.-Z. Li, A. P. Amini, Y. Yue, K. K. Yang, A. X. Lu, Feature Reuse and Scaling:
Understanding Transfer Learning with Protein Language Models. bioRxiv
2024.02.05.578959 [Preprint] (2024); https://doi.org/10.1101/2024.02.05.578959.
50. J. Chen, Z. Hu, S. Sun, Q. Tan, Y. Wang, Q. Yu, L. Zong, L. Hong, J. Xiao, T. Shen, I. King,
Y. Li, Interpretable RNA foundation model from unannotated data for highly accurate
RNA structure and function predictions. arXiv:2204.00300 [q-bio.QM] (2022).
51. Z. D. Zhang, M. Nayar, D. Ammons, J. Rampersad, G. E. Fox, Rapid in vivo exploration of a
5S rRNA neutral network. J. Microbiol. Methods 76, 181–187 (2009).
doi:10.1016/j.mimet.2008.10.010
52. T. L. LaFleur, A. Hossain, H. M. Salis, Automated model-predictive design of synthetic
promoters to control transcriptional profiles in bacteria. Nat. Commun. 13, 5159 (2022).
doi:10.1038/s41467-022-32829-5 Medline
53. G. Urtecho, A. D. Tripp, K. D. Insigne, H. Kim, S. Kosuri, Systematic dissection of sequence
elements controlling σ70 promoters using a genomically encoded multiplexed reporter
assay in Escherichia coli. Biochemistry 58, 1539–1551 (2018).
doi:10.1021/acs.biochem.7b01069 Medline
54. A. Hossain, E. Lopez, S. M. Halper, D. P. Cetnar, A. C. Reis, D. Strickland, E. Klavins, H.
M. Salis, Automated design of thousands of nonrepetitive parts for engineering stable
genetic systems. Nat. Biotechnol. 38, 1466–1475 (2020). doi:10.1038/s41587-020-0584-2
Medline
55. T. C. Yu, W. L. Liu, M. S. Brinck, J. E. Davis, J. Shek, G. Bower, T. Einav, K. D. Insigne, R.
Phillips, S. Kosuri, G. Urtecho, Multiplexed characterization of rationally designed
promoter architectures deconstructs combinatorial logic for IPTG-inducible systems. Nat.
Commun. 12, 325 (2021). doi:10.1038/s41467-020-20094-3 Medline
56. S. Kosuri, D. B. Goodman, G. Cambray, V. K. Mutalik, Y. Gao, A. P. Arkin, D. Endy, G. M.
Church, Composability of regulatory sequences controlling transcription and translation
in Escherichia coli. Proc. Natl. Acad. Sci. U.S.A. 110, 14024–14029 (2013).
doi:10.1073/pnas.1301301110 Medline
57. H. M. Salis, E. A. Mirsky, C. A. Voigt, Automated design of synthetic ribosome binding sites
to control protein expression. Nat. Biotechnol. 27, 946–950 (2009). doi:10.1038/nbt.1568
58. A. C. Reis, H. M. Salis, An automated model test system for systematic development and
improvement of gene expression models. ACS Synth. Biol. 9, 3145–3156 (2020).
doi:10.1021/acssynbio.0c00394 Medline
59. J. Y. Wang, P. Pausch, J. A. Doudna, Structural biology of CRISPR–Cas immunity and
genome editing enzymes. Nat. Rev. Microbiol. 20, 641–656 (2022). doi:10.1038/s41579-
022-00739-4
60. P. D. Hsu, E. S. Lander, F. Zhang, Development and applications of CRISPR-Cas9 for
genome engineering. Cell 157, 1262–1278 (2014). doi:10.1016/j.cell.2014.05.010
Medline
61. E. V. Koonin, K. S. Makarova, Origins and evolution of CRISPR-Cas systems. Phil. Trans.
R. Soc. B 374, 20180087 (2019). doi:10.1098/rstb.2018.0087
62. M. Jinek, K. Chylinski, I. Fonfara, M. Hauer, J. A. Doudna, E. Charpentier, A programmable
dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science 337, 816–
821 (2012). doi:10.1126/science.1225829 Medline
63. P. D. Hsu, D. A. Scott, J. A. Weinstein, F. A. Ran, S. Konermann, V. Agarwala, Y. Li, E. J.
Fine, X. Wu, O. Shalem, T. J. Cradick, L. A. Marraffini, G. Bao, F. Zhang, DNA
targeting specificity of RNA-guided Cas9 nucleases. Nat. Biotechnol. 31, 827–832
(2013). doi:10.1038/nbt.2647 Medline
64. J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, A. Pritzel, O. Ronneberger, L.
Willmore, A. J. Ballard, J. Bambrick, S. W. Bodenstein, D. A. Evans, C.-C. Hung, M.
O’Neill, D. Reiman, K. Tunyasuvunakool, Z. Wu, A. Žemgulytė, E. Arvaniti, C. Beattie,
O. Bertolli, A. Bridgland, A. Cherepanov, M. Congreve, A. I. Cowen-Rivers, A. Cowie,
M. Figurnov, F. B. Fuchs, H. Gladman, R. Jain, Y. A. Khan, C. M. R. Low, K. Perlin, A.
Potapenko, P. Savy, S. Singh, A. Stecula, A. Thillaisundaram, C. Tong, S. Yakneen, E.
D. Zhong, M. Zielinski, A. Žídek, V. Bapst, P. Kohli, M. Jaderberg, D. Hassabis, J. M.
Jumper, Accurate structure prediction of biomolecular interactions with AlphaFold 3.
Nature 630, 493–500 (2024). doi:10.1038/s41586-024-07487-w Medline
65. N. L. Craig, M. Chandler, M. Gellert, A. M. Lambowitz, P. A. Rice, S. B. Sandmeyer, Eds.,
Mobile DNA III (Wiley, ed. 3, 2020).
66. C. Meers, H. C. Le, S. R. Pesari, F. T. Hoffmann, M. W. G. Walker, J. Gezelle, S. Tang, S.
H. Sternberg, Transposon-encoded nucleases use guide RNAs to promote their selfish
spread. Nature 622, 863–871 (2023). doi:10.1038/s41586-023-06597-1 Medline
67. T. Karvelis, G. Druteika, G. Bigelyte, K. Budre, R. Zedaveinyte, A. Silanskas, D.
Kazlauskas, Č. Venclovas, V. Siksnys, Transposon-associated TnpB is a programmable
RNA-guided DNA endonuclease. Nature 599, 692–696 (2021). doi:10.1038/s41586-021-
04058-1 Medline
68. H. Altae-Tran, S. Kannan, F. E. Demircioglu, R. Oshiro, S. P. Nety, L. J. McKay, M. Dlakić,
W. P. Inskeep, K. S. Makarova, R. K. Macrae, E. V. Koonin, F. Zhang, The widespread
IS200/IS605 transposon family encodes diverse programmable RNA-guided
endonucleases. Science 374, 57–65 (2021). doi:10.1126/science.abj6856
69. O. Barabas, D. R. Ronning, C. Guynet, A. B. Hickman, B. Ton-Hoang, M. Chandler, F.
Dyda, Mechanism of IS200/IS605 family DNA transposases: Activation and transposon-
directed target site selection. Cell 132, 208–220 (2008). doi:10.1016/j.cell.2007.12.029
Medline
70. Z. Zhang, H. K. Wayment-Steele, G. Brixi, H. Wang, D. Kern, S. Ovchinnikov, Protein
language models learn evolutionary statistics of interacting sequence motifs. Proc. Natl.
Acad. Sci. U.S.A. 121, e2406285121 (2024). doi:10.1073/pnas.2406285121
71. P. Siguier, J. Pérochon, L. Lestrade, J. Mahillon, M. Chandler, ISfinder: The reference centre
for bacterial insertion sequences. Nucleic Acids Res. 34, D32–D36 (2006).
doi:10.1093/nar/gkj014 Medline
72. E. P. C. Rocha, A. Danchin, Gene essentiality determines chromosome organisation in
bacteria. Nucleic Acids Res. 31, 6570–6577 (2003). doi:10.1093/nar/gkg859 Medline
73. R. Zhang, H.‐Y. Ou, C.‐T. Zhang, DEG: A database of essential genes. Nucleic Acids Res.
32, D271–D272 (2004). doi:10.1093/nar/gkh024
74. D. Piya, N. Nolan, M. L. Moore, L. A. Ramirez Hernandez, B. F. Cress, R. Young, A. P.
Arkin, V. K. Mutalik, Systematic and scalable genome-wide essentiality mapping to
identify nonessential genes in phages. PLOS Biol. 21, e3002416 (2023).
doi:10.1371/journal.pbio.3002416 Medline
75. K. H. Turner, A. K. Wessel, G. C. Palmer, J. L. Murray, M. Whiteley, Essential genome of
Pseudomonas aeruginosa in cystic fibrosis sputum. Proc. Natl. Acad. Sci. U.S.A. 112,
4110–4115 (2015). doi:10.1073/pnas.1419677112 Medline
76. A. Blanchard, C. Bébéar, The evolution of Mycoplasma genitalium. Ann. N. Y. Acad. Sci.
1230, E61–E64 (2011). doi:10.1111/j.1749-6632.2011.06418.x Medline
77. D. H. Parks, M. Imelfort, C. T. Skennerton, P. Hugenholtz, G. W. Tyson, CheckM:
Assessing the quality of microbial genomes recovered from isolates, single cells, and
metagenomes. Genome Res. 25, 1043–1055 (2015). doi:10.1101/gr.186072.114
78. D. T. Pride, R. J. Meinersmann, T. M. Wassenaar, M. J. Blaser, Evolutionary implications of
microbial genome tetranucleotide frequency biases. Genome Res. 13, 145–158 (2003).
doi:10.1101/gr.335003
79. L. Xu, J. Kuo, J.-K. Liu, T.-Y. Wong, Bacterial phylogenetic tree construction based on
genomic translation stop signals. Microb. Inform. Exp. 2, 6 (2012). doi:10.1186/2042-
5783-2-6
80. G. Korkmaz, M. Holm, T. Wiens, S. Sanyal, Comprehensive analysis of stop codon usage in
bacteria and its correlation with release factor abundance. J. Biol. Chem. 289, 30334–
30342 (2014). doi:10.1074/jbc.M114.606632 Medline
81. T. Seemann, barrnap, GitHub (2018); https://github.com/tseemann/barrnap.
82. N. Goldman, J. L. Thorne, D. T. Jones, Assessing the impact of secondary structure and
solvent accessibility on protein evolution. Genetics 149, 445–458 (1998).
doi:10.1093/genetics/149.1.445 Medline
83. J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, Q. V. Le,
Finetuned language models are zero-shot learners. arXiv:2109.01652 [cs.CL] (2022).
84. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S.
Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A.
Askell, P. Welinder, P. Christiano, J. Leike, R. Lowe, Training language models to
follow instructions with human feedback. arXiv:2203.02155 [cs.CL] (2022).
85. R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, C. Finn, “Direct Preference
Optimization: Your Language Model is Secretly a Reward Model” in Advances in Neural
Information Processing Systems, vol. 36, A. Oh, T. Naumann, A. Globerson, K. Saenko,
M. Hardt, S. Levine, Eds. (Curran Associates, Inc., 2023), pp. 53728–53741.
86. H. L. Rehm, A. J. H. Page, L. Smith, J. B. Adams, G. Alterovitz, L. J. Babb, M. P. Barkley,
M. Baudis, M. J. S. Beauvais, T. Beck, J. S. Beckmann, S. Beltran, D. Bernick, A.
Bernier, J. K. Bonfield, T. F. Boughtwood, G. Bourque, S. R. Bowers, A. J. Brookes, M.
Brudno, M. H. Brush, D. Bujold, T. Burdett, O. J. Buske, M. N. Cabili, D. L. Cameron,
R. J. Carroll, E. Casas-Silva, D. Chakravarty, B. P. Chaudhari, S. H. Chen, J. M. Cherry,
J. Chung, M. Cline, H. L. Clissold, R. M. Cook-Deegan, M. Courtot, F. Cunningham, M.
Cupak, R. M. Davies, D. Denisko, M. J. Doerr, L. I. Dolman, E. S. Dove, L. J. Dursi, S.
O. M. Dyke, J. A. Eddy, K. Eilbeck, K. P. Ellrott, S. Fairley, K. A. Fakhro, H. V. Firth,
M. S. Fitzsimons, M. Fiume, P. Flicek, I. M. Fore, M. A. Freeberg, R. R. Freimuth, L. A.
Fromont, J. Fuerth, C. L. Gaff, W. Gan, E. M. Ghanaim, D. Glazer, R. C. Green, M.
Griffith, O. L. Griffith, R. L. Grossman, T. Groza, J. M. Guidry Auvil, R. Guigó, D.
Gupta, M. A. Haendel, A. Hamosh, D. P. Hansen, R. K. Hart, D. M. Hartley, D. Haussler,
R. M. Hendricks-Sturrup, C. W. L. Ho, A. E. Hobb, M. M. Hoffman, O. M. Hofmann, P.
Holub, J. S. Hsu, J.-P. Hubaux, S. E. Hunt, A. Husami, J. O. Jacobsen, S. S. Jamuar, E. L.
Janes, F. Jeanson, A. Jené, A. L. Johns, Y. Joly, S. J. M. Jones, A. Kanitz, K. Kato, T. M.
Keane, K. Kekesi-Lafrance, J. Kelleher, G. Kerry, S. S. Khor, B. M. Knoppers, M. A.
Konopko, K. Kosaki, M. Kuba, J. Lawson, R. Leinonen, S. Li, M. F. Lin, M. Linden, X.
Liu, I. U. Liyanage, J. Lopez, A. M. Lucassen, M. Lukowski, A. L. Mann, J. Marshall,
M. Mattioni, A. Metke-Jimenez, A. Middleton, R. J. Milne, F. Molnár-Gábor, N. Mulder,
M. C. Munoz-Torres, R. Nag, H. Nakagawa, J. Nasir, A. Navarro, T. H. Nelson, A.
Niewielska, A. Nisselle, J. Niu, T. H. Nyrönen, B. D. O’Connor, S. Oesterle, S.
Ogishima, V. Ota Wang, L. A. D. Paglione, E. Palumbo, H. E. Parkinson, A. A.
Philippakis, A. D. Pizarro, A. Prlic, J. Rambla, A. Rendon, R. A. Rider, P. N. Robinson,
K. W. Rodarmer, L. L. Rodriguez, A. F. Rubin, M. Rueda, G. A. Rushton, R. S. Ryan, G.
I. Saunders, H. Schuilenburg, T. Schwede, S. Scollen, A. Senf, N. C. Sheffield, N.
Skantharajah, A. V. Smith, H. J. Sofia, D. Spalding, A. B. Spurdle, Z. Stark, L. D. Stein,
M. Suematsu, P. Tan, J. A. Tedds, A. A. Thomson, A. Thorogood, T. L. Tickle, K.
Tokunaga, J. Törnroos, D. Torrents, S. Upchurch, A. Valencia, R. Valls Guimera, J.
Vamathevan, S. Varma, D. F. Vears, C. Viner, C. Voisin, A. H. Wagner, S. E. Wallace,
B. P. Walsh, M. S. Williams, E. C. Winkler, B. J. Wold, G. M. Wood, J. P. Woolley, C.
Yamasaki, A. D. Yates, C. K. Yung, L. J. Zass, K. Zaytseva, J. Zhang, P. Goodhand, K.
North, E. Birney, GA4GH: International policies and standards for data sharing across
genomic research and healthcare. Cell Genomics 1, 100029 (2021).
doi:10.1016/j.xgen.2021.100029
87. T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, “Large Language Models are Zero-
Shot Reasoners” in Advances in Neural Information Processing Systems, vol. 35, S.
Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh, Eds. (Curran Associates,
Inc., 2022), pp. 22199–22213.
88. B. L. Hie, V. R. Shanker, D. Xu, T. U. J. Bruun, P. A. Weidenbacher, S. Tang, W. Wu, J. E.
Pak, P. S. Kim, Efficient evolution of human antibodies from general protein language
models. Nat. Biotechnol. 42, 275–283 (2024). doi:10.1038/s41587-023-01763-2
89. V. R. Shanker, T. U. J. Bruun, B. L. Hie, P. S. Kim, Unsupervised evolution of protein and
antibody complexes with a structure-informed language model. Science 385, 46–53
(2024). doi:10.1126/science.adk8946 Medline
90. M. G. Durrant, N. T. Perry, J. J. Pai, A. R. Jangid, J. S. Athukoralage, M. Hiraizumi, J. P.
McSpedon, A. Pawluk, H. Nishimasu, S. Konermann, P. D. Hsu, Bridge RNAs direct
programmable recombination of target and donor DNA. Nature 630, 984–993 (2024).
doi:10.1038/s41586-024-07552-4 Medline
91. Y. N. Dauphin, A. Fan, M. Auli, D. Grangier, “Language Modeling with Gated
Convolutional Networks” in Proceedings of the 34th International Conference on
Machine Learning, vol. 70, D. Precup, Y. W. Teh, Eds. (PMLR, 2017), pp. 933–941.
92. N. Shazeer, GLU variants improve Transformer. arXiv:2002.05202 [cs.LG] (2020).
93. B. Zhang, R. Sennrich, “Root Mean Square Layer Normalization” in Advances in Neural
Information Processing Systems, vol. 32, H. Wallach, H. Larochelle, A. Beygelzimer, F.
d’Alché-Buc, E. Fox, R. Garnett, Eds. (Curran Associates, Inc., 2019).
94. D. Fu, S. Arora, J. Grogan, I. Johnson, E. S. Eyuboglu, A. Thomas, B. Spector, M. Poli, A.
Rudra, C. Ré, “Monarch Mixer: A simple sub-quadratic GEMM-based architecture” in
Advances in Neural Information Processing Systems, vol. 36, A. Oh, T. Naumann, A.
Globerson, K. Saenko, M. Hardt, S. Levine, Eds. (Curran Associates, Inc., 2023), pp.
77546–77603.
95. S. Arora, S. Eyuboglu, A. Timalsina, I. Johnson, M. Poli, J. Zou, A. Rudra, C. Ré, Zoology:
Measuring and improving recall in efficient language models. arXiv:2312.04927 [cs.CL]
(2023).
96. S. Bhattamishra, A. Patel, P. Blunsom, V. Kanade, Understanding in-context learning in
transformers and LLMs by learning to learn discrete functions. arXiv:2310.03016
[cs.LG] (2023).
97. D. W. Romero, A. Kuzina, E. J. Bekkers, J. M. Tomczak, M. Hoogendoorn, CKConv:
Continuous kernel convolution for sequential data. arXiv:2102.02611 [cs.LG] (2022).
98. A. Gupta, A. Gu, J. Berant, “Diagonal State Spaces are as Effective as Structured State
Spaces” in Advances in Neural Information Processing Systems, vol. 35, S. Koyejo, S.
Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh, Eds. (Curran Associates, Inc.,
2022), pp. 22982–22994.
99. A. Gu, K. Goel, A. Gupta, C. Ré, “On the Parameterization and Initialization of Diagonal
State Space Models” in Advances in Neural Information Processing Systems, vol. 35, S.
Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh, Eds. (Curran Associates,
Inc., 2022), pp. 35971–35983.
100. M. Zhang, K. K. Saab, M. Poli, T. Dao, K. Goel, C. Ré, Effectively modeling time series
with simple discrete state spaces. arXiv:2303.09489 [cs.LG] (2023).
101. J. Wei, P. Lotfy, K. Faizi, S. Baungaard, E. Gibson, E. Wang, H. Slabodkin, E. Kinnaman,
S. Chandrasekaran, H. Kitano, M. G. Durrant, C. V. Duffy, A. Pawluk, P. D. Hsu, S.
Konermann, Deep learning and CRISPR-Cas13d ortholog discovery for optimized RNA
targeting. Cell Syst. 14, 1087–1102.e13 (2023). doi:10.1016/j.cels.2023.11.006 Medline
102. N. A. O’Leary, M. W. Wright, J. R. Brister, S. Ciufo, D. Haddad, R. McVeigh, B. Rajput,
B. Robbertse, B. Smith-White, D. Ako-Adjei, A. Astashyn, A. Badretdin, Y. Bao, O.
Blinkova, V. Brover, V. Chetvernin, J. Choi, E. Cox, O. Ermolaeva, C. M. Farrell, T.
Goldfarb, T. Gupta, D. Haft, E. Hatcher, W. Hlavina, V. S. Joardar, V. K. Kodali, W. Li,
D. Maglott, P. Masterson, K. M. McGarvey, M. R. Murphy, K. O’Neill, S. Pujar, S. H.
Rangwala, D. Rausch, L. D. Riddick, C. Schoch, A. Shkeda, S. S. Storz, H. Sun, F.
Thibaud-Nissen, I. Tolstoy, R. E. Tully, A. R. Vatsan, C. Wallin, D. Webb, W. Wu, M. J.
Landrum, A. Kimchi, T. Tatusova, M. DiCuccio, P. Kitts, T. D. Murphy, K. D. Pruitt,
Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion,
and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).
doi:10.1093/nar/gkv1189
103. A. Almeida, S. Nayfach, M. Boland, F. Strozzi, M. Beracochea, Z. J. Shi, K. S. Pollard, E.
Sakharova, D. H. Parks, P. Hugenholtz, N. Segata, N. C. Kyrpides, R. D. Finn, A unified
catalog of 204,938 reference genomes from the human gut microbiome. Nat. Biotechnol.
39, 105–114 (2021). doi:10.1038/s41587-020-0603-3 Medline
104. I.-M. A. Chen, K. Chu, K. Palaniappan, A. Ratner, J. Huang, M. Huntemann, P. Hajek, S.
Ritter, N. Varghese, R. Seshadri, S. Roux, T. Woyke, E. A. Eloe-Fadrosh, N. N. Ivanova,
N. C. Kyrpides, The IMG/M data management and analysis system v.6.0: New tools and
advanced capabilities. Nucleic Acids Res. 49, D751–D763 (2021).
doi:10.1093/nar/gkaa939
105. L. F. Camarillo-Guerrero, A. Almeida, G. Rangel-Pineros, R. D. Finn, T. D. Lawley,
Massive expansion of human gut bacteriophage diversity. Cell 184, 1098–1109.e9
(2021). doi:10.1016/j.cell.2021.01.029
106. S. C. Forster, N. Kumar, B. O. Anonye, A. Almeida, E. Viciani, M. D. Stares, M. Dunn, T.
T. Mkandawire, A. Zhu, Y. Shao, L. J. Pike, T. Louie, H. P. Browne, A. L. Mitchell, B.
A. Neville, R. D. Finn, T. D. Lawley, A human gut bacterial genome and culture
collection for improved metagenomic analyses. Nat. Biotechnol. 37, 186–192 (2019).
doi:10.1038/s41587-018-0009-7
107. A. L. Mitchell, A. Almeida, M. Beracochea, M. Boland, J. Burgin, G. Cochrane, M. R.
Crusoe, V. Kale, S. C. Potter, L. J. Richardson, E. Sakharova, M. Scheremetjew, A.
Korobeynikov, A. Shlemov, O. Kunyavskaya, A. Lapidus, R. D. Finn, MGnify: The
microbiome analysis resource in 2020. Nucleic Acids Res. 48, D570–D578 (2020).
doi:10.1093/nar/gkz1035
108. N. D. Youngblut, J. de la Cuesta-Zuluaga, G. H. Reischer, S. Dauser, N. Schuster, C.
Walzer, G. Stalder, A. H. Farnleitner, R. E. Ley, Large-scale metagenome assembly
reveals novel animal-associated microbial genomes, biosynthetic gene clusters, and other
genetic diversity. mSystems 5, e01045-20 (2020). doi:10.1128/msystems.01045-20
109. F. Meyer, D. Paarmann, M. D’Souza, R. Olson, E. M. Glass, M. Kubal, T. Paczian, A.
Rodriguez, R. Stevens, A. Wilke, J. Wilkening, R. A. Edwards, The metagenomics RAST
server - a public resource for the automatic phylogenetic and functional analysis of
metagenomes. BMC Bioinformatics 9, 386 (2008). doi:10.1186/1471-2105-9-386
Medline
110. S. Sunagawa, L. P. Coelho, S. Chaffron, J. R. Kultima, K. Labadie, G. Salazar, B.
Djahanschiri, G. Zeller, D. R. Mende, A. Alberti, F. M. Cornejo-Castillo, P. I. Costea, C.
Cruaud, F. d’Ovidio, S. Engelen, I. Ferrera, J. M. Gasol, L. Guidi, F. Hildebrand, F.
Kokoszka, C. Lepoivre, G. Lima-Mendez, J. Poulain, B. T. Poulos, M. Royo-Llonch, H.
Sarmento, S. Vieira-Silva, C. Dimier, M. Picheral, S. Searson, S. Kandels-Lewis, Tara
Oceans Coordinators, C. Bowler, C. de Vargas, G. Gorsky, N. Grimsley, P. Hingamp, D.
Iudicone, O. Jaillon, F. Not, H. Ogata, S. Pesant, S. Speich, L. Stemmann, M. B.
Sullivan, J. Weissenbach, P. Wincker, E. Karsenti, J. Raes, S. G. Acinas, P. Bork,
Structure and function of the global ocean microbiome. Science 348, 1261359 (2015).
doi:10.1126/science.1261359 Medline
111. R. D. Finn, J. Clements, S. R. Eddy, HMMER web server: Interactive sequence similarity
searching. Nucleic Acids Res. 39, W29–W37 (2011). doi:10.1093/nar/gkr367
112. J. Russel, R. Pinilla-Redondo, D. Mayo-Muñoz, S. A. Shah, S. J. Sørensen,
CRISPRCasTyper: Automated Identification, Annotation, and Classification of CRISPR-
Cas Loci. CRISPR J. 3, 462–469 (2020). doi:10.1089/crispr.2020.0059 Medline
113. H. Altae-Tran, S. A. Shmakov, K. S. Makarova, Y. I. Wolf, S. Kannan, F. Zhang, E. V.
Koonin, Diversity, evolution, and classification of the RNA-guided nucleases TnpB and
Cas12. Proc. Natl. Acad. Sci. U.S.A. 120, e2308224120 (2023).
doi:10.1073/pnas.2308224120 Medline
114. M. Steinegger, J. Söding, MMseqs2 enables sensitive protein sequence searching for the
analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
doi:10.1038/nbt.3988 Medline
115. K. Katoh, K. Misawa, K. Kuma, T. Miyata, MAFFT: A novel method for rapid multiple
sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066
(2002). doi:10.1093/nar/gkf436 Medline
116. W. Xiong, J. Liu, I. Molybog, H. Zhang, P. Bhargava, R. Hou, L. Martin, R. Rungta, K. A.
Sankararaman, B. Oguz, M. Khabsa, H. Fang, Y. Mehdad, S. Narang, K. Malik, A. Fan,
S. Bhosale, S. Edunov, M. Lewis, S. Wang, H. Ma, Effective long-context scaling of
foundation models. arXiv:2309.16039 [cs.CL] (2023).
117. J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, S. Sanghai, GQA:
Training generalized multi-query transformer models from multi-head checkpoints.
arXiv:2305.13245 [cs.CL] (2023).
118. E. Firnberg, J. W. Labonte, J. J. Gray, M. Ostermeier, A comprehensive, high-resolution
map of a gene’s fitness landscape. Mol. Biol. Evol. 31, 1581–1592 (2014).
doi:10.1093/molbev/msu081 Medline
119. H. Jacquier, A. Birgy, H. Le Nagard, Y. Mechulam, E. Schmitt, J. Glodt, B. Bercot, E. Petit,
J. Poulain, G. Barnaud, P.-A. Gros, O. Tenaillon, Capturing the mutational landscape of
the beta-lactamase TEM-1. Proc. Natl. Acad. Sci. U.S.A. 110, 13067–13072 (2013).
doi:10.1073/pnas.1215206110
120. B. V. Adkar, A. Tripathi, A. Sahoo, K. Bajaj, D. Goswami, P. Chakrabarti, M. K. Swarnkar,
R. S. Gokhale, R. Varadarajan, Protein model discrimination using mutational sensitivity
derived from deep sequencing. Structure 20, 371–381 (2012).
doi:10.1016/j.str.2011.11.021
121. K. Tsuboyama, J. Dauparas, J. Chen, E. Laine, Y. Mohseni Behbahani, J. J. Weinstein, N.
M. Mangan, S. Ovchinnikov, G. J. Rocklin, Mega-scale experimental analysis of protein
folding stability in biology and design. Nature 620, 434–444 (2023). doi:10.1038/s41586-
023-06328-6 Medline
122. E. D. Kelsic, H. Chung, N. Cohen, J. Park, H. H. Wang, R. Kishony, RNA structural
determinants of optimal codons revealed by MAGE-Seq. Cell Syst. 3, 563–571.e6 (2016).
doi:10.1016/j.cels.2016.11.004
123. R. Weeks, M. Ostermeier, Fitness and functional landscapes of the E. coli RNase III gene
rnc. Mol. Biol. Evol. 40, msad047 (2023). doi:10.1093/molbev/msad047 Medline
124. L. Rockah-Shmuel, Á. Tóth-Petróczy, D. S. Tawfik, Systematic mapping of protein
mutational space by prolonged drift reveals the deleterious effects of seemingly neutral
mutations. PLOS Comput. Biol. 11, e1004421 (2015). doi:10.1371/journal.pcbi.1004421
Medline
125. J. Z. Chen, D. M. Fowler, N. Tokuriki, Comprehensive exploration of the translocation,
stability and substrate recognition requirements in VIM-2 lactamase. eLife 9, e56707
(2020). doi:10.7554/eLife.56707 Medline
126. A. Melnikov, P. Rogov, L. Wang, A. Gnirke, T. S. Mikkelsen, Comprehensive mutational
scanning of a kinase in vivo reveals substrate-dependent fitness landscapes. Nucleic Acids
Res. 42, e112 (2014). doi:10.1093/nar/gku511 Medline
127. S. Sun, J. Weile, M. Verby, Y. Wu, Y. Wang, A. G. Cote, I. Fotiadou, J. Kitaygorodsky, M.
Vidal, J. Rine, P. Ješina, V. Kožich, F. P. Roth, A proactive genotype-to-patient-
phenotype map for cystathionine beta-synthase. Genome Med. 12, 13 (2020).
doi:10.1186/s13073-020-0711-1
128. R. A. Silverstein, S. Sun, M. Verby, J. Weile, Y. Wu, M. Gebbia, I. Fotiadou, J.
Kitaygorodsky, F. P. Roth, A systematic genotype-phenotype map for missense variants
in the human intellectual disability-associated gene GDI1. bioRxiv 2021.10.06.463360
[Preprint] (2022); https://doi.org/10.1101/2021.10.06.463360.
129. C. W. Garvie, X. Wu, M. Papanastasiou, S. Lee, J. Fuller, G. R. Schnitzler, S. W. Horner,
A. Baker, T. Zhang, J. P. Mullahoo, L. Westlake, S. H. Hoyt, M. Toetzl, M. J. Ranaghan,
L. de Waal, J. McGaunn, B. Kaplan, F. Piccioni, X. Yang, M. Lange, A. Tersteegen, D.
Raymond, T. A. Lewis, S. A. Carr, A. D. Cherniack, C. T. Lemke, M. Meyerson, H.
Greulich, Structure of PDE3A-SLFN12 complex reveals requirements for activation of
SLFN12 RNase. Nat. Commun. 12, 4375 (2021). doi:10.1038/s41467-021-24495-w
130. E. Kotler, O. Shani, G. Goldfeld, M. Lotan-Pompan, O. Tarcic, A. Gershoni, T. A. Hopf, D.
S. Marks, M. Oren, E. Segal, A systematic p53 mutation library links differential
functional impact to cancer mutation pattern and evolutionary conservation. Mol. Cell 71,
178–190.e8 (2018). doi:10.1016/j.molcel.2018.06.012 Medline
131. A. O. Giacomelli, X. Yang, R. E. Lintner, J. M. McFarland, M. Duby, J. Kim, T. P.
Howard, D. Y. Takeda, S. H. Ly, E. Kim, H. S. Gannon, B. Hurhula, T. Sharpe, A.
Goodale, B. Fritchman, S. Steelman, F. Vazquez, A. Tsherniak, A. J. Aguirre, J. G.
Doench, F. Piccioni, C. W. M. Roberts, M. Meyerson, G. Getz, C. M. Johannessen, D. E.
Root, W. C. Hahn, Mutational processes shape the landscape of TP53 mutations in
human cancer. Nat. Genet. 50, 1381–1387 (2018). doi:10.1038/s41588-018-0204-y
132. G. M. Findlay, R. M. Daza, B. Martin, M. D. Zhang, A. P. Leith, M. Gasperini, J. D.
Janizek, X. Huang, L. M. Starita, J. Shendure, Accurate classification of BRCA1 variants
with saturation genome editing. Nature 562, 217–222 (2018). doi:10.1038/s41586-018-
0461-z
133. S. Kobori, Y. Nomura, A. Miu, Y. Yokobayashi, High-throughput assay and engineering of
self-cleaving ribozymes by sequencing. Nucleic Acids Res. 43, e85 (2015).
doi:10.1093/nar/gkv265 Medline
134. J. O. L. Andreasson, A. Savinov, S. M. Block, W. J. Greenleaf, Comprehensive sequence-
to-function mapping of cofactor-dependent RNA catalysis in the glmS ribozyme. Nat.
Commun. 11, 1663 (2020). doi:10.1038/s41467-020-15540-1
135. J. Domingo, G. Diss, B. Lehner, Pairwise and higher-order genetic interactions during the
evolution of a tRNA. Nature 558, 117–121 (2018). doi:10.1038/s41586-018-0170-7
Medline
136. M. P. Guy, D. L. Young, M. J. Payea, X. Zhang, Y. Kon, K. M. Dean, E. J. Grayhack, D. H.
Mathews, S. Fields, E. M. Phizicky, Identification of the determinants of tRNA function
and susceptibility to rapid tRNA decay by high-throughput in vivo analysis. Genes Dev.
28, 1721–1732 (2014). doi:10.1101/gad.245936.114
137. E. J. Hayden, E. Ferrada, A. Wagner, Cryptic genetic variation promotes rapid evolutionary
adaptation in an RNA enzyme. Nature 474, 92–95 (2011). doi:10.1038/nature10083
Medline
138. J. N. Pitt, A. R. Ferré-D’Amaré, Rapid construction of empirical RNA fitness landscapes.
Science 330, 376–379 (2010). doi:10.1126/science.1192001 Medline
139. T. A. Chang, B. K. Bergen, Language model behavior: A comprehensive survey.
arXiv:2303.11504 [cs.CL] (2023).
140. D. Hyatt, G.-L. Chen, P. F. Locascio, M. L. Land, F. W. Larimer, L. J. Hauser, Prodigal:
Prokaryotic gene recognition and translation initiation site identification. BMC
Bioinformatics 11, 119 (2010). doi:10.1186/1471-2105-11-119 Medline
141. C. Bland, T. L. Ramsey, F. Sabree, M. Lowe, K. Brown, N. C. Kyrpides, P. Hugenholtz,
CRISPR recognition tool (CRT): A tool for automatic detection of clustered regularly
interspaced palindromic repeats. BMC Bioinformatics 8, 209 (2007). doi:10.1186/1471-
2105-8-209 Medline
142. P. Kunzmann, T. D. Müller, M. Greil, J. H. Krumbach, J. M. Anter, D. Bauer, F. Islam, K.
Hamacher, Biotite: New tools for a versatile Python bioinformatics library. BMC
Bioinformatics 24, 236 (2023). doi:10.1186/s12859-023-05345-6
143. A. Mitrofanov, M. Ziemann, O. S. Alkhnbashi, W. R. Hess, R. Backofen,
CRISPRtracrRNA: Robust approach for CRISPR tracrRNA detection. Bioinformatics 38,
ii42–ii48 (2022). doi:10.1093/bioinformatics/btac466 Medline
144. R. Lorenz, S. H. Bernhart, C. Höner Zu Siederdissen, H. Tafer, C. Flamm, P. F. Stadler, I.
L. Hofacker, ViennaRNA Package 2.0. Algorithms Mol. Biol. 6, 26 (2011).
doi:10.1186/1748-7188-6-26 Medline
145. E. P. Nawrocki, S. R. Eddy, Infernal 1.1: 100-fold faster RNA homology searches.
Bioinformatics 29, 2933–2935 (2013). doi:10.1093/bioinformatics/btt509 Medline
146. C. Zhang, M. Shine, A. M. Pyle, Y. Zhang, US-align: Universal structure alignments of
proteins, nucleic acids, and macromolecular complexes. Nat. Methods 19, 1109–1115
(2022). doi:10.1038/s41592-022-01585-1 Medline
147. Schrödinger LLC, The PyMOL Molecular Graphics System, version 1.8 (2015).
148. A. R. Gruber, R. Lorenz, S. H. Bernhart, R. Neuböck, I. L. Hofacker, The Vienna RNA
Websuite. Nucleic Acids Res. 36, W70–W74 (2008). doi:10.1093/nar/gkn188
149. W. B. Langdon, J. Petke, R. Lorenz, in Genetic Programming, M. Castelli, L. Sekanina, M.
Zhang, S. Cagnoni, P. García-Sánchez, Eds. (Springer, 2018), pp. 220–236.
150. Z. Weinberg, R. R. Breaker, R2R - Software to speed the depiction of aesthetic consensus
RNA secondary structures. BMC Bioinformatics 12, 3 (2011). doi:10.1186/1471-2105-
12-3 Medline
151. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, Basic local alignment
search tool. J. Mol. Biol. 215, 403–410 (1990). doi:10.1016/S0022-2836(05)80360-2
152. M. Mirdita, K. Schütze, Y. Moriwaki, L. Heo, S. Ovchinnikov, M. Steinegger, ColabFold:
Making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).
doi:10.1038/s41592-022-01488-1 Medline
153. B. Hie, Code for paper “Sequence modeling and design from molecular to genome scale
with Evo,” Zenodo (2024); https://doi.org/10.5281/zenodo.12693561.
154. K. Badal, C. M. Lee, L. J. Esserman, Guiding principles for the responsible development of
artificial intelligence tools for healthcare. Commun. Med. 3, 47 (2023).
doi:10.1038/s43856-023-00279-9 Medline
155. D. Baker, G. Church, Protein design meets biosecurity. Science 383, 349–349 (2024).
doi:10.1126/science.ado1671 Medline
156. S. Morin, G. Segafredo, M. Piccolis, A. Das, M. Das, N. Loffredi, A. Larbi, K. Mwamelo,
E. Villanueva, S. Nobre, E. Burrone, Expanding access to biotherapeutics in low-income
and middle-income countries through public health non-exclusive voluntary intellectual
property licensing: Considerations, requirements, and opportunities. Lancet Glob. Health
11, e145–e154 (2023). doi:10.1016/S2214-109X(22)00460-0 Medline
157. M. E. Peek, By any means necessary: Why lowering insulin prices is relevant to racial
health equity. Lancet 398, 1783–1784 (2021). doi:10.1016/S0140-6736(21)02315-1
Medline
158. N. B. W. Macfarlane, J. Adams, E. L. Bennett, T. M. Brooks, J. A. Delborne, H.
Eggermont, D. Endy, K. M. Esvelt, B. Kolodziejczyk, T. Kuiken, M. J. Oliva, S. Peña
Moreno, L. Slobodian, R. B. Smith, D. Thizy, D. M. Tompkins, W. Wei, K. H. Redford,
Direct and indirect impacts of synthetic biology on biodiversity conservation. iScience
25, 105423 (2022). doi:10.1016/j.isci.2022.105423 Medline
159. The carbon footprint of computational research. Nat. Comput. Sci. 3, 659 (2023).
doi:10.1038/s43588-023-00506-2