0% found this document useful (0 votes)

17 views55 pages

Science - Ad09336 SM

The document discusses the implications of the Evo generative genomic model, which predicts and generates DNA sequences at a whole-genome scale for prokaryotes. It highlights ethical concerns related to biosafety, social equity, and environmental impacts, emphasizing the need for guidelines and oversight to prevent misuse and ensure equitable access to genetic engineering tools. The authors advocate for community partnerships and continuous dialogue among stakeholders to address disparities and promote responsible innovation in genetic research.

Uploaded by

Anatoly Kustanovich

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views55 pages

Science - Ad09336 SM

Uploaded by

Anatoly Kustanovich

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Supplementary Materials for

Sequence modeling and design from molecular to genome scale with Evo

Eric Nguyen et al.

Corresponding authors: Patrick D. Hsu, [email protected]; Brian L. Hie, [email protected]

Science 386, ead09336 (2024)

DOI: 10.1126/science.ad09336

The PDF file includes:

Supplementary Text
Figs. S1 to S28
Tables S1 to S7
References

Other Supplementary Material for this manuscript includes the following:

MDAR Reproducibility Checklist

Data S1
Safety and ethics discussion
The introduction of powerful generative genomic foundation models such as Evo enables the rapid deciphering
of complex genetic information, which can be used for genetic engineering and therapeutic development. Evo
is the first of its kind to predict and generate DNA sequences at the whole-genome scale with single-nucleotide
resolution, albeit only for prokaryotes in this version. As future capability increases are likely achievable with
the class of large-scale DNA models enabled by Evo, we provide an extended ethical discussion on potential
risks and precautionary measures. While Evo is limited in its current form, the molecular design, synthesis,
manipulation, and dissemination of new synthetic genetic materials could pose concerns to individuals, society,
and the environment. Through a responsible AI lens (154), we forecast three salient ethical implications and
identify mediating solutions.

Safety and ethical implications

Whole-genome foundation models have the potential for misuse
There are concerns that the dual-use (or misuse) of genomic foundation models by malevolent actors could
pose a threat to biosafety and biosecurity (155). Tools like Evo serve to enhance queries of the existing genomic
knowledge base and identify genetic regions of interest for editing or experimentation. The ability to discern
fitness associated with certain sequences can assist in the discovery of novel biomarkers or therapeutic targets,
but can also catalyze the development of harmful synthetic microorganisms that more easily bypass the body’s
natural defenses, are resistant to current treatments, or cause more severe disease. Fortunately, even with
optimal synthetic genomic designs, the ability to create viable organisms is limited by high barriers to entry,
including a substantial amount of technical resources and expertise needed to carry out genome synthesis
and expression, which is further compounded by the unpredictability of biological mechanisms. Nevertheless,
as genetic engineering tools become more readily available, guardrails (for example, access controls, usage
audits) should be agreed upon by shareholders to limit unfettered queries for harmful genetic sequences. Clear
definitions of what constitutes “dual-misuse” are also needed to draw the line for researchers, policy makers,
and other shareholders.

Whole-genome foundation models could contribute to social and health inequity

Given the high barriers to entry, access and capability inequality with tools such as Evo can lead to inadvertent
societal harms. Evo is open source to promote transparency and reproducible research. However, those who
can most effectively use, and hence benefit the most from, the tool are entities with coordinated biotechnical
resources and expertise, such as biotechnology and pharmaceutical corporations. These companies may ac-
celerate research in a direction that prioritizes returns-on-investment over the global disease burden or health
equity (156). Along the same line, wealthier nations or more well-funded institutions also stand to better
leverage Evo to accelerate their research agendas, further widening the gap between high- and low-resource
settings.
The use of generative tools in biology also raises complex intellectual property concerns. Biological foun-
dation models such as Evo may enable an organization to bypass current intellectual property limitations on
biological therapeutics or other materials. In some cases, this may lead to a monopolization of treatments
for certain conditions. Such an entity could then use these rights to set prohibitively high prices and make
treatments inaccessible to most patients (for example, those in low-income countries), thus further exacer-
bating health disparities (157). In other cases, bypassing intellectual property protections could discourage
further investment into therapeutic innovation. Overall, we argue that an entity that uses and benefits from
open-source tools such as Evo has a duty to return value to the public and contribute to social and health
equity. Intellectual property law should also evolve as generative models increasingly automate the biological
discovery and design process.

Whole-genome foundation models could contribute to disruptions to the natural environment

Although Evo does not directly manipulate any genetic material, it may enhance the efficiency of genetic engi-
neering projects. There are concerns with how the capabilities of genetic engineering technologies may disrupt
the environment and cause ecological uncertainty (for example, the release of altered organisms), leading to

2
a loss of biodiversity or the emergence of new, potentially harmful species (158). Although the ecological
impacts of training whole-genome foundation models remain unknown, more immediately, it is also impor-
tant to consider the carbon footprint associated with increasing infrastructure and computational demands
(159). The capabilities of tools such as Evo, alongside other technologies for genome editing and ecological
engineering, add to complex debates about the extent to which science should intervene in evolution. As we
push the boundaries of scientific capabilities with tools such as Evo, it becomes imperative to reflect on the
interactions and boundaries between our inventions and natural evolutionary processes, aiming to preserve
ecological balance, maintain environmental sustainability, and uphold ethical standards.

The path forward

The path forward for the responsible use and development of tools like Evo is anchored in the establishment
of clear, comprehensive guidelines that delineate ethical practices. These guidelines serve as a responsible
AI framework, ensuring that all shareholders—researchers, developers, and users—have a common under-
standing of the safety and ethical dimensions inherent in genetic engineering. Coupled with robust oversight
mechanisms, this approach aims to monitor and manage the application of Evo to prevent misuse and en-
sure its alignment with ethical standards. Furthermore, promoting transparency regarding the use of these
technologies and fostering open dialogue among all parties will enhance trust and collaboration within the
scientific community and beyond.
To address disparities in access and capabilities, particularly in low-income countries, the strategy includes
forging community partnerships and international collaborations. By offering targeted training and support,
these partnerships can democratize access to advanced tools like Evo, enabling a broader spectrum of scientists
and researchers to contribute to and benefit from genetic engineering innovations. At the policy level, investing
in education and capacity building emerges as a pivotal element, equipping the next generation of scientists
with the ethical acumen and technical skills to navigate the complexities of genetic research responsibly.
Central to sustaining ethical innovation is the creation of a dynamic feedback loop that engages all share-
holders in a continuous dialogue. By setting up mechanisms to collect and integrate feedback from those
involved in or impacted by Evo’s applications, the process ensures that guidelines, policies, and practices are
regularly refined in response to evolving ethical challenges and societal expectations. Collaborating with or-
ganizations such as the Global Alliance for Genomics and Health (GA4GH) (86) to develop and update genetic
engineering guidelines further solidifies this commitment to ethical excellence. This multifaceted approach
not only addresses immediate concerns but also lays the groundwork for a future where genetic engineering
advances in harmony with ethical principles and societal values.

3
Fig. S1 | Pretraining data statistics. (A) A pie chart depicting the composition of viral realms in the IMG/VR
subset of the pretraining dataset. (B) A pie chart depicting the composition of host kingdoms in the IMG/VR
subset of the pretraining dataset. We excluded viruses that are likely to infect eukaryotic hosts (Materials and
Methods). (C) The distribution of genome lengths in the IMG/VR subset of the pretraining dataset. (D) The
distribution of plasmid sequence lengths in the IMG/PR subset of the pretraining dataset. (E) The distribution
of contig lengths in GTDB.

4
Perplexity scaling with context size
2.96 Train. context size
Eval. PPL (over last 2k tokens)

2.94
2.92
2.90
2.88
2.86
2.84
2.82
2k 16k 32k 65k 131k 165k
Context size

Fig. S2 | Perplexity scaling in context length. Perplexity on a subset of the OpenGenome validation set with
Evo 131k as a function of sequence length, or context length. The perplexity is computed over the last 2048
nucleotides of each sequence, with increasing lengths of the prefix and thus of the context available to the
model. We observe perplexity to continually decrease beyond the training context length at 131k, indicated
by the vertical dashed line.

5
Scaling Rates for Compute-Suboptimal Model Sizes
5.0
Offset: 5% 5.0
Offset: 8% 5.0
Offset: 11% 5.0
Offset: 14% 5.0
Offset: 17%
Transformer++
4.5 Mamba 4.5 4.5 4.5 4.5
Hyena
Eval. PPL

Eval. PPL

Eval. PPL
4.0 StripedHyena 4.0 4.0 4.0 4.0

3.5 3.5 3.5 3.5 3.5

3.0 4 1019 1020 3.0 4 1019 1020 3.0 4 1019 1020 3.0 4 1019 1020 3.0 4 1019 1020
FLOPS FLOPS FLOPS FLOPS FLOPS

Fig. S3 | Scaling rates for compute-suboptimal model sizes by architecture. We quantify the effect on
perplexity scaling caused by a suboptimal allocation of compute budget to model or dataset size (e.g. training
a smaller model for more tokens). We estimate the compute-optimal model size (Figure S5) for each compute
budget, then reduce it by a percentage (the offset). The corresponding perplexity is obtained via the IsoFLOP
curves (Figure 1F). Transformer++ perplexity scaling rapidly degrades outside the compute-optimal frontier,
in contrast to Mamba, Hyena, and StripedHyena.

6
Fig. S4 | Direct comparison of training loss curves during hyperparameter tuning for 7B models. We tune
several Transformer++ and Mamba models as baselines, sweeping learning rates, batch size, sequence lengths,
and model depth vs. width ratio. In all settings, Hyena and StripedHyena outperformed Transformer++ and
Mamba, where both baselines experienced instability during training.

7
Transformer++ 109
1011 Mamba
Transformer++
Compute-optimal model size

Mamba
Compute-optimal #tokens

Hyena Hyena
StripedHyena StripedHyena

108

1010 107
1019 1020 1019 1020
FLOPS FLOPS

Fig. S5 | Compute-optimal tokens and model size by model. Compute-optimal allocation to dataset size
(number of tokens) and model size (number of parameters) for each compute budget, measured in FLOPS.

8
1.6 StripedHyena
Model Sizes
1.5 6.90e+07
8.40e+07
1.4 9.90e+07
Train Loss

1.3 1.35e+08
1.2
1.1
1.0
1017 1018 1019 1020
FLOPs

Fig. S6 | Loss vs. FLOPs. StripedHyena training loss vs. FLOPs for near-compute optimal runs from 8e18 to
8e19 FLOPs.

9
PPL vs % Hyena layers
3.48
(10 total layers)
3.46

3.44
Perplexity (PPL)

3.42

3.40

3.38

3.36

3.34

0 20 40 60 80 100
% Hyena layers
Fig. S7 | Perplexity comparison by amount of hyena layer percentage. We progressively replace attention
for hyena layers and observe improved perplexity, with an optimal hybrid ratio of 90% hyena and 10% atten-
tion. All models use 10 layers and a width of 768. Hyperparameters and FLOPs are controlled.

10
Fig. S8 | Performance of Evo on out-of-distribution mutational effect prediction. (A) Predictive perfor-
mance of nucleotide and protein language models on mutational effect prediction for human proteins, mea-
sured via Spearman correlation. Bar height indicates the mean; each dot indicates a different DMS study. LM:
language model; Nucl. Trans.: Nucleotide Transformer. Related to Fig. 2B. (B) Relationship between the Evo
perplexity of the wildtype nucleotide sequence (horizontal axis) and the ability for Evo to perform zero-shot
mutational effect prediction for that protein as measured via Spearman correlation (vertical axis). Each dot
corresponds to a different protein; dots are colored as to whether they are proteins from E. coli (blue), other
prokaryotes (green), or human (purple) proteins. We observed a significant negative correlation (Spearman
𝑟 = −0.79, two-sided 𝑡 -distributed 𝑃 = 8.7 × 10 −4 ) between perplexity and zero-shot function prediction per-
formance.

11
Fig. S9 | Filtering and selection of Evo-generated Cas9 candidates. 2,079,750 nucleotide sequences were
generated from a finetuned Evo model. All ORFs were extracted from generated sequences and processed
with a Cas9 pHMM to identify generated loci containing potential Cas9 proteins yielding 817,049 generations
containing a Cas9 ORF longer than 600 residues and a valid CRISPR array (as detected with MinCED). Ex-
tracted Cas9 amino acid sequences were aligned with Cas9 protein sequences found in the finetuning dataset.
We retained 132,079 generations that contained ORFs that bore less than 90% identity to the nearest training
sequence. The remaining generations were scored based on the quality of the pairwise sequence alignments
between the extracted Cas9 protein sequence and its nearest match in the training set. Alignments with an
even distribution of mismatches across the Cas9 ORFs were scored highly; alignments with an imbalanced dis-
tribution of mismatches were down-weighted. The Cas9 ORFs from the top-ranking 2,000 generations were
folded with AlphaFold2. From the predicted structures, 268 generations were selected based on pLDDT, radius
of gyration, the presence of a detected tracrRNA sequence, and the presence of RuvC and HNH domains in
the Cas9 ORF. The final 11 Evo-generated Cas9 candidates were selected from this subset through manual
inspection of predicted Cas9 structure and predicted sgRNA secondary structure.

12
Fig. S10 | CRISPR-Cas finetuning dataset composition. Breakdown of the ORFs found in the 72,813 Cas
loci used to finetune the base 8k Evo model. The 72,813 Cas loci were initially categorized as either Cas9,
Cas12a, Cas12b, Cas12g, Cas13a, Cas13b, and Cas13d sequences by extracting the ORFs using Prodigal and
searching amino acid sequences against pHMMs for each Cas subtype. We then clustered each categorized
subset of loci at 90% and 50% identity using MMSeqs2 before plotting the distributions of subtypes across
clustering thresholds.

13
A B

EvoCas9-1
EvoCas9-1

Fig. S11 | Sequence identities of Cas9 ORFs in finetuning dataset. Sequence identities to the nearest Cas
ORF (excluding the query sequence) in our finetuning dataset were computed for all Cas9 ORFs in our fine-
tuning dataset. Percentages are plotted both in linear (A) and log-scale (B). The percent identity of EvoCas9-1
to the nearest Cas ORF in the fine-tuning dataset is marked on both plots as a dashed red line.

14
Cas9 Cas12 Cas13
(Not Pre-Trained)

5 5 5
10 10 10
3 3 3
10 10 10
Counts

1 1 1
10 10 10
1 1 1
10 10 10
5 5 5
10 10 10
(Pre-Trained)

3 3 3
10 10 10
Counts

1 1 1
10 10 10
1 1 1
10 10 10
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Identity to CRISPR-Cas ORFs in OpenGenome

Fig. S12 | Sequence identities of all generated Cas ORFs to fine-tuning dataset. Sequence identities of
ORFs sampled from pretrained and non-pretrained Evo models. 400,000 generations were analyzed for each
Cas-model combination.

15
Cas9 Cas12 Cas13
(Not Pre-Trained)

5 5 5
10 10 10
3 3 3
10 10 10
Counts

1 1 1
10 10 10
1 1 1
10 10 10
5 5 5
10 10 10
(Pre-Trained)

3 3 3
10 10 10
Counts

1 1 1
10 10 10
1 1 1
10 10 10
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Degeneracy Score

Fig. S13 | Comparison of generation quality between samples drawn from pre-trained and non-pre-
trained models. A degeneracy score was calculated for samples drawn from Evo by computing the coverage
of each sequence by repetitive substrings longer than 9 nucleotides in length (Methods). We find that prompt-
ing with a Cas9 special token yields higher quality samples when sampling from a pre-trained Evo that was
finetuned on CRISPR-Cas sequences according to this degeneracy metric.

16
Cand. 1 Cand. 2 Cand. 3 Cand. 4 Cand. 5 Cand. 6 Cand. 7 Cand. 8 Cand. 9 Cand. 10 Cand. 11
sgRNA: + − + − + − + − + − + − + − + − + − + − + −
IVTT template

bp bp
2000 2000
1500 1500

1000 Uncleaved 1,694 bp target 1000

902 bp cleavage product
850 792 bp cleavage product 850
650 650

500 500
400 400

300 300

200 200

100 100

Fig. S14 | CRISPR-Cas9 experimental validation screen. 11 CRISPR-Cas9 candidates along with their co-
generated guides were incubated at 37◦ C in a 20-hour cleavage reaction with a DNA target containing a shared
protospacer sequence and an NGG PAM (other PAM sequences were not tested). Candidate 4 demonstrated
robust DNA cleavage activity that was guide RNA-specific and was named EvoCas9-1.

17
A SpCas9
+ SpCas9 sgRNA
EvoCas9-1
+ EvoCas9-1 sgRNA
B NEB SpCas9
+ SpCas9 sgRNA
Purified SpCas9
+ SpCas9 sgRNA

NTG 5m 15m 1hr 3hr NTG 5m 15m 1hr 3hr NTG 5m 15m 1hr 3hr NTG 5m 15m 1hr 3hr

bp bp

2000 2000
1500 1500

1000
1000
850
850
650
650
500
400 500

300 400

200 300

200
100

100

C SpCas9
+ SpCas9 sgRNA
EvoCas9-1
+ SpCas9 sgRNA
D SpCas9
+ EvoCas9-1 sgRNA
EvoCas9-1
+ EvoCas9-1 sgRNA
NTG 5m 15m 1hr 3hr NTG 5m 15m 1hr 3hr NTG 5m 15m 1hr 3hr NTG 5m 15m 1hr 3hr

bp bp

2000 2000
1500 1500

1000 1000
850 850

650 650

500 500

400 400

300 300

200 200

100 100

Fig. S15 | Additional analysis of EvoCas9-1 effector and sgRNA performance. Cleavage reactions were per-
formed at 37◦ C with a 10:10:1 molar ratio of Cas9:sgRNA:target and a target concentration of 1 nM. (A) Full
cleavage gel corresponding to Fig. 3F. (B) Comparison of in-house purified SpCas9 with commercially avail-
able SpCas9 (NEB M0386T). (C) Timecourse comparison of in-house purified SpCas9 with purified EvoCas9-1,
both with an SpCas9 sgRNA. (D) Comparison of in-house purified SpCas9 with purified EvoCas9-1, both with
an EvoCas9-1 sgRNA.

18
A EvoCas9-1 (AF3) B Electrostatic Potential
EvoCas9-1 sgRNA
DNA target

180° 180°

C SpCas9 (PDB: 4OO8) D Electrostatic Potential

SpCas9 sgRNA
DNA target

180°
180°

Fig. S16 | Predicted structures of EvoCas9-1. (A) Surface view of EvoCas9-1. (B) EvoCas9-1 electrostatics
calculated using the coulombic function in ChimeraX. (C) Surface view of SpCas9 (PDB: 4OO8). (D) SpCas9
electrostatics.

19
A 9-1 B 1000
9 s
Cas o Ca 900
Sp Ev 800
Absorbance (mAU)

kDa
700
200
150 600
120
500
85
400
60
50 300

40 200
30 100
25 0
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
15
Elution Volume (mL)

Fig. S17 | Purification of EvoCas9-1 and SpCas9. (A) SDS-PAGE analysis of final concentrated stocks of
EvoCas9-1 and SpCas9 protein. (B) Size exclusion chromatography (SEC) trace from the purification of
EvoCas9-1. 0.5mL fractions between 11.5mL and 14mL were pooled, concentrated, and used for biochemical
assays.

20
A B
TnpA − IS200 TnpA − IS605 TnpB − IS605
75
% of generated loci

ESMFold pLDDT
Prompt 80
50
<is200>
<is605>
60
25

40
0 n = 1800 n = 1800 n = 1800
TnpA TnpB TnpA + None
only only TnpB 25 50 75 100 25 50 75 100 25 50 75 100
Detected proteins % identity to training
C TnpA TnpB
D ISDge10
1.5

0.020
1.0

Entropy
0.010
0.015

Training
0.5
0.010
0.005
Probability Density

0.005 0.0
1,059 bp
0.000 0.000
ISHp608
1.5
0.020
0.010 1.0

Entropy
0.015
Generated
0.010 0.5
0.005
0.005 0.0
2,377 bp
0.000 0.000
100 200 300 400 200 400 600 800 ISEvo2
Protein Length (aa) 1.5

E 1.0
Entropy

LE RE
0.75 0.5
Avg. Entropy

0.50
0.0
0.25 2,400 bp
0.00
ISSpn6 ISStin10
−200 −100 0 TnpA 0 100 200
1.5 1.5
Avg. Entropy

1.0 1.0
Entropy

1.0

0.5 0.5 0.5

0.0
0.0 0.0
−200 −100 0 TnpA TnpB 0 100 200 1,074 bp 1,065 bp
1.2
IS200G ISEvo1
Avg. Entropy

0.8 1.5 1.5

0.4
1.0 1.0
Entropy

0.0

−200 −100 0 TnpB TnpA 0 100 200

0.5 0.5
Distance from CDS
0.0 0.0
1,059 bp 1,011 bp

Fig. S18 | Additional analysis of generated IS200/IS605 sequences. (A) The detected TnpA/TnpB proteins
as a percentage of all loci generated with each special token. Colors indicate which prompt token was used.
“None” indicates that no TnpA or TnpB proteins were detected in the generated sequence. Includes 2,373,150
sequences generated with <is200> token prompt, and 2,104,350 generated with <is605> token prompt.
(B) Relationship between ESMFold pLDDT and percent identity distance from generated protein to nearest
training example. Showing for generated sequences where only a TnpA was generated (TnpA - IS200), or
where both a TnpA and a TnpB were detected (TnpA - IS605, TnpB - IS605). 1800 generated protein sequences
were randomly selected per category, with 200 selected from each equal-width bin from 10% to 100%. (C)
Showing the protein length distribution of the training examples and the generated proteins. TnpA distribution
was cut off at 450 amino acids to improve visualization. (D) The entropy of the conditional probabilities at
each position across IS200/IS605 loci. Showing 2 natural IS605 elements (ISDge10 and ISHp608), 3 natural
IS200 elements (ISSpn6, ISStin10, IS200G), and 2 Evo-generated elements (ISEvo1, ISEvo2). TnpA (short)
and TnpB (long) coding sequences are shown in blue, and the start and end of the complete element is shown
with black bars. (E) The average entropy within 250 bp of the 5′ and 3′ ends of IS200/IS605 coding sequences,
including 50 bp of the CDS itself. The entropy was calculated at each position across natural IS200 (top) or
IS605 (middle, bottom) sequences in the training set (N = 228,763). Sequence positions were aligned with
respect to the beginning and end of each respective CDS. (Top) All sequences with only a detected TnpA
(IS200-like). (Middle) All sequences with a TnpA followed by a TnpB (IS605-like). (Bottom) Sequences
where the TnpB precedes the TnpA on the forward strand (IS605-like). Gray ribbon indicates the standard
deviation of the entropy values.

21
Fig. S19 | IS605 categorical Jacobian analysis. Categorical Jacobian matrices (70) were computed for
natural IS605 systems and plotted as heatmaps thresholded between 0 and 4. Regions corresponding to the
LE, TnpA, TnpB, and RE elements are annotated in the heatmaps.

22
Fig. S20 | Filtering and selection of Evo-generated IS200/IS605-like candidates. (A, B) Downstream
computational metrics, computed on both the TnpA protein and DNA sequences, were used to filter Evo-
generated sequences to obtain a high-confidence set of IS200-like and IS605-like candidates. (C) Scatter plot
of max stem length of the longest perfect hairpin (with ≤ 5 bp loop) vs. the start position of the longest
perfect hairpin for the sequences around the tnpA CDS in Evo IS200 generations and scrambled sequences
preserving the same base frequencies. Black dashed lines depict the threshold used. (D) Schematic describing
calculation of upstream base pair propensity vector from sequences flanking the tnpA CDS in Evo IS200-like
element generations. (E-G) Hierarchical clustering of upstream base pair propensity vectors of Evo-generated
IS200-like candidate sequences after TnpA protein filtering and other DNA sequence filtering steps. Scale bar
is the same as panel (D). Selected sequences and natural IS200 sequences are labeled.

23
Fig. S21 | Activity of 12 other active Evo-generated IS200/IS605-like systems. 2% agarose gel with SYBR
Safe showing in vitro activity of Evo-generated TnpA proteins on ssDNA substrates alongside mutants in which
putative catalytic tyrosines (identified by an amino acid motif match for YXXXQ) are mutated to an alanine for
TnpAs from Evo-generated IS200-like and IS605-like elements ISEvo3 through ISEvo14. Compare to Figs. 4F
and 4I.

24
Fig. S22 | Validation of excision and insertion via nanopore sequencing. (A) Schematic describing
nanopore jump map computation procedure. (B) Nanopore jump map from nanopore sequencing of PCR
products for natural IS605 element ISHp608. (C) Nanopore jump map from nanopore sequencing of PCR
products for natural IS200 element ISSpn6, revealing multiple LEs and REs. (D-E) Nanopore jump maps from
nanopore sequencing of PCR products from in vitro TnpA excision/insertion assay of Evo-generated IS200-like
element ISEvo1 and IS605-like element ISEvo2.

25
(Continued on the next page)

26
Fig. S23 | Nanopore sequencing jump maps for experimentally validated Evo-generated IS200-like and
IS605-like systems. Jump maps computed as in Fig. S22 for nanopore sequencing of PCR products from in
vitro TnpA excision/insertion assay.

27
Fig. S24 | ISSpn6 tnpA ORF and hairpins. ISSpn6 can mobilize and insert with multiple LEs and REs, which
are illustrated above.

28
(Continued on the next page)

29
Fig. S25 | Evo-generated IS200-like systems. Illustrations of 10 additional active Evo-generated IS200-like
systems, including element annotations and relevant DNA and protein features. Compare to Fig. 4E.

30
Fig. S26 | Evo-generated IS605-like systems. Illustrations of two additional active Evo-generated IS605-like
systems, including element annotations and relevant DNA, RNA, and protein features. Compare to Fig. 4H.

31
Fig. S27 | Gene essentiality prediction under different settings. (A) Scatterplots that compare the AUROC
values between different models or different context windows. Axis labels are the same as the horizontal
labels in Fig. 5C. Each dot corresponds to a different whole-genome essentiality study. (B) Gene essentiality
prediction performance as in Fig. 5C except plotted according to the average precision, where essential genes
are considered as the “positive” examples. (C) Gene essentiality prediction performance for across 58 studies
(each dot corresponds to a different study). We performed in silico mutagenesis of each coding sequence in
a genome and computed the change in Evo likelihood, which we used to predict gene essentiality. “Evo (8k
context, multi-stop)” indicates a mutagenesis strategy that inserts multiple stop codons at the beginning of
each coding sequence. “Evo (8k context, single stop)” indicates a mutagenesis that inserts a single stop codon
at the beginning of each coding sequence. “Evo (8k context, deletion)” indicates a mutagenesis strategy that
deletes the entire sequence of the gene. “Position-based prediction” indicates a prediction strategy (not using
Evo) in which we use the position of a gene in the reference genome annotation as the predictor variable for
gene essentiality. See Materials and Methods for more details.

32
Fig. S28 | Additional analysis of genome-scale sequence generations. (A-B) GC content (A) and dinu-
cleotide frequencies (B) of Evo-generated, natural, and random sequences. (C) Codon usage of Evo-generated
sequences compared to natural reference genomes. The sequences used in this analysis were selected from
the TUD analysis in Fig. 6. (D) Mean stop codon ratios in all three reading frames of Evo-generated, natural,
and random ORFs. (E) Histograms representing the distribution of statistics computed on ESMFold-predicted
structures. These structures correspond to coding sequences found on sixteen Evo-generated sequences, each
of length ∼1 mb. These statistics are, from left to right: the percentage of residues in 𝛼-helices, the percentage
of residues in 𝛽 -sheets, the mean backbone pLDDT, and the TMscore to the closest UniRef50 structure in the
AlphaFold Protein Structure Database as determined by FoldSeek easy-search.

33
Stage 1
sequence len. 8k
batch size 4M tokens
scheduler cosine decay
learning rate 3e-4
min. learning rate 3e-5
position emb. RoPE (29)
weight decay 0.1
weight decay in hyena layers 0.0
precision mixed (BF16, FP32 on long convolution parameters)
optimizer Adam
optimizer betas 0.9, 0.95
eps 1e-8
norm RMSnorm (93)
dropout 0.0
warmup 1% total steps
checkpoint activation True
model_width 4096
num_attn_heads 32
hyena short conv. length 3
GLU width 4/3 model_width
vocab size 512
Stage 2
sequence len. 131k
batch size 1M tokens
learning rate 1e-4
min. learning rate 1e-5
position emb. linearly interpolated RoPE
interp. factor 16

Table S1 | StripedHyena 7B hyperparameters settings at pretraining. Hyperparameters used for pretrain-

ing at 7B are shown by stage, where stage 2 highlights differences from stage 1.

34
Dataset Name Source Subset Total Total Avg
Genomes/ Bases Length
Loci/Plasmids (M) (base)
Bacterial and GTDB 85,205 273,865 3,214,184
Archaeal Genomes
Prokaryotic Viruses IMG/VR 2,653,046 36,236 13,658
Plasmids IMG/PR 214,950 5,827 27,106
CRISPR/Cas Loci Custom Cas9 5,566 43 7,798
Database
CRISPR/Cas Loci Custom Cas12 5,069 35 6,911
Database
CRISPR/Cas Loci Custom Cas13 498 4 7,559
Database
IS200/IS605 Loci Custom IS200 Loci 219,866 239 1,085
Database
IS200/IS605 Loci Custom IS605 Loci 10,720 26 2,445
Database

Table S2 | Summary statistics for the OpenGenome datasets. See Materials and Methods for further details
on the dataset sources and curating process.

35
Params (M) d_model glu_size kv_size n_heads n_layers learning rate
1 128 336 64 2 4 9.77E-04
6 320 848 64 5 5 9.57E-04
17 448 1200 64 7 7 9.36E-04
29 512 1360 64 8 9 9.15E-04
40 576 1536 64 8 10 8.95E-04
59 640 1696 64 10 12 8.70E-04
69 640 1712 64 10 14 8.56E-04
84 704 1872 64 11 14 8.37E-04
99 768 2048 64 12 14 8.18E-04
114 768 2048 64 12 16 8.00E-04
121 768 2048 64 12 17 7.75E-04
135 768 2048 64 12 19 7.50E-04
158 832 2224 64 13 19 7.25E-04
175 832 2224 64 13 21 7.00E-04
203 896 2384 64 14 21 6.75E-04
232 896 2384 64 14 24 6.50E-04
266 960 2560 64 15 24 6.25E-04
303 1024 2736 64 16 24 6.00E-04
383 1152 3072 64 18 24 5.66E-04
473 1280 3408 64 20 24 5.33E-04
572 1408 3760 128 11 24 5.00E-04
680 1536 4096 128 12 24 4.75E-04
798 1664 4432 128 13 24 4.55E-04
926 1792 4784 128 14 24 4.33E-04
1063 1920 5120 128 15 24 4.15E-04
1209 1920 5120 128 15 25 4.11E-04

Table S3 | Scaling laws model settings for Transformer++, Hyena, StripedHyena and Mamba. Layer
number for Mamba is doubled (a single block corresponds to two Mamba layers and the dedicated channel
mixer layer is removed, as described in reference (40)). Parameter counts vary slightly for each architecture.

36
Model Adkar Chen Firnberg Jacquier Kelsic Melnikov Rockah Tsuboyama Weeks Average
Evo 0.25 0.57 0.55 0.47 0.60 0.45 0.32 0.42 0.50 0.46
GenSLM 0.02 0.08 0.41 0.35 0.57 0.01 0.19 0.17 0.37 0.24
Nucleotide Transformer 0.08 0.02 0.03 0.00 0.30 0.00 0.08 0.10 0.02 0.07
RNA-FM 0.16 0.03 0.03 0.01 0.45 0.03 0.00 0.19 0.07 0.11
CARP 640M 0.06 0.67 0.40 0.40 0.56 0.56 0.48 0.28 0.30 0.41
ESM-1v 0.28 0.61 0.72 0.68 0.46 0.56 0.30 0.27 0.62 0.50
ESM-2 3B 0.37 0.66 0.65 0.61 0.49 0.58 0.24 0.24 0.61 0.49
ESM-2 650M 0.32 0.69 0.72 0.68 0.48 0.54 0.24 0.37 0.61 0.52
ProGen2 large 0.07 0.62 0.56 0.52 0.49 0.49 0.30 0.47 0.60 0.46
ProGen2 xlarge 0.23 0.67 0.40 0.38 0.48 0.56 0.32 0.46 0.59 0.46

Table S4 | Zero-shot prokaryotic protein fitness prediction. Spearman correlations of language-model-likelihood and experimental fitness across nine
datasets.

37
Model Findlay Garvie Giacomelli Kotler Silverstein Sun Average
Evo 0.06 0.15 0.09 0.13 0.15 0.29 0.14
GenSLM 0.01 0.11 0.04 0.02 0.03 0.23 0.07
Nucleotide Transformer 0.01 0.10 0.02 0.06 0.03 0.03 0.04
RNA-FM 0.04 0.02 0.03 0.09 0.02 0.07 0.05
CARP-640M 0.12 0.13 0.64 0.72 0.49 0.36 0.41
ESM-1v 0.19 0.18 0.53 0.58 0.53 0.44 0.41
ESM-2 650M 0.21 0.16 0.32 0.60 0.49 0.39 0.36
ESM-2 3B 0.21 0.13 0.37 0.62 0.50 0.38 0.37
ProGen2 large 0.19 0.04 0.54 0.49 0.49 0.34 0.35
ProGen2 xlarge 0.22 0.05 0.18 0.41 0.48 0.37 0.29

Table S5 | Zero-shot human protein fitness prediction. Spearman correlations of language-model-likelihood

and experimental fitness across six datasets.

38
Model Kobori Andreasson Domingo Guy Hayden Pitt Zhang Average
Evo 0.17 0.14 0.45 0.24 0.13 0.14 0.60 0.27
GenSLM 0.11 0.10 0.29 0.05 0.19 0.18 0.12 0.15
Nucleotide Transformer 0.20 0.07 0.06 0.05 0.24 0.01 0.20 0.12
RNA-FM 0.03 0.16 0.20 0.05 0.11 0.13 0.56 0.18

Table S6 | Zero-shot ncRNA fitness prediction. Spearman correlations of language-model-likelihood and

experimental fitness across seven datasets for different nucleotide-level language models. The highest per-
forming score for each dataset is in bold.

39
Model Hossain Kosuri Urtecho Yu Average
Evo zero-shot 0.34 0.66 0.25 0.47 0.43
Evo ridge regression 0.49 0.58 0.25 0.53 0.46
Evo CNN 0.60 0.56 0.40 0.66 0.56
One-hot ridge regression 0.18 0.07 0.13 0.23 0.15
One-hot CNN 0.35 0.44 0.38 0.57 0.44
GC content 0.39 0.47 0.27 0.29 0.35
GenSLM 0.10 0.07 0.08 0.09 0.09
Promoter Calculator 0.60 0.53 0.69 0.66 0.62

Table S7 | Promoter activity prediction. Spearman correlations of language-model-likelihood and promoter

activity across four datasets. CNN: convolutional neural network.

40
Data S1 (separate file) | Sequences of primers and generated sequences experimentally validated, with
additional statistics.

41
References and Notes
1. T. H. Morgan, Sex limited inheritance in Drosophila. Science 32, 120–122 (1910).
doi:10.1126/science.32.812.120 Medline
2. J. D. Watson, F. H. C. Crick, Molecular structure of nucleic acids: A structure for deoxyribose
nucleic acid. Nature 171, 737–738 (1953). doi:10.1038/171737a0 Medline
3. M. W. Nirenberg, J. H. Matthaei, The dependence of cell-free protein synthesis in E. coli upon
naturally occurring or synthetic polyribonucleotides. Proc. Natl. Acad. Sci. U.S.A. 47,
1588–1602 (1961). doi:10.1073/pnas.47.10.1588 Medline
4. T. Dobzhansky, Genetics and the Origin of Species (Columbia Univ. Press, 1951).
5. J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool,
R. Bates, A. Žídek, A. Potapenko, A. Bridgland, C. Meyer, S. A. A. Kohl, A. J. Ballard,
A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D.
Reiman, E. Clancy, M. Zielinski, M. Steinegger, M. Pacholska, T. Berghammer, S.
Bodenstein, D. Silver, O. Vinyals, A. W. Senior, K. Kavukcuoglu, P. Kohli, D. Hassabis,
Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589
(2021). doi:10.1038/s41586-021-03819-2 Medline
6. A. Rives, J. Meier, T. Sercu, S. Goyal, Z. Lin, J. Liu, D. Guo, M. Ott, C. L. Zitnick, J. Ma, R.
Fergus, Biological structure and function emerge from scaling unsupervised learning to
250 million protein sequences. Proc. Natl. Acad. Sci. U.S.A. 118, e2016239118 (2021).
doi:10.1073/pnas.2016239118 Medline
7. C. Outeiral, C. M. Deane, Codon language embeddings provide strong signals for use in
protein engineering. Nat. Mach. Intell. 6, 170–179 (2024). doi:10.1038/s42256-024-
00791-0
8. S. Li, S. Moayedpour, R. Li, M. Bailey, S. Riahi, L. Kogler-Anele, M. Miladi, J. Miner, F.
Pertuy, D. Zheng, J. Wang, A. Balsubramani, K. Tran, M. Zacharia, M. Wu, X. Gu, R.
Clinton, C. Asquith, J. Skaleski, L. Boeglin, S. Chivukula, A. Dias, T. Strugnell, F. U.
Montoya, V. Agarwal, Z. Bar-Joseph, S. Jager, CodonBERT large language model for
mRNA vaccines. Genome Res. 34, 1027–1035 (2024). doi:10.1101/gr.278870.123
Medline
9. Ž. Avsec, V. Agarwal, D. Visentin, J. R. Ledsam, A. Grabska-Barwinska, K. R. Taylor, Y.
Assael, J. Jumper, P. Kohli, D. R. Kelley, Effective gene expression prediction from
sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).
doi:10.1038/s41592-021-01252-x Medline
10. J. L. Watson, D. Juergens, N. R. Bennett, B. L. Trippe, J. Yim, H. E. Eisenach, W. Ahern, A.
J. Borst, R. J. Ragotte, L. F. Milles, B. I. M. Wicky, N. Hanikel, S. J. Pellock, A.
Courbet, W. Sheffler, J. Wang, P. Venkatesh, I. Sappington, S. Vázquez Torres, A.
Lauko, V. De Bortoli, E. Mathieu, S. Ovchinnikov, R. Barzilay, T. S. Jaakkola, F.
DiMaio, M. Baek, D. Baker, De novo design of protein structure and function with
RFdiffusion. Nature 620, 1089–1100 (2023). doi:10.1038/s41586-023-06415-8 Medline
11. A. Madani, B. Krause, E. R. Greene, S. Subramanian, B. P. Mohr, J. M. Holton, J. L. Olmos
Jr., C. Xiong, Z. Z. Sun, R. Socher, J. S. Fraser, N. Naik, Large language models generate
functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106
(2023). doi:10.1038/s41587-022-01618-2 Medline
12. J. B. Ingraham, M. Baranov, Z. Costello, K. W. Barber, W. Wang, A. Ismail, V. Frappier, D.
M. Lord, C. Ng-Thow-Hing, E. R. Van Vlack, S. Tie, V. Xue, S. C. Cowles, A. Leung, J.
V. Rodrigues, C. L. Morales-Perez, A. M. Ayoub, R. Green, K. Puentes, F. Oplinger, N.
V. Panwar, F. Obermeyer, A. R. Root, A. L. Beam, F. J. Poelwijk, G. Grigoryan,
Illuminating protein space with a programmable generative model. Nature 623, 1070–
1078 (2023). doi:10.1038/s41586-023-06728-8 Medline
13. L. F. DaSilva, S. Senan, Z. M. Patel, A. J. Reddy, S. Gabbita, Z. Nussbaum, C. M. V.
Córdova, A. Wenteler, N. Weber, T. M. Tunjic, T. A. Khan, Z. Li, C. Smith, M. Bejan, L.
K. Louis, P. Cornejo, W. Connell, E. S. Wong, W. Meuleman, L. Pinello, DNA-
Diffusion: Leveraging Generative Models for Controlling Chromatin Accessibility and
Gene Expression via Synthetic Regulatory Elements. bioRxiv 2024.02.01.578352
[Preprint] (2024); https://doi.org/10.1101/2024.02.01.578352.
14. A. Lal, D. Garfield, T. Biancalani, G. Eraslan, regLM: Designing realistic regulatory DNA
with autoregressive language models. bioRxiv 2024.02.14.580373 [Preprint] (2024);
https://doi.org/10.1101/2024.02.14.580373.
15. M. Zvyagin, A. Brace, K. Hippe, Y. Deng, B. Zhang, C. O. Bohorquez, A. Clyde, B. Kale, D.
Perez-Rivera, H. Ma, C. M. Mann, M. Irvin, D. G. Ozgulbas, N. Vassilieva, J. G.
Pauloski, L. Ward, V. Hayot-Sasson, M. Emani, S. Foreman, Z. Xie, D. Lin, M. Shukla,
W. Nie, J. Romero, C. Dallago, A. Vahdat, C. Xiao, T. Gibbs, I. Foster, J. J. Davis, M. E.
Papka, T. Brettin, R. Stevens, A. Anandkumar, V. Vishwanath, A. Ramanathan,
GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics.
Int. J. High Perform. Comput. Appl. 37, 683–705 (2023).
doi:10.1177/10943420231201154
16. H. Dalla-Torre, L. Gonzalez, J. Mendoza Revilla, N. L. Carranza, A. H. Grzywaczewski, F.
Oteri, C. Dallago, E. Trop, H. Sirelkhatim, G. Richard, M. Skwark, K. Beguir, M. Lopez,
T. Pierrot, The Nucleotide Transformer: Building and Evaluating Robust Foundation
Models for Human Genomics. bioRxiv 2023.01.11.523679 [Preprint] (2023);
https://doi.org/10.1101/2023.01.11.523679.
17. Z. Zhou, Y. Ji, W. Li, P. Dutta, R. Davuluri, H. Liu, DNABERT-2: Efficient foundation
model and benchmark for multi-species genome. arXiv:2306.15006 [q-bio.GN] (2023).
18. Y. Tay, V. Q. Tran, S. Ruder, J. Gupta, H. W. Chung, D. Bahri, Z. Qin, S. Baumgartner, C.
Yu, D. Metzler, Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization. arXiv:2106.12672 [cs.CL] (2022).
19. S. Chen, S. Wong, L. Chen, Y. Tian, Extending context window of large language models
via positional interpolation. arXiv:2306.15595 [cs.CL] (2023).
20. H. Liu, M. Zaharia, P. Abbeel, Ring attention with blockwise transformers for near-infinite
context. arXiv:2310.01889 [cs.CL] (2023).
21. V. Fishman, Y. Kuratov, A. Shmelev, M. Petrov, D. Penzar, D. Shepelin, N. Chekanov, O.
Kardymon, M. Burtsev, GENA-LM: A family of open-source foundational DNA
language models for long sequences. bioRxiv 2023.06.12.544594 [Preprint] (2024);
https://doi.org/10.1101/2023.06.12.544594.
22. Y. Ji, Z. Zhou, H. Liu, R. V. Davuluri, DNABERT: Pre-trained Bidirectional Encoder
Representations from Transformers model for DNA-language in genome. Bioinformatics
37, 2112–2120 (2021). doi:10.1093/bioinformatics/btab083 Medline
23. Y. Hwang, A. L. Cornman, E. H. Kellogg, S. Ovchinnikov, P. R. Girguis, Genomic language
model predicts protein co-regulation and function. Nat. Commun. 15, 2880 (2024).
doi:10.1038/s41467-024-46947-9 Medline
24. M. Poli, J. Wang, S. Massaroli, J. Quesnelle, R. Carlow, E. Nguyen, A. Thomas,
StripedHyena: Moving Beyond Transformers with Hybrid Signal Processing Models,
GitHub (2023); https://github.com/togethercomputer/stripedhyena.
25. Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, A. Anandkumar,
Fourier neural operator for parametric partial differential equations. arXiv:2010.08895
[cs.LG] (2021).
26. A. Gu, K. Goel, C. Ré, Efficiently modeling long sequences with structured state spaces.
arXiv:2111.00396 [cs.LG] (2022).
27. A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gulcehre, R. Pascanu, S. De, Resurrecting
Recurrent Neural Networks for Long Sequences. arXiv:2303.06349 [cs.LG] (2023).
28. S. Massaroli, M. Poli, D. Fu, H. Kumbong, R. Parnichkun, D. Romero, A. Timalsina, Q.
McIntyre, B. Chen, A. Rudra, C. Zhang, C. Ré, S. Ermon, Y. Bengio, “Laughing Hyena
Distillery: Extracting Compact Recurrences From Convolutions” in Advances in Neural
Information Processing Systems, vol. 36, A. Oh, T. Naumann, A. Globerson, K. Saenko,
M. Hardt, S. Levine, Eds. (Curran Associates, Inc., 2023), pp. 17072–17116.
29. J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, Y. Liu, RoFormer: Enhanced transformer with
rotary position embedding. Neurocomputing 568, 127063 (2024).
doi:10.1016/j.neucom.2023.127063
30. X. Ma, C. Zhou, X. Kong, J. He, L. Gui, G. Neubig, J. May, L. Zettlemoyer, Mega: Moving
average equipped gated attention. arXiv:2209.10655 [cs.LG] (2023).
31. D. Y. Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, C. Ré, Hungry hungry hippos:
Towards language modeling with state space models. arXiv:2212.14052 [cs.LG] (2023).
32. J. Pilault, M. Fathi, O. Firat, C. Pal, P.-L. Bacon, R. Goroshin, “Block-state transformers” in
Advances in Neural Information Processing Systems, vol. 36, A. Oh, T. Naumann, A.
Globerson, K. Saenko, M. Hardt, S. Levine, Eds. (Curran Associates, Inc., 2023), pp.
7311–7329.
33. E. Nguyen, M. Poli, M. Faizi, A. Thomas, M. Wornow, C. Birch-Sykes, S. Massaroli, A.
Patel, C. Rabideau, Y. Bengio, S. Ermon, C. Ré, S. Baccus, “HyenaDNA: Long-Range
Genomic Sequence Modeling at Single Nucleotide Resolution” in Advances in Neural
Information Processing Systems, vol. 36, A. Oh, T. Naumann, A. Globerson, K. Saenko,
M. Hardt, S. Levine, Eds. (Curran Associates, Inc., 2023), pp. 43177–43201.
34. M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao, S. Baccus, Y. Bengio, S. Ermon, C. Ré,
Hyena Hierarchy: Towards Larger Convolutional Language Models. arXiv:2302.10866
[cs.LG] (2023).
35. D. H. Parks, M. Chuvochina, C. Rinke, A. J. Mussig, P.-A. Chaumeil, P. Hugenholtz, GTDB:
An ongoing census of bacterial and archaeal diversity through a phylogenetically
consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res.
50, D785–D794 (2022). doi:10.1093/nar/gkab776
36. A. P. Camargo, S. Nayfach, I.-M. A. Chen, K. Palaniappan, A. Ratner, K. Chu, S. J. Ritter, T.
B. K. Reddy, S. Mukherjee, F. Schulz, L. Call, R. Y. Neches, T. Woyke, N. N. Ivanova,
E. A. Eloe-Fadrosh, N. C. Kyrpides, S. Roux, IMG/VR v4: An expanded database of
uncultivated virus genomes within a framework of extensive functional, taxonomic, and
ecological metadata. Nucleic Acids Res. 51, D733–D743 (2023).
doi:10.1093/nar/gkac1037
37. A. P. Camargo, L. Call, S. Roux, S. Nayfach, M. Huntemann, K. Palaniappan, A. Ratner, K.
Chu, S. Mukherjeep, T. B. K. Reddy, I. A. Chen, N. N. Ivanova, E. A. Eloe-Fadrosh, T.
Woyke, D. A. Baltrus, S. Castañeda-Barba, F. de la Cruz, B. E. Funnell, J. P. J. Hall, A.
Mukhopadhyay, E. P. C. Rocha, T. Stalder, E. Top, N. C. Kyrpides, IMG/PR: A database
of plasmids from genomes and metagenomes with rich annotations and metadata. Nucleic
Acids Res. 52, D164–D173 (2024). doi:10.1093/nar/gkad964 Medline
38. J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las
Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van
den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O.
Vinyals, L. Sifre, Training Compute-Optimal Large Language Models. arXiv:2203.15556
[cs.CL] (2022).
39. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A.
Radford, J. Wu, D. Amodei, Scaling Laws for Neural Language Models.
arXiv:2001.08361 [cs.LG] (2020).
40. A. Gu, T. Dao, Mamba: Linear-time sequence modeling with selective state spaces.
arXiv:2312.00752 [cs.LG] (2024).
41. J. Meier, R. Rao, R. Verkuil, J. Liu, T. Sercu, A. Rives, “Language models enable zero-shot
prediction of the effects of mutations on protein function” in Advances in Neural
Information Processing Systems, vol. 34, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S.
Liang, J. Wortman Vaughan, Eds. (Curran Associates, Inc., 2021), pp. 29287–29303.
42. P. Notin, M. Dias, J. Frazer, J. Marchena-Hurtado, A. N. Gomez, D. Marks, Y. Gal,
“Tranception: Protein Fitness Prediction with Autoregressive Transformers and
Inference-time Retrieval” in Proceedings of the 39th International Conference on
Machine Learning, vol. 162, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, S.
Sabato, Eds. (PMLR, 2022), pp. 16990–17017.
43. G. Benegas, C. Albors, A. J. Aw, C. Ye, Y. S. Song, GPN-MSA: an alignment-based DNA
language model for genome-wide variant effect prediction. bioRxiv 2023.10.10.561776
[Preprint] (2024); https://doi.org/10.1101/2023.10.10.561776.
44. P. Notin, A. W. Kollasch, D. Ritter, L. van Niekerk, S. Paul, H. Spinner, N. Rollins, A.
Shaw, R. Weitzman, J. Frazer, M. Dias, D. Franceschi, R. Orenbuch, Y. Gal, D. S.
Marks, ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction.
bioRxiv 2023.12.07.570727 [Preprint] (2023);
https://doi.org/10.1101/2023.12.07.570727.
45. B. J. Livesey, J. A. Marsh, Updated benchmarking of variant effect predictors using deep
mutational scanning. Mol. Syst. Biol. 19, e11474 (2023). doi:10.15252/msb.202211474
Medline
46. K. K. Yang, N. Fusi, A. X. Lu, Convolutions are competitive with transformers for protein
sequence pretraining. Cell Syst. 15, 286–294.e2 (2024). doi:10.1016/j.cels.2024.01.008
Medline
47. Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu, N. Smetanin, R. Verkuil, O. Kabeli, Y.
Shmueli, A. Dos Santos Costa, M. Fazel-Zarandi, T. Sercu, S. Candido, A. Rives,
Evolutionary-scale prediction of atomic-level protein structure with a language model.
Science 379, 1123–1130 (2023). doi:10.1126/science.ade2574 Medline
48. E. Nijkamp, J. A. Ruffolo, E. N. Weinstein, N. Naik, A. Madani, ProGen2: Exploring the
boundaries of protein language models. Cell Syst. 14, 968–978.e3 (2023).
doi:10.1016/j.cels.2023.10.002 Medline
49. F.-Z. Li, A. P. Amini, Y. Yue, K. K. Yang, A. X. Lu, Feature Reuse and Scaling:
Understanding Transfer Learning with Protein Language Models. bioRxiv
2024.02.05.578959 [Preprint] (2024); https://doi.org/10.1101/2024.02.05.578959.
50. J. Chen, Z. Hu, S. Sun, Q. Tan, Y. Wang, Q. Yu, L. Zong, L. Hong, J. Xiao, T. Shen, I. King,
Y. Li, Interpretable RNA foundation model from unannotated data for highly accurate
RNA structure and function predictions. arXiv:2204.00300 [q-bio.QM] (2022).
51. Z. D. Zhang, M. Nayar, D. Ammons, J. Rampersad, G. E. Fox, Rapid in vivo exploration of a
5S rRNA neutral network. J. Microbiol. Methods 76, 181–187 (2009).
doi:10.1016/j.mimet.2008.10.010
52. T. L. LaFleur, A. Hossain, H. M. Salis, Automated model-predictive design of synthetic
promoters to control transcriptional profiles in bacteria. Nat. Commun. 13, 5159 (2022).
doi:10.1038/s41467-022-32829-5 Medline
53. G. Urtecho, A. D. Tripp, K. D. Insigne, H. Kim, S. Kosuri, Systematic dissection of sequence
elements controlling σ70 promoters using a genomically encoded multiplexed reporter
assay in Escherichia coli. Biochemistry 58, 1539–1551 (2018).
doi:10.1021/acs.biochem.7b01069 Medline
54. A. Hossain, E. Lopez, S. M. Halper, D. P. Cetnar, A. C. Reis, D. Strickland, E. Klavins, H.
M. Salis, Automated design of thousands of nonrepetitive parts for engineering stable
genetic systems. Nat. Biotechnol. 38, 1466–1475 (2020). doi:10.1038/s41587-020-0584-2
Medline
55. T. C. Yu, W. L. Liu, M. S. Brinck, J. E. Davis, J. Shek, G. Bower, T. Einav, K. D. Insigne, R.
Phillips, S. Kosuri, G. Urtecho, Multiplexed characterization of rationally designed
promoter architectures deconstructs combinatorial logic for IPTG-inducible systems. Nat.
Commun. 12, 325 (2021). doi:10.1038/s41467-020-20094-3 Medline
56. S. Kosuri, D. B. Goodman, G. Cambray, V. K. Mutalik, Y. Gao, A. P. Arkin, D. Endy, G. M.
Church, Composability of regulatory sequences controlling transcription and translation
in Escherichia coli. Proc. Natl. Acad. Sci. U.S.A. 110, 14024–14029 (2013).
doi:10.1073/pnas.1301301110 Medline
57. H. M. Salis, E. A. Mirsky, C. A. Voigt, Automated design of synthetic ribosome binding sites
to control protein expression. Nat. Biotechnol. 27, 946–950 (2009). doi:10.1038/nbt.1568
58. A. C. Reis, H. M. Salis, An automated model test system for systematic development and
improvement of gene expression models. ACS Synth. Biol. 9, 3145–3156 (2020).
doi:10.1021/acssynbio.0c00394 Medline
59. J. Y. Wang, P. Pausch, J. A. Doudna, Structural biology of CRISPR–Cas immunity and
genome editing enzymes. Nat. Rev. Microbiol. 20, 641–656 (2022). doi:10.1038/s41579-
022-00739-4
60. P. D. Hsu, E. S. Lander, F. Zhang, Development and applications of CRISPR-Cas9 for
genome engineering. Cell 157, 1262–1278 (2014). doi:10.1016/j.cell.2014.05.010
Medline
61. E. V. Koonin, K. S. Makarova, Origins and evolution of CRISPR-Cas systems. Phil. Trans.
R. Soc. B 374, 20180087 (2019). doi:10.1098/rstb.2018.0087
62. M. Jinek, K. Chylinski, I. Fonfara, M. Hauer, J. A. Doudna, E. Charpentier, A programmable
dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science 337, 816–
821 (2012). doi:10.1126/science.1225829 Medline
63. P. D. Hsu, D. A. Scott, J. A. Weinstein, F. A. Ran, S. Konermann, V. Agarwala, Y. Li, E. J.
Fine, X. Wu, O. Shalem, T. J. Cradick, L. A. Marraffini, G. Bao, F. Zhang, DNA
targeting specificity of RNA-guided Cas9 nucleases. Nat. Biotechnol. 31, 827–832
(2013). doi:10.1038/nbt.2647 Medline
64. J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, A. Pritzel, O. Ronneberger, L.
Willmore, A. J. Ballard, J. Bambrick, S. W. Bodenstein, D. A. Evans, C.-C. Hung, M.
O’Neill, D. Reiman, K. Tunyasuvunakool, Z. Wu, A. Žemgulytė, E. Arvaniti, C. Beattie,
O. Bertolli, A. Bridgland, A. Cherepanov, M. Congreve, A. I. Cowen-Rivers, A. Cowie,
M. Figurnov, F. B. Fuchs, H. Gladman, R. Jain, Y. A. Khan, C. M. R. Low, K. Perlin, A.
Potapenko, P. Savy, S. Singh, A. Stecula, A. Thillaisundaram, C. Tong, S. Yakneen, E.
D. Zhong, M. Zielinski, A. Žídek, V. Bapst, P. Kohli, M. Jaderberg, D. Hassabis, J. M.
Jumper, Accurate structure prediction of biomolecular interactions with AlphaFold 3.
Nature 630, 493–500 (2024). doi:10.1038/s41586-024-07487-w Medline
65. N. L. Craig, M. Chandler, M. Gellert, A. M. Lambowitz, P. A. Rice, S. B. Sandmeyer, Eds.,
Mobile DNA III (Wiley, ed. 3, 2020).
66. C. Meers, H. C. Le, S. R. Pesari, F. T. Hoffmann, M. W. G. Walker, J. Gezelle, S. Tang, S.
H. Sternberg, Transposon-encoded nucleases use guide RNAs to promote their selfish
spread. Nature 622, 863–871 (2023). doi:10.1038/s41586-023-06597-1 Medline
67. T. Karvelis, G. Druteika, G. Bigelyte, K. Budre, R. Zedaveinyte, A. Silanskas, D.
Kazlauskas, Č. Venclovas, V. Siksnys, Transposon-associated TnpB is a programmable
RNA-guided DNA endonuclease. Nature 599, 692–696 (2021). doi:10.1038/s41586-021-
04058-1 Medline
68. H. Altae-Tran, S. Kannan, F. E. Demircioglu, R. Oshiro, S. P. Nety, L. J. McKay, M. Dlakić,
W. P. Inskeep, K. S. Makarova, R. K. Macrae, E. V. Koonin, F. Zhang, The widespread
IS200/IS605 transposon family encodes diverse programmable RNA-guided
endonucleases. Science 374, 57–65 (2021). doi:10.1126/science.abj6856
69. O. Barabas, D. R. Ronning, C. Guynet, A. B. Hickman, B. Ton-Hoang, M. Chandler, F.
Dyda, Mechanism of IS200/IS605 family DNA transposases: Activation and transposon-
directed target site selection. Cell 132, 208–220 (2008). doi:10.1016/j.cell.2007.12.029
Medline
70. Z. Zhang, H. K. Wayment-Steele, G. Brixi, H. Wang, D. Kern, S. Ovchinnikov, Protein
language models learn evolutionary statistics of interacting sequence motifs. Proc. Natl.
Acad. Sci. U.S.A. 121, e2406285121 (2024). doi:10.1073/pnas.2406285121
71. P. Siguier, J. Pérochon, L. Lestrade, J. Mahillon, M. Chandler, ISfinder: The reference centre
for bacterial insertion sequences. Nucleic Acids Res. 34, D32–D36 (2006).
doi:10.1093/nar/gkj014 Medline
72. E. P. C. Rocha, A. Danchin, Gene essentiality determines chromosome organisation in
bacteria. Nucleic Acids Res. 31, 6570–6577 (2003). doi:10.1093/nar/gkg859 Medline
73. R. Zhang, H.‐Y. Ou, C.‐T. Zhang, DEG: A database of essential genes. Nucleic Acids Res.
32, D271–D272 (2004). doi:10.1093/nar/gkh024
74. D. Piya, N. Nolan, M. L. Moore, L. A. Ramirez Hernandez, B. F. Cress, R. Young, A. P.
Arkin, V. K. Mutalik, Systematic and scalable genome-wide essentiality mapping to
identify nonessential genes in phages. PLOS Biol. 21, e3002416 (2023).
doi:10.1371/journal.pbio.3002416 Medline
75. K. H. Turner, A. K. Wessel, G. C. Palmer, J. L. Murray, M. Whiteley, Essential genome of
Pseudomonas aeruginosa in cystic fibrosis sputum. Proc. Natl. Acad. Sci. U.S.A. 112,
4110–4115 (2015). doi:10.1073/pnas.1419677112 Medline
76. A. Blanchard, C. Bébéar, The evolution of Mycoplasma genitalium. Ann. N. Y. Acad. Sci.
1230, E61–E64 (2011). doi:10.1111/j.1749-6632.2011.06418.x Medline
77. D. H. Parks, M. Imelfort, C. T. Skennerton, P. Hugenholtz, G. W. Tyson, CheckM:
Assessing the quality of microbial genomes recovered from isolates, single cells, and
metagenomes. Genome Res. 25, 1043–1055 (2015). doi:10.1101/gr.186072.114
78. D. T. Pride, R. J. Meinersmann, T. M. Wassenaar, M. J. Blaser, Evolutionary implications of
microbial genome tetranucleotide frequency biases. Genome Res. 13, 145–158 (2003).
doi:10.1101/gr.335003
79. L. Xu, J. Kuo, J.-K. Liu, T.-Y. Wong, Bacterial phylogenetic tree construction based on
genomic translation stop signals. Microb. Inform. Exp. 2, 6 (2012). doi:10.1186/2042-
5783-2-6
80. G. Korkmaz, M. Holm, T. Wiens, S. Sanyal, Comprehensive analysis of stop codon usage in
bacteria and its correlation with release factor abundance. J. Biol. Chem. 289, 30334–
30342 (2014). doi:10.1074/jbc.M114.606632 Medline
81. T. Seemann, barrnap, GitHub (2018); https://github.com/tseemann/barrnap.
82. N. Goldman, J. L. Thorne, D. T. Jones, Assessing the impact of secondary structure and
solvent accessibility on protein evolution. Genetics 149, 445–458 (1998).
doi:10.1093/genetics/149.1.445 Medline
83. J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, Q. V. Le,
Finetuned language models are zero-shot learners. arXiv:2109.01652 [cs.CL] (2022).
84. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S.
Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A.
Askell, P. Welinder, P. Christiano, J. Leike, R. Lowe, Training language models to
follow instructions with human feedback. arXiv:2203.02155 [cs.CL] (2022).
85. R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, C. Finn, “Direct Preference
Optimization: Your Language Model is Secretly a Reward Model” in Advances in Neural
Information Processing Systems, vol. 36, A. Oh, T. Naumann, A. Globerson, K. Saenko,
M. Hardt, S. Levine, Eds. (Curran Associates, Inc., 2023), pp. 53728–53741.
86. H. L. Rehm, A. J. H. Page, L. Smith, J. B. Adams, G. Alterovitz, L. J. Babb, M. P. Barkley,
M. Baudis, M. J. S. Beauvais, T. Beck, J. S. Beckmann, S. Beltran, D. Bernick, A.
Bernier, J. K. Bonfield, T. F. Boughtwood, G. Bourque, S. R. Bowers, A. J. Brookes, M.
Brudno, M. H. Brush, D. Bujold, T. Burdett, O. J. Buske, M. N. Cabili, D. L. Cameron,
R. J. Carroll, E. Casas-Silva, D. Chakravarty, B. P. Chaudhari, S. H. Chen, J. M. Cherry,
J. Chung, M. Cline, H. L. Clissold, R. M. Cook-Deegan, M. Courtot, F. Cunningham, M.
Cupak, R. M. Davies, D. Denisko, M. J. Doerr, L. I. Dolman, E. S. Dove, L. J. Dursi, S.
O. M. Dyke, J. A. Eddy, K. Eilbeck, K. P. Ellrott, S. Fairley, K. A. Fakhro, H. V. Firth,
M. S. Fitzsimons, M. Fiume, P. Flicek, I. M. Fore, M. A. Freeberg, R. R. Freimuth, L. A.
Fromont, J. Fuerth, C. L. Gaff, W. Gan, E. M. Ghanaim, D. Glazer, R. C. Green, M.
Griffith, O. L. Griffith, R. L. Grossman, T. Groza, J. M. Guidry Auvil, R. Guigó, D.
Gupta, M. A. Haendel, A. Hamosh, D. P. Hansen, R. K. Hart, D. M. Hartley, D. Haussler,
R. M. Hendricks-Sturrup, C. W. L. Ho, A. E. Hobb, M. M. Hoffman, O. M. Hofmann, P.
Holub, J. S. Hsu, J.-P. Hubaux, S. E. Hunt, A. Husami, J. O. Jacobsen, S. S. Jamuar, E. L.
Janes, F. Jeanson, A. Jené, A. L. Johns, Y. Joly, S. J. M. Jones, A. Kanitz, K. Kato, T. M.
Keane, K. Kekesi-Lafrance, J. Kelleher, G. Kerry, S. S. Khor, B. M. Knoppers, M. A.
Konopko, K. Kosaki, M. Kuba, J. Lawson, R. Leinonen, S. Li, M. F. Lin, M. Linden, X.
Liu, I. U. Liyanage, J. Lopez, A. M. Lucassen, M. Lukowski, A. L. Mann, J. Marshall,
M. Mattioni, A. Metke-Jimenez, A. Middleton, R. J. Milne, F. Molnár-Gábor, N. Mulder,
M. C. Munoz-Torres, R. Nag, H. Nakagawa, J. Nasir, A. Navarro, T. H. Nelson, A.
Niewielska, A. Nisselle, J. Niu, T. H. Nyrönen, B. D. O’Connor, S. Oesterle, S.
Ogishima, V. Ota Wang, L. A. D. Paglione, E. Palumbo, H. E. Parkinson, A. A.
Philippakis, A. D. Pizarro, A. Prlic, J. Rambla, A. Rendon, R. A. Rider, P. N. Robinson,
K. W. Rodarmer, L. L. Rodriguez, A. F. Rubin, M. Rueda, G. A. Rushton, R. S. Ryan, G.
I. Saunders, H. Schuilenburg, T. Schwede, S. Scollen, A. Senf, N. C. Sheffield, N.
Skantharajah, A. V. Smith, H. J. Sofia, D. Spalding, A. B. Spurdle, Z. Stark, L. D. Stein,
M. Suematsu, P. Tan, J. A. Tedds, A. A. Thomson, A. Thorogood, T. L. Tickle, K.
Tokunaga, J. Törnroos, D. Torrents, S. Upchurch, A. Valencia, R. Valls Guimera, J.
Vamathevan, S. Varma, D. F. Vears, C. Viner, C. Voisin, A. H. Wagner, S. E. Wallace,
B. P. Walsh, M. S. Williams, E. C. Winkler, B. J. Wold, G. M. Wood, J. P. Woolley, C.
Yamasaki, A. D. Yates, C. K. Yung, L. J. Zass, K. Zaytseva, J. Zhang, P. Goodhand, K.
North, E. Birney, GA4GH: International policies and standards for data sharing across
genomic research and healthcare. Cell Genomics 1, 100029 (2021).
doi:10.1016/j.xgen.2021.100029
87. T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, “Large Language Models are Zero-
Shot Reasoners” in Advances in Neural Information Processing Systems, vol. 35, S.
Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh, Eds. (Curran Associates,
Inc., 2022), pp. 22199–22213.
88. B. L. Hie, V. R. Shanker, D. Xu, T. U. J. Bruun, P. A. Weidenbacher, S. Tang, W. Wu, J. E.
Pak, P. S. Kim, Efficient evolution of human antibodies from general protein language
models. Nat. Biotechnol. 42, 275–283 (2024). doi:10.1038/s41587-023-01763-2
89. V. R. Shanker, T. U. J. Bruun, B. L. Hie, P. S. Kim, Unsupervised evolution of protein and
antibody complexes with a structure-informed language model. Science 385, 46–53
(2024). doi:10.1126/science.adk8946 Medline
90. M. G. Durrant, N. T. Perry, J. J. Pai, A. R. Jangid, J. S. Athukoralage, M. Hiraizumi, J. P.
McSpedon, A. Pawluk, H. Nishimasu, S. Konermann, P. D. Hsu, Bridge RNAs direct
programmable recombination of target and donor DNA. Nature 630, 984–993 (2024).
doi:10.1038/s41586-024-07552-4 Medline
91. Y. N. Dauphin, A. Fan, M. Auli, D. Grangier, “Language Modeling with Gated
Convolutional Networks” in Proceedings of the 34th International Conference on
Machine Learning, vol. 70, D. Precup, Y. W. Teh, Eds. (PMLR, 2017), pp. 933–941.
92. N. Shazeer, GLU variants improve Transformer. arXiv:2002.05202 [cs.LG] (2020).
93. B. Zhang, R. Sennrich, “Root Mean Square Layer Normalization” in Advances in Neural
Information Processing Systems, vol. 32, H. Wallach, H. Larochelle, A. Beygelzimer, F.
d’Alché-Buc, E. Fox, R. Garnett, Eds. (Curran Associates, Inc., 2019).
94. D. Fu, S. Arora, J. Grogan, I. Johnson, E. S. Eyuboglu, A. Thomas, B. Spector, M. Poli, A.
Rudra, C. Ré, “Monarch Mixer: A simple sub-quadratic GEMM-based architecture” in
Advances in Neural Information Processing Systems, vol. 36, A. Oh, T. Naumann, A.
Globerson, K. Saenko, M. Hardt, S. Levine, Eds. (Curran Associates, Inc., 2023), pp.
77546–77603.
95. S. Arora, S. Eyuboglu, A. Timalsina, I. Johnson, M. Poli, J. Zou, A. Rudra, C. Ré, Zoology:
Measuring and improving recall in efficient language models. arXiv:2312.04927 [cs.CL]
(2023).
96. S. Bhattamishra, A. Patel, P. Blunsom, V. Kanade, Understanding in-context learning in
transformers and LLMs by learning to learn discrete functions. arXiv:2310.03016
[cs.LG] (2023).
97. D. W. Romero, A. Kuzina, E. J. Bekkers, J. M. Tomczak, M. Hoogendoorn, CKConv:
Continuous kernel convolution for sequential data. arXiv:2102.02611 [cs.LG] (2022).
98. A. Gupta, A. Gu, J. Berant, “Diagonal State Spaces are as Effective as Structured State
Spaces” in Advances in Neural Information Processing Systems, vol. 35, S. Koyejo, S.
Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh, Eds. (Curran Associates, Inc.,
2022), pp. 22982–22994.
99. A. Gu, K. Goel, A. Gupta, C. Ré, “On the Parameterization and Initialization of Diagonal
State Space Models” in Advances in Neural Information Processing Systems, vol. 35, S.
Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh, Eds. (Curran Associates,
Inc., 2022), pp. 35971–35983.
100. M. Zhang, K. K. Saab, M. Poli, T. Dao, K. Goel, C. Ré, Effectively modeling time series
with simple discrete state spaces. arXiv:2303.09489 [cs.LG] (2023).
101. J. Wei, P. Lotfy, K. Faizi, S. Baungaard, E. Gibson, E. Wang, H. Slabodkin, E. Kinnaman,
S. Chandrasekaran, H. Kitano, M. G. Durrant, C. V. Duffy, A. Pawluk, P. D. Hsu, S.
Konermann, Deep learning and CRISPR-Cas13d ortholog discovery for optimized RNA
targeting. Cell Syst. 14, 1087–1102.e13 (2023). doi:10.1016/j.cels.2023.11.006 Medline
102. N. A. O’Leary, M. W. Wright, J. R. Brister, S. Ciufo, D. Haddad, R. McVeigh, B. Rajput,
B. Robbertse, B. Smith-White, D. Ako-Adjei, A. Astashyn, A. Badretdin, Y. Bao, O.
Blinkova, V. Brover, V. Chetvernin, J. Choi, E. Cox, O. Ermolaeva, C. M. Farrell, T.
Goldfarb, T. Gupta, D. Haft, E. Hatcher, W. Hlavina, V. S. Joardar, V. K. Kodali, W. Li,
D. Maglott, P. Masterson, K. M. McGarvey, M. R. Murphy, K. O’Neill, S. Pujar, S. H.
Rangwala, D. Rausch, L. D. Riddick, C. Schoch, A. Shkeda, S. S. Storz, H. Sun, F.
Thibaud-Nissen, I. Tolstoy, R. E. Tully, A. R. Vatsan, C. Wallin, D. Webb, W. Wu, M. J.
Landrum, A. Kimchi, T. Tatusova, M. DiCuccio, P. Kitts, T. D. Murphy, K. D. Pruitt,
Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion,
and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).
doi:10.1093/nar/gkv1189
103. A. Almeida, S. Nayfach, M. Boland, F. Strozzi, M. Beracochea, Z. J. Shi, K. S. Pollard, E.
Sakharova, D. H. Parks, P. Hugenholtz, N. Segata, N. C. Kyrpides, R. D. Finn, A unified
catalog of 204,938 reference genomes from the human gut microbiome. Nat. Biotechnol.
39, 105–114 (2021). doi:10.1038/s41587-020-0603-3 Medline
104. I.-M. A. Chen, K. Chu, K. Palaniappan, A. Ratner, J. Huang, M. Huntemann, P. Hajek, S.
Ritter, N. Varghese, R. Seshadri, S. Roux, T. Woyke, E. A. Eloe-Fadrosh, N. N. Ivanova,
N. C. Kyrpides, The IMG/M data management and analysis system v.6.0: New tools and
advanced capabilities. Nucleic Acids Res. 49, D751–D763 (2021).
doi:10.1093/nar/gkaa939
105. L. F. Camarillo-Guerrero, A. Almeida, G. Rangel-Pineros, R. D. Finn, T. D. Lawley,
Massive expansion of human gut bacteriophage diversity. Cell 184, 1098–1109.e9
(2021). doi:10.1016/j.cell.2021.01.029
106. S. C. Forster, N. Kumar, B. O. Anonye, A. Almeida, E. Viciani, M. D. Stares, M. Dunn, T.
T. Mkandawire, A. Zhu, Y. Shao, L. J. Pike, T. Louie, H. P. Browne, A. L. Mitchell, B.
A. Neville, R. D. Finn, T. D. Lawley, A human gut bacterial genome and culture
collection for improved metagenomic analyses. Nat. Biotechnol. 37, 186–192 (2019).
doi:10.1038/s41587-018-0009-7
107. A. L. Mitchell, A. Almeida, M. Beracochea, M. Boland, J. Burgin, G. Cochrane, M. R.
Crusoe, V. Kale, S. C. Potter, L. J. Richardson, E. Sakharova, M. Scheremetjew, A.
Korobeynikov, A. Shlemov, O. Kunyavskaya, A. Lapidus, R. D. Finn, MGnify: The
microbiome analysis resource in 2020. Nucleic Acids Res. 48, D570–D578 (2020).
doi:10.1093/nar/gkz1035
108. N. D. Youngblut, J. de la Cuesta-Zuluaga, G. H. Reischer, S. Dauser, N. Schuster, C.
Walzer, G. Stalder, A. H. Farnleitner, R. E. Ley, Large-scale metagenome assembly
reveals novel animal-associated microbial genomes, biosynthetic gene clusters, and other
genetic diversity. mSystems 5, e01045-20 (2020). doi:10.1128/msystems.01045-20
109. F. Meyer, D. Paarmann, M. D’Souza, R. Olson, E. M. Glass, M. Kubal, T. Paczian, A.
Rodriguez, R. Stevens, A. Wilke, J. Wilkening, R. A. Edwards, The metagenomics RAST
server - a public resource for the automatic phylogenetic and functional analysis of
metagenomes. BMC Bioinformatics 9, 386 (2008). doi:10.1186/1471-2105-9-386
Medline
110. S. Sunagawa, L. P. Coelho, S. Chaffron, J. R. Kultima, K. Labadie, G. Salazar, B.
Djahanschiri, G. Zeller, D. R. Mende, A. Alberti, F. M. Cornejo-Castillo, P. I. Costea, C.
Cruaud, F. d’Ovidio, S. Engelen, I. Ferrera, J. M. Gasol, L. Guidi, F. Hildebrand, F.
Kokoszka, C. Lepoivre, G. Lima-Mendez, J. Poulain, B. T. Poulos, M. Royo-Llonch, H.
Sarmento, S. Vieira-Silva, C. Dimier, M. Picheral, S. Searson, S. Kandels-Lewis, Tara
Oceans Coordinators, C. Bowler, C. de Vargas, G. Gorsky, N. Grimsley, P. Hingamp, D.
Iudicone, O. Jaillon, F. Not, H. Ogata, S. Pesant, S. Speich, L. Stemmann, M. B.
Sullivan, J. Weissenbach, P. Wincker, E. Karsenti, J. Raes, S. G. Acinas, P. Bork,
Structure and function of the global ocean microbiome. Science 348, 1261359 (2015).
doi:10.1126/science.1261359 Medline
111. R. D. Finn, J. Clements, S. R. Eddy, HMMER web server: Interactive sequence similarity
searching. Nucleic Acids Res. 39, W29–W37 (2011). doi:10.1093/nar/gkr367
112. J. Russel, R. Pinilla-Redondo, D. Mayo-Muñoz, S. A. Shah, S. J. Sørensen,
CRISPRCasTyper: Automated Identification, Annotation, and Classification of CRISPR-
Cas Loci. CRISPR J. 3, 462–469 (2020). doi:10.1089/crispr.2020.0059 Medline
113. H. Altae-Tran, S. A. Shmakov, K. S. Makarova, Y. I. Wolf, S. Kannan, F. Zhang, E. V.
Koonin, Diversity, evolution, and classification of the RNA-guided nucleases TnpB and
Cas12. Proc. Natl. Acad. Sci. U.S.A. 120, e2308224120 (2023).
doi:10.1073/pnas.2308224120 Medline
114. M. Steinegger, J. Söding, MMseqs2 enables sensitive protein sequence searching for the
analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
doi:10.1038/nbt.3988 Medline
115. K. Katoh, K. Misawa, K. Kuma, T. Miyata, MAFFT: A novel method for rapid multiple
sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066
(2002). doi:10.1093/nar/gkf436 Medline
116. W. Xiong, J. Liu, I. Molybog, H. Zhang, P. Bhargava, R. Hou, L. Martin, R. Rungta, K. A.
Sankararaman, B. Oguz, M. Khabsa, H. Fang, Y. Mehdad, S. Narang, K. Malik, A. Fan,
S. Bhosale, S. Edunov, M. Lewis, S. Wang, H. Ma, Effective long-context scaling of
foundation models. arXiv:2309.16039 [cs.CL] (2023).
117. J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, S. Sanghai, GQA:
Training generalized multi-query transformer models from multi-head checkpoints.
arXiv:2305.13245 [cs.CL] (2023).
118. E. Firnberg, J. W. Labonte, J. J. Gray, M. Ostermeier, A comprehensive, high-resolution
map of a gene’s fitness landscape. Mol. Biol. Evol. 31, 1581–1592 (2014).
doi:10.1093/molbev/msu081 Medline
119. H. Jacquier, A. Birgy, H. Le Nagard, Y. Mechulam, E. Schmitt, J. Glodt, B. Bercot, E. Petit,
J. Poulain, G. Barnaud, P.-A. Gros, O. Tenaillon, Capturing the mutational landscape of
the beta-lactamase TEM-1. Proc. Natl. Acad. Sci. U.S.A. 110, 13067–13072 (2013).
doi:10.1073/pnas.1215206110
120. B. V. Adkar, A. Tripathi, A. Sahoo, K. Bajaj, D. Goswami, P. Chakrabarti, M. K. Swarnkar,
R. S. Gokhale, R. Varadarajan, Protein model discrimination using mutational sensitivity
derived from deep sequencing. Structure 20, 371–381 (2012).
doi:10.1016/j.str.2011.11.021
121. K. Tsuboyama, J. Dauparas, J. Chen, E. Laine, Y. Mohseni Behbahani, J. J. Weinstein, N.
M. Mangan, S. Ovchinnikov, G. J. Rocklin, Mega-scale experimental analysis of protein
folding stability in biology and design. Nature 620, 434–444 (2023). doi:10.1038/s41586-
023-06328-6 Medline
122. E. D. Kelsic, H. Chung, N. Cohen, J. Park, H. H. Wang, R. Kishony, RNA structural
determinants of optimal codons revealed by MAGE-Seq. Cell Syst. 3, 563–571.e6 (2016).
doi:10.1016/j.cels.2016.11.004
123. R. Weeks, M. Ostermeier, Fitness and functional landscapes of the E. coli RNase III gene
rnc. Mol. Biol. Evol. 40, msad047 (2023). doi:10.1093/molbev/msad047 Medline
124. L. Rockah-Shmuel, Á. Tóth-Petróczy, D. S. Tawfik, Systematic mapping of protein
mutational space by prolonged drift reveals the deleterious effects of seemingly neutral
mutations. PLOS Comput. Biol. 11, e1004421 (2015). doi:10.1371/journal.pcbi.1004421
Medline
125. J. Z. Chen, D. M. Fowler, N. Tokuriki, Comprehensive exploration of the translocation,
stability and substrate recognition requirements in VIM-2 lactamase. eLife 9, e56707
(2020). doi:10.7554/eLife.56707 Medline
126. A. Melnikov, P. Rogov, L. Wang, A. Gnirke, T. S. Mikkelsen, Comprehensive mutational
scanning of a kinase in vivo reveals substrate-dependent fitness landscapes. Nucleic Acids
Res. 42, e112 (2014). doi:10.1093/nar/gku511 Medline
127. S. Sun, J. Weile, M. Verby, Y. Wu, Y. Wang, A. G. Cote, I. Fotiadou, J. Kitaygorodsky, M.
Vidal, J. Rine, P. Ješina, V. Kožich, F. P. Roth, A proactive genotype-to-patient-
phenotype map for cystathionine beta-synthase. Genome Med. 12, 13 (2020).
doi:10.1186/s13073-020-0711-1
128. R. A. Silverstein, S. Sun, M. Verby, J. Weile, Y. Wu, M. Gebbia, I. Fotiadou, J.
Kitaygorodsky, F. P. Roth, A systematic genotype-phenotype map for missense variants
in the human intellectual disability-associated gene GDI1. bioRxiv 2021.10.06.463360
[Preprint] (2022); https://doi.org/10.1101/2021.10.06.463360.
129. C. W. Garvie, X. Wu, M. Papanastasiou, S. Lee, J. Fuller, G. R. Schnitzler, S. W. Horner,
A. Baker, T. Zhang, J. P. Mullahoo, L. Westlake, S. H. Hoyt, M. Toetzl, M. J. Ranaghan,
L. de Waal, J. McGaunn, B. Kaplan, F. Piccioni, X. Yang, M. Lange, A. Tersteegen, D.
Raymond, T. A. Lewis, S. A. Carr, A. D. Cherniack, C. T. Lemke, M. Meyerson, H.
Greulich, Structure of PDE3A-SLFN12 complex reveals requirements for activation of
SLFN12 RNase. Nat. Commun. 12, 4375 (2021). doi:10.1038/s41467-021-24495-w
130. E. Kotler, O. Shani, G. Goldfeld, M. Lotan-Pompan, O. Tarcic, A. Gershoni, T. A. Hopf, D.
S. Marks, M. Oren, E. Segal, A systematic p53 mutation library links differential
functional impact to cancer mutation pattern and evolutionary conservation. Mol. Cell 71,
178–190.e8 (2018). doi:10.1016/j.molcel.2018.06.012 Medline
131. A. O. Giacomelli, X. Yang, R. E. Lintner, J. M. McFarland, M. Duby, J. Kim, T. P.
Howard, D. Y. Takeda, S. H. Ly, E. Kim, H. S. Gannon, B. Hurhula, T. Sharpe, A.
Goodale, B. Fritchman, S. Steelman, F. Vazquez, A. Tsherniak, A. J. Aguirre, J. G.
Doench, F. Piccioni, C. W. M. Roberts, M. Meyerson, G. Getz, C. M. Johannessen, D. E.
Root, W. C. Hahn, Mutational processes shape the landscape of TP53 mutations in
human cancer. Nat. Genet. 50, 1381–1387 (2018). doi:10.1038/s41588-018-0204-y
132. G. M. Findlay, R. M. Daza, B. Martin, M. D. Zhang, A. P. Leith, M. Gasperini, J. D.
Janizek, X. Huang, L. M. Starita, J. Shendure, Accurate classification of BRCA1 variants
with saturation genome editing. Nature 562, 217–222 (2018). doi:10.1038/s41586-018-
0461-z
133. S. Kobori, Y. Nomura, A. Miu, Y. Yokobayashi, High-throughput assay and engineering of
self-cleaving ribozymes by sequencing. Nucleic Acids Res. 43, e85 (2015).
doi:10.1093/nar/gkv265 Medline
134. J. O. L. Andreasson, A. Savinov, S. M. Block, W. J. Greenleaf, Comprehensive sequence-
to-function mapping of cofactor-dependent RNA catalysis in the glmS ribozyme. Nat.
Commun. 11, 1663 (2020). doi:10.1038/s41467-020-15540-1
135. J. Domingo, G. Diss, B. Lehner, Pairwise and higher-order genetic interactions during the
evolution of a tRNA. Nature 558, 117–121 (2018). doi:10.1038/s41586-018-0170-7
Medline
136. M. P. Guy, D. L. Young, M. J. Payea, X. Zhang, Y. Kon, K. M. Dean, E. J. Grayhack, D. H.
Mathews, S. Fields, E. M. Phizicky, Identification of the determinants of tRNA function
and susceptibility to rapid tRNA decay by high-throughput in vivo analysis. Genes Dev.
28, 1721–1732 (2014). doi:10.1101/gad.245936.114
137. E. J. Hayden, E. Ferrada, A. Wagner, Cryptic genetic variation promotes rapid evolutionary
adaptation in an RNA enzyme. Nature 474, 92–95 (2011). doi:10.1038/nature10083
Medline
138. J. N. Pitt, A. R. Ferré-D’Amaré, Rapid construction of empirical RNA fitness landscapes.
Science 330, 376–379 (2010). doi:10.1126/science.1192001 Medline
139. T. A. Chang, B. K. Bergen, Language model behavior: A comprehensive survey.
arXiv:2303.11504 [cs.CL] (2023).
140. D. Hyatt, G.-L. Chen, P. F. Locascio, M. L. Land, F. W. Larimer, L. J. Hauser, Prodigal:
Prokaryotic gene recognition and translation initiation site identification. BMC
Bioinformatics 11, 119 (2010). doi:10.1186/1471-2105-11-119 Medline
141. C. Bland, T. L. Ramsey, F. Sabree, M. Lowe, K. Brown, N. C. Kyrpides, P. Hugenholtz,
CRISPR recognition tool (CRT): A tool for automatic detection of clustered regularly
interspaced palindromic repeats. BMC Bioinformatics 8, 209 (2007). doi:10.1186/1471-
2105-8-209 Medline
142. P. Kunzmann, T. D. Müller, M. Greil, J. H. Krumbach, J. M. Anter, D. Bauer, F. Islam, K.
Hamacher, Biotite: New tools for a versatile Python bioinformatics library. BMC
Bioinformatics 24, 236 (2023). doi:10.1186/s12859-023-05345-6
143. A. Mitrofanov, M. Ziemann, O. S. Alkhnbashi, W. R. Hess, R. Backofen,
CRISPRtracrRNA: Robust approach for CRISPR tracrRNA detection. Bioinformatics 38,
ii42–ii48 (2022). doi:10.1093/bioinformatics/btac466 Medline
144. R. Lorenz, S. H. Bernhart, C. Höner Zu Siederdissen, H. Tafer, C. Flamm, P. F. Stadler, I.
L. Hofacker, ViennaRNA Package 2.0. Algorithms Mol. Biol. 6, 26 (2011).
doi:10.1186/1748-7188-6-26 Medline
145. E. P. Nawrocki, S. R. Eddy, Infernal 1.1: 100-fold faster RNA homology searches.
Bioinformatics 29, 2933–2935 (2013). doi:10.1093/bioinformatics/btt509 Medline
146. C. Zhang, M. Shine, A. M. Pyle, Y. Zhang, US-align: Universal structure alignments of
proteins, nucleic acids, and macromolecular complexes. Nat. Methods 19, 1109–1115
(2022). doi:10.1038/s41592-022-01585-1 Medline
147. Schrödinger LLC, The PyMOL Molecular Graphics System, version 1.8 (2015).
148. A. R. Gruber, R. Lorenz, S. H. Bernhart, R. Neuböck, I. L. Hofacker, The Vienna RNA
Websuite. Nucleic Acids Res. 36, W70–W74 (2008). doi:10.1093/nar/gkn188
149. W. B. Langdon, J. Petke, R. Lorenz, in Genetic Programming, M. Castelli, L. Sekanina, M.
Zhang, S. Cagnoni, P. García-Sánchez, Eds. (Springer, 2018), pp. 220–236.
150. Z. Weinberg, R. R. Breaker, R2R - Software to speed the depiction of aesthetic consensus
RNA secondary structures. BMC Bioinformatics 12, 3 (2011). doi:10.1186/1471-2105-
12-3 Medline
151. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, Basic local alignment
search tool. J. Mol. Biol. 215, 403–410 (1990). doi:10.1016/S0022-2836(05)80360-2
152. M. Mirdita, K. Schütze, Y. Moriwaki, L. Heo, S. Ovchinnikov, M. Steinegger, ColabFold:
Making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).
doi:10.1038/s41592-022-01488-1 Medline
153. B. Hie, Code for paper “Sequence modeling and design from molecular to genome scale
with Evo,” Zenodo (2024); https://doi.org/10.5281/zenodo.12693561.
154. K. Badal, C. M. Lee, L. J. Esserman, Guiding principles for the responsible development of
artificial intelligence tools for healthcare. Commun. Med. 3, 47 (2023).
doi:10.1038/s43856-023-00279-9 Medline
155. D. Baker, G. Church, Protein design meets biosecurity. Science 383, 349–349 (2024).
doi:10.1126/science.ado1671 Medline
156. S. Morin, G. Segafredo, M. Piccolis, A. Das, M. Das, N. Loffredi, A. Larbi, K. Mwamelo,
E. Villanueva, S. Nobre, E. Burrone, Expanding access to biotherapeutics in low-income
and middle-income countries through public health non-exclusive voluntary intellectual
property licensing: Considerations, requirements, and opportunities. Lancet Glob. Health
11, e145–e154 (2023). doi:10.1016/S2214-109X(22)00460-0 Medline
157. M. E. Peek, By any means necessary: Why lowering insulin prices is relevant to racial
health equity. Lancet 398, 1783–1784 (2021). doi:10.1016/S0140-6736(21)02315-1
Medline
158. N. B. W. Macfarlane, J. Adams, E. L. Bennett, T. M. Brooks, J. A. Delborne, H.
Eggermont, D. Endy, K. M. Esvelt, B. Kolodziejczyk, T. Kuiken, M. J. Oliva, S. Peña
Moreno, L. Slobodian, R. B. Smith, D. Thizy, D. M. Tompkins, W. Wei, K. H. Redford,
Direct and indirect impacts of synthetic biology on biodiversity conservation. iScience
25, 105423 (2022). doi:10.1016/j.isci.2022.105423 Medline
159. The carbon footprint of computational research. Nat. Comput. Sci. 3, 659 (2023).
doi:10.1038/s43588-023-00506-2

Science Ado9336
No ratings yet
Science Ado9336
22 pages
Applications of Evolutionary Computation: João Correia Stephen Smith Raneem Qaddoura (Eds.)
No ratings yet
Applications of Evolutionary Computation: João Correia Stephen Smith Raneem Qaddoura (Eds.)
821 pages
Genome Modeling Design Across All Domains of Life 2025.02.18.638918v1.full
No ratings yet
Genome Modeling Design Across All Domains of Life 2025.02.18.638918v1.full
65 pages
The Biotechnological Revolution
No ratings yet
The Biotechnological Revolution
2 pages
Synthetic Biology Conservation Issues Brief Final
No ratings yet
Synthetic Biology Conservation Issues Brief Final
2 pages
Model Hints Ethics Test Date 29 July 20025
No ratings yet
Model Hints Ethics Test Date 29 July 20025
55 pages
Ethical Practical Genetic Engineering - PPTX 20250127 162317 0000
No ratings yet
Ethical Practical Genetic Engineering - PPTX 20250127 162317 0000
19 pages
Fundamental Approaches To Software Engineering 1st Edition Esther Guerra Available Instanly
No ratings yet
Fundamental Approaches To Software Engineering 1st Edition Esther Guerra Available Instanly
83 pages
Science, Technology and Values: Promoting Ethics and Social Responsibility
No ratings yet
Science, Technology and Values: Promoting Ethics and Social Responsibility
17 pages
Synthetic Biology: Influencing Development
No ratings yet
Synthetic Biology: Influencing Development
25 pages
Ethical Analysis of The Disaster at Bhopal and The Tuskegee Study
No ratings yet
Ethical Analysis of The Disaster at Bhopal and The Tuskegee Study
15 pages
Sts Genome Editing
No ratings yet
Sts Genome Editing
6 pages
The Next 25 Years in Biotechnology
No ratings yet
The Next 25 Years in Biotechnology
1 page
Lecture 3 - Ethical Consideration
No ratings yet
Lecture 3 - Ethical Consideration
11 pages
Notes Etic
No ratings yet
Notes Etic
8 pages
Xain 2
No ratings yet
Xain 2
3 pages
Fundamental Approaches To Software Engineering 1st Edition Esther Guerra Instant Download
No ratings yet
Fundamental Approaches To Software Engineering 1st Edition Esther Guerra Instant Download
135 pages
Global Impact of Biomedical Engineering
No ratings yet
Global Impact of Biomedical Engineering
3 pages
Ethical Issues
No ratings yet
Ethical Issues
16 pages
Ai Tool Usage Checklist For Debate Preparation
No ratings yet
Ai Tool Usage Checklist For Debate Preparation
3 pages
The Ethics of Genetic Engineering
No ratings yet
The Ethics of Genetic Engineering
1 page
STS Group 2
No ratings yet
STS Group 2
9 pages
Diffusion of Synthetic Biology A Challenge To Biosafety
No ratings yet
Diffusion of Synthetic Biology A Challenge To Biosafety
6 pages
Science 1193749
No ratings yet
Science 1193749
2 pages
Essay
No ratings yet
Essay
3 pages
Agnihotri Ch1
No ratings yet
Agnihotri Ch1
17 pages
Synthetic Biology 2020 Frontiers in Risk Analysis and Governance Benjamin D. Trump Instant Download
No ratings yet
Synthetic Biology 2020 Frontiers in Risk Analysis and Governance Benjamin D. Trump Instant Download
136 pages
Gene Therapy
No ratings yet
Gene Therapy
2 pages
(Lecture Notes in Computer Science 6024 _ Theoretical Computer Science and General Issues) Marc Ebner, Richard A. Watson, Jason Alexander (auth.), Cecilia Di Chio, Stefano Cagnoni, Carlos Cotta, Marc .pdf
No ratings yet
(Lecture Notes in Computer Science 6024 _ Theoretical Computer Science and General Issues) Marc Ebner, Richard A. Watson, Jason Alexander (auth.), Cecilia Di Chio, Stefano Cagnoni, Carlos Cotta, Marc .pdf
644 pages
Durgiah T. Discussion.
No ratings yet
Durgiah T. Discussion.
3 pages
Gift Serminal
No ratings yet
Gift Serminal
22 pages
BIOETHICS
No ratings yet
BIOETHICS
3 pages
Group 2 Biotechnology Debate (AGAINST)
No ratings yet
Group 2 Biotechnology Debate (AGAINST)
6 pages
Genetic Programming: Gisele Pappa Mario Giacobini Zdenek Vasicek
No ratings yet
Genetic Programming: Gisele Pappa Mario Giacobini Zdenek Vasicek
366 pages
What Can Exome Sequencing Do For You
No ratings yet
What Can Exome Sequencing Do For You
11 pages
Ethics in Biotechnology
No ratings yet
Ethics in Biotechnology
57 pages
Science Holiday Homework PDF
No ratings yet
Science Holiday Homework PDF
5 pages
Human Values and Professional Ethics: Submitted To Dr. Khushbu Batra, Assistant Professor, Ash Dept
No ratings yet
Human Values and Professional Ethics: Submitted To Dr. Khushbu Batra, Assistant Professor, Ash Dept
7 pages
Synthetic Genomics Report
No ratings yet
Synthetic Genomics Report
66 pages
Ethical Gene Editing Insights
No ratings yet
Ethical Gene Editing Insights
5 pages
Evans, 2024
No ratings yet
Evans, 2024
7 pages
Ethics and Genetic
No ratings yet
Ethics and Genetic
14 pages
Sci and Tec Readings GP PDF
No ratings yet
Sci and Tec Readings GP PDF
42 pages
Reimagining Innovation in Humanitarian Medicine Engineering Care To Improve Health and Welfare Krish W. Ramadurai Available Instanly
100% (1)
Reimagining Innovation in Humanitarian Medicine Engineering Care To Improve Health and Welfare Krish W. Ramadurai Available Instanly
142 pages
Briefing 17 Synthetic Biology and The Environment PDF
No ratings yet
Briefing 17 Synthetic Biology and The Environment PDF
8 pages
Promoting Ethics
No ratings yet
Promoting Ethics
6 pages
New Neoliberalism and The Other Biopower Anthropophagy and Living Money 1st Edition Giuseppe Cocco PDF Available
100% (1)
New Neoliberalism and The Other Biopower Anthropophagy and Living Money 1st Edition Giuseppe Cocco PDF Available
120 pages
Module 8 - Biotechnology and Relevance To Society
No ratings yet
Module 8 - Biotechnology and Relevance To Society
92 pages
Sts - Week 9 Activity
No ratings yet
Sts - Week 9 Activity
3 pages
Paper Bioetica
No ratings yet
Paper Bioetica
2 pages
The Ethical Implications of Genetic Engineering
No ratings yet
The Ethical Implications of Genetic Engineering
3 pages
Giae 107
No ratings yet
Giae 107
18 pages
Fundamental Approaches To Software Engineering 1st Edition Esther Guerra Instant Download
100% (1)
Fundamental Approaches To Software Engineering 1st Edition Esther Guerra Instant Download
56 pages
AI-Driven Protein Design Security
No ratings yet
AI-Driven Protein Design Security
132 pages
PDF2
No ratings yet
PDF2
2 pages
Biotechnology
No ratings yet
Biotechnology
3 pages
Responsible Use of Generative AI in Chemical Engineering
No ratings yet
Responsible Use of Generative AI in Chemical Engineering
5 pages
Thought Evoking Approaches in Engineering Problems 1st Edition Yoshimo Ito (Eds.) PDF Version
100% (1)
Thought Evoking Approaches in Engineering Problems 1st Edition Yoshimo Ito (Eds.) PDF Version
88 pages
Informarion Collected and References
No ratings yet
Informarion Collected and References
16 pages
M A Modification Controls The Innate Immune Response To Infection by Targeting Type I Interferons
No ratings yet
M A Modification Controls The Innate Immune Response To Infection by Targeting Type I Interferons
15 pages
Nature 11112
No ratings yet
Nature 11112
8 pages
Previewpdf
No ratings yet
Previewpdf
52 pages
3D Genome Organization
No ratings yet
3D Genome Organization
22 pages
SPRITE Method
No ratings yet
SPRITE Method
41 pages
Genome Organization
No ratings yet
Genome Organization
32 pages
Catalog of Corcoran Student Artworks
No ratings yet
Catalog of Corcoran Student Artworks
41 pages
ICH9
No ratings yet
ICH9
27 pages
Drug Study Diclofenac K
100% (1)
Drug Study Diclofenac K
3 pages
Dy034 Procion Dyes
No ratings yet
Dy034 Procion Dyes
6 pages
ADC Exam - MCQ
100% (10)
ADC Exam - MCQ
10 pages
Pacific TAFE's Growth in Fiji
No ratings yet
Pacific TAFE's Growth in Fiji
20 pages
Sudden Cardiac Arrest
No ratings yet
Sudden Cardiac Arrest
20 pages
Pseudalert Iso 13843 Validation Report
No ratings yet
Pseudalert Iso 13843 Validation Report
38 pages
Outdoor Activity Planning Guide
No ratings yet
Outdoor Activity Planning Guide
7 pages
CH 3 Institutional Support To International Business
No ratings yet
CH 3 Institutional Support To International Business
77 pages
Atlas Copco Manuel BBD 94w PDF
No ratings yet
Atlas Copco Manuel BBD 94w PDF
192 pages
GIT Examination - ٠٩٠٩٥٨
No ratings yet
GIT Examination - ٠٩٠٩٥٨
2 pages
Resume - Seyed Amir Reza Hashemi Moghadam
No ratings yet
Resume - Seyed Amir Reza Hashemi Moghadam
2 pages
1 1000 MCQ Bank of Previous Years Adc
32% (25)
1 1000 MCQ Bank of Previous Years Adc
8 pages
UTI in Under-Fives at Mwananyamala
No ratings yet
UTI in Under-Fives at Mwananyamala
23 pages
Impact of Trauma on Student Learning
No ratings yet
Impact of Trauma on Student Learning
5 pages
Bovine Mastitis
No ratings yet
Bovine Mastitis
31 pages
Animal Bite Center
No ratings yet
Animal Bite Center
2 pages
UK Biometric Appointment Guide
No ratings yet
UK Biometric Appointment Guide
2 pages
The Top Five Glute Exercises
100% (7)
The Top Five Glute Exercises
19 pages
Kali Group of Homoeopathic Remedies
No ratings yet
Kali Group of Homoeopathic Remedies
11 pages
Labor Relation Reviewer by Yming
No ratings yet
Labor Relation Reviewer by Yming
75 pages
Diagnosis Fisik Abdomen UNTAD
No ratings yet
Diagnosis Fisik Abdomen UNTAD
20 pages
Ophthalmic Bottle Squeeze Testing
No ratings yet
Ophthalmic Bottle Squeeze Testing
1 page
MSDS Vincristine
No ratings yet
MSDS Vincristine
6 pages
Safety Data Sheet Rust Remover: Revision Date: 12/05/2015 Revision: 5 Supersedes Date: 02/06/2014
No ratings yet
Safety Data Sheet Rust Remover: Revision Date: 12/05/2015 Revision: 5 Supersedes Date: 02/06/2014
9 pages
Aircraft Act and Rules - M.10!3!22
No ratings yet
Aircraft Act and Rules - M.10!3!22
20 pages
Hak-Hak Anak Menurut Unesco Berdasarkan Pihak Yang Wajib Melindungi
No ratings yet
Hak-Hak Anak Menurut Unesco Berdasarkan Pihak Yang Wajib Melindungi
5 pages
Nutritional Disorders of The of The Rice Plant in Asia (TB10)
100% (1)
Nutritional Disorders of The of The Rice Plant in Asia (TB10)
62 pages
A Case Study On Asthma
No ratings yet
A Case Study On Asthma
38 pages

Science - Ad09336 SM

Uploaded by

Science - Ad09336 SM

Uploaded by

Supplementary Materials for

Eric Nguyen et al.

Corresponding authors: Patrick D. Hsu, [email protected]; Brian L. Hie, [email protected]

Science 386, ead09336 (2024)

The PDF file includes:

Other Supplementary Material for this manuscript includes the following:

MDAR Reproducibility Checklist

Safety and ethical implications

Whole-genome foundation models could contribute to social and health inequity

Whole-genome foundation models could contribute to disruptions to the natural environment

The path forward

3.5 3.5 3.5 3.5 3.5

1000 Uncleaved 1,694 bp target 1000

C SpCas9 (PDB: 4OO8) D Electrostatic Potential

0.5 0.5 0.5

0.8 1.5 1.5

−200 −100 0 TnpB TnpA 0 100 200

Table S1 | StripedHyena 7B hyperparameters settings at pretraining. Hyperparameters used for pretrain-

Table S5 | Zero-shot human protein fitness prediction. Spearman correlations of language-model-likelihood

Table S6 | Zero-shot ncRNA fitness prediction. Spearman correlations of language-model-likelihood and

Table S7 | Promoter activity prediction. Spearman correlations of language-model-likelihood and promoter

You might also like