DESCRIPTION

This repository contains all code described in our recent manuscript:

Yekaterina Shulgina*, Marena I. Trinidad*, Conner J. Langeberg*, Hunter Nisonoff*, Seyone Chithrananda*, Petr Skopintsev*, Amos J. Nissley*, Jaymin Patel, Ron S. Boger, Honglue Shi, Peter H. Yoon, Erin E Doherty, Tara Pande, Aditya M. Iyer, Jennifer A. Doudna, Jamie H. D. Cate*. RNA language models predict mutations that improve RNA function. bioRxiv 2024.04.05.588317; doi: https://doi.org/10.1101/2024.04.05.588317

* Contributed equally

It includes the curation of diverse RNA datasets from the Genome Taxonomy Database (GTDB), prediction of optimal growth temperature (OGT) phenotypes, and the application of LM and GNN models for the development of thermostable ribosomes. These resources include examples for dataset preprocessing, the generation of training and test sets utilizing hierarchical clustering with CD-HIT, model validation, and sequence generation. This project is a community effort coordinated by The Innovative Genomics Institute and Department of Electrical Engineering and Computer Sciences at University of California Berkeley.

The GARNET database (GTDB-Acquired RNA with Environmental Temperatures) is freely available here: https://doi.org/10.5281/zenodo.12541208

Software Requirements and Installation with Anaconda

All software requirements are specified in the following yml files. Dependencies may be configured with Anaconda as detailed below.

Dependencies for Rfam curation, data preprocessing, figures and OGT prediction:
- Conda_Environments/data_processing_env.yml
Dependencies for GNN Model:
- Conda_Environments/gnn_environment.yml
Dependencies for LM Model:
- Conda_Environments/lm_environment.yml

Example Installation:

conda env create --file data_processing_env.yml --name data_processing_env

Dataset Curation: Extraction of Rfams from GTDB

Extended directions are available under GTDB_RNA_Curation/rna_alignment_methodology.md

Directory:
- GTDB_RNA_Curation/*

Phenotype Annotation: OGT Prediction and Figure 2

The workflow for predicting optimal growth temperatures with TOME and the generation of Figure 2 and S2 are provided in the notebook below.

Directory:
- GTDB_RNA_Curation/Figure_2_OGT.ipynb

Train/Test Split: Hierarchical Clustering with CD-HIT

Train-test splits were performed using hierarchical clustering with CD-HIT-EST, as described in the manuscript. Step-by-step commands for splitting the rRNA datasets, 228 Rfam sequences, aggregation of the 231 Rfam superset and generation of the thermophile finetuning sets are described in the following notebook: Train_Test_Split/Train_Test_Splits.ipynb.

Directory:
- train_Test_Split/
Dependency Files (Zenodo):
- RNA Sequences in fasta format (*fa.gz)
- hyperthermophiles_60dC_gtdb_ids.txt

Dataset Preprocessing: Create Contact Maps for GNN Model

Directory:
- Contact_Maps/*

Language Model:

An RNA language model built on a GPT framework, starting from the nanoGPT code at https://github.com/karpathy/nanoGPT. Modifications include rotary positional embedding and rms normalization. The optimal tokenization scheme involves overlapping nucleotide triplets, i.e. triplets with 1-nt steps for each token. Models were pretrained (PT) using all RNA sequences or finetuned (FT) on hyperthermophile sequences. Models include those trained on 23S sequences only, or on the 231 RFAM RNA set. Config files for training and logs provided in respective subdirectories.

Directory:
- LM_Model/
Dependency Files:
- configurator.py
- model_RNA_rot.py

Tokenization Example:

Generate Training Tokens:

python prepare_RNA_train.py -i 23S_train.fa.gz -o 23S_triples --type triples

Generate Validation Tokens:

python prepare_RNA_val.py -i 23S_test.fa.gz -o 23S_triples --type triples

Outputs:
- 23S_triples_train.bin
- 23S_triples_val.bin, 23S_triples_meta.pkl

Training Example:

python -u train_RNA_rot_flash.py ./Config/train_23S_triples_0_18_6_300_rot_flash.py &> ./Logs/train_23S_triples_0_18_6_300_rot_flash.log
python -u resume_RNA_rot_flash.py ./Config/resume_23S_triples_0_18_6_300_rot_flash.py &> ./Logs/resume_23S_triples_0_18_6_300_rot_flash.log

Output: out/23S_triples_resume_0_18_6_300_rot_flash.pt

Finetuning Example:

python -u finetune_RNA_rot_flash.py ./Config/finetune_23S_Hyperthermophiles_triples_0_18_6_300_rot_flash.py &> ./Logs/finetune_23S_Hyperthermophiles_triples_0_18_6_300_rot_flash.log

Output:
- out/23S_Hyperthermophiles_triples_finetune_0_18_6_300_rot_flash.pt

GNN Model: Graph-Based RNA Sequence Modeling

Directory:
- GNN_Model/*

Generated Sequences: Sequences Generated by LM and GM Models

23S rRNA sequences (or 16S sequences) generated using a seed from the 5’ end of the corresponding E. coli 23S rRNA or 16S rRNA.

Directories:
- Generated_Sequences/*
- LM_Model/*
Dependency Files:
- model *.pt files
- model_RNA_rot.py
- configurator.py
- Ecoli_23S_7K00.fasta
- Ecoli_16S_rRNA.fasta
Example:

For the RNA LM’s, sequences generated using either the pretrained (PT) or hyperthermophile finetuned (FT) models. Models include those trained on 23S sequences only, or on the 231 RFAM RNA set (231 or Superset in file names).
```
python -u sample_RNA.py &> sample_23S_1.log
```

Validation: Likelihoods of Generated Sequence

Calculation of the probability of generating a particular sequence from a model. Input sequences should all be of the same length, in multifasta format.

Directory:
- Validation/*
Dependency Files:
- Pretrained (PT) model .pt file or hyperthermophile finetuned (FT) model .pt file
- configurator.py
- model_RNA_rot_v2.py

Example:

python -u seqprob_23S_PT_v2_sweep.py &> seqprob_23S_EcoliWT_sweep_231RNAsmodel_PT.log
python -u seqprob_23S_FT_v2_sweep.py &> seqprob_23S_EcoliWT_sweep_231RNAsmodel_FT.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DESCRIPTION

Software Requirements and Installation with Anaconda

Dataset Curation: Extraction of Rfams from GTDB

Phenotype Annotation: OGT Prediction and Figure 2

Train/Test Split: Hierarchical Clustering with CD-HIT

Dataset Preprocessing: Create Contact Maps for GNN Model

Language Model:

GNN Model: Graph-Based RNA Sequence Modeling

Generated Sequences: Sequences Generated by LM and GM Models

Validation: Likelihoods of Generated Sequence

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
Conda_Environments		Conda_Environments
Contact_Maps		Contact_Maps
GNN_Model		GNN_Model
GTDB_RNA_Curation		GTDB_RNA_Curation
Generated_Sequences		Generated_Sequences
LM_Model		LM_Model
Train_Test_Split		Train_Test_Split
Validation		Validation
LICENSE		LICENSE
README.md		README.md

License

Doudna-lab/GARNET_DL

Folders and files

Latest commit

History

Repository files navigation

DESCRIPTION

Software Requirements and Installation with Anaconda

Dataset Curation: Extraction of Rfams from GTDB

Phenotype Annotation: OGT Prediction and Figure 2

Train/Test Split: Hierarchical Clustering with CD-HIT

Dataset Preprocessing: Create Contact Maps for GNN Model

Language Model:

GNN Model: Graph-Based RNA Sequence Modeling

Generated Sequences: Sequences Generated by LM and GM Models

Validation: Likelihoods of Generated Sequence

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages