Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MultiNano is a deep learning framework designed for predicting m6A RNA modifications using raw electrical signals from Oxford Nanopore sequencing. It provides high accuracy across species and conditions, offering a user-friendly pipeline for researchers. MultiNano supports both training from scratch and direct prediction modes.

License

Notifications You must be signed in to change notification settings

zhangjun640/MultiNano

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

image

MultiNano: A Deep Learning Framework for m6A Prediction


GitHub release (latest SemVer)

GitHub release (latest SemVer) GitHub release (latest SemVer) GitHub release (latest SemVer) GitHub release (latest SemVer) GitHub release (latest SemVer) GitHub release (latest SemVer) GitHub release (latest SemVer)

MultiNano is a deep learning framework designed for predicting m6A RNA modifications using raw electrical signals from Oxford Nanopore sequencing. It provides high accuracy across species and conditions, offering a user-friendly pipeline for researchers. MultiNano supports both training from scratch and direct prediction modes, detailed as follows.

You can read our paper directly at here for more details.

1. Environment Setup

First, create the Conda environment using the provided MultiNano.yml file. This will install all necessary dependencies.

conda env create -f demo/MultiNano.yml
Option Purpose
conda env create Creates a new Conda environment.
-f <FILE> Specifies the path to the .yml file that lists all required packages for the environment.

Tip: After creation, activate the new environment with conda activate MultiNano (the environment name is defined within the .yml file).


Data Pre-processing Pipeline

This section details the steps required to convert raw Nanopore data into a feature format suitable for model training or prediction.

2. Convert multi-fast5 to single-fast5

The first step is to convert the default multi-fast5 files from the sequencer into single-fast5 format, as required by Tombo.

multi_to_single_fast5   -i demo/input_fast5   -s demo/output_fast5_single   --recursive   -t 30
Option Purpose
-i <DIR> Input directory containing multi-fast5 files.
-s <DIR> Output directory to save single-fast5 files.
--recursive Recursively searches for files in subdirectories.
-t <INT> Number of parallel threads to use for conversion.

3. Base-calling with Guppy

Next, perform base-calling on the single-fast5 files. This step generates base calls and writes them back into the fast5 files, which is essential for the subsequent re-squiggling step.

guppy_basecaller   -i /input/single_fast5   -s /output/basecall   --num_callers 30   --recursive   --fast5_out   --config rna_r9.4.1_70bps_hac.cfg
Option Purpose
-i <DIR> Input directory of single-fast5 files.
-s <DIR> Output directory for FASTQ files and the updated fast5 files.
--num_callers <INT> Number of parallel base-calling threads to use.
--recursive Traverses input subdirectories to find all fast5 files.
--fast5_out Crucial flag. Ensures base-calling results are written back to fast5 files.
--config <FILE> Specifies the configuration for high-accuracy RNA base-calling.

4. Re-squiggle with Tombo

Tombo's resquiggle command aligns the raw electronic signal events to a reference genome or transcriptome. This step is critical for accurately mapping signals to specific genomic positions.

tombo resquiggle   /path/to/workspace   /path/to/reference.fa   --rna   --overwrite   --processes 50   --corrected-group RawGenomeCorrected_001   --basecall-group Basecall_1D_001
Option Purpose
Positional 1 Workspace directory containing the base-called single-fast5 files from the previous step.
Positional 2 The reference transcriptome in FASTA format.
--rna Informs Tombo to use RNA-specific models and expectations for signal alignment.
--overwrite If the command was run before, this will overwrite the previous Tombo output within the fast5 files.
--processes <INT> Number of worker processes for parallel execution.
--corrected-group <STR> Important. The name of the group within the fast5 file where Tombo will store the re-squiggled alignment data. This name is needed for feature extraction.
--basecall-group <STR> The name of the group within the fast5 file where Guppy stored its base calls. This must match the output from Guppy. Basecall_1D_001 is a common default.

5. Map Reads and Generate Error Profiles

This series of commands aligns the base-called reads to the reference, filters them, and generates a detailed TSV file that describes matches, mismatches, and indels for each read.

# 1. Concatenate all FASTQ files into one
cat /path/to/fast5_guppy/*.fastq > test.fastq

# 2. Align reads, convert to BAM, and sort
minimap2 -t 30 -ax map-ont ref.transcript.fa test.fastq |   samtools view -hSb |   samtools sort -@ 30 -o test.bam

# 3. Index the BAM file
samtools index test.bam

# 4. Generate a detailed error profile in TSV format
samtools view -h -F 3844 test.bam |   java -jar sam2tsv.jar -r ref.transcript.fa > test.tsv
  • minimap2: A fast aligner optimized for noisy long-read data. -ax map-ont is a preset for mapping Oxford Nanopore reads.
  • samtools view -F 3844: This filter removes unmapped, secondary, supplementary, and low-quality alignments, ensuring only primary alignments are used.
  • sam2tsv.jar: A tool that converts SAM/BAM format to a detailed TSV, which is used in the next step to guide feature extraction.

6. Split Error Profiles for Parallel Processing

This awk command splits the master test.tsv file into smaller files, one for each read. This allows for massive parallelization in the feature extraction step.

mkdir tmp
awk 'NR==1{ h=$0 } NR>1 {
  print (!a[$2]++ ? h ORS $0 : $0) > "tmp/"$1".txt"
}' test.tsv
  • How it works: The script reads test.tsv line by line. It saves the header (h=$0). For each data line, it writes the header and the current line to a new file named after the read ID (tmp/$1.txt). The !a[$2]++ logic ensures the header is only written once per file.

7. Feature Extraction

This is the core script that extracts signal features, k-mers, and quality information for each potential m6A site (defined by the DRACH motif).

python scripts/extract.py   -i /path/to/workspace   -o test/features/   --errors_dir test/tmp/   --corrected_group RawGenomeCorrected_001   --w_is_dir yes   -k 5   -s 65   -n 30
Option Purpose
-i <DIR> Path to the directory containing the re-squiggled fast5 files.
-o <PATH> Output path. Since --w_is_dir is set, this will be a directory where feature files are saved.
--errors_dir <DIR> Path to the directory containing the per-read error profiles (.txt files) created in the previous step.
--corrected_group <STR> Must match the group name used in the tombo resquiggle command (RawGenomeCorrected_001). This tells the script where to find the signal data.
--w_is_dir <yes/no> If yes, treats the output path as a directory and saves features in batches to multiple files, which is efficient for large datasets.
-k <INT> K-mer length. The length of the nucleotide sequence to extract around a site (e.g., 5 for NN[DRACH]NN). Must be an odd number.
-s <INT> Signal length. The number of raw signal values to extract for each base. The script will pad or sample the signals to meet this fixed length.
-n <INT> Number of parallel processes to use for feature extraction.

8. Aggregate Read-Level to Site-Level Features

The final pre-processing step aggregates the features from all individual reads into a single entry for each genomic site.

# 1. Sort all feature files by genomic location (chromosome, position)
sort -k1,1 -k2,2n -k5,5 -T./ --parallel=30 -S30G test/features/* > test.features.sort.tsv &

# 2. Aggregate the sorted features
python scripts/aggre_features_to_site_level.py   --input test.features.sort.tsv   --output test.features.sort.aggre.tsv &

# 3. Filter for sites with a minimum read coverage of 20
awk 'BEGIN{ FS=OFS="	" } $4>=20 {print $0}' test.features.sort.aggre.tsv > test.features.sort.aggre.c20.tsv
  • sort: Sorting is essential to group all reads covering the same site together. -k1,1 -k2,2n sorts by chromosome and then numerically by position.
  • aggre_features_to_site_level.py: This script iterates through the sorted file and combines all feature lines corresponding to the same site into a single, aggregated line.
  • awk '$4>=20': A common quality control step to ensure that predictions are only made on sites with sufficient evidence (in this case, at least 20 reads).

Prediction Workflow

MultiNano offers several modes for prediction depending on your needs.

9. Site-Level Prediction

This is the standard mode for predicting m6A status directly from aggregated site-level feature files.

python scripts/predict_sl.py     --model /path/to/your/trained_site_model.pt     --input_file /path/to/your/site_features_for_prediction.txt     --output_file /path/to/your/site_predictions.tsv

10. Read-Level Prediction

If you need to determine the modification status of individual reads, use this mode.

python scripts/predict_rl.py     --model /path/to/your/trained_read_model.pt     --input_file /path/to/your/read_features_for_prediction.txt     --output_file /path/to/your/read_predictions.tsv

11. Two-Stage Prediction (Site-level with False Positive Control)

This advanced workflow first generates per-read probabilities and then uses a second model to reduce false positives. This is the recommended approach for the highest accuracy.

Step 1: Generate Primary Predictions

This script uses a read-level model to generate per-read probabilities for each site and then calculates a preliminary site-level score. The output file contains all the per-read probabilities, which are needed for the next step.

python scripts/predict_sl_with_rl.py     --model /path/to/your/trained_read_model.pt     --input_file /path/to/your/site_features_for_prediction.txt     --output_file /path/to/primary_site_predictions.csv
Option Purpose
--model Path to a read-level trained model (.pt).
--input_file Aggregated site-level feature file from step 8.
--output_file Output .csv file containing the site ID, all per-read probabilities, and a primary prediction.

Step 2: Refine Predictions and Control False Positives

This script uses a trained XGBoost model to analyze the distribution of read probabilities from the previous step. It acts as a second-stage filter to differentiate true positives from likely false positives.

python FP_control/predict.py     --differentiator-model /path/to/xgboost_differentiator.json     --primary-predictions /path/to/primary_site_predictions.csv     --output /path/to/final_refined_predictions.tsv
Option Purpose
--differentiator-model Path to a pre-trained XGBoost model (.json) designed to identify false positive signatures.
--primary-predictions The output file from the previous predict_sl_with_rl.py step, containing the per-read probability distributions.
--output The final, refined prediction file containing site identifiers and a final binary prediction (1 for modified, 0 for unmodified).

Training Workflow

You can also train MultiNano's models from scratch using your own labeled data.

13. Train a Site-Level Model

This trains the model that predicts modification status directly from aggregated site-level features.

python scripts/train.py     --train_file /path/to/your/site_train_data.txt     --valid_file /path/to/your/site_val_data.txt     --save_dir results/site_model_checkpoints/     --model_type comb     --epochs 100     --batch_size 64
Option Purpose
--train_file Path to the training dataset (aggregated site-level features).
--valid_file Path to the validation dataset used for monitoring training and early stopping.
--save_dir Directory where model checkpoints will be saved after each epoch.
--model_type Specifies the input features to use: raw_signals, basecall (k-mer), or comb (both).

14. Train a Read-Level Model

This trains the model that predicts modification status for each individual read. This model is required for the two-stage prediction workflow.

python scripts/train_rl.py     --train_file /path/to/your/read_train_data.txt     --valid_file /path/to/your/read_val_data.txt     --save_dir results/read_model_checkpoints/     --model_type comb     --epochs 100     --batch_size 256

15. Train a False Positive Differentiator Model

This trains the second-stage XGBoost classifier used to reduce false positives. To train it, you need primary prediction results (from predict_sl_with_rl.py) on sites that you know are true positives and true negatives.

python FP_control/train.py     --positive-samples /path/to/primary_preds_on_true_positives.csv     --negative-samples /path/to/primary_preds_on_true_negatives.csv     --model-out /path/to/save/differentiator.json
Option Purpose
--positive-samples Path to the primary prediction results file generated by running Step 11 on a dataset where all sites are known to be modified (label=1).
--negative-samples Path to the primary prediction results file generated by running Step 11 on a dataset where all sites are known to be unmodified (label=0).
--model-out Output path where the trained XGBoost model will be saved in .json format.

About

MultiNano is a deep learning framework designed for predicting m6A RNA modifications using raw electrical signals from Oxford Nanopore sequencing. It provides high accuracy across species and conditions, offering a user-friendly pipeline for researchers. MultiNano supports both training from scratch and direct prediction modes.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages