Paula Ruiz-Rodriguez1
and Mireia Coscolla1
1. Institute for Integrative Systems Biology, I2SysBio, University of Valencia-CSIC, Valencia, Spain
get_MNV is a tool designed to identify Multi-Nucleotide Variants (MNVs) within the same codon in genomic sequences. MNVs occur when multiple Single Nucleotide Variants (SNVs) are present within the same codon, leading to the translation of a different amino acid. This tool addresses limitations in current annotation programs like ANNOVAR or SnpEff, which are primarily designed to work with individual SNVs and might overlook the actual amino acid changes resulting from MNVs.
get_MNV seeks to address this issue, enhancing the comprehensiveness of genetic variant interpretation.
IMPORTANT this script works with SNV against a reference, insertions and deletions modifiying reading frame are not currently supported
- MNV Identification: Detects SNVs occurring within the same codon and reclassifies them as MNVs.
- Accurate Amino Acid Change Calculation: Computes the resulting amino acid changes based on genomic reads.
- Integration with BAM and VCF Files: Supports input from VCF files for variants and optional BAM files for aligned reads.
- Quality Analysis: Allows setting a minimum Phred quality threshold to filter out low-quality reads.
You can install get_MNV via conda, mamba (for unix/mac) or downloading the binary file (unix):
conda install -c bioconda get_mnv
mamba install -c bioconda get_mnv
wget https://github.com/PathoGenOmics-Lab/get_MNV/releases/download/1.0.0/get_mnv
get_mnv [OPTIONS] --vcf <VCF_FILE> --fasta <FASTA_FILE> (--genes <GENES_FILE> | --gff <GFF_FILE>)
- -v, --vcf <VCF_FILE>: VCF file containing the SNVs. (Required)
- -b, --bam <BAM_FILE>: BAM file with aligned reads. (Optional)
- -f, --fasta <FASTA_FILE>: FASTA file with the reference sequence. (Required)
- -g, --genes <GENES_FILE>: Gene annotation file in TSV format. (Required if
--gffis not used) - --gff <GFF_FILE>: Gene annotation file in GFF/GFF3 format. (Required if
--genesis not used) - --sample : Sample name used to read original FORMAT metrics (
DP/AF/FREQ) from multi-sample VCF. Default: first sample.- Use
--sample allto process every sample in the VCF and write one output per sample.
- Use
- --chrom : Optional contig selection. Accepts one or multiple comma-separated contigs (e.g.
--chrom chr1,chr2). Default: all contigs in VCF. - -q, --quality : Minimum Phred quality score (default: 20).
- --mapq : Minimum mapping quality MAPQ (default: 0).
- --threads : Number of worker threads for parallel processing (default: Rayon auto).
- --normalize-alleles: Normalize REF/ALT alleles (trim shared prefix/suffix) before processing.
- --split-multiallelic: Split multiallelic VCF records (
ALT=A,C,...) into independent ALT alleles. - --dry-run: Validate inputs and print per-contig summary without writing output files.
- --strict: Fail if original input metrics (
ODP/OFREQ) cannot be recovered from VCF FORMAT/INFO fields. - --vcf-gz: Write VCF output as BGZF-compressed
.MNV.vcf.gz(requires--convertor--both). - --index-vcf-gz: Build Tabix index (
.tbi) for.MNV.vcf.gzoutput. - --bcf: Also write BCF output (
.MNV.bcf) converted from generated VCF (requires--convertor--both). - --min-snp-strand : Require at least
Nsupporting reads in forward and reverse strand for SNP records. - --min-mnv-strand : Require at least
Nsupporting reads in forward and reverse strand for MNV records. - --min-strand-bias-p
: Filter variants with Fisher exact strand-bias p-value
< P(0 <= P <= 1). - --emit-filtered: Keep threshold-failing variants in VCF and mark them with FILTER tags instead of skipping them.
- --strand-bias-info: Add Fisher exact strand-bias p-values (
SBPfor SNP,MSBPfor MNV) to VCF INFO. - --summary-json <JSON_FILE>: Write structured run summary in JSON format.
- --error-json <JSON_FILE>: Write structured JSON error details when the command fails.
- --run-manifest <JSON_FILE>: Write reproducibility manifest with command line, summary and checksums.
- --convert: Output only VCF (
.MNV.vcfor.MNV.vcf.gzwith--vcf-gz) instead of TSV. - --both: Output both TSV and VCF in one run.
get_mnv \
--vcf variants.vcf \
--bam reads.bam \
--fasta reference.fasta \
--gff genes.gff3 \
--quality 30 \
--mapq 20
- VCF File: Should contain the identified SNVs.
- BAM File: (Optional) Genomic reads aligned to the reference sequence.
- FASTA File: Reference genomic sequence.
- Gene annotation file:
- TSV: tab-delimited text file with one entry per line
(GeneName, GeneStart, GeneEnd, Strand). - GFF/GFF3: features of type
geneare used (columns 4/5/7 for start/end/strand).
- TSV: tab-delimited text file with one entry per line
Contig naming contract:
- VCF contigs, FASTA record IDs, and GFF sequence IDs must match exactly (case-sensitive).
- Input VCF should declare all used contigs in the header (
##contig=<ID=...>). - If contig names do not match, normalize them before running
get_mnv.
TSV example:
Rv0007_Rv0007 9914 10828 +
ileT_Rvnt01 10887 10960 +
alaT_Rvnt02 11112 11184 +
Rv0008c_Rv0008c 11874 12311 -
ppiA_Rv0009 12468 13016 +
Rv0010c_Rv0010c 13133 13558 -The program generates a TSV file named <vcf_filename>.MNV.tsv containing the following information:
- Chromosome: Contig/chromosome name.
- Gene: Name of the gene.
- Positions: Positions of the variants.
- Reference Bases: Reference nucleotide bases at variant positions.
- Base Changes: Nucleotide base changes.
- AA Changes: Resulting amino acid changes.
- SNP AA Changes: Amino acid changes if considering individual SNVs.
- Variant Type: Type of variant (SNP, MNV, or SNP/MNV).
- Change Type: Type of change at the protein level (Synonymous, Non-synonymous, Stop gained).
- SNP Reads: (If BAM provided) Count of reads supporting each SNP.
- SNP Forward Reads / SNP Reverse Reads: (If BAM provided) Strand-specific SNP support counts.
- MNV Reads: (If BAM provided) Count of reads supporting the MNV.
- MNV Forward Reads / MNV Reverse Reads: (If BAM provided) Strand-specific MNV support counts.
When --convert or --both is used, the VCF INFO field includes:
ODP/OFREQ: Original depth and frequency from the input VCF (parsed from FORMAT/INFODP+AF/FREQ, with caller-compatible fallback fromADorAO/ROwhen needed).DP/FREQ: Recalculated depth and frequency from BAM reads.SRF/SRR: SNP supporting reads on forward/reverse strand.MRF/MRR: MNV supporting reads on forward/reverse strand.
When --emit-filtered is used, VCF FILTER can contain:
LowSupport: fails minimum read support thresholds (--snp/--mnv).StrandSupport: fails minimum per-strand support (--min-snp-strand/--min-mnv-strand).StrandBias: fails Fisher exact p-value threshold (--min-strand-bias-p).
When --bcf is enabled, an additional .MNV.bcf output is written.
VCF header metadata includes:
##get_mnv_version##get_mnv_command##get_mnv_min_quality##get_mnv_min_mapq##get_mnv_min_snp_reads##get_mnv_min_mnv_reads
Notes:
- For MNV records, recalculated
DP/FREQuse the depth of reads that span all SNP positions in the MNV haplotype. - Recalculated frequencies (
FREQ, SNP/MNV frequencies in TSV) are printed with 4 decimal places. --dry-runprints parsing/annotation summaries and skips TSV/VCF creation.--dry-runalso skips.bcfcreation.--summary-jsonstores per-contig and global counters (variants, genes, cache hits/misses).summary-jsonincludesschema_versionfor stable downstream parsing.summary-jsonincludes per-phase timings in milliseconds (parse_inputs_ms,process_ms,emit_ms,total_ms).summary-jsonincludes SHA-256 checksums for VCF/FASTA/annotation/BAM inputs.--run-manifestincludes summary plus output checksums (VCF/TSV/BCF when present).--sample allwrites one output set per sample using suffix.sample_<sample_name>.- If
--chromis provided, GFF/GFF3 gene features are filtered to those contigs. - Multi-contig processing without
--chromis supported with GFF/GFF3 annotation files. - TSV annotation files do not contain contig information, so for multi-contig VCF inputs use
--gffor restrict with--chrom. --strictis useful for QC pipelines that require original depth/frequency traceability in every output record.- Multiallelic records can be processed with
--split-multiallelic; default behavior remains strict fail.
Example:
Chromosome Gene Positions Base Changes AA Changes SNP AA Changes Variant Type Change Type SNP Reads SNP Forward Reads SNP Reverse Reads MNV Reads MNV Forward Reads MNV Reverse Reads
MTB_anc Rv0095c_Rv0095c 104838 T Asp126Glu Asp126Glu SNP Non-synonymous 16 8 8 0 0 0
MTB_anc Rv0095c_Rv0095c 104941,104942 T,G Gly92Gln Gly92Glu; Gly92Arg SNP/MNV Non-synonymous 0,3 0,1 0,2 22 11 11
MTB_anc esxL_Rv1198 1341102,1341103 T,C Arg33Ser Arg33Cys; Arg33Pro MNV Non-synonymous 0,0 0,0 0,0 9 5 4
Run a baseline benchmark for codon variant generation:
cargo run --release --bin bench_variants -- --warmup 5 --iters 30Benchmark against the bundled example dataset and export CSV:
cargo run --release --bin bench_variants -- --dataset example --warmup 5 --iters 20 --threads 4 --csv benchmark.csv --max-avg-ms 200This appends one line per run to benchmark.csv.
Larger synthetic stress benchmark:
cargo run --release --bin bench_variants -- --warmup 3 --iters 20 --threads 4 --synthetic-scale 4For reproducible benchmark batches and example-run artifact generation, see scripts in /analysis:
/analysis/run_reproducible_benchmark.sh/analysis/reproduce_example_run.sh
[E002] Contig validation failed ... Missing in VCF/FASTA: contig names do not match between files. Normalize names in VCF/FASTA/GFF and re-run.[E002] VCF REF/FASTA mismatch at <contig>:<pos>: REF in VCF does not match the FASTA base. Normalize/fix the VCF against the same reference.[E002] Invalid base 'X' in REF/ALT allele ...: non-IUPAC base detected. Clean invalid alleles before running.Sample '<name>' not found in VCF header: check sample name spelling or omit--sampleto use the first sample.TSV annotation does not include contig names: for multi-contig VCF use--gffor restrict with--chrom.--strict enabled, but original VCF metrics are missing ...: ensure each selected record has recoverableDPandAF/FREQin FORMAT or INFO.--index-vcf-gz requires --vcf-gz: enable BGZF VCF output before requesting Tabix index generation.--bcf requires --convert or --both: enable VCF output mode before requesting BCF conversion.--min-strand-bias-p must be between 0 and 1: provide a probability threshold inside[0, 1].Requested --sample all but input VCF has no sample columns: use a VCF with FORMAT/sample columns or select a single sample mode.Multiallelic VCF record ... is not supported: split multiallelic sites first (bcftools norm -m -) or run with--split-multiallelic.
Exit status codes:
1: generic error (E000)2: config/CLI error (E001)3: input/validation error (E002)10-14: I/O/CSV/HTSlib/UTF-8/parse specific errors
- The script currently works only with SNVs compared against a reference sequence.
- Insertions and deletions that modify the reading frame are not supported in this version.
- Multiallelic VCF records require either pre-splitting (e.g.
bcftools norm -m -) or enabling--split-multiallelic. - VCF contig names must be properly declared in the VCF header and present in the FASTA file.
|
Paula Ruiz-Rodriguez π» π¬ π€ π£ π¨ π§ |
Mireia Coscolla π π€ π§βπ« π¬ π |
This project follows the all-contributors specification (emoji key).
Click for the stl file