This repository contains the computational and analytical framework used to characterize toxin–antitoxin (TA) systems in the oral microbiome using metatranscriptomic data from two publicly available cohorts: Dieguez et al. and Ev datasets.
The analysis integrates differential expression, curated functional annotations, and detailed visualizations to identify and interpret the transcriptional activity of TA gene pairs in healthy and caries-associated and caries-treated oral microbiomes. All scripts, intermediate data files, and processed outputs are included to enable reproducibility, transparency, and downstream reuse.
- Toxin–Antitoxin Systems in the Oral Microbiome
- Repository Overview & Structure
- Installation instructions
- Metatranscriptomic Data Processing Pipeline
- Functional Profiling (HUMAnN)
- Functional annotations
- Downstream Analysis: Differential Expression and Visualization
- Functional Evidences
- Validated TA Gene Clusters
- Directory structure
- Packages to install
- Authors and Maintainers
- Questions and Feedback
All analyses were developed and executed on a Linux-based HPC cluster and assume working familiarity with common bioinformatics tools plus basic R/Python scripting.
git clone https://github.com/biocoms/ta_systems_oral_mt.git
cd ta_systems_oral_mt
To streamline setup, we provide a single bash script that installs all dependencies in one step.
bash scripts/setup.sh
If any large files are missing or stubbed out, ensure Git LFS is installed. You can also install git lfs
via conda
and activate in git
as shown below.
git lfs install
git lfs pull
git lfs track "filename or file extension"
git add .gitattributes
The following section details the entire preprocessing workflow using publicly available tools, from raw sequencing reads to functional gene and pathway profiles.
Enter these BioProject numbers in SRA-Explorer, copy the FTP links of the paired-end FASTQ files to a .txt file and download using wget
-
Dieguez: BioProject PRJNA712952
-
Ev: BioProject PRJNA930965
wget -i dieguez.txt
wget -i ev.txt
Make sure dieguez.txt and ev.txt each contains full FTP links for both R1 and R2 reads e.g.:
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR233/020/SRR23351020/SRR23351020_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR233/020/SRR23351020/SRR23351020_2.fastq.gz
.......
.......
Trim Galore is a wrapper tool that combines FastQC and Cutadapt to perform quality filtering and adapter trimming.
Outputs:
- Trimmed paired FASTQ files (e.g., _1_trimmed.fastq.gz,_2_trimmed.fastq.gz)
- FastQC quality reports (_fastqc.html,_fastqc.zip)
Command Example:
trim_galore --paired --fastqc --gzip --phred33 --length 50 \
--output_dir dieguez/trimmed_reads sample_1.fastq.gz sample_2.fastq.gz
-
--paired: Indicates paired-end reads.
-
--fastqc: Runs FastQC before and after trimming for QC reports.
-
--gzip: Compresses the output files in .gz format.
-
--phred33: Specifies base quality encoding (standard for Illumina).
-
--length 50: Discards reads shorter than 50 bp after trimming.
-
--output_dir: Output folder for trimmed reads and FastQC reports.
This step is applied to both the Dieguez and Ev datasets. FastQC reports are generated to evaluate sequence quality before and after trimming.
SortMeRNA filters out unwanted rRNA and host (human) reads using curated databases.
Databases Used:
- SILVA rRNA databases (16S, 23S, 5S, etc.)
- Rfam (non-coding RNAs)
- GRCh38 (human genome and transcriptome)
All reference FASTA files are stored in: dbs/sortmerna_databases/
Inputs:
Trimmed paired FASTQ files (trimmed_reads/_1_trimmed.fastq.gz and trimmed_reads/_2_trimmed.fastq.gz)
Outputs:
- *_aligned.fastq.gz — reads matching rRNA/host
- *_unaligned.fastq.gz — filtered reads for downstream analysis
Command Template:
sortmerna --ref dbs/ref_sortmerna/silva-bac-16s-id90.fasta \
--reads sample_1_trimmed.fastq.gz \
--reads sample_2_trimmed.fastq.gz \
--fastx \
--aligned output_dir/sample_aligned \
--other output_dir/sample_unaligned \
--threads 32 #flexible as per available cores
-
--ref: Specifies each reference database for filtering.
-
--reads: Input FASTQ files (both forward and reverse).
-
--fastx: Ensures FASTQ format is preserved for input/output.
-
--aligned: Output prefix for reads that matched references.
-
--other: Output prefix for reads that did not match references.
-
--threads: Number of threads to use for parallel processing.
Filtered reads are saved into:
- dieguez/trimmed_reads/sortmerna_unaligned/
- ev/trimmed_reads/sortmerna_unaligned/
These unaligned reads are used as input for HUMAnN.
HUMAnN3 (The HMP Unified Metabolic Analysis Network) is used to profile gene families and metabolic pathways from host-filtered microbial reads.
Databases Required:
- ChocoPhlAn: nucleotide-level mapping
- UniRef90 or UniRef50: translated protein database for function
Inputs:
- .fq.gz files from sortmerna_unaligned/ directory
Outputs (for each sample):
- {sample}_genefamilies.tsv
- {sample}_pathabundance.tsv
- {sample}_pathcoverage.tsv
Command Template:
humann -i sample_unaligned.fq.gz \
-o humann_output/sample_name \
--threads 32 \
--nucleotide-database humann/dbs/chocophlan \
--protein-database humann/dbs/uniref \
--verbose
- -i: Input FASTQ file (filtered reads).
- -o: Output directory for results.
- --threads: Number of parallel threads for speed.
- --nucleotide-database: Path to ChocoPhlAn nucleotide database.
- --protein-database: Path to UniRef protein database.
- --verbose: Prints detailed progress during execution.
Run separately for each dataset (Dieguez and Ev).
Run separately for each dataset (Dieguez and Ev). This step generates functional profiles that can be used for downstream gene analysis and comparison towards TA gene clusters.
ta_systems_oral_mt/
├── dieguez/
│ ├── trimmed_reads/
│ │ ├── *_1_trimmed.fastq.gz
│ │ ├── *_2_trimmed.fastq.gz
│ │ ├── *_fastqc.html
│ │ ├── *_fastqc.zip
│ │ ├── sortmerna/
│ │ │ ├── <sample>/
│ │ │ │ ├── <sample>_aligned.fastq.gz
│ │ │ │ ├── <sample>_unaligned.fastq.gz
│ │ └── sortmerna_unaligned/
│ │ └── *.fq.gz # Filtered (host & rRNA removed) reads
│ └── humann/
│ └── <sample>/
│ ├── <sample>_genefamilies.tsv
│ ├── <sample>_pathabundance.tsv
│ └── <sample>_pathcoverage.tsv
│
├── ev/
│ ├── trimmed_reads/
│ │ ├── *_1_trimmed.fastq.gz
│ │ ├── *_2_trimmed.fastq.gz
│ │ ├── *_fastqc.html
│ │ ├── *_fastqc.zip
│ │ ├── sortmerna/
│ │ │ ├── <sample>/
│ │ │ │ ├── <sample>_aligned.fastq.gz
│ │ │ │ ├── <sample>_unaligned.fastq.gz
│ │ └── sortmerna_unaligned/
│ │ └── *.fq.gz # Filtered reads
│ └── humann/
│ └── <sample>/
│ ├── <sample>_genefamilies.tsv
│ ├── <sample>_pathabundance.tsv
│ └── <sample>_pathcoverage.tsv
│
├── dbs/
│ └── sortmerna_databases/ # SortMeRNA reference databases
│ ├── *.fasta
│ └── humann_databases/
│ ├── chocophlan/ # HUMAnN nucleotide database
│ └── uniref/ # HUMAnN protein database
│
├── dieguez.txt # FTP links for Dieguez dataset
├── ev.txt # FTP links for Ev dataset
├── ta_process.sh # Main preprocessing script
└── log/
└── processing_pipeline.log # Cluster job log output
We first extracted unique UniRef90 gene family IDs from the *_genefamilies.tsv
outputs of HUMAnN. These IDs were then used to fetch FASTA sequences for downstream functional annotation using InterProScan, eggNOG-mapper, TADB3, and VFDB.
Inputs:
- Directory of
*_genefamilies.tsv
files from HUMAnN for each sample
(raw_data/dieguez_genefamilies.tsv
,raw_data/ev_genefamilies.tsv
)
Steps
- Extract UniRef90 IDs
- Parsed each genefamilies file to collect only those gene families starting with
UniRef90_
. - Saved as one ID list per sample in:
uniref_ids/extracted_ids/*_uniref90_ids.txt
- Parsed each genefamilies file to collect only those gene families starting with
- Download FASTA Sequences
- Queried the UniProt REST API in chunks (default = 500 IDs/query).
- Fetched
.fasta
sequences corresponding to UniRef90 IDs per sample. - Logs were saved for each sample to monitor download status.
Script:
scripts/uniref_idmapping.py
python scripts/uniref_idmapping.py \
--input_dir raw_data/dieguez_genefamilies.tsv \
--output_dir dieguez_uniref_mapped/
python scripts/uniref_idmapping.py \
--input_dir raw_data/ev_genefamilies.tsv \
--output_dir ev_uniref_mapped/
Outputs:
uniref_ids/extracted_ids/*.txt
– Sample-wise UniRef90 ID listsuniref_ids/fastas/*.fasta
– Fetched amino acid sequencesuniref_ids/logs/*.log
– Chunk-wise log of downloads per sample
The shell script database_annotations.sh
automates batch functional annotation using four major tools: VFDB, TADB3, eggNOG-mapper, and InterProScan, applied to UniRef90-mapped FASTA sequences from the Dieguez and EV datasets.
Inputs
-
FASTA Sequences: Protein-coding genes extracted using UniRef90 IDs from significant TA genes.
- Located in:
dieguez_uniref_mapped/fastas/
ev_uniref_mapped/fastas/
- Located in:
-
Databases:
vfdb/VFDB_prot.dmnd
tadb3/tadb3_combined.fasta/*.dmnd
dbs/eggnog_data/eggnog_proteins.dmnd
dbs/interproscan-5.72-103.0/interproscan.sh
-
Threads: VFDB/TADB3 (64), eggNOG (80), InterProScan uses all available CPUs.
Steps
-
VFDB (DIAMOND Blastp)
Matches each UniRef90 FASTA file to the VFDB protein database.
Output format: tab-delimited DIAMOND blastp (.tsv
) -
TADB3 (DIAMOND Blastp to Multiple DBs)
Iteratively matches each FASTA to all*.dmnd
toxin/antitoxin databases from TADB3. -
eggNOG-mapper
Runsemapper.py
with:- Minimum 50% identity (
--pident
) - Minimum 80% query coverage (
--query_cover
) - DIAMOND mode, using custom eggNOG directory
- Minimum 50% identity (
-
InterProScan
Annotates each FASTA using InterProScan with GO term and IPR lookups.
Output format: tab-separated.tsv
with all domain and functional annotations.
Outputs
Tool | Dieguez Output Directory | EV Output Directory |
---|---|---|
VFDB | vfdb/dieguez/ |
vfdb/ev/ |
TADB3 | tadb3/dieguez/ |
tadb3/ev/ |
eggNOG | eggnog/dieguez/ |
eggnog/ev/ |
InterProScan | interproscan/dieguez/ |
interproscan/ev/ |
Execution
To run all annotations:
bash database_annotations.sh
All downstream analyses were conducted in R using the ta_systems_oral_mt.Rproj
environment with strict reproducibility ensured via renv
.
Activate the project-local R environment using:
# From within the R project root
install.packages("renv")
renv::restore()
This installs all required packages at pinned versions as declared in renv.lock
.
Alternatively, for manual setup, install the following CRAN and Bioconductor packages:
# CRAN packages
install.packages(c(
"tidyverse", "dplyr", "ggplot2", "gridExtra", "patchwork",
"VennDiagram", "RColorBrewer", "pheatmap", "UpSetR"
))
# Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c(
"phyloseq", "ANCOMBC", "ComplexHeatmap", "circlize", "microbiome"
))
Raw gene family expression files (*_genefamilies.tsv
) generated by HUMAnN were cleaned and normalized as follows:
- Retained only gene families prefixed with
UniRef90_
- Removed unmapped or low-abundance entries (e.g.,
UNMAPPED
) - Standardized sample identifiers by removing
.RPKs
suffixes - Ensured uniqueness of gene family rows
Output Files:
processed_data/processed_dieguez_genes.csv
processed_data/processed_ev_genes.csv
We intersected processed gene family tables with the curated UniRef90 toxin cluster list (uniref90_toxin_toxic.tsv
) to identify shared and unique transcriptional signals across datasets.
- Computed overlaps between Dieguez and EV gene families with UniRef90 toxin genes
- Separated results into:
- Dieguez-only overlaps
- EV-only overlaps
- Shared UniRef90-positive genes
Output Matrices:
uniref_tox_abundance/dieguez_final_uniref90_mapped.csv
uniref_tox_abundance/ev_final_uniref90_mapped.csv
uniref_tox_abundance/intersection_dieguez_final_uniref90_mapped.csv
uniref_tox_abundance/intersection_ev_final_uniref90_mapped.csv
Venn Diagram
venn_diagrams/venn_dieguez_ev_uniref.png
Differential abundance testing was conducted using ANCOM-BC independently within each dataset.
Workflow:
- Constructed
phyloseq
objects from UniRef90-filtered matrices - Enumerated all valid pairwise condition comparisons
- Applied
ancombc()
with the model formula~ condition
- Exported full and filtered (significant) result tables
Output Tables:
DA_results/da_complete/*.csv
DA_results/da_sig/*.csv
For each ANCOM-BC comparison, volcano plots were generated to highlight gene-level statistical significance and fold-change magnitude.
-
x-axis:
$\log_{2}$ fold-change (lfc
) - y-axis: –\log_{10}$(p-value)
- Color scheme: red (significant), gray (non-significant)
- Vertical and horizontal thresholds marked at ±1
$\log_{2}$ (FC) and p = 0.05
Output Directory:
DA_results/volcano_plot/*.png
Significantly differentially expressed TA gene clusters were visualized using scaled heatmaps.
- Expression values were
$\log_{2}$ -transformed with pseudocounts - Row-wise scaling was applied to highlight relative expression shifts
- Columns were annotated using condition and
host_subject_id
- Color palettes were customized to reflect biological and sample group distinctions
Output Directory:
DA_results/heatmaps/*.png
To summarize and visualize shared transcriptional activity of UniRef90-mapped toxin–antitoxin genes across conditions, we generated both UpSet plots and Venn diagrams.
The UpSet plot captures the intersection of all significantly expressed UniRef90 genes across validated pairwise comparisons.
Highlights:
- Rows: Gene clusters
- Columns: Comparisons (Dieguez and EV)
- Bars: Number of shared genes across comparison sets
- Queries:
- Red: genes annotated as toxins
- Green: genes annotated as antitoxins
Color Mapping:
- Dieguez comparisons:
#E82561
- EV comparisons:
purple
Output:
DA_results/upset_plot/upset_plot.png
To ensure consistency in comparison labels across datasets, condition names were programmatically cleaned and standardized. Reverse contrasts (e.g., B vs A
and A vs B
) were collapsed.
Nine thematic panels (B to J) were designed to illustrate 3-way overlap across clinically meaningful comparisons.
Examples:
- Panel B: Caries–CA* vs Healthy–CF (AAa, CAi, CAs)
- Panel H: Healthy-only timepoint comparisons
- Panel J: Healthy vs Caries across all stages
Each Venn diagram shows:
- Size-labeled sets with word-wrapped condition labels
- Color-coded ellipses from
RColorBrewer::Set2
- Gene counts in intersections and unique sections
Output Directory:
DA_results/venn_panels/
File Naming Convention:
Venn_Panel_<Letter>.png
Supporting Tables:
All_Venn_Sets_Combined.csv
: All individual sets per panelAll_Venn_Intersections_3way.csv
: Genes shared by all three comparisons per panel
Panel | Comparisons |
---|---|
B | Caries-CAa vs Healthy-CF, Caries-CAi vs Healthy-CF, Caries-CAs vs Healthy-CF |
C | Caries-CAa vs Healthy-CI, Caries-CAi vs Healthy-CI, Caries-CAs vs Healthy-CI |
D | Healthy-CF vs Healthy-CI, Caries-CAa vs Healthy-CF, Caries-CAa vs Healthy-CI |
E | Healthy-CF vs Healthy-CI, Caries-CAi vs Healthy-CF, Caries-CAi vs Healthy-CI |
F | Healthy-CF vs Healthy-CI, Caries-CAs vs Healthy-CF, Caries-CAs vs Healthy-CI |
G | Caries-CAa vs Caries-CAi, Caries-CAa vs Caries-CAs, Caries-CAi vs Caries-CAs |
H | Healthy-Baseline vs Healthy-Fl, Healthy-Baseline vs Healthy-Fl-Ar, Healthy-Fl vs Healthy-Fl-Ar |
I | Caries-Baseline vs Caries-Fl, Caries-Baseline vs Caries-Fl-Ar, Caries-Fl vs Caries-Fl-Ar |
J | Caries-Baseline vs Healthy-Baseline, Caries-Fl vs Healthy-Fl, Caries-Fl-Ar vs Healthy-Fl-Ar |
tadb3_master_table.py
parses DIAMOND alignment outputs against the TADB3 database and generates a consolidated annotation table for all matched genes across Dieguez and Ev datasets.
Inputs:
- DIAMOND outputs:
tadb3/diamond_dieguez/*.tsv
tadb3/diamond_ev/*.tsv
- Reference FASTA:
tadb3/tadb3_combined.fasta
Steps:
- Parse TADB3 FASTA to extract
Protein_Accession
,Genome_Accession
, andOrganism
. - Read DIAMOND result files per sample; each file contains UniRef90 hits to TADB3 proteins.
- Merge DIAMOND hits with metadata extracted from the FASTA.
- Annotate each hit with:
Toxin_Antitoxin
: inferred from ID patterns (_tox_
,_antitox_
)Validation_Type
:Experimental
orComputational
based on UniRef90 ID
- Record
Sample_Name
and dataset (dieguez
orev
) for each entry.
Output:
tadb3/tadb3_tox_antitox_main_table_1.tsv
: master table with annotation metadata for all TADB3-aligned genes across both datasets.
tadb3_sig_genes.py
filters and annotates TADB3-aligned genes that are differentially expressed in the Dieguez or EV datasets. This script integrates DIAMOND-based hits with metadata from the TADB3 master table and augments entries with protein descriptions retrieved from NCBI.
Inputs:
DA_results/sig_genes.txt
: List of significant UniRef90 gene IDstadb3/master_table_tadb3.txt
: Annotated master table fromtadb3_master_table.py
Steps:
- Load significant UniRef90 IDs.
- Subset TADB3 hits for these IDs and retain only high-confidence matches (Bit Score ≥ 90).
- Standardize organism names to species-level.
- Query NCBI Entrez for full protein descriptions using
Protein_Accession
and clean the output. - For each gene, perform annotation collapse (e.g., most common
Organism
, validation type, aggregated annotations).
Output:
tadb3/sig_genes_annotations.csv
: Final annotation table for significant TA genes enriched with curated metadata and Entrez descriptions.
vfdb_sig_genes.py
filters DIAMOND alignments against the VFDB (Virulence Factor Database) to extract and annotate significant genes identified from the Dieguez and EV datasets.
Inputs:
DA_results/sig_genes.txt
: List of significant UniRef90 gene IDs.- DIAMOND outputs:
vfdb/dieguez/*_vs_VFDB.tsv
vfdb/ev/*_vs_VFDB.tsv
Steps:
- Load significant UniRef90 IDs and subset the DIAMOND VFDB alignments to retain only matches with these genes.
- Extract GenBank accession numbers from
Query_ID
using regex. - Query NCBI Entrez using
Bio.Entrez
to fetch gene names and protein descriptions for each accession. - Append sample name, dataset, gene name, and description to each annotated entry.
- Collapse annotations by gene using custom aggregation logic.
Output:
vfdb/sig_genes_annotations.csv
: Annotated table of significant genes aligned to VFDB with NCBI-derived metadata.
eggnog_sig_genes.py
extracts and summarizes functional annotations for significant UniRef90 gene families using EggNOG-mapper outputs from the Dieguez and EV datasets.
Inputs:
DA_results/sig_genes.txt
: List of significant UniRef90 gene IDs (dot-truncated)- Annotated
.emapper.annotations
files from EggNOG-mapper:eggnog/dieguez/*_eggnog.emapper.annotations
eggnog/ev/*_eggnog.emapper.annotations
Steps:
- Parse each
.emapper.annotations
file while ignoring comment lines and correctly identifying the column header. - Filter rows where the query matches a significant UniRef90 gene ID.
- Annotate each record with dataset and sample origin.
- Collapse annotations per UniRef90 ID by computing the most frequent (mode) entry for each functional field (e.g.,
KEGG_ko
,EC
,eggNOG_OGs
,GO_terms
). - Track how many and which samples contributed to each gene’s annotation.
Output:
eggnog/sig_genes_annotations.csv
: Collapsed annotation table of significant genes, including sample metadata and representative functional terms.
interproscan_sig_genes.py
parses and annotates InterProScan .tsv
results for significant UniRef90 gene families, enriching each entry with GO term descriptions and ontology categories.
Inputs:
DA_results/sig_genes.txt
: List of significant UniRef90 gene IDs.- InterProScan output files:
interproscan/dieguez/*.tsv
interproscan/ev/*.tsv
Steps:
- Load
.tsv
files and assign dataset/sample labels. - Filter records based on significant gene IDs.
- Extract GO identifiers (
GO:XXXXXXX
) from the GO column. - Query the EBI QuickGO API to retrieve term names and ontology categories (BP, MF, CC).
- Annotate each record with:
- Full GO term descriptions
- Separated Biological Process, Molecular Function, Cellular Component
- Aggregate and collapse annotations for each gene across samples and datasets.
Output:
interproscan/sig_genes_annotations.csv
: Clean annotation table with GO terms and structured categories for significant genes.
This section summarizes and visualizes the expression behavior of experimentally validated or computationally predicted toxin–antitoxin (TA) system genes based on curated annotations from TADB3. Two components are included:
The script valid_gene_summary.R
compiles condition-wise expression summaries for all validated TA system genes across the Dieguez and Ev datasets.
Inputs:
processed_data/processed_dieguez_genes.csv
processed_data/processed_ev_genes.csv
raw_data/metadata_dieguez.csv
raw_data/metadata_ev.csv
tadb3/tadb3_tox_antitox_main_table.tsv
Steps:
- Subset expression matrices to retain only TA genes annotated in the TADB3 table.
- Merge with metadata to assign each sample to a condition group.
- Apply a pseudocount (0.01) and compute
log10(expression + 0.01)
per sample. - For each gene × condition × dataset, compute:
n
,non-zero sample count
min
,max
,mean
,median
mean_log10
,median_log10
- Gene label (
Toxin
orAntitoxin
)
- Merge Dieguez and EV summaries into a combined summary table.
Output:
DA_results/debug_gene_condition_summary_combined.csv
The script barplots.R
creates annotated barplots for each significant TA gene identified in sig_genes_combined.csv
, stratified by dataset and gene type.
Inputs:
processed_data/processed_dieguez_genes.csv
processed_data/processed_ev_genes.csv
raw_data/metadata_dieguez.csv
raw_data/metadata_ev.csv
DA_results/sig_genes_combined.csv
tadb3/tadb3_tox_antitox_main_table.tsv
Steps:
- Subset significant genes by type (
Toxin
,Antitoxin
). - Plot per-gene barplots using:
- Y-axis:
log1p(expression)
- X-axis: condition group (ordered)
- Mean ± standard error
- Y-axis:
- Overlay ANCOM-BC adjusted p-values as:
- Brackets for significant comparisons only
- Significance stars in black
- Customize plot aesthetics:
- Red text for toxins, green text for antitoxins
- One dot per condition per gene (no jitter)
Outputs:
DA_results/barplots/toxin_barplots/*.png
DA_results/barplots/antitoxin_barplots/*.png
ta_systems_oral_mt/
├── DA_results/ # Differential expression results - tables and visualizations
├── dieguez_uniref_mapped/ # Sample-wise mapping of the functional profiled UniRef90 Ids and Fastas
├── eggnog/ # Functional annotations from eggNOG-mapper
├── ev_uniref_mapped/ # Sample-wise mapping of the functional profiled UniRef90 Ids and Fastas
├── interproscan/ # InterProScan annotations for significant genes
├── processed_data/ # Preprocessed and cleaned expression matrices
├── raw_data/ # Raw input files (metadata, UniRef90 toxin-antitoxin files, functional gene tables)
├── scripts/ # All analysis scripts (R/Python)
├── tadb3/ # TADB3 toxin–antitoxin annotation tables and sequences
├── uniref_tox_abundance/ # Abundance tables intersecting Dieguez, ev and UniRef90 toxin–antitoxin gene clusters
├── venn_diagrams/ # Venn diagrams used in the manuscript
├── vfdb/ # VFDB annotations (dieguez/ev-specific)
├── README.md # Project description and documentation
├── LICENSE # License for usage
└── ta_systems_oral_mt.Rproj # R project file for downstream analysis
Tool | Version | Install Command | Documentation Link |
---|---|---|---|
FastQC | 0.12.1 |
mamba install -c bioconda fastqc=0.12.1 |
Documentation |
Cutadapt | 5.0 |
mamba install -c bioconda cutadapt=5.0 |
Documentation |
Trim Galore | 0.6.10 |
mamba install -c bioconda trim-galore=0.6.10 |
Documentation |
SortMeRNA | 4.3.7 |
mamba install -c bioconda sortmerna=4.3.7 |
Documentation |
HUMAnN | 3.9 |
pip install humann==3.9 |
Documentation |
Bowtie2 | 2.5.4 |
mamba install -c bioconda bowtie2=2.5.4 |
Documentation |
DIAMOND | 2.1.10 |
mamba install -c bioconda diamond=2.1.10 |
Documentation |
eggNOG-mapper | 2.1.12 |
mamba install -c bioconda eggnog-mapper=2.1.12 |
Documentation |
HMMER | 3.4 |
mamba install -c bioconda hmmer=3.4 |
Documentation |
MMseqs2 | latest |
mamba install -c bioconda mmseqs2 |
Documentation |
Prodigal | 2.6.3 |
mamba install -c bioconda prodigal=2.6.3 |
Documentation |
MultiQC | 1.26 |
pip install multiqc==1.26 |
Documentation |
OpenJDK | 11.0.25 |
mamba install -c conda-forge openjdk=11.0.25 |
Documentation |
InterProScan | 5.72-103.0 |
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.72-103.0/interproscan-5.72-103.0-64-bit.tar.gz |
Documentation |
git-lfs | latest |
mamba install -c conda-forge git-lfs |
Documentation |
biopython | 1.85 |
pip install biopython==1.85 |
Documentation |
pandas | 2.3.1 |
pip install pandas==2.3.1 |
Documentation |
tqdm | 4.67.1 |
pip install tqdm==4.67.1 |
Documentation |
requests | 2.32.4 |
pip install requests==2.32.4 |
Documentation |
lxml | 6.0.0 |
pip install lxml==6.0.0 |
Documentation |
openpyxl | 3.1.5 |
pip install openpyxl==3.1.5 |
Documentation |
matplotlib | 3.10.3 |
pip install matplotlib==3.10.3 |
Documentation |
seaborn | 0.13.2 |
pip install seaborn==0.13.2 |
Documentation |
scikit-learn | 1.7.1 |
pip install scikit-learn==1.7.1 |
Documentation |
scipy | 1.16.1 |
pip install scipy==1.16.1 |
Documentation |
statsmodels | 0.14.5 |
pip install statsmodels==0.14.5 |
Documentation |
tidyverse | 2.0.0 | install.packages("tidyverse") |
Documentation |
ggplot2 | 3.5.2 | install.packages("ggplot2") |
Documentation |
patchwork | 1.3.1 | install.packages("patchwork") |
Documentation |
ggpubr | 0.6.1 | install.packages("ggpubr") |
Documentation |
ggtext | 0.1.2 | install.packages("ggtext") |
Documentation |
ggrepel | 0.9.6 | install.packages("ggrepel") |
Documentation |
ggsignif | 0.6.4 | install.packages("ggsignif") |
Documentation |
pheatmap | 1.0.13 | install.packages("pheatmap") |
Documentation |
ComplexHeatmap | 2.22.0 | BiocManager::install("ComplexHeatmap") |
Documentation |
circlize | 0.4.16 | BiocManager::install("circlize") |
Documentation |
gridExtra | 2.3 | install.packages("gridExtra") |
Documentation |
RColorBrewer | 1.1-3 | install.packages("RColorBrewer") |
Documentation |
VennDiagram | 1.7.3 | install.packages("VennDiagram") |
Documentation |
UpSetR | 1.4.0 | install.packages("UpSetR") |
Documentation |
phyloseq | 1.50.0 | BiocManager::install("phyloseq") |
Documentation |
ANCOMBC | 2.8.1 | BiocManager::install("ANCOMBC") |
Documentation |
microbiome | 1.28.0 | BiocManager::install("microbiome") |
Documentation |
Shri Vishalini Rajaram, Priyanka Singh and Erliang Zeng
Shri Vishalini Rajaram, Priyanka Singh, & Erliang Zeng. (2025). biocoms/ta_systems_oral_mt: Bacterial Toxin-Antitoxin Systems in Oral Metatranscriptome (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.16763308
This repository is maintained by the Bioinformatics and Computational Systems Biology Lab at the University of Iowa.
If you notice any errors, have suggestions for improvement, or simply want to discuss aspects of the analysis, we genuinely welcome your feedback. Please feel free to open an issue or contact us. We are always happy to engage with the community.