A comprehensive Snakemake workflow for analyzing codon usage patterns from genomic and transcriptomic data.
This project provides a scalable, reproducible workflow for analyzing codon usage bias and patterns. The workflow can process:
- Raw sequencing data (FASTQ)
- Assembled transcriptomes (FASTA)
- CDS sequences (FASTA)
- Pre-computed codon usage tables (from CodonW or other tools)
The workflow integrates multiple analysis steps into a unified pipeline with standardized output formats.
- Raw sequencing data: FASTQ files (paired-end or single-end)
- Assembled transcripts: FASTA files
- CDS sequences: FASTA files with coding sequences
- Pre-computed tables: CodonW output files or custom tables
- Codon Usage Calculation: Compute RSCU, CAI, ENC, and other metrics
- Codon Bias Analysis: Identify preferred and avoided codons
- tRNA Adaptation Index: Calculate tAI for translation efficiency
- Correlation Analysis: Relate codon usage to gene expression
- Visualization: Generate publication-ready plots and heatmaps
- Works with any organism with genomic or transcriptomic data
- Built-in genetic code tables for different organisms
- Customizable codon tables for non-standard genetic codes
git clone https://github.com/autosomal/codon_usage_analyzer.git
cd codon_usage_analyzerconda env create -f environment.yml
conda activate codon_analyzerEdit the config.yaml file to set your parameters:
# Input data configuration
input:
type: "fasta" # Options: fastq, fasta, cds, codonw
files: "data/transcripts.fasta"
# For FASTQ input
# fastq_files: ["data/sample1_R1.fastq.gz", "data/sample1_R2.fastq.gz"]
# reference: "reference/genome.fa"
# gtf: "reference/annotation.gtf"
# Genetic code configuration
genetic_code:
table: 1 # Standard genetic code
# For custom genetic code, provide a dictionary:
# custom_table:
# TTT: "F"
# TTC: "F"
# ...
# Analysis parameters
analysis:
calculate_rscu: true
calculate_cai: true
calculate_enc: true
calculate_tai: true
correlation_analysis: true
# Output configuration
output:
base_dir: "results"
plots_dir: "results/plots"
tables_dir: "results/tables"
reports_dir: "results/reports"
# Resource configuration
resources:
threads: 16
memory: "32G"snakemake --cores 16 --use-condasnakemake --cluster "sbatch --cpus-per-task={threads} --mem={resources.mem_mb}MB" \
--jobs 10 --use-condacodon_usage_analyzer/
├── README.md # Project documentation
├── environment.yml # Conda environment configuration
├── config.yaml # Workflow configuration
├── Snakefile # Main Snakemake workflow
├── rules/ # Snakemake rules
│ ├── preprocessing.smk # Data preprocessing rules
│ ├── codon_calculation.smk # Codon usage calculation rules
│ ├── bias_analysis.smk # Codon bias analysis rules
│ ├── tai_analysis.smk # tAI calculation rules
│ ├── visualization.smk # Visualization rules
│ └── reporting.smk # Report generation rules
├── scripts/ # Analysis scripts
│ ├── *.py # Python analysis scripts
│ └── *.R # R visualization scripts
├── envs/ # Conda environment files
│ ├── base.yaml # Base environment
│ ├── bioinformatics.yaml # Bioinformatics tools environment
│ └── r_analysis.yaml # R analysis environment
├── reference/ # Reference data directory
│ ├── genetic_codes/ # Genetic code tables
│ └── trna_data/ # tRNA gene data
├── data/ # Input data directory
└── results/ # Output results directory
- Quality Control: FastQC for sequencing data
- Alignment: STAR or HISAT2 for read alignment
- Assembly: StringTie for transcript assembly
- CDS Extraction: Extract coding sequences from annotations
- Relative Synonymous Codon Usage (RSCU)
- Codon Adaptation Index (CAI)
- Effective Number of Codons (ENC)
- Codon Bias Index (CBI)
- Frequency of Optimal Codons (Fop)
- Preferred/Avoided Codons: Identify optimal codons
- Codon Pair Bias: Analyze codon pair usage
- Correlation with Gene Expression: Relate codon usage to expression levels
- Cross-Species Comparison: Compare codon usage across organisms
- tRNA Gene Copy Number: Use known tRNA gene counts
- tRNA Adaptation Calculation: Compute tAI for each gene
- Translation Efficiency Prediction: Predict translation rates
- Codon Usage Heatmaps: Visualize codon preferences
- Correlation Plots: Relate codon usage to other metrics
- PCA Analysis: Dimensionality reduction of codon usage
- Phylogenetic Trees: Based on codon usage patterns
results/tables/codon_usage_table.tsv: Comprehensive codon usage statisticsresults/tables/rscu_values.tsv: RSCU values for all codonsresults/tables/cai_values.tsv: CAI values for all genesresults/tables/enc_values.tsv: ENC values for all genesresults/tables/preferred_codons.tsv: Identified preferred codonsresults/plots/codon_usage_heatmap.pdf: Heatmap of codon usageresults/reports/final_report.html: Comprehensive analysis report
Codon Usage Table:
Codon AminoAcid Count Frequency RSCU CAI ENC
TTT F 1234 0.056 0.89 0.76 45.2
TTC F 1567 0.071 1.11 0.89 45.2
TTA L 890 0.040 0.32 0.23 52.1
...
input:
# Input type: fastq, fasta, cds, or codonw
type: "fasta"
# For FASTQ input
# fastq_files: ["data/sample1_R1.fastq.gz", "data/sample1_R2.fastq.gz"]
# reference: "reference/genome.fa"
# gtf: "reference/annotation.gtf"
# For FASTA input
files: "data/transcripts.fasta"
# For CDS input
# cds_files: "data/cds_sequences.fasta"
# For CodonW input
# codonw_files: "data/codonw_output.blk"genetic_code:
# Use standard genetic code table (1-25)
table: 1
# For custom genetic code
# custom_table:
# TTT: "F"
# TTC: "F"
# TTA: "L"
# TTG: "L"
# ...analysis:
# Basic codon usage metrics
calculate_rscu: true
calculate_cai: true
calculate_enc: true
calculate_cbi: true
calculate_fop: true
# Advanced analysis
codon_pair_analysis: true
correlation_analysis: true
phylogenetic_analysis: false
# tAI calculation
calculate_tai: true
trna_data: "reference/trna_data/human_trna_counts.tsv"
# Visualization
generate_heatmaps: true
generate_correlation_plots: true
generate_pca: true- Create a custom genetic code file in
reference/genetic_codes/ - Update the config file to use your custom table:
genetic_code:
custom_table: "reference/genetic_codes/my_custom_code.tsv"- Prepare a tRNA gene count file with columns: Codon, tRNA_Copies
- Update the config file:
analysis:
calculate_tai: true
trna_data: "reference/trna_data/my_organism_trna.tsv"For processing multiple samples, use wildcards in your config:
input:
type: "fastq"
fastq_files: "data/{sample}_R{read}.fastq.gz"
samples: ["sample1", "sample2", "sample3"]
reads: [1, 2]# Recreate the conda environment
conda env create -f environment.yml
conda activate codon_analyzer
# Check if all prerequisites are installed
snakemake --list-conda-envs- Reduce the number of threads in
config.yaml - Increase the memory limit
- Use cluster execution with appropriate resource allocation
- Ensure you're using the correct genetic code table for your organism
- For custom genetic codes, verify the codon to amino acid mappings
This project is licensed under the MIT License - see the LICENSE file for details.
- Original authors of codon usage analysis methods
- Contributors to the bioinformatics community
- Snakemake development team for the workflow management system
For questions or issues, please:
- Check the GitHub Issues page
- Contact the maintainer at [email protected]
- Refer to the documentation for detailed parameter explanations