🧬 GenoFlow: Modular De Novo Genome Assembly & Annotation Pipeline

A comprehensive, modular, and user-friendly pipeline for de novo genome assembly and annotation that runs seamlessly on Jupyter Notebook and Google Colab

🚀 Quick Start

# Clone the repository
git clone https://github.com/engkinandatama/genoflow.git
cd genoflow

# Launch Jupyter Notebook
jupyter notebook GenoFlow_Complete_Pipeline.ipynb

📋 Overview

GenoFlow is a modular bioinformatics pipeline designed to democratize genome assembly and annotation analysis. Whether you're a beginner researcher or an experienced bioinformatician, GenoFlow provides:

🔧 Modular Design: Mix and match tools based on your needs
🌐 Cloud-Ready: Run on Google Colab without local installation
📊 Interactive Visualizations: Rich plots and reports
🤖 Smart Automation: Auto-detection of optimal parameters
📚 Educational: Detailed explanations for learning
🔄 Reproducible: Complete provenance tracking

✨ Key Features

📥 Flexible Data Input

Manual file upload (drag & drop)
Direct download from NCBI SRA/ENA
API integration for batch processing
Cloud storage integration (Google Drive, AWS S3)

🔬 Comprehensive Analysis

Quality Control: FastQC, fastp, MultiQC
Assembly: SPAdes, Flye, Unicycler, Canu
Annotation: Prokka, bakta, PGAP, antiSMASH
Comparative Genomics: Pan-genome, phylogeny, ANI
Visualization: Interactive plots, genome browsers

🎯 Multiple Use Cases

Bacterial/Archaeal genomes
Fungal genomes
Plasmid analysis
Metagenome assembly
Comparative genomics studies

📚 Pipeline Modules

Module	Description	Tools Available
🔍 Data Input	Fetch/upload sequencing data	SRA-tools, wget, manual upload
✅ Quality Control	Assess and improve read quality	FastQC, fastp, Trimmomatic
🧬 Assembly	De novo genome assembly	SPAdes, Flye, Unicycler, Canu
📊 Assembly QC	Evaluate assembly quality	QUAST, BUSCO, CheckM
🏷️ Annotation	Gene prediction and annotation	Prokka, bakta, PGAP
🔬 Functional Analysis	Protein function prediction	eggNOG, InterPro, antiSMASH
🌳 Comparative	Multi-genome comparisons	Roary, fastANI, Mauve
📈 Visualization	Interactive plots and reports	Plotly, Bokeh, custom plots

🎮 Usage Examples

Quick Bacterial Assembly

# Minimal setup for standard bacterial genome
pipeline = GenoFlow()
pipeline.set_mode("bacterial_quick")
pipeline.input_data("SRR12345678")  # SRA accession
pipeline.run_assembly("spades")
pipeline.annotate("prokka")
pipeline.generate_report()

Comprehensive Analysis

# Full pipeline with comparative analysis
pipeline = GenoFlow()
pipeline.set_mode("comprehensive")
pipeline.input_data(["genome1.fastq", "genome2.fastq"])
pipeline.run_qc(tools=["fastqc", "fastp"])
pipeline.run_assembly("unicycler", polish=True)
pipeline.annotate(["prokka", "antismash"])
pipeline.comparative_analysis()
pipeline.generate_interactive_report()

Metagenome Assembly

# Specialized for metagenomic data
pipeline = GenoFlow()
pipeline.set_mode("metagenome")
pipeline.input_data("metagenome_reads.fastq")
pipeline.run_assembly("metaspades")
pipeline.run_binning("metabat2")
pipeline.taxonomic_classification()

📁 Repository Structure

genoflow/
├── 📁 notebooks/
│   ├── GenoFlow_Complete_Pipeline.ipynb    # Main pipeline notebook
│   ├── GenoFlow_Quick_Start.ipynb          # Beginner-friendly version
│   ├── GenoFlow_Advanced.ipynb             # Advanced features
│   └── GenoFlow_Examples/                  # Example analyses
├── 📁 src/
│   ├── genoflow/                           # Core pipeline modules
│   ├── utils/                              # Utility functions
│   └── visualization/                      # Plotting functions
├── 📁 data/
│   ├── examples/                           # Example datasets
│   └── references/                         # Reference databases
├── 📁 docs/
│   ├── user_guide.md                       # Detailed user guide
│   ├── api_reference.md                    # API documentation
│   └── tutorials/                          # Step-by-step tutorials
├── 📁 tests/                               # Unit tests
├── requirements.txt                        # Python dependencies
└── environment.yml                         # Conda environment

💻 Installation

Option 1: Google Colab (Recommended for Beginners)

No installation required! Just click the Colab badge above.

Option 2: Local Installation

# Clone repository
git clone https://github.com/engkinandatama/genoflow.git
cd genoflow

# Create conda environment
conda env create -f environment.yml
conda activate genoflow

# Install Python packages
pip install -r requirements.txt

# Launch Jupyter
jupyter notebook

Option 3: Docker

docker pull engkinandatama/genoflow:latest
docker run -p 8888:8888 engkinandatama/genoflow:latest

🔧 System Requirements

Minimum Requirements

Python 3.8+
4 GB RAM
10 GB free disk space
Internet connection (for tool downloads)

Recommended Specifications

16+ GB RAM
100+ GB free disk space
Multi-core CPU (4+ cores)
High-speed internet

Cloud Computing

Google Colab Pro (for large datasets)
AWS/Azure instances (for production use)

📖 Documentation

📚 User Guide: Comprehensive usage instructions
🔧 API Reference: Function documentation
🎓 Tutorials: Step-by-step learning materials
❓ FAQ: Frequently asked questions
🐛 Troubleshooting: Common issues and solutions

🌟 Example Outputs

Assembly Statistics

N50: 2.5 Mbp
Total length: 4.2 Mbp
Contigs: 25
GC content: 42.3%

Annotation Results

Protein-coding genes: 3,847
rRNA genes: 12
tRNA genes: 67
CRISPR arrays: 2

Visualizations

Interactive genome browser
Assembly quality plots
Phylogenetic trees
Functional category pie charts

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines.

Ways to Contribute

🐛 Report bugs and issues
💡 Suggest new features
📝 Improve documentation
🔧 Add new tools/modules
🧪 Contribute example datasets

📊 Performance Benchmarks

Dataset Size	Assembly Time	Memory Usage	Colab Compatible
Small (< 500MB)	15-30 min	2-4 GB	✅ Yes
Medium (500MB-2GB)	1-3 hours	4-8 GB	✅ Yes (Pro)
Large (> 2GB)	3-12 hours	8-32 GB	❌ Local/Cloud

🏆 Citation

If you use GenoFlow in your research, please cite:

@software{genoflow2025,
  title={GenoFlow: Modular De Novo Genome Assembly and Annotation Pipeline},
  author={Nandatama, Engki},
  year={2025},
  url={https://github.com/engkinandatama/genoflow},
  version={1.0.0}
}

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Tools Integration: SPAdes, Prokka, QUAST, and all incorporated tools
Communities: Bioinformatics Stack Exchange, Galaxy Project
Contributors: See CONTRIBUTORS.md

📞 Contact & Support

🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions

🔗 Related Projects

Prokka - Rapid prokaryotic genome annotation
SPAdes - Genome assembly toolkit
QUAST - Quality assessment tool
Roary - Pan genome pipeline

🌟 Star this repository if you find it helpful! 🌟

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md

License

engkinandatama/GenoFlow

Folders and files

Latest commit

History

Repository files navigation