A comprehensive, modular, and user-friendly pipeline for de novo genome assembly and annotation that runs seamlessly on Jupyter Notebook and Google Colab
# Clone the repository
git clone https://github.com/engkinandatama/genoflow.git
cd genoflow
# Launch Jupyter Notebook
jupyter notebook GenoFlow_Complete_Pipeline.ipynbGenoFlow is a modular bioinformatics pipeline designed to democratize genome assembly and annotation analysis. Whether you're a beginner researcher or an experienced bioinformatician, GenoFlow provides:
- 🔧 Modular Design: Mix and match tools based on your needs
- 🌐 Cloud-Ready: Run on Google Colab without local installation
- 📊 Interactive Visualizations: Rich plots and reports
- 🤖 Smart Automation: Auto-detection of optimal parameters
- 📚 Educational: Detailed explanations for learning
- 🔄 Reproducible: Complete provenance tracking
- Manual file upload (drag & drop)
- Direct download from NCBI SRA/ENA
- API integration for batch processing
- Cloud storage integration (Google Drive, AWS S3)
- Quality Control: FastQC, fastp, MultiQC
- Assembly: SPAdes, Flye, Unicycler, Canu
- Annotation: Prokka, bakta, PGAP, antiSMASH
- Comparative Genomics: Pan-genome, phylogeny, ANI
- Visualization: Interactive plots, genome browsers
- Bacterial/Archaeal genomes
- Fungal genomes
- Plasmid analysis
- Metagenome assembly
- Comparative genomics studies
| Module | Description | Tools Available |
|---|---|---|
| 🔍 Data Input | Fetch/upload sequencing data | SRA-tools, wget, manual upload |
| ✅ Quality Control | Assess and improve read quality | FastQC, fastp, Trimmomatic |
| 🧬 Assembly | De novo genome assembly | SPAdes, Flye, Unicycler, Canu |
| 📊 Assembly QC | Evaluate assembly quality | QUAST, BUSCO, CheckM |
| 🏷️ Annotation | Gene prediction and annotation | Prokka, bakta, PGAP |
| 🔬 Functional Analysis | Protein function prediction | eggNOG, InterPro, antiSMASH |
| 🌳 Comparative | Multi-genome comparisons | Roary, fastANI, Mauve |
| 📈 Visualization | Interactive plots and reports | Plotly, Bokeh, custom plots |
# Minimal setup for standard bacterial genome
pipeline = GenoFlow()
pipeline.set_mode("bacterial_quick")
pipeline.input_data("SRR12345678") # SRA accession
pipeline.run_assembly("spades")
pipeline.annotate("prokka")
pipeline.generate_report()# Full pipeline with comparative analysis
pipeline = GenoFlow()
pipeline.set_mode("comprehensive")
pipeline.input_data(["genome1.fastq", "genome2.fastq"])
pipeline.run_qc(tools=["fastqc", "fastp"])
pipeline.run_assembly("unicycler", polish=True)
pipeline.annotate(["prokka", "antismash"])
pipeline.comparative_analysis()
pipeline.generate_interactive_report()# Specialized for metagenomic data
pipeline = GenoFlow()
pipeline.set_mode("metagenome")
pipeline.input_data("metagenome_reads.fastq")
pipeline.run_assembly("metaspades")
pipeline.run_binning("metabat2")
pipeline.taxonomic_classification()genoflow/
├── 📁 notebooks/
│ ├── GenoFlow_Complete_Pipeline.ipynb # Main pipeline notebook
│ ├── GenoFlow_Quick_Start.ipynb # Beginner-friendly version
│ ├── GenoFlow_Advanced.ipynb # Advanced features
│ └── GenoFlow_Examples/ # Example analyses
├── 📁 src/
│ ├── genoflow/ # Core pipeline modules
│ ├── utils/ # Utility functions
│ └── visualization/ # Plotting functions
├── 📁 data/
│ ├── examples/ # Example datasets
│ └── references/ # Reference databases
├── 📁 docs/
│ ├── user_guide.md # Detailed user guide
│ ├── api_reference.md # API documentation
│ └── tutorials/ # Step-by-step tutorials
├── 📁 tests/ # Unit tests
├── requirements.txt # Python dependencies
└── environment.yml # Conda environment
No installation required! Just click the Colab badge above.
# Clone repository
git clone https://github.com/engkinandatama/genoflow.git
cd genoflow
# Create conda environment
conda env create -f environment.yml
conda activate genoflow
# Install Python packages
pip install -r requirements.txt
# Launch Jupyter
jupyter notebookdocker pull engkinandatama/genoflow:latest
docker run -p 8888:8888 engkinandatama/genoflow:latest- Python 3.8+
- 4 GB RAM
- 10 GB free disk space
- Internet connection (for tool downloads)
- 16+ GB RAM
- 100+ GB free disk space
- Multi-core CPU (4+ cores)
- High-speed internet
- Google Colab Pro (for large datasets)
- AWS/Azure instances (for production use)
- 📚 User Guide: Comprehensive usage instructions
- 🔧 API Reference: Function documentation
- 🎓 Tutorials: Step-by-step learning materials
- ❓ FAQ: Frequently asked questions
- 🐛 Troubleshooting: Common issues and solutions
- N50: 2.5 Mbp
- Total length: 4.2 Mbp
- Contigs: 25
- GC content: 42.3%
- Protein-coding genes: 3,847
- rRNA genes: 12
- tRNA genes: 67
- CRISPR arrays: 2
- Interactive genome browser
- Assembly quality plots
- Phylogenetic trees
- Functional category pie charts
We welcome contributions! Please see our Contributing Guidelines.
- 🐛 Report bugs and issues
- 💡 Suggest new features
- 📝 Improve documentation
- 🔧 Add new tools/modules
- 🧪 Contribute example datasets
| Dataset Size | Assembly Time | Memory Usage | Colab Compatible |
|---|---|---|---|
| Small (< 500MB) | 15-30 min | 2-4 GB | ✅ Yes |
| Medium (500MB-2GB) | 1-3 hours | 4-8 GB | ✅ Yes (Pro) |
| Large (> 2GB) | 3-12 hours | 8-32 GB | ❌ Local/Cloud |
If you use GenoFlow in your research, please cite:
@software{genoflow2025,
title={GenoFlow: Modular De Novo Genome Assembly and Annotation Pipeline},
author={Nandatama, Engki},
year={2025},
url={https://github.com/engkinandatama/genoflow},
version={1.0.0}
}This project is licensed under the MIT License - see the LICENSE file for details.
- Tools Integration: SPAdes, Prokka, QUAST, and all incorporated tools
- Communities: Bioinformatics Stack Exchange, Galaxy Project
- Contributors: See CONTRIBUTORS.md
- 🐛 Issues: GitHub Issues
- 💬 Discussions: GitHub Discussions
- Prokka - Rapid prokaryotic genome annotation
- SPAdes - Genome assembly toolkit
- QUAST - Quality assessment tool
- Roary - Pan genome pipeline