A Snakemake workflow for for genome assembly.
Why colora? 🐍 Colora means "snake" in Sardinian language 🐍
Input reads: hifi reads, optionally ONT, and hic reads. Other inputs: oatk database, ncbi FCS database (optional), BUSCO database (to be implemented)
The usage of this workflow is described in the Snakemake Workflow Catalog.
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) sitory and its DOI (see above).
- place raw hifi reads in resources/raw_hifi
- place oatk database of interest from github.oatkdb.repo in resources/oatkDB
- place raw hic reads in resources/raw_hic
- place ncbi database for FCS-GX in resources/gx_db (optional, this needs ~500GB of disk space and a large RAM)
How to run colora:
snakemake --software-deployment-method conda --snakefile workflow/Snakefile --cores all
snakemake --software-deployment-method conda --snakefile workflow/Snakefile --cores all --dry-run
#for the cluster:
snakemake --software-deployment-method conda --conda-frontend mamba --snakefile workflow/Snakefile --cores 100
Before executing the command, ensure you have appropriately changed your config.yaml
Test the pipeline:
-
- Download test data
-
- Download oatk DB
git clone https://github.com/c-zhou/OatkDB.git
cd colora/resources
mkdir oatkDB
cp path/to/where/you/cloned/OatkDB/v20230921/dikarya* oatkDB/
-
- Download FCS-GX test database
You can skip this step if you are not going to run the decontamination step with FCS-GX
mamba create -n ncbi_fcsgx ncbi-fcs-gx
mamba activate ncbi_fcsgx
cd colora/resources
mkdir gx_test_db
cd gx_test_db
sync_files.py get --mft https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/FCS/database/test-only/test-only.manifest --dir ./test-only
-
- Run the test pipeline
snakemake --configfile config/config_test.yaml --software-deployment-method conda --snakefile workflow/Snakefile --cores 4
- The workflow will occur in the snakemake-workflow-catalog once it has been made public. Then the link under "Usage" will point to the usage instructions if
<owner>and<repo>were correctly set.
- Rule for Nanoplot
- Rule for fastp
- Rules for arima pipeline - split in several rules
- Rule for yahs
- integrate the snakemake report in the workflow: not necessary
- input / output: hardcoded is okay
- test dataset
- test config file
- test possibility to add ONT reads as optional param in hifiasm
- test possibility to add HiC reads as optional params in hifiasm: file names change in this case. Need more study. Probably this needs a separate rule.
- packages versions: create stable yaml files with conda export
- add singularity and docker as option for environment management
- implement ncbi
FCS(decontamination) as optional rule (orange path in the scheme above) - make purging steps optional
- slurm integration (profile)
- setting of resources for each rule
- Rule
purge_dups.smkandpurge_dups_alt.smk: redirecting outputs - implement
assemblyQC- waiting for new Merqury release to make a new conda recipe (light green path above) - Formatting and linting to be fixed according to snakemake requirements
- log files: some of them are empty because it's impossible to redirect stderr and stdout to the file
Notes:
- Arima pipeline - changes compared to the original pipeline:
- creating conda environments with needed tools so no need to specify tools' path
- Remove the PREFIX line and the option -p $PREFIX from the bwa command, it is not necessary and creates problems in the reading of files
- add -M flag in bwa mem command - step 1.A and 1.B
- pipeline split in several rules