https://docs.google.com/presentation/d/1D19ZAHnwLBRGX1Cvk_spNj84Ww_miAjjMLMp3z_Ix3M/edit?usp=sharing
While metagenomic assembly has significantly improved since the early days of the Human Microbiome Project (HMP), it remains confounded by intragenomic and intergenomic repetitive sequences. Individual reads that span microbial strains (either via long read technology or clever techniques for generating synthetic long reads) to fully resolve variation within a given community. Recent benchmarking studies on error rates in metagenomic assembly range between 1 and 10 major SV error per MB of assembled sequenced. Thus major concern is that detected structural variants could actually result from misassembled data instead of actual strain specific variation. This goal of this project is to identify errors in metagenomic assembly based on short and long read mapping, in the hope of eliminating some of the uncertainty and error in metagenomic studies. Examples of errors we are MASQing are: inversions, chimeras (translocation), indels (<50bp), replacements (large substitutions). We are creating a containerized quality control pipeline called MASQ.
MasQ uses Python 3.x.
Requirements and dependencies for each major step:
-
Zymobiomics Microbial Community Standards Assemblies (10 genomes; 5 gram positive, 3 gram negative, 2 yeast)
- Has a known reference
- Synthetic bacteria included: Bacillus subtilis, Cryptococcus neoformans, Enterococcus faecalis, Escherichia coli, Lactobacillus fermentum, Listeria monocytogenes, Pseudomonas aeruginosa, Saccharomyces cerevisiae, Salmonella enterica, Staphylococcus aureus
- Illumina pair-ended short reads from IMMSA dataset
- Sequencing depth:
- Average read length:
- Oxford Nanopore long reads
- Sequencing depth:
- Average read length:
-
Shakya Assembly (assembled using MegaHIT and MetaSPAdes)
- Has a known reference
- Synthetic bacteria included: Acidobacterium capsulatum, aciduliprofundum boonei, akkermansia muciniphila, archaeoglobus fulgidus, Bacteroides thetaiotaomicron, Bacteroides vulgatus, Bordetella bronchiseptica, Burkholderia xenovorans LB400, Caldicellulosiruptor bescii, Caldicellulosiruptor saccharolyticus, Chlorobium limicola, Chlorobium phaeobacteroides, Chlorobium phaeovibriodes, Chlorobium tepidum, Chloroflexus aurantiacus, Clostridium thermocellum, Deinococcus radiodurans, Desulfovibrio piger, Desulfovibrio vulgaris, Dictyoglomus turgidum, Enterococcus faecalis, Fusobacterium nucleatum, Gemmatimonas aurantiacus, Geobacter sulfurreducens, Haloferax volcanii, Herpetosiphon aurantiacus, Hydrogenobaculum, Ignicoccus hospitalis, Leptothrix cholodnii, Methanocaldococcus jannaschii, Methanococcus maripaludis C5, Methanococcus maripaludis S2, Methanopyrus kandleri, Methanosarcina acetivorans C2A, Nanoarchaeum equitans, Nitrosomonas europaea, Nostoc sp. PCC 7120, Pelodictyon phaeoclatharatiforme, Persephonella marina EX-H1, Porphyromonas gingivalis, Pyrobaculum aerophilum IM2, Pyrobaculum arsenaticum, Pyrobaculum calidifontis, Pyrococcus furiosus, Pyrococcus horikoshii, Rhodopirellula baltica, Ruegeria pomeroyi, Salinispora arenicola, Salinispora tropica, Shewanella baltica OS185, Shewanella baltica OS223, Sulfitobacter sp.EE-36, Sulfitobacter sp.NAS-14.1, Sulfolobus tokodaii, Sulfurihydrogenibium sp.YO3AOP1, Sulfurihydrogenibium yellowstonense, Thermoanaerobacter pseudethanolicus, Thermotoga neapolitana DSM 4359, Thermotoga petrophila RKU-1, Thermotoga sp.RQ2, Thermus thermophilus HB8, Treponema denticola, Zymomonas mobilis
- From Illumina pair-ended short reads
- Sequencing depth:
- Number of sequences: 54814748
- Read length: 101bp
downloaded fastq files, ran fastqc to check the quality of the sequencing reads -> good to go! used megahit and metaspades to build assemblies from long and short reads
Megahit and metaspades are used for short-read metagenomic assembly in the pipeline. These are used to assemble the short reads into representative contigs for the given metagenomic sample.
Correction of the metagenomic assemblies is implemented in the varification workflow. Here a interpreter.py script uses a vcf parser to obtain regions structural variation between the assembly and the short reads. Thereafter these structural variations are "fixed" (or reversed) in the assembly. Lastly, in future editions, the number of mapped reads before and after correction will be compared, which should yield further information on the type and impact of that region of missassembly.
We use Truvari to make calls on vcf obtained from different SV callers.
The MasQ pipeline can smoothly and successfully locate assembly errors. Some examples (visualized with IGV) are shown below:
The MasQ pipeline detected an insertion (labeled Unk570) in the Zymo long read assembly, which is shown by the section of very low mapped reads in IGV.

This is an example of an assembly error that MasQ will fix during the correction step. The validation script checks the percentage of mapped reads in the corrected assembly and shows a ______ increase in mapped reads from the original assembly. Thus, MasQ was successful in correcting assembly errors.
Here we can see the corrected assembly mapping of the same locus (previously Unk570) visualized with IGV insert picture once finished running
All relevant parameters can be found documented here.
- Test by running on simulated data
- Use a machine learning model such as random forest, SVM, or CNN to classify types of assembly errors
A random forest classifier could be used to classify the type of each assembly error that is detected by the MASQ pipeline.
briefly explain what a random forest is and how it will be used to extract imp features and make predictions
can use scikit-learn package for Python
Other options for classification models are SVM (Support Vector Machine) or CNN (Convolutional Neural Network).
The MasQ pipeline is freely available for download and use on DockerHub.

How to run workflow in DNA Nexus
Once all of the inputs for all of the apps in the workflow have been satisfied (indicated by the black app box showing a green runnable label instead of orange set inputs), you can run the workflow by clicking on the green Start Analysis button in the upper right of the workflow.
All relevant parameters can be found documented here.
- Todd Treangen
- Michael Jochum
- Adam English
- Shakuntala Mitra
- Junzhou Wang
- Yongze Yin
- Advait Balaji
- Dreycey Albin






