-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
Investigate and quantify how Wessim generates duplicates to determine if the current implementation produces biologically realistic PCR duplicate patterns or merely sampling duplicates.
Hypothesis: Wessim generates sampling duplicates (the same fragment selected multiple times during coverage sampling) rather than PCR amplification duplicates (one fragment undergoing clonal expansion into many copies), which may not accurately reflect wet-lab library preparation artifacts.
Background
Wessim is a well-established tool for modeling exome capture bias via probe hybridization, but its duplicate generation mechanism relies on external read simulators (GemSim/ART). The resulting duplicates arise from:
- Sampling duplicates: Probabilistic selection picks the same fragment coordinates repeatedly
- Not: PCR amplification clonality where a single molecule is exponentially amplified
This distinction matters for benchmarking variant callers that use duplicate information for quality assessment, error correction, or consensus calling.
Impact
- Affected component:
muc_one_up/read_simulator/wrappers/wessim_wrapper.py - Downstream impact: May affect benchmarking accuracy for tools that rely on duplicate patterns (e.g., UMI-aware callers, consensus-based error correction)
- Relevance: MucOneUp manuscript claims realistic simulation—understanding duplicate behavior is important for documentation
Proposed Investigation
Step 1: Generate test data
# Generate BAMs with known coverage using Wessim pipeline
muconeup run --simulator wessim --coverage 100x --output test_wessim/Step 2: Quantify duplicates with Picard
# On input BAM (if applicable) and output BAM
picard MarkDuplicates \
I=test_wessim/output.bam \
O=marked_duplicates.bam \
M=duplicate_metrics.txt \
VALIDATION_STRINGENCY=LENIENT
# Extract key metrics
grep -A1 "LIBRARY" duplicate_metrics.txtStep 3: Analyze duplicate distribution
Key metrics to compare:
- PERCENT_DUPLICATION: Overall duplicate rate
- READ_PAIR_DUPLICATES: Exact vs optical duplicate ratio
- Duplicate family size distribution: Are duplicates binary (2 copies) or exponential (1→100)?
Step 4: Compare to empirical data
If available, compare metrics to real exome data from:
- Published benchmarking datasets (GIAB, etc.)
- Literature on expected PCR duplicate rates (~10-30% depending on library prep)
Expected Outcomes
| Scenario | Duplicate Family Sizes | Implication |
|---|---|---|
| Sampling duplicates | Mostly pairs (2 copies) | Unrealistic for PCR artifacts |
| PCR duplicates | Power-law distribution | Biologically plausible |
Potential Actions Based on Findings
If duplicates are sampling-based (likely)
- Document limitation in manuscript and README
- Consider integration with ReSeq's duplicate model (already used for Illumina errors)
- Evaluate if this affects MucOneUp's validation claims
If duplicates show PCR-like distribution (unlikely)
- Document as feature
- Add metrics to output for transparency
References
- Wessim paper: Wessim: a whole-exome sequencing simulator
- ReSeq duplicate model: https://github.com/schmeing/ReSeq
- Picard MarkDuplicates: https://gatk.broadinstitute.org/hc/en-us/articles/360037052812
Acceptance Criteria
- Run Picard MarkDuplicates on Wessim-generated BAMs at multiple coverages (50x, 100x, 150x)
- Extract and document duplicate family size distribution
- Compare to expected PCR duplicate patterns from literature
- Update documentation based on findings
- If significant: add duplicate metrics to MucOneUp output summary