Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

Investigation: Wessim duplicate generation model (sampling vs PCR amplification) #51

@berntpopp

Description

@berntpopp

Summary

Investigate and quantify how Wessim generates duplicates to determine if the current implementation produces biologically realistic PCR duplicate patterns or merely sampling duplicates.

Hypothesis: Wessim generates sampling duplicates (the same fragment selected multiple times during coverage sampling) rather than PCR amplification duplicates (one fragment undergoing clonal expansion into many copies), which may not accurately reflect wet-lab library preparation artifacts.

Background

Wessim is a well-established tool for modeling exome capture bias via probe hybridization, but its duplicate generation mechanism relies on external read simulators (GemSim/ART). The resulting duplicates arise from:

  • Sampling duplicates: Probabilistic selection picks the same fragment coordinates repeatedly
  • Not: PCR amplification clonality where a single molecule is exponentially amplified

This distinction matters for benchmarking variant callers that use duplicate information for quality assessment, error correction, or consensus calling.

Impact

  • Affected component: muc_one_up/read_simulator/wrappers/wessim_wrapper.py
  • Downstream impact: May affect benchmarking accuracy for tools that rely on duplicate patterns (e.g., UMI-aware callers, consensus-based error correction)
  • Relevance: MucOneUp manuscript claims realistic simulation—understanding duplicate behavior is important for documentation

Proposed Investigation

Step 1: Generate test data

# Generate BAMs with known coverage using Wessim pipeline
muconeup run --simulator wessim --coverage 100x --output test_wessim/

Step 2: Quantify duplicates with Picard

# On input BAM (if applicable) and output BAM
picard MarkDuplicates \
  I=test_wessim/output.bam \
  O=marked_duplicates.bam \
  M=duplicate_metrics.txt \
  VALIDATION_STRINGENCY=LENIENT

# Extract key metrics
grep -A1 "LIBRARY" duplicate_metrics.txt

Step 3: Analyze duplicate distribution

Key metrics to compare:

  • PERCENT_DUPLICATION: Overall duplicate rate
  • READ_PAIR_DUPLICATES: Exact vs optical duplicate ratio
  • Duplicate family size distribution: Are duplicates binary (2 copies) or exponential (1→100)?

Step 4: Compare to empirical data

If available, compare metrics to real exome data from:

  • Published benchmarking datasets (GIAB, etc.)
  • Literature on expected PCR duplicate rates (~10-30% depending on library prep)

Expected Outcomes

Scenario Duplicate Family Sizes Implication
Sampling duplicates Mostly pairs (2 copies) Unrealistic for PCR artifacts
PCR duplicates Power-law distribution Biologically plausible

Potential Actions Based on Findings

If duplicates are sampling-based (likely)

  1. Document limitation in manuscript and README
  2. Consider integration with ReSeq's duplicate model (already used for Illumina errors)
  3. Evaluate if this affects MucOneUp's validation claims

If duplicates show PCR-like distribution (unlikely)

  1. Document as feature
  2. Add metrics to output for transparency

References

Acceptance Criteria

  • Run Picard MarkDuplicates on Wessim-generated BAMs at multiple coverages (50x, 100x, 150x)
  • Extract and document duplicate family size distribution
  • Compare to expected PCR duplicate patterns from literature
  • Update documentation based on findings
  • If significant: add duplicate metrics to MucOneUp output summary

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions