Investigation: Wessim duplicate generation model (sampling vs PCR amplification)

## Summary

Investigate and quantify how Wessim generates duplicates to determine if the current implementation produces biologically realistic PCR duplicate patterns or merely sampling duplicates.

**Hypothesis**: Wessim generates **sampling duplicates** (the same fragment selected multiple times during coverage sampling) rather than **PCR amplification duplicates** (one fragment undergoing clonal expansion into many copies), which may not accurately reflect wet-lab library preparation artifacts.

## Background

Wessim is a well-established tool for modeling exome capture bias via probe hybridization, but its duplicate generation mechanism relies on external read simulators (GemSim/ART). The resulting duplicates arise from:

- **Sampling duplicates**: Probabilistic selection picks the same fragment coordinates repeatedly
- **Not**: PCR amplification clonality where a single molecule is exponentially amplified

This distinction matters for benchmarking variant callers that use duplicate information for quality assessment, error correction, or consensus calling.

## Impact

- **Affected component**: `muc_one_up/read_simulator/wrappers/wessim_wrapper.py`
- **Downstream impact**: May affect benchmarking accuracy for tools that rely on duplicate patterns (e.g., UMI-aware callers, consensus-based error correction)
- **Relevance**: MucOneUp manuscript claims realistic simulation—understanding duplicate behavior is important for documentation

## Proposed Investigation

### Step 1: Generate test data
```bash
# Generate BAMs with known coverage using Wessim pipeline
muconeup run --simulator wessim --coverage 100x --output test_wessim/
```

### Step 2: Quantify duplicates with Picard
```bash
# On input BAM (if applicable) and output BAM
picard MarkDuplicates \
  I=test_wessim/output.bam \
  O=marked_duplicates.bam \
  M=duplicate_metrics.txt \
  VALIDATION_STRINGENCY=LENIENT

# Extract key metrics
grep -A1 "LIBRARY" duplicate_metrics.txt
```

### Step 3: Analyze duplicate distribution
Key metrics to compare:
- **PERCENT_DUPLICATION**: Overall duplicate rate
- **READ_PAIR_DUPLICATES**: Exact vs optical duplicate ratio
- **Duplicate family size distribution**: Are duplicates binary (2 copies) or exponential (1→100)?

### Step 4: Compare to empirical data
If available, compare metrics to real exome data from:
- Published benchmarking datasets (GIAB, etc.)
- Literature on expected PCR duplicate rates (~10-30% depending on library prep)

## Expected Outcomes

| Scenario | Duplicate Family Sizes | Implication |
|----------|----------------------|-------------|
| Sampling duplicates | Mostly pairs (2 copies) | Unrealistic for PCR artifacts |
| PCR duplicates | Power-law distribution | Biologically plausible |

## Potential Actions Based on Findings

### If duplicates are sampling-based (likely)
1. **Document limitation** in manuscript and README
2. Consider integration with ReSeq's duplicate model (already used for Illumina errors)
3. Evaluate if this affects MucOneUp's validation claims

### If duplicates show PCR-like distribution (unlikely)
1. Document as feature
2. Add metrics to output for transparency

## References

- Wessim paper: [Wessim: a whole-exome sequencing simulator](https://doi.org/10.1093/bioinformatics/bts51)
- ReSeq duplicate model: https://github.com/schmeing/ReSeq
- Picard MarkDuplicates: https://gatk.broadinstitute.org/hc/en-us/articles/360037052812

## Acceptance Criteria

- [ ] Run Picard MarkDuplicates on Wessim-generated BAMs at multiple coverages (50x, 100x, 150x)
- [ ] Extract and document duplicate family size distribution
- [ ] Compare to expected PCR duplicate patterns from literature
- [ ] Update documentation based on findings
- [ ] If significant: add duplicate metrics to MucOneUp output summary

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigation: Wessim duplicate generation model (sampling vs PCR amplification) #51

Summary

Background

Impact

Proposed Investigation

Step 1: Generate test data

Step 2: Quantify duplicates with Picard

Step 3: Analyze duplicate distribution

Step 4: Compare to empirical data

Expected Outcomes

Potential Actions Based on Findings

If duplicates are sampling-based (likely)

If duplicates show PCR-like distribution (unlikely)

References

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scenario	Duplicate Family Sizes	Implication
Sampling duplicates	Mostly pairs (2 copies)	Unrealistic for PCR artifacts
PCR duplicates	Power-law distribution	Biologically plausible

Investigation: Wessim duplicate generation model (sampling vs PCR amplification) #51

Description

Summary

Background

Impact

Proposed Investigation

Step 1: Generate test data

Step 2: Quantify duplicates with Picard

Step 3: Analyze duplicate distribution

Step 4: Compare to empirical data

Expected Outcomes

Potential Actions Based on Findings

If duplicates are sampling-based (likely)

If duplicates show PCR-like distribution (unlikely)

References

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions