-
Notifications
You must be signed in to change notification settings - Fork 83
Description
Request type
Extension of existing module
Request description
This function allows users to link a metadata file and only download the newest (not already included) sequences/data. Done by comparing accession IDs. Also useful for when the API breaks and needs to be re-run from a certain checkpoint.
Feature: Baseline Metadata Deduplication with Merge Control
Problem Statements
Problem 1 (Pagination Failure Recovery): When API fetches fail mid-pagination, users lose data and must restart. Solution: Save partial metadata → user resumes with --baseline + --merge-results to combine new + old results.
Problem 2 (Incremental Updates): Users want to re-query without re-downloading data they already have. Solution: Provide --baseline with --no-merge to get only new sequences.
Proposed Solution
Single unified feature: baseline metadata deduplication
- Accepts baseline file (CSV/JSON/text) with accessions to skip
- User controls output:
--merge-results(combined) or--no-merge(new only) - On API failure: save partial metadata, suggest resume command in summary
- Works for both recovery AND incremental updates with same mechanism
Implementation Details
Data Flow
User Query + Optional --baseline file
↓
Fetch Metadata from API
↓
Deduplicate vs Baseline (skip accessions in file)
↓
Download Sequences for New Accessions Only
↓
Output: --merge-results → Combined | --no-merge → New Only
On API Failure
When pagination fails, the system:
- Saves all metadata fetched so far to
{virus}_partial_{timestamp}.csv - Saves this path in command summary
- Suggests recovery command:
gget virus {virus} --baseline {partial_file} --merge-results -o output/
Baseline File Format
Users provide metadata (CSV/JSON/text) with accession column:
accession,source,date
NC_045512.2,NCBI,2025-12-01
MN908947.3,local,2025-12-01Function Signature
def virus(
virus,
baseline_metadata=None, # Optional: accessions to skip
merge_results=True, # Default: merged output
...
):Processing Flow
- Fetch: Get all metadata from API (with retries on transient failures)
- Deduplicate: Skip accessions already in baseline file
- Download: Only new accessions
- Merge:
--merge-results: Single output file with new + baseline combined--no-merge: Separate files (new.csv, baseline_provided.csv) with clear labeling
Acceptance Criteria
- New parameter
baseline_metadataaccepts CSV/JSON/text files - New parameter
merge_results(default True) controls output format - Baseline accessions correctly extracted and deduplicated
- On API failure: save partial metadata automatically
- Command summary shows recovery command with baseline file path
- Only new accessions downloaded (bandwidth savings)
- Output files properly labeled when
--no-merge:{virus}_new.csv(sequences from API){virus}_baseline_provided.csv(reference copy of baseline used){virus}_merged.csv(when using--merge-results)
- Comprehensive logging: accessions skipped, downloaded, merged
- Unit tests: baseline parsing (CSV/JSON/text), deduplication, merge logic
Related Code Sections
virus()- lines 4668-5637fetch_virus_metadata()- lines 830-1320save_command_summary()- where recovery suggestion addeddownload_sequences_by_accessions()- skips baseline accessions
Benefits
✅ Single Solution, Two Problems - Same feature handles both failures and incremental updates
✅ Simple Recovery - One suggested command in error summary
✅ User Control - Merge option for flexible output
✅ Bandwidth Efficient - Only download new sequences
✅ Transparent - Clear logging of deduplication
✅ Minimal Code - No special checkpoint logic, just baseline deduplication
Potential Concerns & Mitigations
| Concern | Mitigation |
|---|---|
| Baseline file not found | Graceful error; allow continue without baseline |
| Stale baseline (outdated) | Show file date in logs; user responsibility to verify |
| Accession format mismatch | Normalize accessions (case-insensitive, spacing) |
| Separate files confusion | Clear naming: virus_new.csv, virus_baseline_provided.csv |
| Missing merge parameter | Default to --merge-results=True (most intuitive) |
Usage Examples
Scenario 1: API Failure → Recovery
# Run 1: Starts fetching, fails at page 28 of 35
$ gget virus SARS-CoV-2 -o output/
ERROR: Connection lost at page 28
Partial metadata saved: output/SARS-CoV-2_partial_20260205_120530.csv
Recovery command: gget virus SARS-CoV-2 --baseline output/SARS-CoV-2_partial_20260205_120530.csv --merge-results -o output/
# Run 2: User runs suggested command
$ gget virus SARS-CoV-2 --baseline output/SARS-CoV-2_partial_20260205_120530.csv --merge-results -o output/
Baseline loaded: 28,000 accessions from partial fetch
New accessions from API: 22,000
✅ output/SARS-CoV-2_virus_sequences.csv (merged: 50,000 total)Scenario 2: Incremental Update (Merged)
$ gget virus Zika --baseline previous_zika.csv --merge-results -o output/
Baseline loaded: 2,000 accessions
API returned: 2,500 records
New accessions: 500 (downloaded)
✅ output/zika_virus_sequences.csv (merged: 2,500 total)Scenario 3: Incremental Update (New Only)
$ gget virus Zika --baseline previous_zika.csv --no-merge -o output/
Baseline loaded: 2,000 accessions
API returned: 2,500 records
New accessions: 500 (downloaded)
✅ output/zika_virus_new.csv (500 sequences)
✅ output/zika_virus_baseline_provided.csv (reference)Example command
`gget virus {virus} --baseline {partial_file} --merge-results -o output/`Example return value
No response