Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

gget virus: add function to only download new sequences #204

@ferbsx

Description

@ferbsx

Request type

Extension of existing module

Request description

This function allows users to link a metadata file and only download the newest (not already included) sequences/data. Done by comparing accession IDs. Also useful for when the API breaks and needs to be re-run from a certain checkpoint.

Feature: Baseline Metadata Deduplication with Merge Control

Problem Statements

Problem 1 (Pagination Failure Recovery): When API fetches fail mid-pagination, users lose data and must restart. Solution: Save partial metadata → user resumes with --baseline + --merge-results to combine new + old results.

Problem 2 (Incremental Updates): Users want to re-query without re-downloading data they already have. Solution: Provide --baseline with --no-merge to get only new sequences.

Proposed Solution

Single unified feature: baseline metadata deduplication

  • Accepts baseline file (CSV/JSON/text) with accessions to skip
  • User controls output: --merge-results (combined) or --no-merge (new only)
  • On API failure: save partial metadata, suggest resume command in summary
  • Works for both recovery AND incremental updates with same mechanism

Implementation Details

Data Flow

User Query + Optional --baseline file
    ↓
Fetch Metadata from API
    ↓
Deduplicate vs Baseline (skip accessions in file)
    ↓
Download Sequences for New Accessions Only
    ↓
Output: --merge-results → Combined | --no-merge → New Only

On API Failure

When pagination fails, the system:

  1. Saves all metadata fetched so far to {virus}_partial_{timestamp}.csv
  2. Saves this path in command summary
  3. Suggests recovery command: gget virus {virus} --baseline {partial_file} --merge-results -o output/

Baseline File Format

Users provide metadata (CSV/JSON/text) with accession column:

accession,source,date
NC_045512.2,NCBI,2025-12-01
MN908947.3,local,2025-12-01

Function Signature

def virus(
    virus,
    baseline_metadata=None,  # Optional: accessions to skip
    merge_results=True,      # Default: merged output
    ...
):

Processing Flow

  1. Fetch: Get all metadata from API (with retries on transient failures)
  2. Deduplicate: Skip accessions already in baseline file
  3. Download: Only new accessions
  4. Merge:
    • --merge-results: Single output file with new + baseline combined
    • --no-merge: Separate files (new.csv, baseline_provided.csv) with clear labeling

Acceptance Criteria

  • New parameter baseline_metadata accepts CSV/JSON/text files
  • New parameter merge_results (default True) controls output format
  • Baseline accessions correctly extracted and deduplicated
  • On API failure: save partial metadata automatically
  • Command summary shows recovery command with baseline file path
  • Only new accessions downloaded (bandwidth savings)
  • Output files properly labeled when --no-merge:
    • {virus}_new.csv (sequences from API)
    • {virus}_baseline_provided.csv (reference copy of baseline used)
    • {virus}_merged.csv (when using --merge-results)
  • Comprehensive logging: accessions skipped, downloaded, merged
  • Unit tests: baseline parsing (CSV/JSON/text), deduplication, merge logic

Related Code Sections

  • virus() - lines 4668-5637
  • fetch_virus_metadata() - lines 830-1320
  • save_command_summary() - where recovery suggestion added
  • download_sequences_by_accessions() - skips baseline accessions

Benefits

Single Solution, Two Problems - Same feature handles both failures and incremental updates
Simple Recovery - One suggested command in error summary
User Control - Merge option for flexible output
Bandwidth Efficient - Only download new sequences
Transparent - Clear logging of deduplication
Minimal Code - No special checkpoint logic, just baseline deduplication

Potential Concerns & Mitigations

Concern Mitigation
Baseline file not found Graceful error; allow continue without baseline
Stale baseline (outdated) Show file date in logs; user responsibility to verify
Accession format mismatch Normalize accessions (case-insensitive, spacing)
Separate files confusion Clear naming: virus_new.csv, virus_baseline_provided.csv
Missing merge parameter Default to --merge-results=True (most intuitive)

Usage Examples

Scenario 1: API Failure → Recovery

# Run 1: Starts fetching, fails at page 28 of 35
$ gget virus SARS-CoV-2 -o output/
ERROR: Connection lost at page 28
Partial metadata saved: output/SARS-CoV-2_partial_20260205_120530.csv
Recovery command: gget virus SARS-CoV-2 --baseline output/SARS-CoV-2_partial_20260205_120530.csv --merge-results -o output/

# Run 2: User runs suggested command
$ gget virus SARS-CoV-2 --baseline output/SARS-CoV-2_partial_20260205_120530.csv --merge-results -o output/
Baseline loaded: 28,000 accessions from partial fetch
New accessions from API: 22,000
✅ output/SARS-CoV-2_virus_sequences.csv (merged: 50,000 total)

Scenario 2: Incremental Update (Merged)

$ gget virus Zika --baseline previous_zika.csv --merge-results -o output/
Baseline loaded: 2,000 accessions
API returned: 2,500 records
New accessions: 500 (downloaded)
✅ output/zika_virus_sequences.csv (merged: 2,500 total)

Scenario 3: Incremental Update (New Only)

$ gget virus Zika --baseline previous_zika.csv --no-merge -o output/
Baseline loaded: 2,000 accessions
API returned: 2,500 records
New accessions: 500 (downloaded)
✅ output/zika_virus_new.csv (500 sequences)
✅ output/zika_virus_baseline_provided.csv (reference)

Example command

`gget virus {virus} --baseline {partial_file} --merge-results -o output/`

Example return value

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions