gget virus: add function to only download new sequences

### Request type

Extension of existing module

### Request description

This function allows users to link a metadata file and only download the newest (not already included) sequences/data. Done by comparing accession IDs. Also useful for when the API breaks and needs to be re-run from a certain checkpoint.
# Feature: Baseline Metadata Deduplication with Merge Control

## Problem Statements

**Problem 1 (Pagination Failure Recovery):** When API fetches fail mid-pagination, users lose data and must restart. Solution: Save partial metadata → user resumes with `--baseline` + `--merge-results` to combine new + old results.

**Problem 2 (Incremental Updates):** Users want to re-query without re-downloading data they already have. Solution: Provide `--baseline` with `--no-merge` to get only new sequences.

## Proposed Solution

Single unified feature: **baseline metadata deduplication**
- Accepts baseline file (CSV/JSON/text) with accessions to skip
- User controls output: `--merge-results` (combined) or `--no-merge` (new only)
- On API failure: save partial metadata, suggest resume command in summary
- Works for both recovery AND incremental updates with same mechanism

## Implementation Details

### Data Flow
```
User Query + Optional --baseline file
    ↓
Fetch Metadata from API
    ↓
Deduplicate vs Baseline (skip accessions in file)
    ↓
Download Sequences for New Accessions Only
    ↓
Output: --merge-results → Combined | --no-merge → New Only
```

### On API Failure
When pagination fails, the system:
1. Saves all metadata fetched so far to `{virus}_partial_{timestamp}.csv`
2. Saves this path in command summary
3. Suggests recovery command: `gget virus {virus} --baseline {partial_file} --merge-results -o output/`

### Baseline File Format
Users provide metadata (CSV/JSON/text) with `accession` column:
```csv
accession,source,date
NC_045512.2,NCBI,2025-12-01
MN908947.3,local,2025-12-01
```

### Function Signature
```python
def virus(
    virus,
    baseline_metadata=None,  # Optional: accessions to skip
    merge_results=True,      # Default: merged output
    ...
):
```

### Processing Flow
1. **Fetch:** Get all metadata from API (with retries on transient failures)
2. **Deduplicate:** Skip accessions already in baseline file
3. **Download:** Only new accessions
4. **Merge:**
   - `--merge-results`: Single output file with new + baseline combined
   - `--no-merge`: Separate files (new.csv, baseline_provided.csv) with clear labeling

## Acceptance Criteria

- [ ] New parameter `baseline_metadata` accepts CSV/JSON/text files
- [ ] New parameter `merge_results` (default True) controls output format
- [ ] Baseline accessions correctly extracted and deduplicated
- [ ] On API failure: save partial metadata automatically
- [ ] Command summary shows recovery command with baseline file path
- [ ] Only new accessions downloaded (bandwidth savings)
- [ ] Output files properly labeled when `--no-merge`:
  - `{virus}_new.csv` (sequences from API)
  - `{virus}_baseline_provided.csv` (reference copy of baseline used)
  - `{virus}_merged.csv` (when using `--merge-results`)
- [ ] Comprehensive logging: accessions skipped, downloaded, merged
- [ ] Unit tests: baseline parsing (CSV/JSON/text), deduplication, merge logic

## Related Code Sections

- `virus()` - lines 4668-5637
- `fetch_virus_metadata()` - lines 830-1320  
- `save_command_summary()` - where recovery suggestion added
- `download_sequences_by_accessions()` - skips baseline accessions

## Benefits

✅ **Single Solution, Two Problems** - Same feature handles both failures and incremental updates  
✅ **Simple Recovery** - One suggested command in error summary  
✅ **User Control** - Merge option for flexible output  
✅ **Bandwidth Efficient** - Only download new sequences  
✅ **Transparent** - Clear logging of deduplication  
✅ **Minimal Code** - No special checkpoint logic, just baseline deduplication  

## Potential Concerns & Mitigations

| Concern | Mitigation |
|---------|-----------|
| Baseline file not found | Graceful error; allow continue without baseline |
| Stale baseline (outdated) | Show file date in logs; user responsibility to verify |
| Accession format mismatch | Normalize accessions (case-insensitive, spacing) |
| Separate files confusion | Clear naming: `virus_new.csv`, `virus_baseline_provided.csv` |
| Missing merge parameter | Default to `--merge-results=True` (most intuitive) |

## Usage Examples

### Scenario 1: API Failure → Recovery
```bash
# Run 1: Starts fetching, fails at page 28 of 35
$ gget virus SARS-CoV-2 -o output/
ERROR: Connection lost at page 28
Partial metadata saved: output/SARS-CoV-2_partial_20260205_120530.csv
Recovery command: gget virus SARS-CoV-2 --baseline output/SARS-CoV-2_partial_20260205_120530.csv --merge-results -o output/

# Run 2: User runs suggested command
$ gget virus SARS-CoV-2 --baseline output/SARS-CoV-2_partial_20260205_120530.csv --merge-results -o output/
Baseline loaded: 28,000 accessions from partial fetch
New accessions from API: 22,000
✅ output/SARS-CoV-2_virus_sequences.csv (merged: 50,000 total)
```

### Scenario 2: Incremental Update (Merged)
```bash
$ gget virus Zika --baseline previous_zika.csv --merge-results -o output/
Baseline loaded: 2,000 accessions
API returned: 2,500 records
New accessions: 500 (downloaded)
✅ output/zika_virus_sequences.csv (merged: 2,500 total)
```

### Scenario 3: Incremental Update (New Only)
```bash
$ gget virus Zika --baseline previous_zika.csv --no-merge -o output/
Baseline loaded: 2,000 accessions
API returned: 2,500 records
New accessions: 500 (downloaded)
✅ output/zika_virus_new.csv (500 sequences)
✅ output/zika_virus_baseline_provided.csv (reference)
```


### Example command

```shell
`gget virus {virus} --baseline {partial_file} --merge-results -o output/`
```

### Example return value

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gget virus: add function to only download new sequences #204

Request type

Request description

Feature: Baseline Metadata Deduplication with Merge Control

Problem Statements

Proposed Solution

Implementation Details

Data Flow

On API Failure

Baseline File Format

Function Signature

Processing Flow

Acceptance Criteria

Related Code Sections

Benefits

Potential Concerns & Mitigations

Usage Examples

Scenario 1: API Failure → Recovery

Scenario 2: Incremental Update (Merged)

Scenario 3: Incremental Update (New Only)

Example command

Example return value

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Concern	Mitigation
Baseline file not found	Graceful error; allow continue without baseline
Stale baseline (outdated)	Show file date in logs; user responsibility to verify
Accession format mismatch	Normalize accessions (case-insensitive, spacing)
Separate files confusion	Clear naming: `virus_new.csv`, `virus_baseline_provided.csv`
Missing merge parameter	Default to `--merge-results=True` (most intuitive)

gget virus: add function to only download new sequences #204

Description

Request type

Request description

Feature: Baseline Metadata Deduplication with Merge Control

Problem Statements

Proposed Solution

Implementation Details

Data Flow

On API Failure

Baseline File Format

Function Signature

Processing Flow

Acceptance Criteria

Related Code Sections

Benefits

Potential Concerns & Mitigations

Usage Examples

Scenario 1: API Failure → Recovery

Scenario 2: Incremental Update (Merged)

Scenario 3: Incremental Update (New Only)

Example command

Example return value

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions