This pipeline automates the process of checking for available proteomes for a list of species, downloading them, creating a local BLAST database, and running a BLASTp search against it using your query sequences.
- Conda/Mamba: Ensure you have Miniconda, Anaconda, or Mamba installed on your cluster user account.
- Snakemake: Install Snakemake in your base environment:
conda install snakemakeormamba install snakemake - NCBI API Key (recommended): Get a free API key from NCBI Account Settings for enhanced access (10 requests per second vs. 5 rps without key).
The workflow will automatically create isolated conda environments for each rule using the environment definitions in workflow/envs/. No manual environment setup is required.
To avoid rate limiting when querying NCBI databases, set up an API key:
-
Get your API key: Visit NCBI Account Settings and create a free API key
-
Create
.envfile: Copyenv.exampleto.envand add your API key:cp env.example .env # Edit .env and replace 'your_api_key_here' with your actual API key -
Alternative: Set environment variable:
export NCBI_API_KEY=your_actual_api_key_here
The pipeline will automatically use the API key if available, providing enhanced access at 10 requests per second (vs. 5 rps without key).
For detailed information about NCBI API keys, see the official NCBI Datasets API documentation.
If your species data is in an Excel file (.xlsx), you'll need to convert it to CSV format. The workflow expects a CSV file with these columns:
cell_line: Unique identifier for each cell line/cultureAccepted name: The accepted scientific name (used for NCBI queries)Legacy Name: Alternative/legacy name (backup for queries)Genus: Taxonomic genus
# Install required dependency (if not already installed)
pip install openpyxl
# Run the conversion script
python convert_excel_to_csv.py your_file.xlsx species.csv# Install required dependency (if not already installed)
pip install openpyxl
# Convert Excel to CSV with proper column mapping
python3 -c "
import pandas as pd
# Read the Excel file
df = pd.read_excel('your_file.xlsx')
# Create a simplified dataframe with the columns we need
species_df = pd.DataFrame({
'cell_line': df['Culture ID'], # Use Culture ID as cell line identifier
'Accepted name': df['Accepted Name (link)'],
'Legacy Name': df['Legacy Name'],
'Genus': df['Genus']
})
# Remove duplicates based on all taxonomic fields
species_df = species_df.drop_duplicates(subset=['Accepted name', 'Legacy Name', 'Genus'])
print(f'Created species CSV with {len(species_df)} unique species (duplicates removed based on Accepted name, Legacy Name, and Genus)')
# Clean up formatting (remove tabs and extra whitespace)
species_df['Genus'] = species_df['Genus'].str.strip().str.replace('\t', '')
species_df['Accepted name'] = species_df['Accepted name'].str.strip()
species_df['Legacy Name'] = species_df['Legacy Name'].str.strip()
# Save to CSV
species_df.to_csv('species.csv', index=False)
print('Saved to species.csv')
"Note: Adjust the column names ('Culture ID', 'Accepted Name (link)', etc.) to match your Excel file's actual column headers. The standalone script (convert_excel_to_csv.py) includes error checking and helpful output messages.
Organize your files in a single directory like this:
/your/project/directory/
├-- species.csv # Your input file with species names (CSV format)
├-- query_sequences.fasta # Your input file with query sequences
├-- convert_excel_to_csv.py # Excel to CSV conversion script (provided)
├-- env.example # Environment variables template (provided)
├-- config/
│ └-- config.yaml # Configuration file (provided)
├-- workflow/
│ ├-- Snakefile # The main Snakemake workflow (provided)
│ ├-- envs/ # Conda environment definitions (provided)
│ └-- scripts/ # Helper scripts (provided)
├-- profiles/
│ └-- slurm/ # SLURM executor profile (provided)
│ └-- config.v8+.yaml # SLURM configuration (provided)
├-- resources/ # Intermediate files (created by workflow)
├-- results/ # Final outputs (created by workflow)
└-- logs/ # Workflow logs (created by workflow)
- species.csv: This file must contain a header with these exact columns:
cell_line: Unique identifier for each cell line/cultureAccepted name: The accepted scientific name (used for NCBI queries)Legacy Name: Alternative/legacy name (backup for queries)Genus: Taxonomic genus
- query_sequences.fasta: This should be a standard FASTA file containing your query sequences (e.g., proteins, peptides, or any sequences you want to search against the species proteomes).
Edit config/config.yaml to adjust:
species_csv: Path to your species CSV file (default: "resources/species.csv")query_fasta: Path to your query sequences FASTA file (default: "resources/amp.fasta")num_shards: Number of query shards for parallel BLAST (default: 8)threads_per_blast: CPUs per BLAST job (default: 8)resolve_accessions_threads: Parallel workers for species resolution (default: 5)
From your project root directory, run:
snakemake --profile profiles/slurmThe workflow will:
- Automatically create conda environments for each rule
- Submit jobs to SLURM using your configured profile
- Run BLAST jobs in parallel across multiple nodes
To run the workflow interactively while ensuring it continues after disconnection, use one of these methods:
Note for Tufts HPC users: On Tufts HPC login nodes, screen sessions may be
terminated when you disconnect. tmux sessions persist across disconnects and
are recommended. See Tufts RT Guides: Tmux
(https://rtguides.it.tufts.edu/hpc/application/30-tmux.html).
# Start a new tmux session
tmux new-session -s snakemake_workflow
# Run the workflow
snakemake --profile profiles/slurm
# Detach from tmux: Press Ctrl+B, then D
# To reattach later: tmux attach-session -t snakemake_workflow# Start a new screen session
screen -S snakemake_workflow
# Run the workflow
snakemake --profile profiles/slurm
# Detach from screen: Press Ctrl+A, then D
# To reattach later: screen -r snakemake_workflowOn Tufts HPC, be aware that screen sessions can be killed on disconnect. Prefer
tmux if you need persistence across network drops
(https://rtguides.it.tufts.edu/hpc/application/30-tmux.html).
# Run with nohup to prevent termination on disconnect
nohup snakemake --profile profiles/slurm > workflow.log 2>&1 &
# Monitor progress
tail -f workflow.logCreate a job script run_workflow.sh:
#!/bin/bash
#SBATCH --job-name=snakemake_workflow
#SBATCH --partition=batch
#SBATCH --account=your_account
#SBATCH --time=72:00:00
#SBATCH --mem=8G
#SBATCH --cpus-per-task=4
# Load modules if needed
module load conda
# Activate conda environment
eval "$(conda shell.bash hook)"
conda activate snakemake
# Run the workflow
snakemake --profile profiles/slurmThen submit:
sbatch run_workflow.shRecommendation: Use Method 1 (tmux) or Method 4 (SLURM job) for long-running workflows. Tmux allows interactive monitoring with persistence across disconnects, while SLURM jobs are more robust for very long runs.
Edit profiles/slurm/config.v8+.yaml to adjust:
slurm_partition: SLURM partition name (default: "batch")slurm_account: SLURM account name (default: "default")runtime: Job time limit in minutes (default: 1440)mem_mb: Memory per job in MB (default: 8000)cpus_per_task: CPUs per job (default: 4)slurm_extra: Additional SLURM flags (email notifications, etc.)jobs: Maximum concurrent jobs (default: 100)latency_wait: Wait time for files in seconds (default: 120)
The workflow will produce several output files in the results/ directory:
- blast_results.tsv: The raw, tab-separated output from the BLASTp search (outfmt 6).
- species_with_hits.csv: A list of species from your input that had at least one significant BLASTp hit against your query sequences.
- analysis_summary.txt: A human-readable summary of the key statistics from the analysis.
Additional intermediate files are created in the resources/ directory:
- species_status.csv: A log detailing which of your species had a reference proteome available for download on NCBI.
- accessions.txt: List of NCBI genome accessions for species with available proteomes.
- download_info.csv: Detailed download information for each species.
- proteomes/: Directory containing downloaded proteome files (compressed FASTA format).
- blast_db/: Directory containing the BLAST database files.
- query_shards/: Directory containing split query files for parallel processing.
- blast_results/: Directory containing individual BLAST result files (temporary).
This setup allows the entire job to run independently on the cluster, and you will be notified by email when it completes (if email notifications are configured in your SLURM profile).