CRISPRAligner

A Python tool for identifying, extracting, and aligning CRISPR spacer sequences from bacterial genomes, with a focus on CRISPR repeat regions (CRRs) in different bacteria.

Overview

CRISPRAligner processes bacterial genome FASTA files to:

Identify CRISPR repeat regions (like CRR1, CRR2, and when available, CRR4)
Group CRISPR arrays by spatial proximity to detect separate arrays with letter suffixes (e.g., CRR1A, CRR1B)
Extract spacer sequences between repeats
Align spacers across different genomes
Generate comparative analyses in multiple formats (FASTA, CSV, and binary phylip)

The tool supports both individual analyses of specific CRISPR cassettes and combined analyses for phylogenetic applications.

Installation

Prerequisites

Python 3.6 or higher
Biopython
NumPy

Setup

# Clone the repository
git clone https://github.com/Ffloriel/CRISPRAligner.git
cd CRISPRAligner

# Install the package
pip install .

Usage

Command Line Interface

# Basic usage with a directory containing FASTA files
crispr_aligner PREFIX [--input_dir /path/to/fasta/files] [--bacteria SPECIES] [--custom_config FILE]

Where:

PREFIX is used for naming output files
--input_dir (optional) specifies the directory containing input FASTA files (defaults to current directory)
--bacteria (optional) species name to load built-in configurations (defaults to "erwinia")
--custom_config (optional) path to a custom JSON configuration file
--min_crr1_score, --min_crr2_score, --min_crr4_score (optional) minimum match scores
--max_frame_size (optional) maximum size in bp for a CRISPR array frame (default: 20000)

Examples

# Process FASTA files in the current directory with prefix "Erwinia"
crispr_aligner Erwinia

# Use built-in Phytobacter configuration (no CRR4)
crispr_aligner Phytobacter --bacteria phytobacter

# Use a custom configuration file
crispr_aligner MyBacteria --custom_config my_species_config.json

# Adjust minimum match scores
crispr_aligner Erwinia --min_crr1_score 20 --min_crr4_score 18

# Change max frame size for spatially separating CRISPR arrays
crispr_aligner Erwinia --max_frame_size 15000

Output

Results are organized in the following directory structure:

Results/
├── Erwinia.Results.csv       # Summary of CRISPR spacers by genome
├── Erwinia.Error.fasta       # Potential assembly errors
├── AllCRR/                   # Combined CRR results
│   ├── Erwinia.AllCRRAligned.csv
│   ├── Erwinia.AllCRRAligned.fasta
│   └── Erwinia.AllCRRBinary.phy
├── CRR1A/                    # First spatial frame of CRR1 results
│   ├── Erwinia.CRR1A.fasta
│   ├── Erwinia.CRR1AAligned.csv
│   ├── Erwinia.CRR1AAligned.fasta
│   ├── Erwinia.CRR1ABinary.phy
│   └── Erwinia.CRR1AUnique.fasta
├── CRR1B/                    # Second spatial frame of CRR1 (if found)
│   └── [Similar files for CRR1B]
├── CRR2A/                    # First spatial frame of CRR2
│   └── [Similar files for CRR2A]
└── CRR4A/                    # First spatial frame of CRR4 (if available)
    └── [Similar files for CRR4A]

Configuration System

CRISPRAligner supports different bacterial species through a flexible configuration system.

Built-in Configurations

The tool includes built-in configurations for several bacteria:

erwinia (default): Erwinia amylovora (CRR1, CRR2, CRR4)
phytobacter: Phytobacter species (CRR1, CRR2 only, no CRR4)
And others (use --bacteria parameter to specify)

Custom Configuration Files

For other species, you can create a custom JSON configuration file:

{
  "consensus_sequences": {
    "crr1": "GTGTTCCCCGCGTATGCGGGGATAAACCG",
    "crr2": "GTGCTTCACTGCATGCGTTGATAAACCG",
    "crr4": "GTTCACTGCCGTACAGGCAGCTTAGAAA"
  },
  "short_checks": {
    "crr1_crr2_start": "GTGTTC",
    "crr1_crr2_end": "ATAAACC",
    "crr4_start": "GTTCAC",
    "crr4_middle": "GTACAGG"
  },
  "available_crrs": ["crr1", "crr2", "crr4"]
}

The configuration contains:

consensus_sequences: The CRR consensus sequences to search for
short_checks: Short sequences used for initial pattern matching
available_crrs: List of CRR types available for this bacteria (omit unavailable ones)

Programmatic Usage

You can also use the package in your own Python scripts:

from crispr_aligner import find_crispr_spacers, align_spacers
from crispr_aligner.combiner.crr_combiner import combine_fasta, combine_csv, combine_binary
from crispr_aligner.utils import load_bacteria_config

# Load a configuration for a specific bacteria
config = load_bacteria_config("phytobacter")

# Find CRISPR spacers in genomes with specific configuration
find_crispr_spacers(
    "combined.fasta", 
    "MyPrefix.", 
    results_dir="Results",
    consensus_sequences=config["consensus_sequences"],
    short_checks=config["short_checks"],
    available_crrs=config["available_crrs"],
    min_scores={"crr1": 22, "crr2": 22, "crr4": 21},
    max_frame_size=20000
)

# Get all frame directories created by find_crispr_spacers
frame_dirs = []
for item in Path("Results").iterdir():
    if item.is_dir() and item.name not in ["AllCRR"]:
        frame_dirs.append(item.name)

# Align spacers for each frame
for frame in frame_dirs:
    align_spacers(
        f"Results/{frame}/MyPrefix.{frame}.fasta",
        f"Results/{frame}/MyPrefix.{frame}Aligned.csv",
        f"Results/{frame}/MyPrefix.{frame}Aligned.fasta",
        f"Results/{frame}/MyPrefix.{frame}Unique.fasta",
        f"Results/{frame}/MyPrefix.{frame}Binary.phy",
        frame
    )

# Collect files for combination
fasta_files = [f"Results/{frame}/MyPrefix.{frame}Aligned.fasta" for frame in frame_dirs]
csv_files = [f"Results/{frame}/MyPrefix.{frame}Aligned.csv" for frame in frame_dirs]
phy_files = [f"Results/{frame}/MyPrefix.{frame}Binary.phy" for frame in frame_dirs]

# Combine alignments from different frames
combine_fasta(fasta_files, "Results/AllCRR/MyPrefix.AllCRRAligned.fasta")
combine_csv(csv_files, "Results/AllCRR/MyPrefix.AllCRRAligned.csv")
combine_binary(phy_files, "Results/AllCRR/MyPrefix.AllCRRBinary.phy")

Project Structure

CRISPRAligner/
├── crispr_aligner/
│   ├── __init__.py                # Package initialization
│   ├── cli.py                     # Command-line interface
│   ├── sequences/                 # FASTA file handling
│   │   ├── __init__.py
│   │   └── fasta_utils.py
│   ├── finder/                    # CRISPR spacer identification
│   │   ├── __init__.py
│   │   └── crr_finder.py
│   ├── aligner/                   # Spacer alignment
│   │   ├── __init__.py
│   │   └── crr_aligner.py
│   ├── combiner/                  # Result combination
│   │   ├── __init__.py
│   │   └── crr_combiner.py
│   └── utils/                     # Utility functions
│       ├── __init__.py
│       └── config.py              # Configuration management
├── tests/                         # Test suite
│   └── ...
├── README.md
└── pyproject.toml                 # Package configuration

Technical Details

CRISPR Repeat Regions

CRISPRAligner identifies different CRISPR repeat regions depending on the bacteria:

For Erwinia amylovora:

CRR1: Consensus sequence GTGTTCCCCGCGTGAGCGGGGATAAACCG
CRR2: Consensus sequence GTGTTCCCCGCGTATGCGGGGATAAACCG
CRR4: Consensus sequence GTTCACTGCCGTACAGGCAGCTTAGAAA

For Phytobacter species:

CRR1: Consensus sequence GTGTTCCCCGCGCGAGCGGGGATAAACCG
CRR2: Consensus sequence GTGTTCCCCGCGCCAGCGGGGATAAACCG (CRR4 is not present in Phytobacter)

Other bacteria will have different consensus sequences, which can be specified through the configuration system.

Frame-Based CRISPR Detection

CRISPRAligner now groups CRISPR arrays into frames based on spatial proximity:

CRISPR repeats of the same type (e.g., CRR1) are grouped into a frame if they are within a maximum distance (default: 20,000 bp)
Repeats separated by more than the maximum distance are considered separate arrays and assigned different letter suffixes (e.g., CRR1A, CRR1B)
Each frame is processed separately for alignment and analysis
Results can be combined across frames for comprehensive analysis

This approach preserves the biological reality that spatially separate CRISPR arrays of the same type are functionally distinct units.

Analysis Process

Finding: Sequences are scanned for matches to CRISPR repeat consensus sequences
Framing: Repeats are grouped into spatial frames (e.g., CRR1A, CRR1B)
Extraction: Spacers between repeats are extracted
Alignment: Spacers are aligned across different genomes, separately for each frame
Combination: Results from different frames are combined for comparative analysis

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this tool in your research, please reference:

Parcey M, Gayder S, Castle AJ, Svircev AM. 2022. Function and Application of the CRISPR-Cas System in the Plant Pathogen Erwinia amylovora. Appl Envir Micro. e02513-21. doi: 10.1128/aem.02513-21

Credits

Originally developed by Michael Parcey at Brock University.
Reference: Parcey M, Gayder S, Castle AJ, Svircev AM. 2022. Function and Application of the CRISPR-Cas System in the Plant Pathogen Erwinia amylovora. Appl Envir Micro. e02513-21. doi: 10.1128/aem.02513-21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CRISPRAligner

Overview

Installation

Prerequisites

Setup

Usage

Command Line Interface

Examples

Output

Configuration System

Built-in Configurations

Custom Configuration Files

Programmatic Usage

Project Structure

Technical Details

CRISPR Repeat Regions

Frame-Based CRISPR Detection

Analysis Process

License

Citation

Credits

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
crispr_aligner		crispr_aligner
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

License

Ffloriel/CRISPRAligner

Folders and files

Latest commit

History

Repository files navigation

CRISPRAligner

Overview

Installation

Prerequisites

Setup

Usage

Command Line Interface

Examples

Output

Configuration System

Built-in Configurations

Custom Configuration Files

Programmatic Usage

Project Structure

Technical Details

CRISPR Repeat Regions

Frame-Based CRISPR Detection

Analysis Process

License

Citation

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages