Salmonella serovar predictions from whole-genome sequence assemblies by determination of antigen gene and cgMLST gene alleles using BLAST. Mash MinHash can also be used for serovar prediction.
Latest stable version
Don't want to use a command-line app? Try SISTR with interface deployed on Galaxy and Render.com online platforms (see the Web application section)
If you find this tool useful, please cite as:
The Salmonella In Silico Typing Resource (SISTR): an open web-accessible tool for rapidly typing and subtyping draft Salmonella genome assemblies. Catherine Yoshida, Peter Kruczkiewicz, Chad R. Laing, Erika J. Lingohr, Victor P.J. Gannon, John H.E. Nash, Eduardo N. Taboada. PLoS ONE 11(1): e0147101. doi: 10.1371/journal.pone.0147101
—Paper Link: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0147101
@article{Yoshida2016,
        doi = {10.1371/journal.pone.0147101},
        url = {http://dx.doi.org/10.1371/journal.pone.0147101},
        year  = {2016},
        month = {jan},
        publisher = {Public Library of Science ({PLoS})},
        volume = {11},
        number = {1},
        pages = {e0147101},
        author = {Catherine E. Yoshida and Peter Kruczkiewicz and Chad R. Laing and Erika J. Lingohr and Victor P. J. Gannon and John H. E. Nash and Eduardo N. Taboada},
        editor = {Michael Hensel},
        title = {The Salmonella In Silico Typing Resource ({SISTR}): An Open Web-Accessible Tool for Rapidly Typing and Subtyping Draft Salmonella Genome Assemblies},
        journal = {{PLOS} {ONE}}
}
You can install sistr_cmd using Conda from the BioConda channel:
# Install conda. Miniconda is recommended if you don't have Conda installed already
# see https://conda.io/miniconda.html
# Add Bioconda channel and other channels https://bioconda.github.io/#set-up-channels
conda config --add channels conda-forge
conda config --add channels defaults
conda config --add channels r
conda config --add channels bioconda
# Install sistr_cmd and its dependencies
conda install sistr_cmd
# sistr_cmd should be installed in your $PATH
sistr --helpInstalling sistr_cmd is recommended for the least amount of headache since Conda will ensure that all necessary external dependencies are installed along with sistr_cmd (i.e. blast+, mafft, mash). This will also help you get sistr_cmd running on older systems (e.g. CentOS 5) or where you may not have many user privileges.
You can install sistr_cmd using pip:
pip install sistr_cmdsistr_cmd is available from PYPI at https://pypi.python.org/pypi/sistr-cmd
NOTE: You will need to ensure that external dependencies are installed (i.e. blast+, mafft, mash [optionally])
You can install sistr_cmd directly from the source code:
python setup.py installNOTE: If you run into pycurl library installation issues, you can try to resolve them by installing system dependencies via a package manager such as apt (apt install python3-pycurl libcurl4-openssl-dev libssl-dev)
SISTR can be publically accessed as a web application via:
- Galaxy EU instance at https://usegalaxy.eu/root?tool_id=sistr_cmd 
- Render.com Cloud Hosting Platform-as-a-Service (PaaS) hosts a DEMO SISTR web application https://sistr-app.onrender.com/ 
NOTE 1: The SISTR web application hosted on Render.com might take up to 20 seconds to load on the first run and will shutdown after 15 min of inactivity
NOTE 2: SISTR web application source code is available at https://github.com/phac-nml/sistr-web-app allowing easy web interface deployment on any infrastructure types (on-premises, cloud/remote).
SISTR will automatically initialize database of Salmonella serovar determination antigens, cgMLST profiles and MASH sketch of reference genomes by downloading it from a remote location. The SISTR database v1.3 got minor updates by collapsing some of the serovars with O24/O25 antigens detailed in CHANGELOG.md file
- SISTR v1.1 database is available at https://zenodo.org/records/13618515 or via a direct url https://zenodo.org/records/13618515/files/SISTR_V_1.1_db.tar.gz?download=1 (used with SISTR < 1.1.3 )
- SISTR v1.3 database is available at https://zenodo.org/records/15078586 or va a direct url https://zenodo.org/records/14270992/files/SISTR_V_1.1.3_db.tar.gz?download=1 (used with SISTR >= 1.1.3)
These are the external dependencies required for sistr_cmd:
- Python (>= v2.7 OR >= v3.4)
- BLAST+ (>= v2.2.30)
- MAFFT (>=v7.271 (2016/1/6))
- Mash v2.0+ [optional]
sistr_cmd requires the following Python libraries:
- numpy (>=1.11.1)
- pandas (>=0.18.1)
You can run the following commands to get up-to-date versions of numpy and pandas
pip install --upgrade pip
pip install wheel
pip install numpy pandasIf you run sistr -h, you should see the following usage info:
usage: sistr_cmd [-h] [-i fasta_path genome_name] [-f OUTPUT_FORMAT]
                 [-o OUTPUT_PREDICTION] [-p CGMLST_PROFILES]
                 [-n NOVEL_ALLELES] [-a ALLELES_OUTPUT] [-T TMP_DIR] [-K]
                 [--use-full-cgmlst-db] [--no-cgmlst] [-m] [--qc] [-t THREADS]
                 [-v] [-V]
                 [F [F ...]]
SISTR (Salmonella In Silico Typing Resource) Command-line Tool
==============================================================
Serovar predictions from whole-genome sequence assemblies by determination of antigen gene and cgMLST gene alleles using BLAST.
Note about using the "--use-full-cgmlst-db" flag:
The "centroid" allele database is ~10% the size of the full set so analysis is much quicker with the "centroid" vs "full" set of alleles. Results between 2 cgMLST allele sets should not differ.
If you find this program useful in your research, please cite as:
The Salmonella In Silico Typing Resource (SISTR): an open web-accessible tool for rapidly typing and subtyping draft Salmonella genome assemblies.
Catherine Yoshida, Peter Kruczkiewicz, Chad R. Laing, Erika J. Lingohr, Victor P.J. Gannon, John H.E. Nash, Eduardo N. Taboada.
PLoS ONE 11(1): e0147101. doi: 10.1371/journal.pone.0147101
positional arguments:
  F                     Input genome FASTA file
optional arguments:
  -h, --help            show this help message and exit
  -i fasta_path genome_name, --input-fasta-genome-name fasta_path genome_name
                        fasta file path to genome name pair
  -f OUTPUT_FORMAT, --output-format OUTPUT_FORMAT
                        Output format (json, csv, pickle)
  -o OUTPUT_PREDICTION, --output-prediction OUTPUT_PREDICTION
                        SISTR serovar prediction output path
  -p CGMLST_PROFILES, --cgmlst-profiles CGMLST_PROFILES
                        Output CSV file destination for cgMLST allelic
                        profiles
  -n NOVEL_ALLELES, --novel-alleles NOVEL_ALLELES
                        Output FASTA file destination of novel cgMLST alleles
                        from input genomes
  -a ALLELES_OUTPUT, --alleles-output ALLELES_OUTPUT
                        Output path of allele sequences and info to JSON
  -T TMP_DIR, --tmp-dir TMP_DIR
                        Base temporary working directory for intermediate
                        analysis files.
  -K, --keep-tmp        Keep temporary analysis files.
  --use-full-cgmlst-db  Use the full set of cgMLST alleles which can include
                        highly similar alleles. By default the smaller
                        "centroid" alleles or representative alleles are used
                        for each marker.
  --no-cgmlst           Do not run cgMLST serovar prediction
  -m, --run-mash        Determine Mash MinHash genomic distances to Salmonella
                        genomes with trusted serovar designations. Mash binary
                        must be in accessible via $PATH (e.g. /usr/bin).
  --qc                  Perform basic QC to provide level of confidence in
                        serovar prediction results.
  -t THREADS, --threads THREADS
                        Number of parallel threads to run sistr_cmd analysis.
  -l LIST_OF_SEROVARS, --list-of-serovars LIST_OF_SEROVARS
                        A path to a single column text file containing list of
                        serovar(s) to check serovar prediction against. Result
                        reported in the "predicted_serovar_in_list"
                        field as Y (present) or N (absent) value.
  -v, --verbose         Logging verbosity level (-v == show warnings; -vvv ==
                        show debug info)
  -V, --version         show program's version number and exit
By running the following command on a FASTA file of Salmonella enterica strain LT2 (https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP014051.1):
sistr --qc -vv --alleles-output allele-results.json --novel-alleles novel-alleles.fasta --cgmlst-profiles cgmlst-profiles.csv -f tab -o sistr-output.tab LT2.fastaYou should see some log messages like so:
<time> INFO: Running sistr_cmd 0.3.4 [in /usr/lib/python2.7/site-packages/sistr/sistr_cmd.py:290]
<time> INFO: Serial single threaded run mode on 1 genomes [in /usr/lib/python2.7/site-packages/sistr/sistr_cmd.py:319]
<time> INFO: Initializing temporary analysis directory "/tmp/20170309104912-SISTR-LT2" and preparing for BLAST searching. [in /usr/lib/python2.7/site-packages/sistr/sistr_cmd.py:175]
<time> INFO: Temporary FASTA file copied to /tmp/20170309104912-SISTR-LT2/LT2_fasta [in /usr/lib/python2.7/site-packages/sistr/sistr_cmd.py:177]
<time> INFO: Running BLAST on serovar predictive cgMLST330 alleles [in /usr/lib/python2.7/site-packages/sistr/src/cgmlst/__init__.py:319]
<time> INFO: Reading BLAST output file "/tmp/20170309104912-SISTR-LT2/cgmlst-centroid.fasta-LT2_fasta-2017Mar09_10_49_13.blast" [in /usr/lib/python2.7/site-packages/sistr/src/cgmlst/__init__.py:322]
<time> INFO: Found 6525 cgMLST330 allele BLAST results [in /usr/lib/python2.7/site-packages/sistr/src/cgmlst/__init__.py:333]
<time> INFO: Marker NZ_AOXE01000081.1_201 | Recovered novel allele with gaps (n=0) of length 477 vs length 477 for ref allele NZ_AOXE01000081.1_201|2823059714. Novel allele name=3250876267 [in /usr/lib/python2.7/site-packages/sistr/src/cgmlst/__init__.py:181]
<time> INFO: Type retrieved_marker_alleles <type 'dict'> [in /usr/lib/python2.7/site-packages/sistr/src/cgmlst/__init__.py:343]
<time> INFO: Calculating number of matching alleles to serovar predictive cgMLST330 profiles [in /usr/lib/python2.7/site-packages/sistr/src/cgmlst/__init__.py:360]
<time> INFO: Top subspecies by cgMLST is "enterica" (min dist=0.00909090909091, Counter={'enterica': 11532}) [in /usr/lib/python2.7/site-packages/sistr/src/cgmlst/__init__.py:369]
<time> INFO: Top serovar by cgMLST profile matching: "Typhimurium" with 327 matching alleles, distance=0.9% [in /usr/lib/python2.7/site-packages/sistr/src/cgmlst/__init__.py:385]
<time> INFO: cgMLST330 Sequence Type=660408169 [in /usr/lib/python2.7/site-packages/sistr/src/cgmlst/__init__.py:404]
<time> INFO: LT2 | Antigen gene BLAST serovar prediction: "Typhimurium" serogroup=B 1,4,[5],12:i:1,2 [in /usr/lib/python2.7/site-packages/sistr/sistr_cmd.py:207]
<time> INFO: LT2 | Subspecies prediction: enterica [in /usr/lib/python2.7/site-packages/sistr/sistr_cmd.py:210]
<time> INFO: LT2 | Overall serovar prediction: Typhimurium [in /usr/lib/python2.7/site-packages/sistr/sistr_cmd.py:213]
<time> INFO: Genome size=4857473 (within gsize thresholds? True) [in /usr/lib/python2.7/site-packages/sistr/src/qc/__init__.py:13]
<time> INFO: Deleting temporary working directory at /tmp/20170309104912-SISTR-LT2 [in /usr/lib/python2.7/site-packages/sistr/sistr_cmd.py:220]
<time> INFO: Writing output "tab" file to "sistr-output.tab" [in /usr/lib/python2.7/site-packages/sistr/src/writers.py:38]
<time> INFO: cgMLST allelic profiles written to cgmlst-profiles.csv [in /usr/lib/python2.7/site-packages/sistr/sistr_cmd.py:340]
<time> INFO: JSON of allele data written to allele-results.json for 1 cgMLST allele results [in /usr/lib/python2.7/site-packages/sistr/sistr_cmd.py:343]
<time> INFO: Wrote 330 alleles to novel-alleles.fasta [in /usr/lib/python2.7/site-packages/sistr/sistr_cmd.py:346]
sistr_cmd has several output options. The primary output is the serovar prediction and in silico typing results output (e.g. -o sistr-results.tab).
Summary of output options:
- primary results output
- serovar prediction, cgMLST results, Mash results
- format (-f <format>):tab,csv,json,pickle
- -o sistr-results
 
 
- cgMLST allele results
- in-depth allele search results for each input genome for each cgMLST locus (330 loci in total)
- includes extracted allele sequences, top blastnresults and summarizedmafftresults
- format: JSON
- -a allele-results.json
 
 
- cgMLST allelic profiles
- table of allele designations for each genome for each cgMLST locus
- row names: genome names
- column names: cgMLST marker names
- format: CSV
- --cgmlst-profiles cgmlst-profiles.csv
 
 
SISTR supports various text output formats specified by the -f option with json being the default.
[
  {
    "serovar_cgmlst": "Typhimurium",
    "cgmlst_matching_alleles": 327,
    "h1": "i",
    "serovar_antigen": "Typhimurium",
    "cgmlst_distance": 0.009090909090909038,
    "h2": "1,2",
    "cgmlst_genome_match": "LT2",
    "cgmlst_ST": 660408169,
    "serovar": "Typhimurium",
    "fasta_filepath": "/full/path/to/LT2.fasta",
    "genome": "LT2",
    "serogroup": "B",
    "qc_messages": "",
    "qc_status": "PASS",
    "o_antigen": "1,4,[5],12",
    "cgmlst_subspecies": "enterica"
  }
]cgmlst_ST       cgmlst_distance cgmlst_genome_match     cgmlst_matching_alleles cgmlst_subspecies       fasta_filepath  genome  h1      h2      o_antigen       qc_messages     qc_status       serogroup       serovar serovar_antigen serovar_cgmlst
660408169       0.00909090909091        LT2     327     enterica        /home/peter/Downloads/sistr-LT2-example/LT2.fasta       LT2     i       1,2     1,4,[5],12              PASS    B       Typhimurium     Typhimurium     Typhimurium
Raw csv output results opened in a text editor
cgmlst_ST,cgmlst_distance,cgmlst_genome_match,cgmlst_matching_alleles,cgmlst_subspecies,fasta_filepath,genome,h1,h2,o_antigen,qc_messages,qc_status,serogroup,serovar,serovar_antigen,serovar_cgmlst
660408169,0.00909090909091,LT2,327,enterica,/home/peter/Downloads/sistr-LT2-example/LT2.fasta,LT2,i,"1,2","1,4,[5],12",,PASS,B,Typhimurium,Typhimurium,Typhimurium
The same csv results rendered as a table
| cgmlst_ST | cgmlst_distance | cgmlst_genome_match | cgmlst_matching_alleles | cgmlst_subspecies | fasta_filepath | genome | h1 | h2 | o_antigen | qc_messages | qc_status | serogroup | serovar | serovar_antigen | serovar_cgmlst | 
| 660408169 | 0.00909090909091 | LT2 | 327 | enterica | /home/peter/Downloads/sistr-LT2-example/LT2.fasta | LT2 | i | 1,2 | 1,4,[5],12 | PASS | B | Typhimurium | Typhimurium | Typhimurium | 
You can produce in-depth allele search results with the -a/--alleles-output commandline argument.
These results may be useful for understanding unexpected or low confidence serovar predictions.
{
        <genome name>: {
                // for each
                <cgMLST marker id>: {
                        // top blast result on largest contig
                        blast_result: {
                                // perfect match to a previously identified allele?
                                "is_perfect": boolean,
                                // blastn subject sequence length
                                "slen": integer,
                                // blastn percent identity
                                "pident": numeric,
                                // cgMLST marker name
                                "marker": string,
                                // blastn query sequence id
                                "qseqid": string,
                                // blastn query sequence start index
                                "qstart": integer,
                                // is match truncated by end of sequence?
                                "is_trunc": boolean,
                                // number of MSA gaps in subject sequence
                                "sseq_msa_gaps": integer,
                                // blastn subject sequence
                                "sseq": string,
                                // blastn bitscore
                                "bitscore": numeric,
                                // proportion of subject sequence MSA with gaps
                                "sseq_msa_p_gaps": numeric,
                                // blastn E-value
                                "evalue": numeric,
                                // blastn gap open
                                "gapopen": integer,
                                // blastn subject sequence end index
                                "send": integer,
                                // does this allele have a perfect match?
                                "has_perfect_match": boolean,
                                // matching allele name
                                "allele": integer,
                                // subject sequence start index
                                "sstart": integer,
                                // extracted allele name (CRC32 of subject nucleotide sequence)
                                "allele_name": integer,
                                // adjusted subject sequence start index
                                "start_idx": numeric,
                                // blastn query end index
                                "qend": integer,
                                // did the extracted allele sequence need to be reverse complemented?
                                "needs_revcomp": boolean,
                                // did the extracted allele sequence need to be extended to match the length of the query sequence?
                                "is_extended": boolean,
                                // blastn number of mismatches
                                "mismatch": integer,
                                // extracted allele coverage i.e. (length of extracted allele) / (length of closest matching allele)
                                "coverage": numeric,
                                // too many gaps within the MSA of extracted allele sequence and closest matching allele?
                                "too_many_gaps": boolean,
                                // adjusted subject end index
                                "end_idx": numeric,
                                // is extracted allele truncated by end of sequence?
                                "trunc": boolean,
                                // blastn subject sequence title
                                "stitle": string,
                                // blastn query sequence length
                                "qlen": integer,
                                // valid allele match found?
                                "is_match": true,
                                // blastn alignment length
                                "length": integer
                        },
                        // CRC32 unsigned 32-bit integer allele name from allele sequence
                        "name": integer,
                        // extracted allele sequence
                        "seq": string
                }
        }
}
Here's some truncated example allele search results output in JSON format for LT2 sample:
{
  "LT2": {
    "NZ_AOXE01000034.1_82": {
      "blast_result": {
        "is_perfect": false,
        "slen": 4857473,
        "pident": 99.479,
        "marker": "NZ_AOXE01000034.1_82",
        "qseqid": "NZ_AOXE01000034.1_82|340989631",
        "qstart": 1,
        "is_trunc": false,
        "sseq_msa_gaps": 0,
        "sseq": "ATGCCAACCAGACCACCTTATCCGCGGGAAGCTTATATCGTCACCATTGAAAAAGGCACGCCGGGCCAGACGGTGACGTGGTATCAGCTACGGGCTGACCATCCGAAACCTGATTCGCTCATCAGCGAGCATCCGACCGCAGAAGAAGCGATGGATGCGAAAAATCGTTACGAAGATCCGGATAAATCATAG",
        "bitscore": 350.0,
        "sseq_msa_p_gaps": 0.0,
        "evalue": 3.289999999999999E-97,
        "gapopen": 0,
        "send": 358277,
        "has_perfect_match": false,
        "allele": 340989631,
        "sstart": 358468,
        "allele_name": 1204520418,
        "start_idx": 358276.0,
        "qend": 192,
        "needs_revcomp": true,
        "is_extended": false,
        "mismatch": 1,
        "coverage": 1.0,
        "too_many_gaps": false,
        "end_idx": 358467.0,
        "trunc": false,
        "stitle": "NZ_CP014051.1 Salmonella enterica strain LT2, complete genome",
        "qlen": 192,
        "is_match": true,
        "length": 192
      },
      "name": 1204520418,
      "seq": "ATGCCAACCAGACCACCTTATCCGCGGGAAGCTTATATCGTCACCATTGAAAAAGGCACGCCGGGCCAGACGGTGACGTGGTATCAGCTACGGGCTGACCATCCGAAACCTGATTCGCTCATCAGCGAGCATCCGACCGCAGAAGAAGCGATGGATGCGAAAAATCGTTACGAAGATCCGGATAAATCATAG"
    },
    // 329 other cgMLST allele results
  },
  "another-genome": { /* allele results */}
}
With the -p/--cgmlst-profiles commandline argument, you can output the 330 loci cgMLST allelic profiles for your input genomes (i.e. the allele designation for each cgMLST locus for each input genome).
You can use this information to construct phylogenetic trees from this data using a tool such as Phyloviz Online by uploading cgMLST profiles data.
This type of analysis may be useful to explore why unexpected serovar prediction results were generated (e.g. your genomes are genetically very different from each other).
Example truncated cgMLST profiles output:
| NC_003198.1_3005 | NC_006905.1_2841 | NC_011149.1_467 | ... | |
| LT2 | 419666160 | 2853045644 | 161888011 | ... | 
If you are running sistr_cmd with the --qc commandline argument, sistr_cmd will run some basic QC to determine the level of confidence in the serovar prediction.
The qc_status field should contain a value of PASS if your genome passes all QC checks, otherwise, it will be WARNING or FAIL if there are issues with your results and/or input genome sequence.
The qc_messages field will contain useful information about why you may have a low confidence serovar prediction result. The QC messages will be delimited by `` | `` symbol.
For example, here are the QC messages for an unusually small Salmonella assembly where the predicted serovar was "-:-:-":
FAIL: Large number of cgMLST330 loci missing (n=272 > 30)
FAIL: Wzx/Wzy genes missing. Cannot determine O-antigen group/serogroup. Cannot accurately predict serovar from antigen genes.
WARNING: H1 antigen gene (fliC) missing. Cannot determine H1 antigen. Cannot accurately predict serovar from antigen genes.
WARNING: Input genome size (699860 bp) not within expected range of 4000000-6000000 (bp) for Salmonella
WARNING: Only matched 57 cgMLST330 loci. Min threshold for confident serovar prediction from cgMLST is 297.0
The QC messages produced by sistr_cmd should help you understand your serovar prediction results.
The galaxy folder contains Galaxy SISTR workflows that can be readily imported into existing Galaxy server instance and allow to process WGS samples in large batches starting from raw reads and finishing with serovar results.
- Galaxy-Workflow-Assembly-Serotyping-withReport-for-SISTR_v1.1.1+galaxy1-recipe.ga
- Summary: Assembles genomes from raw reads, performs serotyping and generates overall report
- Uses tool dependencies: sistr 1.1.1+galaxy1,shovill 1.0.4+galaxy1andtp_cat 0.1.0
 
 
If you encounter any problems or have any questions feel free to create an issue anonymously or not to let us know so we can address it!
Feature requests and pull requests are welcome!
Do you have any Salmonella genomes with trustworthy serovar info? Would you like SISTR to provide better serovar predictions? You can help by contributing those genomes along with their serovar info!
SISTR relies on a database of cgMLST allelic profiles from Salmonella genomes with validated serovar info to make accurate serovar predictions (since antigenic determinations from a handful of genes like wzx or fliC can only get you so far). So the more genomes there are in the SISTR database, the more accurate the serovar predictions, especially if those genomes belong to uncommon or rare serovars or lineages.
Help us improve SISTR serovar predictions! Contribute Salmonella genomes to SISTR!
You can contribute by:
- let us know here: #15
- linking to your genome on NCBI SRA/BioSample/Assembly
- contacting the authors of SISTR
Getting started
git clone https://github.com/phac-nml/sistr_cmd.git
cd sistr_cmd/
export PYTHONPATH=$(pwd)
# run tests
py.test tests/Pull requests for feature additions and bug fixes welcome!
Want to use sistr_cmd directly in your Python application?
Install sistr_cmd using pip or Conda.
You can run SISTR serovar predictions like so:
from sistr.sistr_cmd import sistr_predict
# create mock commandline arguments class
class SistrCmdMockArgs:
    run_mash = True
    no_cgmlst = False
    qc = True
    use_full_cgmlst_db = False
# run SISTR serovar prediction
sistr_results, allele_results = sistr_predict(genome_fasta_path, genome_name, keep_tmp=False, tmp_dir='/tmp/sistr_cmd', args=SistrCmdMockArgs)
# use sistr_cmd results for somethingCopyright 2017 Public Health Agency of Canada
Distributed under the Apache 2.0 license.