Lightweight python package to map transposon insertion sequencing (INSeq) data in bacteria.
This section is meant for users who know their way around terminal and conda. To use pyinseq,
create a virtual environment with python 3.7 and install pyinseq using conda.
$ conda install -n base -c conda-forge mamba
$ conda create -n pyinseq-py37 python=3.7
$ conda activate pyinseq-py37
(pyinseq-py37) $ conda install -c bioconda bowtie
(pyinseq-py37) $ pip install pyinseqVerify your installation with --test
(pyinseq-py37) $ pyinseq --testNow you can run pyinseq!
(pyinseq-py37) $ pyinseq -i <input file> -s <sample file> -g <genbank file> -e <experiment name>- Overview of Command Line Operation
- Output description
- Notes on output
- Background
- User guide
- Installation
- FAQ
- License
Basic operation and a short description of the files are listed here. Below are detailed descriptions and links to example input files.
$ pyinseq -i <input file> -s <sample file> -g <genbank file> -e <experiment name>-i / --input
- Illumina reads file in FASTQ or gzipped-FASTQ format.
-s / --samples
- Sample list where each line is the
sample_name<tab>barcode (4bp).
-g / --genome
- Concatenated GenBank File for all contigs for the genome.
-e / --experiment
- All results files will be created in a subfolder with this name.
--get_default_config
- Creates a default configuration file for running
pyinseq
--config_format
- File format for configuration file (
yamlorjson)
-c/ --config
- Configuration file for running a
pyinseqworkflow. Every other argument will be ignored.
--test
- Runs pytest on installed
pyinseqsoftware.
--gff
- Generate genome output file in gff3 format.
-d / --disruption
- Five-prime fraction of gene (
0.0-1.0) that must be disrupted for the hit to be counted in the summary_gene_table. Often insertions at the 3' end of a gene do not disrupt function so it may be of interest to run the pipeline with a disruption value of0.8or0.9. [default: 0.9]
--min_count
- Minimum number of reads per insertion site. [default: 3]
--max_ratio
- Maximum ratio of left:right or right:left reads per insertion site. [default: 10]
--barcode_length
- Length of barcode index that is expected in samples. [default: 4]
--transposon_seq
- DNA sequence of transposon that flanks reads. [default: 'ACAGGTTG']
-t / --threads
- Number of cores to use for execution [default: the CPU count of the computer]
--snakemake_params
- Additional parameters that will get passed to
snakemake.
| File | Description |
|---|---|
<experiment name>-config.yml |
Configuration file with run parameters |
results/summary_gene_table.txt |
summary for entire experiment (values in counts-per-million, cpm) |
results/<sample>_sites.txt (for each sample) |
Counts of each insertion in each sample |
results/<sample>_genes.txt (for each sample) |
Counts of each insertion mapped to genes |
results/<sample>_bowtie.txt (for each sample) |
Bowtie mapping results |
results/<sample>_trimmed.fastq (for each sample) |
Demultiplexed fastq reads trimmed for the chromosome sequence only |
results/log.txt |
text printed to console |
results/samples_info_yml |
basics stats for each sample |
results/genome_lookup/genome.fna |
genome fasta nucleotide file |
results/genome_lookup/genome.ftt |
genome feature table |
| bowtie indexes | index files created from genome by bowtie |
results/raw_data/<sample>.fastq (for each sample) |
demultiplexed files for each sample/barcode |
results/raw_data/_other.fastq |
demultiplexed files for unrecognized barcodes |
T50
- The minimum number of transposon insertion sites in the sample that account for at least 50% of the samples's reads. Used as a crude measure to detect bottlenecks, when comparing output T50 to the library input T50, or when comparing biological or technical replicates.
samples_info.yml
- Contains basic information of each sample in the experiment.
log.txt
- Log file from pyinseq, including the messages from the
snakemakeexecution. This includes the order of steps taken during the pyinseq execution.
<experiment name>-config.yaml
- Configuration file with run parameters.
pyinseq was inspired by the software published by Goodman et al. (2011). There are a number of differences, some of which are noted here. Most are fairly superficial in that they are intended to increase automation and reproducibility but do not materially affect the results. One exception is the first point, which does affect the resulting output.
-
As was conducted in Brooks et al. (2014) transposon site data are normalized across all replicons (chromosomes/plasmids/contigs) for calculation of a counts-per-million (cpm) per sample.
-
The user's file of reads is demultiplexed into separate .fastq files for each barcoded read.
-
The user provides a GenBank file, and
pyinseqgenerates a fasta nucleotide file (.fna) and gene feature table (.ftt) that are formatted and named properly forbowtie. -
The gene feature table (.ftt) is comparable to the protein feature table (.ptt) but it also includes RNA genes.
-
If
pyinseqis installed fromconda, thenbowtieis installed fromcondaas a dependency. -
At the end of the analysis, results are aggregated into a tab-delimited table and sample info is summarized.
-
pyinseqis written primarily in Python and usessnakemakeas the workflow manager. You probably figured that out already.
Illumina sequencing reads [-i] for the experiment. File can be uncompressed (.fastq or .fq) or gzip-compressed (.fastq.gz or .fq.gz).
@DGL9ZZQ1:720:C6YD0ACXX:2:1101:1246:2185 1:N:0:
GAAGCGACACCGACGACGGTAACAGGTTGGATGATAAGTCCCCGGTCTTCG
+
CCCFFFDDHHHHHCGHHHIJDHIIJJFHHJIJJJJIJJDHIJJHIAGIJJJ
...Sample file [-s] describing the sample names and barcodes. Sample names should be restricted to letters, numbers, dash (-), and underscore (_), with a tab between the sample and the barcode in each row of a text file. It is recommended that the file be prepared in a text editor to ensure that additional hidden characters are not introduced. BBEdit is one option for Mac.
Microsoft Excel can export tab-delimited files (.tsv), but do not use Microsoft Word for this purpose.
E001_01 GAAG
E001_02 CTTTGenBank file [-g] listing the features and DNA sequence for the organism. If the organism has multiple chromosomes/contigs in the sequence the file they should be concatenated into a single file. Ensure that the double slash // at the end of the file remains to separate each contig.
Files from NCBI GenBank often work where the corresponding files from NCBI RefSeq do not. Feel free to contact us with any questions here.
LOCUS CP000020 2897536 bp DNA circular BCT 02-APR-2008
DEFINITION Vibrio fischeri ES114 chromosome I, complete sequence.
ACCESSION CP000020
VERSION CP000020.2 GI:171902228
KEYWORDS .
SOURCE Vibrio fischeri ES114
...
FEATURES Location/Qualifiers
source 1..2897536
/organism="Vibrio fischeri ES114"
/mol_type="genomic DNA"
/strain="ES114"
/db_xref="taxon:312309"
/chromosome="I"
gene complement(313..747)
/gene="mioC"
/locus_tag="VF_0001"
/old_locus_tag="VF0001"
CDS complement(313..747)
/dnas_title="FMN-binding protein MioC"
/gene="mioC"
/locus_tag="VF_0001"
/old_locus_tag="VF0001"
/codon_start=1
/transl_table=11
/product="FMN-binding protein MioC"
/protein_id="AAW84496.1"
/db_xref="GI:59478709"
/translation="MKKVSIITGSTLGGAEYVGDHLADLLEEMDFSTDIHNQPNLDDI
DIDSLWLLVVSTHGAGDYPDNIKPFIQQLESVTQPLSSVEFAVVAIGDSSYDTFCAAG
KSLQNTLKEHGAIEKYPLLEIDVTQNSIPEEPAELWLKQHIC"
...
ORIGIN
1 aagatcactt aatatatata agatctttta aagagatctt ttattagatc tattatatag
61 atcgtcgatc tctgtggata agtgataaat gatcaatagg atcatatact ttagatggat
121 ccaaagttgt tatctttctt tgatcttcga tcggacagct tgaggacaaa agagttagtt
181 atccacaagg ggggagggcg ttagatctta ttcaatggat aactataact tgatcactgg
241 atctttctat agttatccac atagtaggta tcatctattt aataactttt atagatcgga
301 caacactttt tattaacaaa tgtgttgttt tagccacaat tctgctggtt cttcagggat
361 actattttga gttacatcta tctctaatag agggtacttt tcgatagcgc catgctcttt
421 taaggtattt tgaagtgatt ttcctgctgc gcagaaagtg tcataacttg aatcaccgat
481 agcaacaaca gcaaattcaa cgctcgataa tggctgagtc acgctttcta actgttgaat
541 aaatggttta atgttgtcag ggtaatcacc agcaccgtga gttgatacaa cgagtaacca
601 taagctatca atatcaatat catctaaatt tggttgatta tgaatatcgg tggaaaaatc
661 catttcttct aataaatcag caagatggtc accaacatac tcagcaccac ctagagtgct
721 tcctgtaata atagatactt ttttcatgaa tttatcctat aaaaatataa aaaatgggcc
781 tacataggcc cattattaat cttattaata ttggttttat ttaccaatac agaatgaagt
...
//
LOCUS CP000021 1330333 bp DNA circular BCT 02-APR-2008
DEFINITION Vibrio fischeri ES114 chromosome II, complete sequence.
...Pyinseq uses snakemake to execute the following workflow:
pyinseq
- Demultiplex a file of FASTQ reads.
- Write separate trimmed versions of the files (no barcodes, no transposon sequence).
- Map the trimmed reads to the genome.
- Quantify insertions per site and per ORF in the genome.
- Output summary gene table and report files describing the dataset.
The benefit of using snakemake is that it allows for parallel execution if more than one thread is provided.
The modularity of snakemake enables future feature implementation.
pyinseq writes a configuration file (<experiment name>-config.yaml) into the working directory, which holds the parameters for the run.
For the example command below, the pyinseq-config.yaml file will store the parameters used for this run.
$ pyinseq -i reads.fastq -s samples.txt -g genome.gbk -e pyinseq --threads 4$ cat demo-run-config.yaml
snakemake_params: []
barcode_length: 4
command: pyinseq
config: false
config_format: yaml
disruption: 0.9
experiment: demo-run
genome: genome.gbk
get_default_config: false
gff3: false
input: reads.fastq.gz
max_ratio: 10
min_count: 3
samples: samples.txt
threads: 4
transposon_seq: ACAGGTTGIf you installed conda in your terminal, you can use the option --use-conda which will install bowtie during the executuion of the workflow.
$ pyinseq -i reads.fastq -s samples.txt -g genome.gbk -e demo-run --threads 4 --snakemake_params --use-conda You can also get a default configuration file by using --get_default_config and modify it using a text-editor.
$ pyinseq --get_default_config
$ ls
default-config-pyinseq.yamlMake sure that all file paths in the configuration are correct
These commands are useful when combining samples from multiple Illumina runs in a single pyinseq analysis.
Both commands will also create a configuration file specific to each subcommand.
pyinseq demultiplex
- Demultiplex a file of Illumina reads.
- Writes separate trimmed versions of the files (no barcode, no transposon sequence) unless the optional
--notrimflag is added.
$ pyinseq demultiplex -i <input file> -s <sample file> -e <experiment name>pyinseq genomeprep
- Prepare the fasta nucleotide (.fna) and feature table (.ftt) files from a GenBank file.
- Is a good quick test to run on new GenBank files.
- Generates bowtie indexes unless the
--noindexflag is added. - Optionally create GFF3 file with the
-gffflag (for use in other programs).
$ pyinseq genomeprep -g <genbank file> -e <experiment name>Similar to
pyinseqmain usage, you can run these commands using a configuration file and pass additionalsnakemakeoptions.
pyinseq was written and tested using a MacOS (or Linux-based) operating system.
- MacOS or Linux-based operating systems.
- Python 3.7
- bowtie (v1.0.0)
- snakemake
pyinseqhas not being tested on Windows operating systems, but as of Windows 10 there is support for terminals with Ubuntu
Also note that
pyinsequsesbowtie; not bowtie2, which is a different software.
Conda is a command-line package manager that can create virtual environments with
all the necessary dependencies to run pyinseq. You can acquire conda by installing
Anaconda and all the packages it
brings, or by installing its lightweight version called
miniconda.
We recommend miniconda since it installs a minimal number packages,
and is actually all you need to run pyinseq
Once conda is installed, you can verify it by running:
$ conda --helpmamba is now required for snakemake. Install it system-wide with this command. The subsequent steps are shown using conda since that is more common for the
scientific audience, but can largely be accomplished in mamba if the user desires.
conda install -n base -c conda-forge mamba
A virtual environment is an isolated computational space where you can install dependencies and software without affecting the base operating system's configuration.
We can use conda to create a virtual environment with python 3.7
$ conda create -n pyinseq-py37 python=3.7To activate your environment:
$ conda activate pyinseq-py37You should see the name of your environment surrounded by parentheses in your terminal prompt.
(pyinseq-py37) $ (pyinseq-py37) $ conda install -c bioconda bowtieInstall the stable version:
(pyinseq-py37) $ pip install pyinseqOr, install the most recent version from GitHub:
(pyinseq-py37) $ pip install git+https://github.com/mjmlab/pyinseq
Verify your installation with --test
(pyinseq-py37) $ pyinseq --test
Verify that pyinseq installed correctly by running:
```bash
(pyinseq-py37) $ pyinseq --help
2021-05-26 13:10 - INFO - pyinseq - Process command line arguments
usage: pyinseq [-h] [--get_default_config] [--config_format CONFIG_FORMAT]
[-c CONFIG] [-t THREADS] [--snakemake_params ...] [-v]
[-i INPUT] [-s SAMPLES] [-e EXPERIMENT] [-g GENOME]
[-d DISRUPTION] [--min_count MIN_COUNT] [--max_ratio MAX_RATIO]
[--barcode_length BARCODE_LENGTH]
[--transposon_seq TRANSPOSON_SEQ] [--gff3]
{demultiplex,genomeprep} ...
...Now you are ready to run pyinseq!
You can test your installation of pyinseq by using the option --test.
(pyinseq-py37) $ pyinseq --testIf all tests pass then you are good to go!
Ensure that the file is in GenBank and not RefSeq format.
You can do this two ways:
- uninstalling pyinseq from the virtual environment
pip uninstall pyinseq
- or completely remove the conda virtual environment.
conda env remove -n pyinseq-py37
Note: will need to conda deactivate first to leave the environment.
Please use the GitHub Issues.
Pyinseq is an open-source software licensed under BSD-3.