Promotheus: a bioinformatic pipeline for the analysis of bacterial transcriptomic data deposited on NCBI databases

Description

Promotheus is a data-driven pipeline designed to retrieve, process, and integrate publicly available transcriptomic data from bacterial species and their strains. Its main goal is to support the selection of promoters that exhibit either stable (constitutive) or variable (inducible) expression profiles, facilitating biosensor design in synthetic biology applications. Promotheus performs data retrieval from public repositories, raw data pre-processing, normalization, and batch effect correction. Gene IDs harmonization is performed through orthogroup identification, allowing a multi-strain and/or multi-species comparisons.

To learn more about it, read the manuscript at XXX.

Pipeline Overview

Installation

Promotheus is deployed via Docker. If you do not already have Docker installed, follow the instructions for docker installation at https://docs.docker.com/

Pulling the Docker Image

Once Docker is installed, you can pull the Promotheus image from Docker Hub by running:

docker pull ste40/promotheus

Parameters

Environment Variable	Default	Description	Value	Required?
`THREADS`	8	Number of threads for parallel processing.	Numeric	No
`MEMORY`	4000	Maximum memory allocation (MB) for Rockhopper execution.	Numeric	No
`CORRECTION_DE`	none	Correction method for multiple testing. See p.adjust in R documentation for available options.	Character	No
`PVALUE`	0.05	Significance threshold for identifying differentially expressed genes.	Numeric	No
`OPERONS_DETECTION`	TRUE	Enable or disable operons detection.	Logical	No
`AMINO_DIFF`	3	Defines the orthology matching in terms of protein length differences (amino acids) during orthogoup identification.	Numeric	No
`SIMILARITY`	30	Minimum percentage of amino acid sequence similarity required to group genes into the same orthogroup for ID harmonization	Numeric	No
`ASMcode`	NULL	Assembly accession code for orthofinder mapping.	Character	Yes (required)
`OPERONS_THRESHOLD`	3	Number of GSEs per microorganism used for operon identification.	Character	No
`MULTITAXA`	FALSE	Enable or disable taxon-specific normalization with ComBat.	Character	No

Important notes

ASMcode must be specified. If not provided, the script will terminate. This genome is not used for reads mapping, but rather for orthology-based mapping of GeneIDs across species, reported in the final output table.
If OPERONS_THRESHOLD is greater than the number of GSEs available for a given specie, it will be adjusted to match the number of available GSEs.
When MULTITAXA = TRUE, ComBat normalization is performed separately for each taxon, assigning the GSEs of each taxon to distinct batches.

Input files

The pipeline needs two separate input tables for RNA-Seq and Microarray datasets. Each table must contain the following mandatory columns:

Accession: Accession codes for various datasets (GSEs).
taxon: Microorganism names, based on NCBI taxonomy name.
reference_genome: Assembly accession codes for the reference genome used for reads alignment and gene counts quantification.

The examples of these two tables (RNA-Seq and Microarray) are provided in this repository, see Examples directory.

NOTE: It is reccomended to use our GEO_Search.R script to generate these two tables. See below for more information.

Generating input files with GEO_Search.R R script

We provide GEO_Search.R script to help user to find RNA-Seq and Microarray experiments on NCBI GEO Datasets based on a provided taxonomic ID (taxID).

Features:

Taxonomy Retrieval: Retrieves the scientific name and subspecies for a given taxID.
Genome Assembly: Fetches the reference genome assembly version for the organism.
GEO Datasets Search: Searches for RNA-Seq and Microarray datasets from GEO, filtering by dataset type and minimum sample count (4).
Export: Outputs filtered results to Excel files for RNA-Seq and Microarray experiments.

R (version 3.6 or higher recommended) must be installed in your system.

Required R packages:
- devtools
- rentrez
- openxlsx
- optparse

To install the required packages, run the following in R:

install.packages(c("devtools", "openxlsx", "optparse"))
devtools::install_github("ropensci/rentrez")

Then, the script can be run from command-line, specifying the taxID of interest:

Rscript GEO_Search.R --taxID <taxID>

NOTE: The script attempts to retrieve data from NCBI, so it may fail if the servers are down or experiencing issues.

Usage

The pipeline runs within a Docker container. You can set environment variables using the -e flag. Make sure to mount your input/output directories with the -v flag.

WARNING: the scripts provided in this repository can't be executed without the appropriate environment and softwares/packages, which are included in Docker's image.

Docker Run Example

Below is an example command to run the Promotheus pipeline. Adjust the paths and parameters as needed:

  -e THREADS=4 \
  -e MEMORY=4000 \
  -e CORRECTION_DE=BH \
  -e PVALUE=0.01 \
  -e OPERONS_DETECTION=TRUE \
  -e AMINO_DIFF=5 \
  -e MULTITAXA=TRUE \
  -e SIMILARITY=30 \
  -e ASMcode=ASM317683v1 \
  -e OPERONS_THRESHOLD=3 \
  -v /path/to/input/:/input \
  -v /path/to/output/:/output \
  ste40/promotheus

Output file

The output table contains includes gene expression values, percentiles, coefficient of variation (CV), and Differential expression analysis results.

Columns Description

Gene_name: The name of the gene.
GeneID: The unique identifier for the gene, according to the gene nomenclature of the user-provided ASMcode.
Expression_GSEXXXXX: Gene expression values for each GSE.
Mean_Expression: The average expression value of the gene across all samples in the datasets.
CV: Coefficient of variation, indicating the variability of the gene expression across samples, computed as the standard deviation of log-transformed expression values. A higher CV suggests more variation in gene expression.
Percentile_GSEXXXXX: Percentile values for the gene expression in different datasets.
Mean_Percentile: The average percentile value across all datasets for the gene.
Percentile_SD: Standard deviation of the percentile values across all datasets for the gene.
DE: Differential expression status (YES|NO), in at least one GSE.
Sequence: Contains the DNA sequence upstream of the gene.

An example of output table is provided in this repository (Output.xlsx)

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Examples		Examples
Scripts		Scripts
GEO_Search.R		GEO_Search.R
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Promotheus: a bioinformatic pipeline for the analysis of bacterial transcriptomic data deposited on NCBI databases

Description

Pipeline Overview

Installation

Pulling the Docker Image

Parameters

Important notes

Input files

Generating input files with GEO_Search.R R script

Usage

Docker Run Example

Output file

Columns Description

About

Uh oh!

Releases

Packages

Languages

License

Ste40/Promotheus

Folders and files

Latest commit

History

Repository files navigation

Promotheus: a bioinformatic pipeline for the analysis of bacterial transcriptomic data deposited on NCBI databases

Description

Pipeline Overview

Installation

Pulling the Docker Image

Parameters

Important notes

Input files

Generating input files with GEO_Search.R R script

Usage

Docker Run Example

Output file

Columns Description

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages