Extensive unexplored goat microbiome diversity revealed by 4075 microbial genomes from metagenomes spanning gastrointestinal tract, age, feeding style and geography
This REPO contains in-house scripts (R, Python), data and detailed instructions for users to reproduce much of the analyses we have done for our manuscript titled "Extensive unexplored goat microbiome diversity revealed by 4075 microbial genomes from metagenomes spanning gastrointestinal tract, age, feeding style and geography".
If further assisstance is required, please do not hesitate to contact me by raise an issue in the "Issues" section of this REPO.
Below are a list of softwares and databases required before running out test data. Most of the softwares can be installed through CONDA.
A list of required softwares and URLs for their downloads. Please follow instructions for proper software installation on their respective servers. The versions in the parenthesis indicate the ones we used for our project.
Note: Please make sure all the softwares are in your $PATH.
A list of databases and their URLs for downloads:
| Database | Description | Availability |
|---|---|---|
| Goat genomic | Host genome sequences | ARS1 |
| Glycine max | Host foodborne genome sequences | Gmax ZH13 v2.0 |
| Zea Mays | Host foodborne genome sequences | B73 |
| Medicago truncatula | Host foodborne genome sequences | MtrunA17r5.0-ANR |
| EggNOG | EggNOG annotation | http://eggnog5.embl.de/#/app/downloads |
| dbCAN | CAZymes annotation | http://bcb.unl.edu/dbCAN2/download/ |
| GTDB-tk | GTDB-tk database | https://gtdb.ecogenomic.org/downloads |
Note: Please note the versions indicate the ones we used for our project, which might not be the latest.
The Example workflow is divided into two folders: 'Pipeline' and 'Scripts'. 'Pipeline' folder contains details for the public softwares and their parameters used for our project, while the 'Scripts' folder contains in-house scripts for further data analysis and visualisation. See below for more information.
Pre-processing of the raw sequencing data in FASTQ format, including quality control (removal of low-quality and adaptor sequences) and removal of host genome and contaminations from food.
Metagenomic assembly and binning were divided into individual assembly of each sample and co-assembly of all samples; in the co-assembly step, samples may be divided into different groups according to the origins (e.g., body sites) of the samples. The assembled contigs were then merged together for binning.
High-quality bins were identified to obtain MAGs. The taxonomic annotation was then performed for all MAGs to determine their taxonomic identities and phylogenetic relationships.
MAGs were subjected to tools including PROKKA to identify protein-coding and non-coding genes. Protein-coding genes were clustered using CD-HIT to gereate a non-redundant catelog, which were used as input to BLAST against serveral public databases for functional annotation.
To calculate the coverage of each MAG in each sample, clean reads of each sample were mapped to the 4075 MAGs using BWA-MEM with default parameters. After converting the resulted SAM files to BAM format using Samtools, an in-house Perl script was used to calculate the coverage of MAGs, which defined as the total bases mapped to a MAG in a sample divided by the length of the MAG. Then, the relative abundance was calculated according to the coverage.
Salmon was used to estimate the coverage of non-redundant genes.
All analyzed data were loaded to Perl, R and Python for further analysis and visualisation using serveral in-house scripts. The in-house scripts are available in the folder 'Scripts'.
This workflow was designed speicically for "Extensive unexplored goat microbiome diversity revealed by 4075 microbial genomes from metagenomes spanning gastrointestinal tract, age, feeding style and geography"; editing and revisions might be required before applying to other projects.
Note: This project is jointly participated by Feng Tong, Teng Wang and Na L. Gao.