ADRSM (Ancient DNA Read Simulator for Metagenomics) is a tool designed to simulate the paired-end sequencing of a metagenomic community. ADRSM allows you to control precisely the amount of DNA from each organism in the community, which can be used to benchmark different metagenomics methods.
conda install -c maxibor adrsm
adrsm -d ./data/genomes ./data/short_genome_list.csv
metagenome.{1,2}.fastq: Simulated paired end readsstats.csv: Statistics of simulated metagenome (organism, percentage of organism's DNA in metagenome)
$ adrsm --help
usage: ADRSM v0.6 [-h] [-d DIRECTORY] [-r READLENGTH] [-n NBINOM]
[-fwd FWDADAPT] [-rev REVADAPT] [-e ERROR] [-p GEOM_P]
[-m MIN] [-M MAX] [-o OUTPUT] [-q QUALITY] [-s STATS]
[-se SEED] [-t THREADS]
confFile
Ancient DNA Read Simulator for Metagenomics
positional arguments:
confFile path to configuration file
optional arguments:
-h, --help show this help message and exit
-d DIRECTORY path to genome directory. Default = .
-r READLENGTH Average read length. Default = 76
-n NBINOM n parameter for Negative Binomial insert length distribution.
Default = 8
-fwd FWDADAPT Forward adaptor. Default = AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
NNNNNNATCTCGTATGCCGTCTTCTGCTTG
-rev REVADAPT Reverse adaptor. Default =
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
-e ERROR Illumina sequecing error. Default = 0.01
-p GEOM_P Geometric distribution parameter for deamination. Default =
0.5
-m MIN Deamination substitution base frequency. Default = 0.001
-M MAX Deamination substitution max frequency. Default = 0.3
-o OUTPUT Output file basename. Default = ./metagenome.*
-q QUALITY Base quality encoding. Default = d (PHRED+64)
-s STATS Statistic file. Default = stats.csv
-se SEED Seed for random generator. Default = 7357
-t THREADS Number of threads for parallel processing. Default = 2
Each genome fasta file must be named after the name of the organism. (example: data/genomes)
The configuration file is a .csv file describing, one line per genome, the mean insert size, and the expected genome coverage.
Example short_genome_list.csv:
| genome (mandatory) | insert_size (mandatory) | coverage (mandatory) | deamination (mandatory) |
|---|---|---|---|
| Agrobacterium_tumefaciens.fa | 47 | 0.1 | yes |
| Bacillus_anthracis.fa | 48 | 0.2 | no |
Given the sequencing error, and the random choice of inserts, the target coverage might differ slightly from the real coverage (fig 1)
Figure 1: Coverage plot for simulated sequencing of Elephas maximus mitocondria. Aligned with Bowtie2 (default-parameters). Read-length = 76, insert-length = 200.
The deamination is modeled using a Geometric distribution With the default parameters, the substitution frequency is depicted in fig 2:
Figure 2: Substitution frequency.
For each nucleotide, a random number Pu is sampled from an uniform distribution (of support [0 ,1]) and compared to the corresponding value Pg of the rescaled geometric PMF at this nucleotide.
If Pg >= Pu, the base is substituted (fig 3).
Figure 3: Substitutions distribution along a DNA insert, with default parameters.