Introduction

ADRSM (Ancient DNA Read Simulator for Metagenomics) is a tool designed to simulate the paired-end sequencing of a metagenomic community. ADRSM allows you to control precisely the amount of DNA from each organism in the community, which can be used, for example, to benchmark different metagenomics methods.

Dependencies

Conda

Installation

conda install -c maxibor adrsm

Usage

adrsm ./data/short_genome_list.csv

Output

metagenome.{1,2}.fastq : Simulated paired end reads
stats.csv : Statistics of simulated metagenome (organism, percentage of organism's DNA in metagenome)

Cite

You can cite ADRSM like this:

Maxime Borry (2018). ADRSM: Ancient DNA Read Simulator for Metagenomics. DOI: 10.5281/zenodo.1462743

Help

$ adrsm --help
usage: ADRSM v0.9 [-h] [-r READLENGTH] [-n NBINOM] [-fwd FWDADAPT]
              [-rev REVADAPT] [-e ERROR] [-p GEOM_P] [-m MIN] [-M MAX]
              [-o OUTPUT] [-s STATS] [-se SEED] [-t THREADS]
              confFile

==================================================

ADRSM: Ancient DNA Read Simulator for Metagenomics

Author: Maxime Borry

Contact: <borry[at]shh.mpg.de>

Homepage & Documentation: github.com/maxibor/adrsm
==================================================


positional arguments:
confFile       path to configuration file

optional arguments:
-h, --help     show this help message and exit
-r READLENGTH  Average read length. Default = 76
-n NBINOM      n parameter for Negative Binomial insert length distribution.
                Default = 8
-fwd FWDADAPT  Forward adaptor. Default = AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
                NNNNNNATCTCGTATGCCGTCTTCTGCTTG
-rev REVADAPT  Reverse adaptor. Default =
                AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
-e ERROR       Illumina sequecing error. Default = 0.01
-p GEOM_P      Geometric distribution parameter for deamination. Default =
                0.5
-m MIN         Deamination substitution base frequency. Default = 0.001
-M MAX         Deamination substitution max frequency. Default = 0.3
-o OUTPUT      Output file basename. Default = ./metagenome.*
-s STATS       Statistic file. Default = stats.csv
-se SEED       Seed for random generator. Default = 7357
-t THREADS     Number of threads for parallel processing. Default = 2

Configuration file (`confFile`)

The configuration file is a .csv file describing, one line per genome, the mean insert size, and the expected genome coverage. Example short_genome_list.csv:

genome(mandatory)	insert_size(mandatory)	coverage(mandatory)	deamination(mandatory)	mutation_rate(optional)	age(optional)
./data/genomes/Agrobacterium_tumefaciens.fa	47	0.1	yes	10e-8	10000
./data/genomes/Bacillus_anthracis.fa	48	0.2	no

Note on Coverage

Given the sequencing error, and the random choice of inserts, the target coverage might differ slightly from the real coverage (fig 1)

Figure 1: Coverage plot for simulated sequencing of Elephas maximus mitocondria. Aligned with Bowtie2 (default-parameters). Read-length = 76, insert-length = 200.

Note on Deamination simulation

The deamination is modeled using a Geometric distribution With the default parameters, the substitution frequency is depicted in fig 2:

One can try different parameters for deamination using this interactive plot: maxibor.github.io/adrsm

Figure 2: Substitution frequency.

For each nucleotide, a random number Pu is sampled from an uniform distribution (of support [0 ,1]) and compared to the corresponding value Pg of the rescaled geometric PMF at this nucleotide.
If Pg >= Pu, the base is substituted (fig 3).

Figure 3: Substitutions distribution along a DNA insert, with default parameters.

Note on sequencing error

ADRSM can simulate Illumina sequencing error with a uniform based model.

Note on Illumina base quality score

The base quality score is generated using a Markov chain from fastq template files.

Note on mutation

ADRSM offers you to add mutation to your sequences. This allows to account for the evolutionary differences between ancient organisms and their reference genome counterparts present in today's databases.

ADRSM assumes two times more transitions than transversions.

There are two parameters for mutation simulation:

The mutation rate (in bp/year): a good starting point is 10e-7 for bacteria
The age (in years) of the organism

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
data		data
img		img
lib		lib
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
adrsm		adrsm
index.html		index.html
metagenome.1.fastq		metagenome.1.fastq
metagenome.2.fastq		metagenome.2.fastq
stats.csv		stats.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Dependencies

Installation

Usage

Output

Cite

Help

Configuration file (`confFile`)

Note on Coverage

Note on Deamination simulation

Note on sequencing error

Note on Illumina base quality score

Note on mutation

About

Uh oh!

Releases 14

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

maxibor/adrsm

Folders and files

Latest commit

History

Repository files navigation

Introduction

Dependencies

Installation

Usage

Output

Cite

Help

Configuration file (confFile)

Note on Coverage

Note on Deamination simulation

Note on sequencing error

Note on Illumina base quality score

Note on mutation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 14

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Configuration file (`confFile`)

Packages