This is another attempt to write a fast enough experimental ancient DNA damage aware short read mapper. This work depends on the rust-bio crate (Köster, 2016).
mapAD uses pure backtracking on top of the bidirectional FMD-index (Li, 2012). Central algorithmic ideas are inspired by BWA-backtrack (Li & Durbin, 2009). Improved algorithms and error models will be incorporated step by step as needed.
Ancient DNA damage models can be included via the SequenceDifferenceModel trait.
The default (and only) impl is based on Udo Stenzel's ANFO/r-candy.
Besides Rust, no additional dependencies are needed to compile.
-
Install/update Rust (locally, to the current user's home directory):
curl https://sh.rustup.rs -sSf | sh -
Git clone mapAD or simply download a release zip file:
git clone https://github.com/mpieva/mapAD.git && cd mapAD -
Build:
cargo build --releaseThe resulting binary file
mapadis now in the subfoldertarget/release/. -
Run!
cd target/release/./mapad index --reference /path/to/reference/hg19.fasta
- For increased performance on modern CPUs the compiler can make use of advanced SIMD instructions if you enable AVX2 and FMA like this (recommended). Please note that the resulting binary will not run on CPUs that don't support these features.
RUSTFLAGS="-C target-feature=+avx2,+fma" cargo build --release
or this (not recommended; reduced portability)
RUSTFLAGS="-C target-cpu=native" cargo build --release
Please note that the resulting binary is not necessarily portable between different CPU architectures.
- The number of
vs passed to the program determines the level of verbosity:
mapad -vv index ... or mapad -vv map ...
mapADcan be built as a fully static binary withmusllibc:- Install target platform requirements:
rustup target add x86_64-unknown-linux-musl - Install
musl-gccon the building system (Ubuntu:sudo apt install musl-tools)
RUSTFLAGS="-C target-feature=+crt-static -C link-self-contained=yes" cargo build --release --target x86_64-unknown-linux-musl- Install target platform requirements:
The subprograms mapad index and mapad map will index the reference and map reads to it, respectively.
Adding the --help flag will print a list of available and required command line options.
./mapad index --reference /path/to/reference/hg19.fasta will store six index files in the directory of the input
FASTA (/path/to/reference/hg19.fasta{.tbw, .tle, .toc, .tpi, .trt, .tsa}).
The scoring model is derived from Udo
Stenzel's ANFO/r-candy (Green et al., 2010
; SOM3).
The symbols
Double-stranded library preparation: The probability of a
position
Effective deamination
probabilities:
Sequencing errors and evolution (
| A | C | G | T | |
|---|---|---|---|---|
| A | ||||
| C | ||||
| G | ||||
| T |
All of the above probabilities are
Tests have shown that we can achieve good sensitivity and specificity allowing -p 0.03 mismatches and relatively high
deamination parameters (see "50% Deamination Parameters" below).
Over-specification of damage parameters does not seem to have a significant negative impact on alignment accuracy.
The following example aligns reads to an existing index of the hg19 reference. These damage settings cause C -> T mismatches on both 5'- and 3'-ends to be free (no penalty). The penalties for those substitutions of course increase as the center of a read is approached.
./mapad -vv map \
--threads 32 `# Number of threads to use (runs on all available cores when set to 0).` \
--library single_stranded `# Library preparation protocol (single- or double-stranded)` \
-p 0.03 `# Allowed mismatches under `-D` base error rate (similar to BWA backtrack)` \
-f 0.5 `# Five-prime overhang parameter` (generic overhang parameter when "--library" is set to "double_stranded") \
-t 0.5 `# Three-prime overhang parameter` (not used if "--library" is set to "double_stranded") \
-d 0.02 `# Deamination rate in double-stranded parts` \
-s 1.0 `# Deamination rate in single-stranded overhangs` \
-i 0.001 `# InDel rate (corresponds to gap open penalty)` \
-x 0.5 `# Gap extension penalty as a fraction of the repr. mismatch penalty` \
--reads "${input_bam}" \
--reference "/path/to/reference/hg19.fasta" `# Prefix of index files` \
--output "${output_bam}"The following example starts a dispatcher node and then spawns multi-threaded workers on SGE cluster nodes that have more than 30GB of free RAM. Start the dispatcher:
./mapad -v map \
--dispatcher \
# ... (see local example)Spawn workers:
qsub -N "mapAD_worker" -pe "smp" 1-32 -t 1-128 -l "h_vmem=30G,s_vmem=30G,virtual_free=30G,mem_free=30G,class=*" -j "y" -R "y" -b "y" ./mapad -vv worker --threads 8 --host $(hostname)Mapping qualities are comparable with those produced by BWA. However, an alignment that maps equally well to two
positions in the genome would be assigned a MAPQ of 3 by mapAD, whereas BWA would assign a MAPQ of 0. To filter out
reads mapping to multiple positions a MAPQ threshold of > 5-10 roughly corresponds to a BWA-specific threshold of > 0.
Here,
- Unique (best alignment maps to one position):
$1$ - Pseudo-unique (best alignment maps to one position, but, with worse score, also to others): $\frac{\text{AS}\text{best}}{\text{AS}\text{best} + \sum{\text{AS}_\text{subopt} |\text{subopt}|}}$
- Non-unique (best alignment maps to multiple
positions):
$\frac{1}{|\text{best}|}$
Mapping quality is defined as the PHRED-scaled probability that an alignment is incorrect. Hence the above probabilities
are PHRED-scaled, and, for better compatibility with BWA, confined to the interval
A recommended equivalent to a mapping quality threshold of 25 for BWA mapped data is 20 for mapAD output.
mapAD uses BAM auxiliary data fields to report suboptimal alignments in a bwa aln-like fashion. X0: Number of best
hits (multi-mapped alignment), X1: Number of suboptimal alignments, XA: 5 best suboptimal alignments in the format
chr,(+|-)pos,cigar,MD,NM,num_of_hits,AS, XT: Type of mapping ((N|U|R)), XS: Best suboptimal alignment score.
- Memory consumption of both mapping and indexing (see Hardware Requirements)
- No awareness of paired-end sequencing (pairs need to be merged before mapping)
- No seeding (it's not very effective for short reads, but could easily be implemented for longer ones. Probably
with less negative impact on (aDNA-)sensitivity than seeding in
BWA).