All-Food-Seq (AFS) : A Quantifiable Screen For Species in Biological Samples by Deep DNA Sequencing
All-Food-Seq (AFS) : A Quantifiable Screen For Species in Biological Samples by Deep DNA Sequencing
net/publication/264166377
CITATIONS READS
18 156
8 authors, including:
Mathias Weber
Universität Regensburg
12 PUBLICATIONS 63 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Yongchao Liu on 21 August 2014.
Abstract
Background: DNA-based methods like PCR efficiently identify and quantify the taxon composition of complex
biological materials, but are limited to detecting species targeted by the choice of the primer assay. We show here
how untargeted deep sequencing of foodstuff total genomic DNA, followed by bioinformatic analysis of sequence
reads, facilitates highly accurate identification of species from all kingdoms of life, at the same time enabling
quantitative measurement of the main ingredients and detection of unanticipated food components.
Results: Sequence data simulation and real-case Illumina sequencing of DNA from reference sausages composed
of mammalian (pig, cow, horse, sheep) and avian (chicken, turkey) species are able to quantify material correctly at
the 1% discrimination level via a read counting approach. An additional metagenomic step facilitates identification
of traces from animal, plant and microbial DNA including unexpected species, which is prospectively important for
the detection of allergens and pathogens.
Conclusions: Our data suggest that deep sequencing of total genomic DNA from samples of heterogeneous taxon
composition promises to be a valuable screening tool for reference species identification and quantification in
biosurveillance applications like food testing, potentially alleviating some of the problems in taxon representation
and quantification associated with targeted PCR-based approaches.
Keywords: Illumina, Next-generation sequencing, Real-time PCR, Species identification, Metagenomics
© 2014 Ripp et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly credited.
Ripp et al. BMC Genomics 2014, 15:639 Page 2 of 11
http://www.biomedcentral.com/1471-2164/15/639
sequencing of amplicons from variable genomic or cell sausage meat (see below), were initially mapped against
organelle DNA regions (e.g. 16S rDNA, rDNA-ITS or reference genomes using the algorithms BWA (V 0.7.0;
mitochondrial COI [11]). Barcoding methods have [29]) or CUSHAW [30] resulting in a SAM file for each
been shown to very efficiently identify taxa within envi- mapping. Reference genomes in our pilot analysis com-
ronmental or food-derived metagenomic samples in a prised the species Bos taurus, Bubalus bubalis, Equus
qualitative way [12-15], but require separate assays to caballus, Escherichia coli, Gallus gallus, Glycine max,
address the different domains of life. In addition, quanti- Homo sapiens, Listeria seeligeri, Meleagris gallopavo, Mus
fication of components by barcode sequencing has proven musculus, Neisseria gonorrhoeae, Oryctolagus cuniculus,
problematic due to taxonomic biases induced by the vary- Oryza sativa, Ovis aries, Rattus norvegicus, Shigella boydii,
ing primer binding efficiencies across taxa ([13-16]; and Sus scrofa, Triticum aestivum and Zea mays (for details
references therein). Species quantification by sequencing see Additional file 1). Reference genome taxa were mostly
of organellar PCR amplicons is also critical, as the abso- chosen either because of their foodstuff relevance or
lute number of mitochondrial genomes per cell is highly matches obtained in the metagenomic analyses step of our
fluctuating already between different tissues (e.g. eight- pipeline. Others (like human or rat) were primarily in-
fold within different human cell types [17]). In contrast, cluded to serve as negative controls to judge the extent of
sequence analysis of total genomic DNA isolated from false positive read assignments. It is clear that for a
food offers in principle the possibility to detect species in broader screening many more reference genomes could
a totally unbiased way, enabling e.g. the detection of fraud have been used. The practical upper limit for the number
through admixture of undeclared ‘exotic’ taxa or the pres- of reference genomes clearly depends on computer power
ence of health risks by microbial contamination [4]. In the and scales linearly with time. The BWA mappings were
field of gene expression analysis, NGS sequencing facili- executed by allowing 0, 1, 2 or 3 mismatches, depending
tates a robust quantitative analysis of RNA molecules on the respective approach (see below). For the down-
through digitally counting sequence reads obtained from stream analysis of the mapping results we utilized SAM-
the cDNA population of a tissue [18,19]. The sensitivity tools (V 0.1.18; [31]) and a set of self-implemented Perl
and dynamic range of read counting equals or supersedes scripts.
other quantitative DNA analytical methods like microar- After the mapping step, we identified three sets of se-
rays or SAGE [20,21]. From a technical perspective, spe- quence reads (Figure 1). The first set contained reads
cies identification based on the whole-genome sequencing mapping to just one genome (“unique reads”). Assigning
should also be feasible since the large, non-protein-coding these reads to a genome and quantifying them by count-
part of eucaryotic genomes evolves rather quickly and ing was a straightforward task. More challenging were
strongly conserved gene exons constitute only a minor reads, which covered conserved sequence regions within
proportion, e.g. roughly 1.5% of a mammalian genome genomes and therefore simultaneously hit at least two
[22]. Therefore, even closely related food-relevant taxa different genomes (“multi-mapped reads”), even under
such as goat and sheep or turkey and chicken should be conditions of the highest mapping stringency. Since these
distinguishable in a total genomic comparison. Intra- conserved reads cannot be assigned with any certainty to
specific polymorphism in foodstuff species ranges be- one specific genome, we distributed them to the respective
tween 0.5 to 5 nucleotides per 1,000 bp in horse, swine candidate genomes in the proportion previously calculated
and chicken, respectively [23-26], which should not sub- from the unique reads. By this means, the multi-mapped
stantially affect species discrimination. reads could additively be used to improve the values of
Here we show that deep sequencing of total DNA de- the quantitative analysis.
rived from foodstuff material can readily identify and A 3rd category, so-called “unmapped reads”, were col-
quantify species components with high accuracy by a sin- lected and forwarded to up to three further rounds of
gle experimental assay. Sequence reads are assigned to mapping, each of which allows one more mismatch than
species by mapping [27,28] to publicly available reference the previous round (i.e. in round 4 we had a matching
genome sequences, which steadily grow in number, as ex- stringency of 97%). We then calculated the proportions
emplified by the Genome10k Project (https://genome10k. of species material from all reads, which were unam-
soe.ucsc.edu). At the same time, reads of “unexpected” biguously assigned at this step. To account for the differ-
species origin are readily detected by a metagenomic ana- ent quality (i.e. completeness) of the reference genomes,
lysis based on DNA sequence database searching. as indicated by different numbers of positions denoted by
Ns in the genome drafts, our initial quantitative estimates
Methods were corrected by a genome quality factor f = (n + c)/c,
The bioinformatics pipeline where n is the number of ambiguous nucleotides and c is
Sequence reads of 100 bp, either obtained by simulation the total number of nucleotides in the reference genome.
(see below) or by Illumina sequencing of DNA from Further normalization should in principle be necessary to
Ripp et al. BMC Genomics 2014, 15:639 Page 3 of 11
http://www.biomedcentral.com/1471-2164/15/639
adjust for largely different genome sizes, e.g. when com- to the NCBI taxonomy database. To filter out false-positives,
paring birds and mammals which differ roughly 3-fold in caused by low complexity repeats (e.g. microsatellites) or
DNA content [32,33]. However, our quantification of a highly conserved regions, we set MEGAN’s LCA param-
sample containing avian material (Additional file 2) indi- eters to Min complexity = 0.44, Min Score = 75.0, Top
cated that such normalization might be unnecessary, pos- percent = 1.0 and turned on the percent identity filter. To
sibly due to the correlation of smaller genome size with a limit the analysis to the most relevant results, taxa were
smaller nucleus and cell size [34] leading to a compensa- somewhat arbitrarily only accepted for visualization in our
tory denser packaging of cells per gram avian tissue. pilot study, if more than 50 reads were assigned to this
In our pipeline, we subsequently tried to identify the taxon. BLAST results were then visualized as a phylogen-
origin of still unmapped reads by BLASTN (V 2.2.25) etic tree and quantified using Excel. For species attracting
database searching [35] against the NCBI nucleotide more than a threshold number of the unmapped reads in
database (nr/nt). Since our query sequences were short the BLAST step, a return to the read mapping procedure
100 bp reads, we used a word size of 11, set the BLAST would be reasonable to infer more exactly the proportion
e-value to 100 according to MEGAN’s “how to use of this taxon. This, of course, requires the availability of
BLAST” tutorial [36], and accepted the best three hits the respective reference genome, the list of which is grad-
for further analysis. Furthermore, we set BLAST’s “–I” ually increasing.
option to add the gi number to the default BLAST out-
put files. Otherwise, default BLASTN settings were used. Dataset simulation and calculation
BLASTN results were then visualized by the metagenomic As a proof of principle, we simulated records of Illumina
analysis tool MEGAN4 (V 4.70.4). This tool parses BLAST sequence data by randomly extracting 100 bp long se-
output files and assigns the results to species or, if this is quences from downloaded genome sequences, which
not possible, taxonomic groups of higher rank according were subsequently tagged by their origin (Additional file 1).
Ripp et al. BMC Genomics 2014, 15:639 Page 4 of 11
http://www.biomedcentral.com/1471-2164/15/639
Random errors were introduced into the simulated reads based on randomly sampled sub-sequences of public avail-
at a 1% rate. Next, we compiled mixed datasets for testing able genomes with randomly introduced errors.
the read-mapping pipeline, by randomly sampling subsets Dataset 1 consists of one million reads derived from six
of these simulated sequence reads. For simplicity at this mammalian species (Table 1). After running our pipeline
testing stage, we did not perform iterative mapping at dif- on this dataset, the proportions of reads assigned to the re-
ferent stringencies, but allowed only one mismatch in the spective reference genomes mirrored the sample com-
mapping process. We also did not apply the genome qual- pounds with high accuracy. The artificial dataset contained
ity factor. sequences present in rather high quantities (60% for sheep)
and low amounts (1% for human and rat) indicating that
Illumina sequencing of DNA from a sausage calibration the method worked over a broad range of proportions. We
sample achieved absolute differences between assigned reads and
Total genomic DNA was extracted from 200 mg of the input read numbers of 0 to 0.19% (Table 1). The maximum
homogenized calibration sausages “KalD” (type “boiled relative difference (% absolute difference/% input DNA)
sausage”) [37] and “KLyoA” (type “Lyoner”) [9] by the was 1.67%. We also checked the accuracy of the mapping
Wizard Plus Miniprep DNA purification system (Pro- process by tracing the identity of the uniquely mapped
mega, Madison, USA). DNA was eluted in 50 μl elution reads (represented by different file paths). The mapping ac-
buffer according to the supplier's manual. Illumina se- curacy turned out to be better than 99.9% (cattle).
quencing library preparation was conducted on 1.5 μg of Simulation dataset 2 comprises 850 K reads, mixed at
total DNA by StarSEQ (Mainz, Germany) using the Tru- uneven proportions from three mammalian species (cat-
Seq DNA Sample Preparation Kit v.2 (Illumina, San tle, pig, sheep) and the bacterium E. coli (Table 2). This
Diego, USA). Sequencing was performed on an Illumina dataset was created to check if the method was able to
HiSeq 2000 instrument (100 bp paired-end reads) for detect signals of “unexpected” species, which would pos-
KalD and on a MiSeq instrument (50 bp single reads) for sibly not have a reference genome included in the initial
KLyoA. We used the FASTX toolkit (http://hannonlab. mapping step. The sample was therefore initially run
cshl.edu/fastx_toolkit/index.html) for adapter clipping and through the pipeline without mapping against bacterial
quality filtering. Reads shorter than 50 bp (KalD) or 20 bp genomes. As a result, all E. coli reads were passed on to
(KLyoA) were discarded. the metagenomic BLAST/MEGAN step. By this database
searching routine, 34,944 of 131,683 unmapped reads
Hardware for bioinformatical analyses (=26.54%) were identified as possible E. coli signals. Ac-
For the sake of speed, mappings using BWA were pref- cording to our pipeline rationale, the strong E. coli signal
erentially performed on one node (containing 4 CPUs prompted us to add the E. coli genome to the mapping
with 16 cores each running at 2.1 GHz) of the Mogon process and to run the pipeline again to achieve a better
Linux-Cluster at University of Mainz. Each iterative map- quantitative estimate of the bacterial reads. In fact, the
ping (4 rounds, 0 to 3 mismatches) with 1 mio paired-end proportion of E. coli reads was now determined at high
reads took about 45 minutes. Mapping on a standard PC accuracy with only 0.02% deviation from the real input
(4 × 2.67 GHz, 16 GB RAM) consumed 3 hrs of time value (Table 2). Meanwhile, the correct assignment of
using 12 reference genomes (30 Gbp size) and 100,000 the E. coli reads improved the overall quantitative esti-
reads. The BLAST steps of the pipeline were run on the mates for all the other species components.
University of Mainz Central Computing Linux-Cluster These (and others, not reported) results of our simula-
Lc2 (Suse Linux Enterprise Server 10 SP2, 132 nodes con- tion study proved the general feasibility of quantitative
taining 2 CPUs with 8 cores each running at 2.7 GHz). species identification through deep sequence analysis of
Blast requests (single-threaded) were split up to 1000 sep- total genomic DNA. Quantification was most exact for
arate jobs, which reduced runtime to less than 2 hours for the read mapping process implemented in AFS, which
200,000 queries. The MEGAN program was subsequently however requires the availability of a reference genome.
run on a standard personal computer (PC) with 8 GB Given that vertebrate (and many other) species will soon
RAM and Windows OS. be sequenced by the thousands, this requirement will
not set a limiting condition on the method itself.
Results and discussion For identifying unexpected species, the application of a
Read mapping facilitates exact quantification of DNA less stringent metagenomic search tool, based on a BLAST
from diverse species database search followed by visualization via MEGAN, also
To test if high-throughput genomic DNA sequencing was proved successful. However, our results for dataset 2 sug-
able to accurately determine the proportions of foodstuff gested that the mere evaluation of BLAST/MEGAN results
components, we initially simulated two sets of sequencing would not facilitate an accurate quantitative measure-
records (Table 1 and 2). The simulated sequence data was ment of read numbers, which is only possible via the more
Ripp et al. BMC Genomics 2014, 15:639 Page 5 of 11
http://www.biomedcentral.com/1471-2164/15/639
stringently working read mapping algorithms. It should also After quality filtering, we obtained 2 × 16 million 100 bp
be stressed that the results of the metagenomic working paired-end reads. Encouraged by the previous simulations,
step entirely depend on the completeness of the sequence subsets of only 2 × 500 K (=1 mio) randomly selected
database chosen for searching and on the representation of paired-end reads were used for further analysis. To ac-
a particular species within a database partition. In addition, count for a possible trade-off between the specificity of
one should be cautious of erroneous annotations within taxon identification and a maximally exact quantifica-
public databases [38]. tion of reads, we devised two different mapping strategies.
However promising the results of the simulation data When maximum specificity was the prime goal (“AFS-
analysis appeared, they clearly represented an idealized spec”), we did not allow any mismatches during read map-
situation, since we obviously obtained the simulated ping, and thus performed only a single mapping step with
reads from the very same genomes to which they were the highest stringency. In addition, we disabled the Smith-
mapped thereafter. Hence, we conducted an analysis using Waterman alignment option in BWA because it lowers
real data. the mapping stringency for a paired-end read when rescu-
ing a read from its aligned mate. The second strategy
Illumina sequencing of a DNA sample from calibration (“AFS-quant”) aimed at best quantitative results. To this
sausage material end, we performed an iterative read-mapping starting with
Illumina sequencing was performed on DNA obtained a stringency of 0 and ending with 3 mismatches.
from sausage material, which previously had been designed For both strategies, the n = 3 repetitions produced highly
and produced as calibration sample for qPCR-based ap- similar results, as evidenced by low standard deviations
proaches to species identification (KalD, [37], KLyoA [9]). (Table 3). The AFS-quant approach delivered a highly
The sample KalD, on which we focused our most detailed accurate quantification of meat components in sausage
analysis, contained material from four mammalian species KalD, as exemplified by the value of 54.8% DNA versus
(35% cattle, 1% horse, 9% pig, 55% sheep). These mamma- 55% (w/w) meat for sheep (Table 3). Absolute differences
lian taxa feature a minimal interspecific nucleotide diver- between the DNA proportions and the meat proportions
gence at the level of synonymous sites within genes of 7% ranged from 0.24 to 1.79%, showing that species quantifi-
(for sheep-cattle; [39]), which is most probably exceeded cation can be achieved at the 1% discrimination level. The
by neutral non-genic sites. In addition, the sausage sample highest divergence (1.79%) was observed for pig and can
contained admixtures of 11 different plant allergens at be explained by the use of lard tissue [40], which presum-
varying amounts (R. Köppel, unpublished data; Additional ably contains less cells and thus DNA than e.g. muscle tis-
file 3). sue. In this respect, AFS behaves in the same well-known
matrix-dependent way as other DNA-based detection Detection of “unexpected” species via metagenomic
methods [40], and the definition of normalization values analysis
for typical ingredients and production recipes should alle- DNA reads which do not map to the selection of reference
viate this problem also for AFS. genomes will be passed over to the BLAST/MEGAN an-
To infer the specificity of the mapping procedure we notation procedure in AFS. The one-million-read datasets
included the reference genome sequence of water buf- obtained from the KalD sausage each produced more than
falo, which belongs to the same subfamily (Bovinae) as 200 K of unmapped reads (Figure 2). Roughly half of these
cattle. AFS-quant detected a false-positive proportion of reads could successfully be assigned to a species or higher
0.64% DNA reads in the buffalo genome (Table 3), which ranked taxon. The other half was represented by two clas-
probably represent sequences strongly conserved be- ses: (i) low-complexity repetitive DNA (e.g. microsatel-
tween the two bovines. The more stringent AFS-spec lites) which is present in almost all genomes and thus
approach was able to reduce the false-positive rate of cannot be assigned unequivocally; and (ii) un-assignable
“buffalo reads” substantially to 0.07%, but only at the ex- reads which either did not match an entry in the chosen
pense of a markedly diminished accuracy for quantification database or did not meet the stringent MEGAN criteria
of the real meat components (Table 3). To demonstrate applied. Clearly, the choice of different specialized data-
the broader applicability of the AFS-quant approach we se- bases and perhaps less stringent match criteria has the po-
quenced and quantified the main ingredients of the KLyoA tential of reducing this portion.
sausage sample, which contains 0.5% chicken and 5.5% The ca. 100 K reads that were taxonomically assigned
turkey on a background of pig and cattle meat (Additional by BLAST/MEGAN originated in their vast majority
file 2). The avian components were determined as accurate (98%) from mammals (Figure 2). Of those mammalian
as the mammalian ones. hits, 96% were annotated as cattle, sheep, pig and horse
We conclude that the AFS-quant strategy delivers the (i.e. those taxa which formed the sausage). Close inspec-
most accurate quantitative species determination. We tion of these sequences revealed that they predominantly
note that the AFS quantification results are equal to or represented centromeric satellite DNA. This sequence
sometimes even better than species analyses performed by class is usually not represented in genome reference se-
quantitative PCR on the same sausage material [37]. AFS quences, explaining that the corresponding reads could
still contains a very low risk of obtaining false-positive not be assigned in the mapping step. The observed spe-
matches to closely related species. Clearly, further case cies proportions of the satellite DNA reads somewhat
studies with other species pairs like horse-donkey, which surprisingly did not match the meat proportions for cat-
diverged only 2.4 million years ago [41], have to be con- tle and sheep. A reason could be that centromeric DNA,
ducted to generalize our conclusions. As a screening which is an inherently unstable component of eucaryotic
procedure, AFS performance is only limited by the num- genomes [43], is present in different amounts in the het-
ber of reference genomes available. Offering both, a quali- erochromatin of sheep and cattle chromosomes, making
tative and quantitative result, deep sequencing of total its use for quantification purposes problematic.
genomic DNA appears as an excellent alternative to Among the reads of mammalian origins, we further re-
microarray-based screening methods for species iden- corded hits to several bovine-related taxa like the muntjac,
tification [42] or sequencing of PCR-based barcode goat or whales (Figure 2), which separated from bovines 25,
amplicons [12,13,15]. 30 and 56 million years ago, respectively (www.timetree.org).
Ripp et al. BMC Genomics 2014, 15:639 Page 7 of 11
http://www.biomedcentral.com/1471-2164/15/639
Figure 2 Metagenomic analysis of unmapped reads. Results of the metagenomic analysis of sequence reads obtained from the KalD
reference sausage. The global result of the BLAST/MEGAN step is shown in the box (grey frame). A more detailed classification of matches is
displayed for mammals, viruses, bacteria and plants.
We could show by an analysis using the tool REPEAT- applied for technical reasons as a calibrator in Illumina
MASKER (http://www.repeatmasker.org/) that these reads sequencing (http://res.illumina.com/documents/products/
most often belonged to transposable elements (MIRs, technotes/technote_phixcontrolv3.pdf). Several hundred
LINEs, ERVs, DNA transposons) which show sequence con- bacterial reads were detected, mostly originating from the
servation across this clade. Surprisingly, we also found ~500 human-pathogenic species Neisseria gonorrhoeae (n = 572
matches to human, cercopithecan primates and mouse. reads) or from Pseudomonadales (n = 64 reads), with
Inspection of these BLAST hits revealed that they also Pseudomonas fluorescens as an often annotated species
contained interspersed repeats. However, in humans and (n = 45 reads). The latter is notably present e.g. in deterio-
monkeys, none of those reads corresponded to the rating milk and meat products [44]. While the small num-
primate-specific Alu element family. We are thus rather bers of P. fluorescens reads can be taken as an indicator of
sure that neither goat or whale nor traces of human, mon- beginning food spoilage, the finding of Neisseria reads tells
key or mouse DNA are present in the sample. At the same a very important cautionary tale in metagenomic analysis.
time, this issue demonstrates that expert interpretation After adding the respective genome [45] to the mapping
of BLAST results is required, which is by no means a process, presumed N. gonorrhoaea DNA was detected at
simple task. an amount of 0.04%. Knowing that there should not be
Beyond Mammalia, BLAST/MEGAN suggested the pres- any N. gonorrhoeae material in the sample, we investigated
ence of viral, bacterial and plant sequences in the sausage this result further. By mapping all 32 million reads of our
DNA (Figure 2). Viral DNA, all belonging to bacteriophage initial dataset to the N. gonorrhoeae genome, we obtained
PhiX174, was easily explained since this DNA is usually matches exclusively located in ten genomic regions, each
Ripp et al. BMC Genomics 2014, 15:639 Page 8 of 11
http://www.biomedcentral.com/1471-2164/15/639
shorter than 700 bp, where read coverage was extremely database (data not shown). This can be overcome in fu-
high (up to 5200-fold). These regions were extracted from ture by the production of reference genomes for all major
the N. gonorrhoeae genome and analyzed by BLAST food- and allergenicity-relevant species. In addition, as ex-
against the NCBI nucleotide database, thereby revealing pected for a DNA-based method, the quantification result
strong homology of these parts to ruminant sequence en- will heavily depend on the efficiency of DNA recovery
tries (data not shown). In addition, mapping the sausage from the food matrix. Of all plant allergens tested, only
reads to other available N. gonorrhoeae genomes sequences the genome of soy (Glycine max) is publicly available and
did not produce any matches. We thus question the was thus included in the AFS read mapping step. We de-
quality of the N. gonorrhoeae strain TCDC-NG08107 tected a stable proportion of 0.005% soy DNA in the
genome and recommend using it with high caution. In sample, while the proportion of spiked-in soy material in
general, this points out that the annotation quality of data- the sausage was 0.0316%, suggesting a matrix-dependent
base entries is of prime importance to species diagnosis. underestimation by a factor of 6. We point out, however,
Since meat products often contain plant material, the that qualitative detection may be the prime goal in aller-
metagenomic analysis on the plant spectrum is of special gen analysis [46]. The limits of AFS for allergen detection
interest. In fact, the sausage contained admixtures of 11 clearly have to be investigated further.
plant species (Additional file 3) to enable its use in the
development of allergen detection methods. The most Technical considerations and further improvements
prominent spiked-in ingredients were lupine (Lupinus Next-generation sequencing methods represent the fast-
spec.), walnut (Juglans regia), hazelnut (Corylus avellana) est growing technology worldwide, with ever decreasing
and mustard (Brassica spec.). We detected 661 plant cost per analysis (http://www.genome.gov/sequencingcosts/).
hits, which were assigned to a total of 33 plant families. Applying novel 96-well format multiplex methods for
Amongst those families, Brassicaceae (mustard) domi- Illumina library preparation (NEXTERA®) and a personal
nated with 449 hits, followed by Fabaceae (lupine, peanut, sequencer (MiSeq®) we calculate current sequencing cost
soy) with 62 hits (Figure 2; Additional file 3). All other (chemistry plus personnel, but excluding the bioinformatic
plant ingredients received only from 1 to 17 BLAST hits. analysis) at roughly 150–200 Euro per sample, which may
These numbers of database matches did not correlate with already now be interesting and feasible for routine screen-
the amount of spiked-in plant material, illustrating that ing purposes. Although we produced 100 bp paired-end
the current BLAST/MEGAN routine is by no means reads for the KalD sample, the initial results on KLyoA
quantitative. A probable reason is the unbalanced repre- suggest that cost-saving 50 bp single-end reads will prob-
sentation of sequence entries for the different taxa in the ably perform equally well in read mapping. However,
Figure 3 Determination of the optimal number of sequence reads necessary to obtain accurate quantification results for species
components. The number of sequence reads used in the mapping (x-axis) was plotted against the values of mapping accuracy (y-axis),
calculated as the cumulated absolute deviation in% of mapping results versus expected species proportions.
Ripp et al. BMC Genomics 2014, 15:639 Page 9 of 11
http://www.biomedcentral.com/1471-2164/15/639
shorter reads may pose more problems in database out on our University high-performance cluster) necessary
searches, unless the BLAST version is optimized for short for the metagenomic step. Our adhoc calculations suggest
query sequences. that additional costs (ca. 100 EUR) have to be considered,
An additional cost saving can possibly be achieved by if access to a commercial computing facility is needed.
optimizing the numbers of sequence reads necessary to The cautionary tale of the wrongly assembled/annotated
obtain stable quantification results. To this end, we mapped Neisseria reference genome in our metagenomic step il-
different numbers of reads, starting with 50 K and multiples lustrates that the correct interpretation of the BLAST/
thereof up to 10 mio reads, and calculated the sum of devi- MEGAN results still requires substantial biological and
ations (in%) of the observed from the expected species pro- bioinformatical knowledge. The use of curated sequence
portions (Figure 3). Deviations decreased with increasing database information and/or the application of dedicated
dataset size, but were already close to the optimum at 1 repositories containing validated species-specific sequence
mio reads. Even at 50 or 100 K reads, the sum of deviations data (such as bar-coding targets; http://www.barcodeoflife.
was rather moderate, opening the perspective that even org/) will greatly simplify this step for non-specialists on
very small datasets will still guarantee a reasonable quantifi- the food control side. We wish to point out that a number
cation result for the main sample ingredients. of highly innovative approaches for the identification (but
Throughput of samples in time will improve, especially not necessarily quantification) of species have recently
when using the MiSeq® instrument, running only 6 hrs been established in the field of bacterial metagenomics,
for 50 bp reads. DNA size requirements (>300 bp) and making use of curated taxon-specific sequence databases
input amounts needed (1 ng) for the NEXTERA XT® (e.g. MetaPhlAn [51]), ultrafast algorithms for sequence
protocol [47] are routinely obtained in current PCR- pattern recognition (e.g. k-mer based methods; [52]) or a
based foodstuff analytics (reviewed in [48]). The slight probabilistic framework for read assignment to very
chance for a wrong allocation of multiplexed samples, closely related genomes (e.g. Pathoscope [53]). Integration
which may e.g. be due to erroneous bioinformatic sort- of these tools is a promising option for further improve-
ing of multiplexing tags, will be substantially reduced by ment of AFS.
the use of two such tags per sample in the NEXTERA
protocol [49,50]. Another practical problem which has Conclusion
to be adequately addressed is the possible run-to-run AFS has the potential to be a valuable method for rou-
carry-over of DNA molecules e.g. due to incomplete re- tine testing of food material and other biosurveillance
moval of residual DNA washing from the sequencing de- applications, offering an attractive combination of un-
vice. Illumina’s technical notes say that this detrimental biased screening for all types of ingredients and the pos-
effect is typically below 0.05% (thus affecting 500 reads sibility of simultaneously obtaining quantifiable results.
in 1 million) and must be controlled by dedicated device Since deep DNA sequencing has already revolutionized
maintenance procedures. biological and medical research, it may find its way into
On the bioinformatic side, the read-mapping process routine diagnostics soon. AFS implementation currently
can already be carried out on standard PCs with 4 GB of requires elaborate knowledge of genomes and bioinfor-
RAM using commercial software tools, but is still time- matics, but several strategies are conceivable to further
consuming when many reference genomes are inspected. simplify and standardize the approach.
New developments in software programming offer the
use of fast and affordable graphics processing units Additional files
(GPUs) to analyze massive sequence data in reasonable
time. To test if such compute unified device architecture Additional file 1: Table S1. Reference genomes used in AFS.
(CUDA)-based programs will speed up our pipeline, we Additional file 2: Table S2. Mapping results for the reference sausage
KLyoA.
compared the novel mapping tool CUSHAW [30] to the
Additional file 3: Table S3. Plant components: spiked in proportions
standard tool BWA for the time needed for analyze the and respective BLAST-hits.
species proportions of the sausage sample. While the ac-
curacy using CUSHAW appeared somewhat lower than Abbreviations
BWA possibly due to algorithmic differences (data not AFS: All food sequencing; KalD: Kalibrator D; KLyoA: Kalibrator Lyoner A;
reported), time improvement using CUSHAW was sub- GPUs: Graphics processing units; NGS: Next-generation sequencing.
stantial with a 2.0 to 2.6-fold speed-up, depending on
Competing interests
the number of threads (one to eight) BWA was allowed The authors declare that they have no competing interests.
to use. CUSHAW thus could cut the time needed for
read mapping on a PC roughly by half. Authors’ contributions
TH, FR, FK, MW and RK conceived the study. FR, FK, AS and MW evaluated
The biggest limitation in our pipeline in terms of time datasets. YL and BS improved AFS routines and benchmarked datasets. RK
and costs was set by the massive BLAST routines (carried contributed DNA material and unpublished food-related data. TH, FR and FK
Ripp et al. BMC Genomics 2014, 15:639 Page 10 of 11
http://www.biomedcentral.com/1471-2164/15/639
drafted the manuscript. YL, BS, RK and MW revised the manuscript. All 17. Robin ED, Wong R: Mitochondrial-DNA Molecules and Virtual Number
authors approved the paper in its final version. of Mitochondria Per Cell in Mammalian-Cells. J Cell Physiol 1988,
136(3):507–513.
18. Marguerat S, Bahler J: RNA-seq: from technology to biology. Cell Mol Life
Acknowledgements Sci 2010, 67(4):569–579.
We gratefully acknowledge funding by the Johannes Gutenberg University 19. Ozsolak F, Milos PM: RNA sequencing: advances, challenges and
of Mainz Center for Computational Sciences (SRFN; to TH and BS), the JGU opportunities. Nat Rev Genet 2011, 12(2):87–98.
Mainz intramural funding program and the Ministry of Justice and for 20. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: an
Consumer Safety Rhineland-Palatinate, Germany (to TH). The authors also assessment of technical reproducibility and comparison with gene
thank Dr. Steffen Rapp (JGU Mainz) and Dr. Sven Bikar (GENterprise & StarSEQ expression arrays. Genome Res 2008, 18(9):1509–1517.
GmbH, Mainz) for operation of the Illumina sequencers. 21. Mortazavi A, Williams BA, Mccue K, Schaeffer L, Wold B: Mapping and
quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008,
Author details 5(7):621–628.
1
Institute of Molecular Genetics, Johannes Gutenberg University Mainz, 22. Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, Doyle F, Epstein CB,
D55099 Mainz, Germany. 2Institute of Computer Science, Johannes Frietze S, Harrow J, Kaul R, Khatun J, Lajoie BR, Landt SG, Lee BK, Pauli F,
Gutenberg University Mainz, D55099 Mainz, Germany. 3Official Food Control Rosenbloom KR, Sabo P, Safi A, Sanyal A, Shoresh N, Simon JM, Song L,
Authority of the Canton Zürich, Zürich, Switzerland. Trinklein ND, Altshuler RC, Birney E, Brown JB, Cheng C, Djebali S, Dong X,
Dunham I: An integrated encyclopedia of DNA elements in the human
Received: 23 July 2013 Accepted: 24 July 2014 genome. Nature 2012, 489(7414):57–74.
Published: 31 July 2014 23. Wiedmann RT, Smith TPL, Nonneman DJ: SNP discovery in swine by
reduced representation and high throughput pyrosequencing.
BMC Genet 2008, 9:81.
References 24. Wade CM, Giulotto E, Sigurdsson S, Zoli M, Gnerre S, Imsland F, Lear TL,
1. Das Eidgenössische Departement des Innern: Verordnung über Lebensmittel Adelson DL, Bailey E, Bellone RR, Blöcker H, Distl O, Edgar RC, Garber M,
tierischer Herkunft. Swiss Food Legislation 2005, 23:11. Art. 8 Abs. 5. Leeb T, Mauceli E, MacLeod JN, Penedo MC, Raison JM, Sharpe T, Vogel J,
2. Bundesministerium der Justiz und für Verbraucherschutz: Gesetz über den Andersson L, Antczak DF, Biagi T, Binns MM, Chowdhary BP, Coleman SJ,
Verkehr von Arzneimitteln. German Drug Law 2011, 22:12. Art. 10 & 11. Della Valle G, Fryc S, Guerin G: Genome sequence, comparative
3. Meyer R, Candrian U, Lüthy J: Detection of pork in heated meat products analysis, and population genetics of the domestic horse. Science 2009,
by the polymerase chain reaction. J AOAC Int 1994, 77(3):617–622. 326(5954):865–867.
4. Woolfe M, Primrose S: Food forensics: using DNA technology to combat 25. Kerstens HHD, Crooijmans RPMA, Veenendaal A, Dibbits BW, Chin-A-Woeng
misdescription and fraud. Trends Biotechnol 2004, 22(5):222–226. TFC, den Dunnen JT, Groenen MAM: Large scale single nucleotide
5. Asensio L, Gonzalez I, Garcia T, Martin R: Determination of food authenticity polymorphism discovery in unsequenced genomes using second
by enzyme-linked immunosorbent assay (ELISA). Food Control 2008, generation high throughput sequencing technology: applied to
19(1):1–8. turkey. BMC Genomics 2009, 10:479.
6. Brodmann PD, Moor D: Sensitive and semi-quantitative TaqMan™ 26. Kijas JW, Townley D, Dalrymple BP, Heaton MP, Maddox JF, McGrath A,
real-time polymerase chain reaction systems for the detection of beef Wilson P, Ingersoll RG, McCulloch R, McWilliam S, Tang D, McEwan J,
(Bos taurus) and the detection of the family Mammalia in food and feed. Cockett N, Oddy VH, Nicholas FW, Raadsma H, International Sheep
Meat Sci 2003, 65(1):599–607. Genomics Consortium: A genome wide survey of SNP variation reveals
7. Zhang CL, Fowler, Scott NW, Lawson G, Slater A: A TaqMan real-time PCR the genetic structure of sheep breeds. PLoS One 2009, 4(3):e4668.
system for the identification and quantification of bovine DNA in meats, 27. Pop M, Salzberg SL: Bioinformatics challenges of new sequencing
milks and cheeses. Food Control 2007, 18(9):1149–1158. technology. Trends Genet 2008, 24(3):142–149.
8. Köppel R, Ruf J, Zimmerli F, Breitenmoser A: Multiplex real-time PCR for 28. Li H, Homer N: A survey of sequence alignment algorithms for
the detection and quantification of DNA from beef, pork, chicken and next-generation sequencing. Brief in Bioinform 2010, 11(5):473–483.
turkey. Eur Food Res Technol 2008, 227(4):1199–1203. 29. Li H, Durbin R: Fast and accurate long-read alignment with Burrows-Wheeler
9. Eugster A, Ruf J, Rentsch J, Koppel R: Quantification of beef, pork, chicken transform. Bioinformatics 2010, 26(5):589–595.
and turkey proportions in sausages: use of matrix-adapted standards 30. Liu Y, Schmidt B, Maskell DL: CUSHAW: a CUDA compatible short read
and comparison of single versus multiplex PCR in an interlaboratory trial. aligner to large genomes based on the Burrows-Wheeler transform.
Eur Food Res Technol 2009, 230(1):55–61. Bioinformatics 2012, 28(14):1830–1837.
10. Sawyer J, Wood C, Shanahan D, Gout S, McDowell D: Real-time PCR for 31. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis
quantitative meat species testing. Food Control 2003, 14(8):579–583. G, Durbin R: The Sequence Alignment/Map format and SAMtools.
11. Hebert PDN, Ratnasingham S, deWaard JR: Barcoding animal life: Bioinformatics 2009, 25(16):2078–2079.
cytochrome c oxidase subunit 1 divergences among closely related 32. Williams RBH, Cotsapas CJ, Cowley MJ, Chan E, Nott DJ, Little PFR:
species. P Roy Soc B-Biol Sci 2003, 270:96–99. Normalization procedures and detection of linkage signal in genetical-
12. Coghlan ML, Haile J, Houston J, Murray DC, White NE, Moolhuijzen P, genomics experiments. Nat Genet 2006, 38(8):855–856.
Bellgard MI, Bunce M: Deep Sequencing of Plant and Animal DNA 33. Dalloul RA, Long JA, Zimin AV, Aslam L, Beal K, Le Blomberg A, Bouffard P,
Contained within Traditional Chinese Medicines Reveals Legality Issues Burt DW, Crasta O, Crooijmans RPMA, Cooper K, Coulombe RA, De S, Delany
and Health Safety Concerns. Plos Genet 2012, 8(4):436–446. ME, Dodgson JB, Dong JJ, Evans C, Frederickson KM, Flicek P, Florea L,
13. Tillmar AO, Dell'Amico B, Welander J, Holmlund G: A Universal Method for Folkerts O, Groenen MA, Harkins TT, Herrero J, Hoffmann S, Megens HJ,
Species Identification of Mammals Utilizing Next Generation Sequencing Jiang A, de Jong P, Kaiser P, Kim H: Multi-platform next-generation
for the Analysis of DNA Mixtures. PLoS One 2013, 8(12):e83761. sequencing of the domestic turkey (Meleagris gallopavo): genome
14. Zhou X, Li Y, Liu S, Yang Q, Su X, Zhou L, Tang M, Fu R, Li J, Huang Q: assembly and analysis. PLoS Biol 2010, 8(9):e1000475.
Ultra-deep sequencing enables high-fidelity recovery of biodiversity 34. Gregory TR: A bird's-eye view of the C-value enigma: Genome size, cell
for bulk arthropod samples without PCR amplification. GigaScience 2013, size, and metabolic rate in the class aves. Evolution 2002, 56(1):121–130.
2(1):4. 35. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment
15. Newmaster SG, Grguric M, Shanmughanandhan D, Ramalingam S, search tool. J Mol Biol 1990, 215(3):403–410.
Ragupathy S: DNA barcoding detects contamination and substitution 36. Huson DH, Mitra S, Ruscheweyh H-J, Weber N, Schuster SC: Integrative
in North American herbal products. BMC Med 2013, 11:222. analysis of environmental sequences using MEGAN4. Genome Res 2011,
16. Deagle BE, Thomas AC, Shaffer AK, Trites AW, Jarman SN: Quantifying 21(9):1552–1560.
sequence proportions in a DNA-based diet study using Ion Torrent 37. Köppel R, Ruf J, Rentsch J: Multiplex real-time PCR for the detection and
amplicon sequencing: which counts count? Mol Ecol Resour 2013, quantification of DNA from beef, pork, horse and sheep. Eur Food Res
13(4):620–633. Technol 2011, 232(1):151–155.
Ripp et al. BMC Genomics 2014, 15:639 Page 11 of 11
http://www.biomedcentral.com/1471-2164/15/639
38. Schnoes AM, Brown SD, Dodevski I, Babbitt PC: Annotation error in public
databases: misannotation of molecular function in enzyme superfamilies.
PLoS Comp Biol 2009, 5(12):e1000605.
39. Kijas JW, Menzies M, Ingham A: Sequence diversity and rates of molecular
evolution between sheep and cattle genes. Anim Genet 2006, 37(2):171–174.
40. Köppel R, Eugster A, Ruf J, Rentsch J: Quantification of meat proportions
by measuring DNA contents in raw and boiled sausages using
matrix-adapted calibrators and multiplex real-time PCR. J AOAC Int
2012, 95(2):494–499.
41. Oakenfull EA, Clegg JB: Phylogenetic relationships within the genus Equus
and the evolution of alpha and theta globin genes. J Mol Evol 1998,
47(6):772–783.
42. Teletchea F, Bernillon J, Duffraisse M, Laudet V, Hanni C: Molecular
identification of vertebrate species by oligonucleotide microarray in
food and forensic samples. J Appl Ecol 2008, 45(3):967–975.
43. Plohl M, Luchetti A, Mestrović N, Mantovani B: Satellite DNAs between
selfishness and functionality: structure, genomics and evolution of tandem
repeats in centromeric (hetero)chromatin. Gene 2008, 409(1–2):72–82.
44. Chiang YC, Tsen HY, Chen HY, Chang YH, Lin CK, Chen CY, Pai WY:
Multiplex PCR and a chromogenic DNA macroarray for the detection of
Listeria monocytogens, Staphylococcus aureus, Streptococcus agalactiae,
Enterobacter sakazakii, Escherichia coli O157:H7, Vibrio parahaemolyticus,
Salmonella spp. and Pseudomonas fluorescens in milk and meat samples.
J Microbiol Methods 2012, 88(1):110–116.
45. Chen C-C, Hsia K-C, Huang C-T, Wong W-W, Yen M-Y, Li L-H, Lin K-Y, Chen
K-W, Li S-Y: Draft genome sequence of a dominant, multidrug-resistant
Neisseria gonorrhoeae strain, TCDC-NG08107, from a sexual group at
high risk of acquiring human immunodeficiency virus infection and
syphilis. Bacteriol 2011, 193(7):1788–1789.
46. Schubert-Ullrich P, Rudolf J, Ansari P, Galler B, Führer M, Molinelli A,
Baumgartner S: Commercialized rapid immunoanalytical tests for
determination of allergenic food proteins: an overview. Anal Bioanal
Chem 2009, 395(1):69–81.
47. Adey A, Morrison HG, Asan, Xun X, Kitzman JO, Turner EH, Stackhouse B,
MacKenzie AP, Caruccio NC, Zhang X, Shendure J: Rapid, low-input,
low-bias construction of shotgun fragment libraries by high-density
in vitro transposition. Genome Biol 2010, 11(12):R119.
48. Mafra I, Ferreira I, Oliveira M: Food authentication by PCR-based methods.
Eur Food Res Technol A 2008, 227(3):649–665.
49. van Nieuwerburgh F, Soetaert S, Podshivalova K, Ay-Lin Wang E, Schaffer L,
Deforce D, Salomon DR, Head SR, Ordoukhanian P: Quantitative bias in
Illumina TruSeq and a novel post amplification barcoding strategy
for multiplexed DNA and small RNA deep sequencing. PloS One 2011,
6(10):e26969.
50. Kircher M, Sawyer S, Meyer M: Double indexing overcomes inaccuracies
in multiplex sequencing on the Illumina platform. Nucleic Acids Res 2012,
40(1):e3.
51. Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C:
Metagenomic microbial community profiling using unique clade-specific
marker genes. Nat Methods 2012, 9(8):811–814.
52. Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence
classification using exact alignments. Genome Biol 2014, 15(3):R46.
53. Francis OE, Bendall M, Manimaran S, Hong CJ, Clement NL, Castro-Nallar E,
Snell Q, Schaalje GB, Clement MJ, Crandall KA, Johnason WE: Pathoscope:
Species identification and strain attribution with unassembled sequencing
data. Genome Res 2013, 23(10):1721–1729.
doi:10.1186/1471-2164-15-639
Cite this article as: Ripp et al.: All-Food-Seq (AFS): a quantifiable screen
Submit your next manuscript to BioMed Central
for species in biological samples by deep DNA sequencing. BMC and take full advantage of:
Genomics 2014 15:639.
• Convenient online submission
• Thorough peer review
• No space constraints or color figure charges
• Immediate publication on acceptance
• Inclusion in PubMed, CAS, Scopus and Google Scholar
• Research which is freely available for redistribution