High Performance Computing in Genomics Research
High Performance Computing in Genomics Research
Andrew Dean
The field of genomics, which deals with the study of genetic material, has
made tremendous progress in recent years, largely due to advancements in
sequencing technology. The human genome, once considered a mysterious
and intricate entity, is now being decoded at an unprecedented rate,
uncovering new knowledge about the genetic basis of diseases and
disorders. However, the sheer quantity of data produced by modern
sequencing methods presents a substantial computational challenge. This is
where High Performance Computing (HPC) comes in, providing the necessary
processing power to drive advancements in the field of genomics.
Finally, compute helps to lower the costs associated with whole genome
sequencing by automating many tasks. This can make the field more
accessible to a wider range of organisations and researchers, especially
those with limited budgets. For example, the development of low-cost
sequencing technologies such as Oxford Nanopore Technologies’ MinION
sequencer, the first portable real-time device for DNA and RNA sequencing
has been made possible through edge computing, enabling researchers to
perform sequencing in remote or resource-constrained settings and yielding
new insights into the genomes of understudied species and populations.
Genomics
Tap into the Latest Advances in Genome Research with
Accelerated Computing
As genomic testing becomes more mainstream, sequencing our
3.2 billion DNA base pairs is critical to identifying mutations that
can cause disease. Advancements in high-throughput
instruments has decreased the cost of sequencing but also
increased the amount of data that requires analysis. Leveraging
GPUs to accelerate this analysis can vastly decrease runtime
and costs compared to CPU-based approaches.
Supercharging Genomics Research
understand the genomes of United Arab Emirates’ citizens and improve healthcare in the
country.
READ BLOG
Understanding Mutational Signatures of Cancer
Researchers from the Wellcome Sanger Institute and UC San Diego collaborated with
NVIDIA to accelerate the analysis of molecular signatures of somatic mutation by 30X on
NVIDIA DGX systems.
LEARN MORE
LEARN MORE
Latest Drug Design
and Development Webinars
VIEW WEBINAR
Shortening Whole Genome Analysis from Days to
Minutes
Learn about the newest features and performance improvements in the latest release of
NVIDIA Clara Parabricks, a turnkey solution that accelerates production germline, somatic,
™
VIEW WEBINAR
LEARN MORE
Artemis Supercomputer on the Hunt
for Deeper Understanding of
Genomics
Group 42’s supercomputer, powered by NVIDIA DGX systems, fuels
national genome program to enhance understanding of UAE citizens’
genomes, improve healthcare and fight COVID-19.
May 14, 2020 by Marc Domenech
Share
Studying the entire genetic code of an individual or a group of individuals can help us
gain a better understanding of diseases, enable precision medicine and power
pharmacogenomics — how genes affect a person’s response to drugs.
G42, based in Abu Dhabi, develops and deploys holistic and scalable AI and cloud
computing offerings. Through its Inception Institute of Artificial Intelligence, it carries out
fundamental research on AI, big data and machine learning.
Now the supercomputer is being put to work on the Population Genome Program. This
national effort aims to enhance scientific understanding of Abu Dhabi citizens’ genomes
and improve healthcare in the country.
Till now, the understanding of genetic variation in the Arab population has been a
challenge due to the lack of a high-quality Emirati reference genome. The Population
Genome Program will enrich available data by producing a reference genome specific
to citizens of the United Arab Emirates.
The program aims to be the first of its kind in the world to then use this as a baseline
and incorporate the genomic data into healthcare management processes.
Anonymized DNA samples will first be collected and processed using Oxford Nanopore
PromethION sequencers. These devices, which contain embedded NVIDIA GPU
technology to enable AI at the edge, will help to accelerate the processing of genomic
data.
The processed data will be supplied, in a graphical format, to Artemis for AI-powered
analysis, with support from NVIDIA Parabricks software to support their population
analysis.
The final results will be provided to the research and medical community to help deliver
more effective patient care. This could include more advanced treatments for conditions
such as cancer, schizophrenia, autism, and cardiovascular and neuronal diseases.
“With NVIDIA’s GPU technology we’re able to provide a highly optimized AI platform for
the national Population Genome Program and accelerate data processing,” said Min S.
Park, director of Genome Programs at G42. “This collaboration supports our goals of
developing a program for personalized care across the UAE, bringing experts, data and
technology together for improving patient care.”
Combatting COVID-19
G42 is also using its supercomputing prowess in the battle against COVID-19, having
recently established a new detection laboratory in Masdar City, Abu Dhabi. This facility
can, on a daily basis, support tens of thousands of real-time reverse transcription
polymerase chain reaction (RT-PCR) tests. These tests detect the presence of the
SARS-CoV-2 virus in samples taken from patients.
In addition, G42 is involved in the production of COVID-19 diagnostic kits, the supply of
thermal sensors and, working in coordination with local and international health
authorities, assisting in the creation of effective prevention and detection protocols to
contain the virus.
“Technology will play a crucial role in curbing the spread of the coronavirus and the
superior computing capability of Artemis can help in many ways — from rapid vaccine
development, where computer simulations may replace manual experiments and
reduce the development time of a vaccine, to mapping and predicting trends in the
outbreak, as well as predicting virus mutations,” said Peng Xiao, CEO of G42.
Basel
Cancellation deadline:
23 November 2018
Academic: 0 CHF
For-profit: 0 CHF
The course will include on the second day also a module related to cluster use
for personalized medicine: "Data & Computing Services for Personalized
Health Research"
Application
The classes of the course can be attended or skipped in a "pick-and-mix"
mode, depending on the need and skills of the participant, however a
registration is required for the full course by filling the form:
https://goo.gl/forms/tscMb9TAURIlWtFI2
Location
BSSE ETH in Basel
Additional information
Coordination: Michal Okoniewski, Scientific IT Services ETH
Instructors
Michal Okoniewski, Samuel Fux, Diana Coman-Schmid
Schedule
====> High Performance Computing for genomic applications
13:00- 13:45 Data & Computing Services for Personalized Health Research
Getting Started
This course assumes that learners have no prior experience with the tools covered in the
course. However, learners are expected to have some familiarity with biological concepts,
including the concept of genomic variation within a population. Participants should bring their
own laptops and plan to participate actively.
To get started, follow the directions in the Setup tab to get access to the required software and
data for this workshop.
Data
This course uses data from a long term evolution experiment published in 2016: Tempo and
mode of genome evolution in a 50,000-generation experiment by Tenaillon O, Barrick JE,
Ribeck N, Deatherage DE, Blanchard JL, Dasgupta A, Wu GC, Wielgoss S, Cruveiller S,
Médigue C, Schneider D, and Lenski RE. (doi: 10.1038/nature18959)
All of the data used in this workshop can be downloaded from Figshare. More information
about this data is available on the Data page.
Course Overview
Lesson Overview
Project management for cloud Learn how to structure your data and metadata, plan for an NGS project, lo
genomics line.
Using the command line Learn to use the command line to navigate your file system, create, copy, m
files.
Data preparation and Learn how to automate commonly used workflows, organise your file syste
organisation quality control.
Data processing and analysis Learn how to filter out poor quality data, align reads to a reference genome
these tasks for efficiency and accuracy.
Teaching Platform
This workshop is designed to be run on pre-imaged Amazon Web Services (AWS) instances. All
the software and data used in the workshop are hosted on an Amazon Machine Image (AMI). To
access your AMI instance remotely, follow the directions in the Setup.
Run Git Bash by double clicking on the Git Bash icon in your Desktop screen.
Exit Git Bash by pressing Ctrl-d – that is pressing the keys Ctrl and d simultaneously.
Bash stands for Bourne Again Shell. In addition to be a CLI, Bash shell is a powerful command
programming language and has a long and interesting history which you can read in
the Wikipedia entry for Bash shell.
Learner View
Data Carpentry’s aim is to teach researchers basic concepts, skills, and tools
for working with data so that they can get more done in less time, and with
less pain. This workshop teaches data management and analysis for
genomics research including: best practices for organization of
bioinformatics projects and data, use of command-line utilities, use of
command-line tools to analyze sequence quality and perform variant calling,
and connecting to and using cloud computing. This workshop is designed to
be taught over two full days of instruction.
Read our FAQ to learn more about Data Carpentry’s Genomics workshop, as
an Instructor or a workshop host.
GETTING STARTED
This lesson assumes that learners have no prior experience with the tools
covered in the workshop. However, learners are expected to have some
familiarity with biological concepts, including the concept of genomic
variation within a population. Participants should bring their own laptops and
plan to participate actively.
To get started, follow the directions in the Setup tab to get access to the
required software and data for this workshop.
DATA
This workshop uses data from a long term evolution experiment published in
2016: Tempo and mode of genome evolution in a 50,000-generation
experiment by Tenaillon O, Barrick JE, Ribeck N, Deatherage DE, Blanchard
JL, Dasgupta A, Wu GC, Wielgoss S, Cruveiller S, Médigue C, Schneider D, and
Lenski RE. (doi: 10.1038/nature18959)
All of the data used in this workshop can be downloaded from Figshare. More
information about this data is available on the Data page.
Workshop Overview
Lesson Overvie
Learn how to structure your metadata, organize and document your genom
Project organization and management
sequence read archive (SRA) database.
Learn to navigate your file system, create, copy, move, and remove files a
Introduction to the command line
wildcards.
Data wrangling and processing Use command-line tools to perform quality control, align reads to a refere
Introduction to cloud computing for
Learn how to work with Amazon AWS cloud computing and how to trans
genomics
Optional Additional
Lessons
Lesson Overview
Intro to R and RStudio for Genomics Use R to analyze and visualize between-sample variation.
Teaching Platform
This workshop is designed to be run on pre-imaged Amazon Web Services
(AWS) instances. All the software and data used in the workshop are hosted
on an Amazon Machine Image (AMI). If you want to run your own instance of
the server used for this workshop, follow the directions in the Setup tab.
Common Schedules
Schedule A (2 days OR 4 half days)
Overview
This workshop is designed to be run on pre-imaged Amazon Web Services
(AWS) instances. All of the data and most of the software used in the
workshop are hosted on an Amazon Machine Image (AMI). Some additional
software, detailed below, must be installed on your computer.
Please follow the instructions below to prepare your computer for the
workshop:
WINDOWS
MAC OS X
LINUX
Data
More information about these data will be presented in the first lesson of the
workshop.
Software
Software Version Manual Available for
FastQC 0.11.9 Link Linux, MacOS, Windows Quality control tool for high th
Trimmomatic 0.39 Link Linux, MacOS, Windows A flexible read trimming tool
BWA 0.7.17 Link Linux, MacOS Mapping DNA sequences aga
SAMtools 1.9 Link Linux, MacOS Utilities for manipulating align
BCFtools 1.9 Link Linux, MacOS Utilities for variant calling and
IGV Link Link Linux, MacOS, Windows Visualization and interactive e
Conda
LINUX
MACOS
FastQC
MACOS
BASH
$ fastqc -h
Trimmomatic
MACOS
BASH
BWA
MACOS
BASH
$ bwa
SAMtools
MACOS
SAMTOOLS VERSIONS
SAMtools has changed the command line invocation (for the better). But this
means that most of the tutorials on the web indicate an older and obsolete
usage.
Using SAMtools version 1.9 is important to work with the commands we
present in these lessons.
BASH
$ samtools
BCFtools
MACOS
BASH
$ bcftools
IGV
APPLY NOW
by
Nathan Eddy
Listen
05:57
Sequencing our 3.2 billion DNA base pairs is becoming increasingly crucial as
genomic testing gains widespread acceptance.
That translates into more than 80 times the acceleration on some of those
industry standard tools, he notes, adding that the software is also scalable.
“It’s fully compatible with all the workflow managers genomics researchers
are using,” Clifford says. “There’s also the improved accuracy point, which is
provided by artificial intelligence-based deep learning and high accuracy
approaches included in the toolkit.”
Volume, Velocity and Variety of Data Pose Challenges in
Genomics
Clifford points out that Big Data analysis can be split broadly into three
pillars: the amount of data (volume), the speed of processing (velocity) and
the number of data types (variety).
“First off, we have this huge explosion of data, this volume problem in
genomics, and that’s why you need HPC solutions,” he says.
The second aspect of the Big Data challenge is velocity, as each sample that
is run through a sequencer must be run through a sequencing process, a wet
lab process and then through the computational analysis process.
“Those sequencers are now running so quickly that compute is the new
bottleneck in genomics,” Clifford explains.
“We have that in our Clara Parabricks genomics analysis software with AI-
led, neural network–based solutions for high accuracy, as well as
downstream in a lot of the drug discovery work with large language models
driving new insights in the field,” he says.
“If you were to compare the run times on a CPU with a GPU for analysis of a
single sample end to end, you’re looking at somewhere on the order of 24
hours-plus on a CPU, whereas on the GPU we have that down to less than 25
minutes on our DGX systems,” Clifford says. “That’s a huge acceleration of
the analysis.”
“If you need to run this in the cloud, for example, where time is money, then
that reduced time is saving you a huge amount in costs as well,” he says.
Clifford explains that NVIDIA’s full-stack approach to solutions means it’s
getting easier for healthcare organizations to tap the power of HPC.
“You’re able to program these chips, to use so many different libraries for
the data science steps and for the genomics itself with Clara Parabricks,” he
explains. “The tools are there, and all of this analysis can now be brought on
the GPU.”
MORE FROM HEALTHTECH: How tech helps identify and track social
determinants of health data.
He points to the next generation of chips, including the recently released
H100 Tensor Core GPU, which he describes as “very well suited to genomics
analysis.” It boasts a transformer engine that determines whether 8-bit and
16-bit floating point calculations are appropriate.
All these components mean it will work well with large language models and
some of the latest transformer-based deep learning architectures.
“These very large models give us a new ability to interpret data and
understand biological meaning,” he says. “That’s an area of the field that is
just getting started and benefits hugely from HPC and GPU.”
Abstract
Full Text
Info/History
Metrics
Preview PDF
Abstract
Next Generation Sequencing (NGS) workloads largely consist of pipelines of tasks with
heterogeneous compute, memory, and storage requirements. Identifying the optimal
system configuration has historically required expertise in both system architecture and
bioinformatics. This paper outlines infrastructure recommendations for one commonly
used genomics workload based on extensive benchmarking and profiling, along with
recommendations on how to tune genomics workflows for high performance computing
(HPC) infrastructure. The demonstrated methodology and learnings can be extended for
other genomics workloads and for other infrastructures such as the cloud.
Introduction
Since the advent of Next Generation Sequencing (NGS), the cost of sequencing genomic
data has drastically decreased, and the amount of genomic samples processed
continues to increase [1,2]. With this growth comes the need to more efficiently process
NGS datasets.
Prior work has focused on custom methods for deploying exomes on HPC systems [3],
as well as best practices for deploying genomics workflows on the cloud [4]. This paper
outlines how to optimize system utilization for one commonly used genomics workflow,
along with recommendations on how to tune genomics workflows for HPC infrastructure.
The Broad Institute’s Genome Analysis Toolkit (GATK) Best Practices Pipeline for
Germline Short Variant Discovery is a commonly used workflow for processing human
whole genome sequences (WGS) datasets. This pipeline consists of 24 tasks, each with
specific compute, memory, and disk requirements.
See S1 Table for a full list of the tasks and their requirements. Of these 24 tasks, six are
multithreaded, and the rest are single threaded.
For multithreaded tasks, the genome is broken into shards, and each is executed as a
parallel process. At the end of the task, the output datasets from all shards are
aggregated and passed as a single input to the next task. This ability to “scatter” a task
across multiple jobs, and then “gather” outputs for the next task is called “scatter-
gather” functionality [5]. The ability to process multiple jobs concurrently is referred to
as “parallelization.” Both scatter-gather functionality and parallelization are key
concepts for efficiently distributing genomic pipelines on a system.
For example, in the task BWA, which aligns fragments output by the sequencer into a
single aligned string, the genome is broken into 24 shards. On local high performance
computing (HPC) infrastructure, each of these 24 shards is packaged as a single batch
scheduler job. Once all 24 shards of BWA complete, the task MergeBamAlignment (Mba)
consolidates the 24 output files into a single input file for the next task, MarkDuplicates,
which is single threaded.
Each of these 24 BWA jobs is deployed on the cluster and executed in parallel, and each
job is allocated a recommended four CPU threads and 14GB DRAM (see S1 Table). When
BWA is running with these parameters, therefore, it consumes in total 96 threads and
336GB DRAM.
It’s important to note that while BWA can readily consume 96 threads on a system,
most of the tasks are single threaded. Figure 1 below shows CPU utilization (gray) and
memory utilization (in red) for the duration of the pipeline when processing a single 30X
coverage human whole genome sequence (WGS). Note that memory utilization is close
to its maximum for roughly only a third of the overall processing time. CPU utilization is
at 100% for even less time.
Download figure
Open in new tab
Fig 1.CPU and Memory Utilization for single WGS run.
The runtime for key tasks in the pipeline (with start and end times in parens) is BWA
(0.0-2.4); MarkDuplicates (2.4-4.8); SortSampleBam (4.8-5.7); BaseRecalibrator (5.7-
6.3); GatherBQSRReports (6.3-6.4); ApplyBQSR (6.4-6.9); GatherBamFiles (6.9-7.3);
HaplotypeCaller (7.3-8.7); MergeVCFs (8.7-9.3). CPU Utilization is at 100% for BWA, and
40-50% for HaplotypeCaller. Memory utilization is close to 100% only for BWA and
SortSampleBam. Because of this heterogeneity in resource utilization, achieving
maximum throughput requires efficient scheduling of multiple WGS samples in parallel.
Because of this heterogeneity, making efficient use of HPC infrastructure requires
tuning and orchestration of the workflow. First, these tasks need to be efficiently
sharded and distributed across the cluster. Second, each task needs to be allocated the
optimal number of threads and memory. Local disk must be used for temporary storage.
Finally, the tasks need to take advantage of underlying hardware features.
This paper outlines the impact of each of these factors on performance, and details best
known methods for configuring the Germline Variant Discovery pipeline on local HPC
infrastructure.
The publicly available NA12878 30X coverage whole genome sequence (WGS) dataset
was used for all benchmarking. For the resource profiling (e.g. Fig 1), a single sample
was run. For throughput tuning 40 WGS were submitted concurrently.
Jobs were orchestrated on the cluster with Slurm. The Broad Institute provides a Slurm
backend for Cromwell, which can be found in the Cromwell documentation [6]. GATK
Best Practices Pipelines are defined in Workflow Description Language (WDL). A given
WDL defines which GATK tools to call in the form of tasks and is accompanied by a JSON
file with dataset locations and other configuration settings.
The full list of tasks in the Germline Variant Discovery pipeline can be found in S1 Table,
along with the compute requirements for each task. Testing was performed with GATK
v4.1.8.0. A full software bill of materials is provided in the Supporting Information S3
Table.
The specific WDL and JSON files used for testing can be found in S3 Table. The
recommended resource allocation values are included in those workflows.
Results
1. Tuning Resource Allocation Values
To allocate specific amounts of cores and memory to each task, the HPC batch
scheduler must be configured to enable consumable resources. With Slurm, this is set in
slurm.conf by specifying both “SelectType=select/cons_res” and
“SelectTypeParameters=CR_Core_Memory.” Additional detail on consumable resources
is available in the Slurm documentation [7].
Another key component of resource tuning is Hyperthreading. With Hyperthreading
turned on (HT=On) two processes can be executed simultaneously on a single physical
core. Testing shows a 10% overall pipeline speedup with HT=On. With HT=On, a task is
allocated a set number of threads, with two threads available per physical core.
The most compute-intensive task in the Germline Variant Discovery pipeline is BWA.
Manipulating the number of threads per shard for BWA has a substantial impact on the
overall runtime of the pipeline, as shown in Figure 2A.
Download figure
Open in new tab
Fig. 2.Impact of Increasing threads-per-shard on BWA Performance.
Figure 2A shows the effect of increasing threads per shard on single shard runtime for
BWAmem. Figure 2B shows the effect of increasing the number of threads per shard
on total corehours consumed.
Figure 2A shows impact of increasing the number of threads per shard on the shard
runtime. When two threads are allocated per shard, the runtime is 200 minutes.
Increasing to 16 threads per shard decreases the runtime by 10X down to 27 minutes.
Increasing the thread count higher than 16 threads per shard has limited positive
impact on shard runtime. Figure 2B shows the impact, however, on total corehours.
Increasing the number of threads per shard gradually increases the total corehours
consumed on the cluster. When processing a single WGS, 16 threads per shard provides
a fast task runtime while limiting corehours.
S1 Table provides a detailed list of thread count and memory recommendations for
each individual task in the pipeline, including BWA. These values were determined
empirically specifically optimizing for throughput processing. The Discussion covers
considerations when optimizing for fastest single sample runtime, as well as
considerations for cloud infrastructure.
While 16 threads per shard results in the fastest single shard runtime, and the fastest
runtime for BWA, it does not necessarily result in the best throughput, or number of
genomes that can be processed on a system per day. On a 4-server 2-socket system
with 24-core CPUs and HT=On, there are 384 available threads available at any given
time. Setting BWA to consume 16 threads per shard for 24 shards results in BWA
consuming all 384 threads for a single WGS. Adjusting this thread count to, for example,
4 threads per shard, results in a longer runtime for BWA but allows for processing four
WGS samples in parallel.
While this section has focused on BWA tuning, similar methods were used to identify
optimal thread and memory allocations for each task in the pipeline. These
recommended values can be found in S1 Table.
Download figure
Open in new tab
Figure 3.Impact of scattercount on HaplotypeCaller Runtime and total Core
Hours.
As scattercount increases, the runtime of the longest running shard decreases (A) while
the total corehours consumed increases (B).
Figure 3B shows the relationship between scattercount and total corehours consumed
by HaplotypeCaller. Corehours gradually increases with scattercount. As
HaplotypeCaller is split into more small jobs the total corehours consumed increases.
It’s important to note that scattercount cannot be arbitrarily set without considering
potential artifact generation. For this reason, scattercount is set specifically to 48.
Concordance analysis is always required when tuning scattercount to ensure fidelity.
While BWA and HaplotypeCaller are the two most compute-intensive tasks in the
pipeline, one of the longest running tasks is the single threaded MarkDuplicates.
MarkDuplicates takes in a BAM or SAM file and compares and identifies duplicate reads.
The uncompressed files processed by MarkDuplicates for a 30X human whole genome
sequence can total over 200GB in size. The task is highly dependent on fast local
storage for processing these datasets.
Figure 4 shows the impact of running with a local Solid State Drive (SSD) compared to
running without an SSD and just using the parallel file system. With an SSD,
MarkDuplicates runtime is 2.5 hours. Without an SSD, MarkDuplicates runtime is 37.6
hours.
Download figure
Open in new tab
Fig. 4.Local disk for MarkDuplicates Performance.
The runtime for the entire Germline Variant Calling pipeline drastically decreases with
use of a local SSD. This is primarily due to the decrease in runtime of MarkDuplicate
(blue).
S2 Table includes an NVMe Pxxx SSD in each of the compute servers, largely for the
sake of MarkDuplicates processing.
Ultimately each shard of each task is executed on a single thread of a CPU. Ensuring
these tasks are able to take advantage of the underlying CPU features is a key factor for
performance.
As of GATK 4.0, a number of tasks in the Germline Variant Discovery pipeline have been
accelerated to take advantage of Intel AVX-512 Instructions through the Genomics
Kernel Library (GKL). GKL is developed and maintained by Intel and is distributed open
source with GATK [8].
GKL includes compression and decompression from Intel’s ISA-L and zlib libraries, as
well as AVX-512 implementations of PairHMM and Smith-Waterman [8-10]. PairHMM
and Smith-Waterman are two key kernels included in a number of genomics tasks,
including HaplotypeCaller.
Figure 5 shows the benefit of GKL compression for the three tasks with the largest
input file sizes: ApplyBQSR (133GB), GatherBamFiles (62GB) and MarkDuplicates
(222GB). GKL provides compression at levels from 1-9 (CL=1-9). CL=1 (orange) with
GKL provides a 2-4X compression ratio relative to with no compression (blue). The
compression ratio continues to improve as compression levels increase up to level 5.
Download figure
Open in new tab
Fig. 5.Compression with GKL for tasks with largest input file sizes.
The Genomics Kernel Library (GKL) performs compression at levels 1 through 9 (CL=1-
9). Compression ratios relative to CL=1 are shown in Figure 5A for three different tasks
in the pipeline. Figure 5B shows the impact of each compression level on completion
time for MarkDuplicates.
Figure 5B shows the task runtime as a function of compression level for these three
tasks.
As the name suggests, MarkDuplicates checks the input BAM file for duplicate reads,
and tags any identified duplicates [11]. In doing so the task reads and writes small (kB)
intermediary files throughout the 2+ hours of processing. Each of these intermediary
files is compressed and decompressed. Because of this, higher compression levels
result in a high runtime cost with this task (see Figure 5B).
Based on these results, compression level is set to CL=2 in GATK 4.2.0.0. This
compression level provide a good balance between high compression ratio across tasks
(Figure 5A) and low runtime for MarkDuplicates (Figure 5B).
Figure 6 shows the difference between HaplotypeCaller runtime with the AVX-512
implementations of both kernels (left) compared to with the original implementations
with no AVX instructions (right). The middle bar shows the runtime with the AVX512
implementation of SW and the Java AVX2 implementation of pairHMM.
Download figure
Open in new tab
Fig. 6.Impact of GKL AVX flags on HaplotypeCaller Performance.
HaplotypeCaller performance, as measured by task runtime in seconds, drastically
improves with the use of GKL pairHMM and SmithWaterman (SW). Runtimes for
pairHMM are shown in blue; SW in orange.
Note the y-axis is a log scale. Without the AVX implementations, HaplotypeCaller takes
125,754 seconds, or 35 hours, to complete. With the GKL AVX512 implementations the
same task completes in less than one hour.
Notably, users do not need to set any special flags to run with the GKL implementations.
As shown in the HaplotypeCaller documentation, running with default flag
(FASTEST_AVAILABLE) automatically detects if the underlying CPU includes support for
AVX-512 instructions and, if so, deploys the GKL implementation [12].
Discussion
As illustrated above, optimal performance of the Germline Variant Discovery pipeline is
dependent on (1) efficiently distributing tasks across the cluster; (2) tuning resource
allocation values; (3) utilization of fast local storage; and (4) libraries that take
advantage of underlying CPU features.
Intel and Broad Institute have partnered to form the Intel-Broad Center for Genomics
Data Engineering. The Genomics Kernel Library (GKL) is a direct outcome of this joint
engineering Center. As part of this partnership, many of the configuration
recommendations outlined above (eg the Slurm backend for Cromwell and resource
allocation values) are directly incorporated into the Broad Institute workflows and
documentation.
In other cases, such as with single threaded tasks, it makes sense to increase the
thread allocation to increase throughput. As shown in S1 Table, most tasks in the
pipeline are allocated two threads each despite being single threaded. This is
specifically to optimize for throughput. The second thread (1) allows for Java collection
and (2) intentionally limits the number of jobs concurrently running on the system.
Limiting the number of overall jobs helps ensure each task has sufficient memory while
also leaving sufficient memory for scheduling and system level operations.
For scenarios where single sample processing time is the highest priority, increasing the
threads and memory allocated per task will reduce single sample runtime, while
decreasing the overall throughput of the cluster. A workflow optimized for single sample
runtime is provided in the same repository as the throughput WDL (see S3 Table).
Increasing threads per task is also beneficial in the cloud. When deploying the Germline
Variant Discovery pipeline through the Broad Institute’s Platform as a Service (PaaS)
Terra.bio, each shard of each task is allocated its own VM with a set number of virtual
CPUs (vCPUs) and DRAM. In this scenario, each of the 24 BWA shards is allocated 16
vCPUs, compared to the four threads per shard recommended for local deployments.
Allocating each 16 vCPUs to each BWA shard does not negatively impact the runtime of
other tasks and samples on the cloud, since there are no infrastructure scale
constraints. Van der Auwera and O’Connor provide a detailed guide on best practices for
deploying Broad Institute workflows on the cloud [4].
As shown, the optimal workflow configuration is dependent on both underlying
infrastructure and key performance metrics (e.g. throughput vs single sample runtime).
Profiling workloads with methods described here can be extended to genomics
workflows beyond Germline Variant Calling. Future work will include tuning additional
workflows as well as comparing cloud and local performance considerations.
Key Points
Because genomics workflows consist of pipelines of tasks with heterogeneous
compute requirements, achieving maximum throughput requires efficient tuning
and orchestration of these workflows.
Tasks need to be efficiently sharded and distributed across the cluster; Each task
needs to be allocated the optimal number of threads and memory; and local disk
must be used for temporary storage.
Supporting Information
S1 Table. Recommended Resource Allocations for Germline Variant Discovery
tasks.
S2 Table. Hardware Configuration Used for Testing.
S3 Table. Software Configuration Used for Testing.
Acknowledgements
The authors thank Michael J. McManus PhD for his input and guidance from study
conception through data analysis. We thank Kyle Vernest, Kylee Degatano, and Louis
Bergelson for technical support throughout benchmarking
Objective
Sanger Institute uses the NVIDIA DGX server to power its mutational cancer
signature analysis pipeline—improving performance by 30x.
Customer
Sanger Institute
Use Case
Performance Improvement
Technology
“Research projects such as the Mutographs Grand Challenge are just that—
grand challenges that push the boundary of what’s possible,” said Pete
Clapham, leader of the Informatics Support Group at the Wellcome Sanger
Institute. “NVIDIA DGX systems provide considerable acceleration that
enables the Mutographs team to, not only meet the project’s computational
demands, but to drive it even further, efficiently delivering previously
impossible results.”
In the current project, researchers are studying DNA from the tumors of
5,000 patients with five cancer types: pancreas, kidney, colorectal, and two
kinds of esophageal cancer. Five synthetic data matrices that mimic one type
of real-world mutational profiles were used for estimating compute
performance. An NVIDIA DGX-1 system runs the NMF algorithm against the
five matrices, while the corresponding replicated CPU jobs are executed in
docker containers on OpenStack virtual machines (VMs), specifically 60 cores
in Intel Xeon Skylake Processors with 2.6 GHz and 697.3 GB of random-
access memory (RAM).
The NVIDIA DGX-1 is an integrated system for AI featuring eight NVIDIA V100
Tensor Core GPUs that connect through NVIDIA NVLink, the NVIDIA high-
performance GPU interconnect, in a hybrid cube-mesh network. Together
with dual-socket Intel Xeon CPUs and four 100 Gb NVIDIA Mellanox®
InfiniBand network interface cards, the DGX-1 delivers one petaFLOPS of AI
power, for unprecedented training performance. The DGX-1 system software,
powerful libraries, and NVLink network are tuned for scaling up deep learning
across all eight V100 Tensor Core GPUs to provide a flexible, maximum
performance platform for the development and deployment of AI applications
in both production and research settings.
“Research projects such as the Mutographs Grand Challenge are just that—
grand challenges that push the boundary of what’s possible. NVIDIA DGX
systems provide considerable acceleration that enables the Mutographs
team to, not only meet the project’s computational demands, but to drive it
even further, efficiently delivering previously impossible results.”
CPU/VMDGX-
1data_set_1data_set_2data_set_3data_set_4data_set_5data_set_6051015202
5Executing Time (Days)
CPU/ DGX-
VM 1
data_set_
20.858 0.591
1
data_set_
21.053 0.75
2
data_set_
17.716 0.446
3
data_set_
17.707 0.609
4
data_set_
6.634 0.235
5
data_set_
13.219 0.487
6
Faster Results and More Complex Experiments Hold the Promise to
Improve Human Health
The speedup and compute power of GPUs are enabling researchers to obtain
scientific results faster, run a greater number of experiments, and run more
complex experiments than were previously possible, paving the way for
scientific discoveries that could transform the future of cancer treatments.
https://figshare.com/articles/dataset/
Data_Carpentry_Genomics_beta_2_0/7726454
https://hpc.dccn.nl/docs/cluster_howto/software-maintainer-module-
howto.html
https://hpc.dccn.nl/docs/cluster_howto/index.html#accessing-the-cluster
https://training.pages.sigma2.no/tutorials/hpc-intro/episodes/13-
scheduler.html