Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views44 pages

High Performance Computing in Genomics Research

Uploaded by

Tolu Romio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views44 pages

High Performance Computing in Genomics Research

Uploaded by

Tolu Romio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 44

High Performance Computing in Genomics

Research: An Indispensable Tool for


Advancement

Andrew Dean

Supplying HPC, Storage, Cloud and AI - Solutions & Services - to the


Research and Technical computing market
February 28, 2023

The field of genomics, which deals with the study of genetic material, has
made tremendous progress in recent years, largely due to advancements in
sequencing technology. The human genome, once considered a mysterious
and intricate entity, is now being decoded at an unprecedented rate,
uncovering new knowledge about the genetic basis of diseases and
disorders. However, the sheer quantity of data produced by modern
sequencing methods presents a substantial computational challenge. This is
where High Performance Computing (HPC) comes in, providing the necessary
processing power to drive advancements in the field of genomics.

One of the most critical advantages of HPC in genomics is speed. Multiple


computers running in parallel can analyse vast amounts of data in a matter
of hours or even minutes, thanks to parallel processing. This can be
especially beneficial in the realm of precision medicine , where swift
diagnoses are of utmost importance. With the help of HPC, patients can
receive a diagnosis and treatment plan based on their unique genetic
makeup in a fraction of the time it would take through manual methods.

Accuracy is another critical benefit of compute in whole genome sequencing.


With the use of machine learning algorithms, computers can identify errors
and improve data quality, resulting in more trustworthy sequencing results.
For example, the application of algorithms to detect and correct errors in
genomic data has been a major factor in advancing the accuracy of cancer
genomics research. With the assistance of HPC and evolving deep learning
techniques, researchers can now identify the specific genetic mutations that
drive the development of cancer, leading to more targeted and effective
treatments.

As demand for sequencing continues to grow HPC provides the necessary


capacity to handle the growing volume of data generated by large-scale
sequencing projects. For instance, the National Institute of Health’s Genome
Project has been able to sequence the genomes of thousands of individuals,
due in no small part to the power of HPC.

Additionally, storage technologies developed for HPC applications are also


invaluable in managing the massive amounts of data produced by
sequencing, ensuring data security, accuracy and efficiency at huge scale at
an affordable cost. Furthermore public cloud and on-premises solutions
offering a ‘cloud like’ functionality ensure data can be securely stored and
shared with researchers and teams globally, fostering improved collaboration
and faster advancements in the field.

Finally, compute helps to lower the costs associated with whole genome
sequencing by automating many tasks. This can make the field more
accessible to a wider range of organisations and researchers, especially
those with limited budgets. For example, the development of low-cost
sequencing technologies such as Oxford Nanopore Technologies’ MinION
sequencer, the first portable real-time device for DNA and RNA sequencing
has been made possible through edge computing, enabling researchers to
perform sequencing in remote or resource-constrained settings and yielding
new insights into the genomes of understudied species and populations.

In conclusion, HPC is a crucial piece of the puzzle in the advancement of the


field of genomics. Its ability to speed up data analysis, improve accuracy,
scale to meet demand, manage data efficiently, and lower costs makes it an
indispensable asset to the field. As the study of genetics continues to
advance, the role of HPC in genomics will only become more important.

Genomics
Tap into the Latest Advances in Genome Research with
Accelerated Computing
As genomic testing becomes more mainstream, sequencing our
3.2 billion DNA base pairs is critical to identifying mutations that
can cause disease. Advancements in high-throughput
instruments has decreased the cost of sequencing but also
increased the amount of data that requires analysis. Leveraging
GPUs to accelerate this analysis can vastly decrease runtime
and costs compared to CPU-based approaches.
Supercharging Genomics Research

Powering Population Genomics


Group 42’s supercomputer, powered by NVIDIA DGX systems, will help researchers

understand the genomes of United Arab Emirates’ citizens and improve healthcare in the
country.

READ BLOG
Understanding Mutational Signatures of Cancer
Researchers from the Wellcome Sanger Institute and UC San Diego collaborated with
NVIDIA to accelerate the analysis of molecular signatures of somatic mutation by 30X on
NVIDIA DGX systems.

LEARN MORE

Developing Personalized Treatments


Using the power of GPUs, the Translational Genomics Research Institute analyzes the
genome sequences of children with rare neurological diseases in under a day to quickly
develop the most effective therapies.

LEARN MORE
Latest Drug Design
and Development Webinars

Oxford Nanopore Sequencing for COVID-19


Monitoring
Sequencing a virus can help characterize it and help health workers understand its identity,
mutations, and transmission. Explore how Oxford Nanopore sequencers, using GPU-
accelerated analysis software, are being used to monitor and study the ongoing COVID-19
pandemic.

VIEW WEBINAR
Shortening Whole Genome Analysis from Days to
Minutes
Learn about the newest features and performance improvements in the latest release of
NVIDIA Clara Parabricks, a turnkey solution that accelerates production germline, somatic,

and RNA variant calling pipelines.

VIEW WEBINAR

Applying HPC to Genome Analysis


NVIDIA Clara Parabricks is a complete portfolio of off-the-shelf solutions for genomics
analysis coupled with a developer toolkit for new application development. Supporting DNA
to RNA analysis and application workflows for primary, secondary, and tertiary analysis,
Clara Parabricks is built to optimize acceleration, accuracy, and scalability.

LEARN MORE
Artemis Supercomputer on the Hunt
for Deeper Understanding of
Genomics
Group 42’s supercomputer, powered by NVIDIA DGX systems, fuels
national genome program to enhance understanding of UAE citizens’
genomes, improve healthcare and fight COVID-19.
May 14, 2020 by Marc Domenech
Share





Ever wondered what you’re really made of?


Your genome is a unique genetic code that determines your characteristics. It’s a
specific combination of DNA molecules that makes you you.

Studying the entire genetic code of an individual or a group of individuals can help us
gain a better understanding of diseases, enable precision medicine and power
pharmacogenomics — how genes affect a person’s response to drugs.

As part of a national project launched by Abu Dhabi’s Department of Health, Group 42


is harnessing its Artemis supercomputer to decode the human genome and improve
patient care. Powered by NVIDIA GPUs, Artemis is the 26th fastest system in the world.

G42, based in Abu Dhabi, develops and deploys holistic and scalable AI and cloud
computing offerings. Through its Inception Institute of Artificial Intelligence, it carries out
fundamental research on AI, big data and machine learning.

Building a world-class AI supercomputer normally takes six months or longer. In just


three weeks, G42 designed, built and deployed Artemis with NVIDIA, using the DGX
SuperPOD reference architecture and Mellanox AI networking fabric.

The Population Genome Program


Built with 81 NVIDIA DGX systems, Artemis can deliver a total of 7.2 petaflops of
double-precision HPL performance and run workloads 120x faster than G42’s previous
system.

Now the supercomputer is being put to work on the Population Genome Program. This
national effort aims to enhance scientific understanding of Abu Dhabi citizens’ genomes
and improve healthcare in the country.

Till now, the understanding of genetic variation in the Arab population has been a
challenge due to the lack of a high-quality Emirati reference genome. The Population
Genome Program will enrich available data by producing a reference genome specific
to citizens of the United Arab Emirates.

The program aims to be the first of its kind in the world to then use this as a baseline
and incorporate the genomic data into healthcare management processes.

“Embracing innovation and providing a comprehensive healthcare programme in the


Emirate of Abu Dhabi remains at the forefront of our priorities. Two of the world’s most
exciting technologies — DNA sequencing and AI — will come together in this project,”
explained H.E. Sheikh Abdulla Bin Mohamed Al Hamed, Chairman of Department of
Health-Abu Dhabi, in a press statement.

Accelerating Processing of Genomic Data


In the first phase of the program, the genomes of 10,000 individuals are set to be
tested. To ensure the highest throughput and accurate analysis, both short-read and
long-read genome sequencing platforms will be used, leveraging G42’s collaboration
with BGI and Oxford Nanopore — two global genome sequencing leaders.

Anonymized DNA samples will first be collected and processed using Oxford Nanopore
PromethION sequencers. These devices, which contain embedded NVIDIA GPU
technology to enable AI at the edge, will help to accelerate the processing of genomic
data.

The processed data will be supplied, in a graphical format, to Artemis for AI-powered
analysis, with support from NVIDIA Parabricks software to support their population
analysis.

The final results will be provided to the research and medical community to help deliver
more effective patient care. This could include more advanced treatments for conditions
such as cancer, schizophrenia, autism, and cardiovascular and neuronal diseases.

“With NVIDIA’s GPU technology we’re able to provide a highly optimized AI platform for
the national Population Genome Program and accelerate data processing,” said Min S.
Park, director of Genome Programs at G42. “This collaboration supports our goals of
developing a program for personalized care across the UAE, bringing experts, data and
technology together for improving patient care.”

Combatting COVID-19
G42 is also using its supercomputing prowess in the battle against COVID-19, having
recently established a new detection laboratory in Masdar City, Abu Dhabi. This facility
can, on a daily basis, support tens of thousands of real-time reverse transcription
polymerase chain reaction (RT-PCR) tests. These tests detect the presence of the
SARS-CoV-2 virus in samples taken from patients.

In addition, G42 is involved in the production of COVID-19 diagnostic kits, the supply of
thermal sensors and, working in coordination with local and international health
authorities, assisting in the creation of effective prevention and detection protocols to
contain the virus.
“Technology will play a crucial role in curbing the spread of the coronavirus and the
superior computing capability of Artemis can help in many ways — from rapid vaccine
development, where computer simulations may replace manual experiments and
reduce the development time of a vaccine, to mapping and predicting trends in the
outbreak, as well as predicting virus mutations,” said Peng Xiao, CEO of G42.

High Performance Computing for genomic


applications
13 - 14 December 2018

Basel

Cancellation deadline:
23 November 2018

Michal Okoniewski, Samuel Fux, Diana Coman-Schmid

Academic: 0 CHF
For-profit: 0 CHF

No future instance of this course is planned yet


Overview
The course "High Performance Computing for genomic applications" is
organized for the D-BIOL researchers (including PhD students) by Scientific IT
Services on 13-14 December 2018. The main goal of the course is to increase
IT competences of researchers and encourage them to use Euler cluster for
bioinformatic analyses: independently or together with a bioinformatician in a
co-analysis mode.

Some modules include hands-on exercises, so the participants are expected


to bring own laptops. On Friday, Dec 14th there is planned a time slot, when
the participants can discuss or work on solutions for their own data analysis
issues with the instructors.

The course will include on the second day also a module related to cluster use
for personalized medicine: "Data & Computing Services for Personalized
Health Research"

Application
The classes of the course can be attended or skipped in a "pick-and-mix"
mode, depending on the need and skills of the participant, however a
registration is required for the full course by filling the form:

https://goo.gl/forms/tscMb9TAURIlWtFI2

The course is limited to 20 participants, confirmations will be sent after the


registration process on first-come-first-served basis.

Location
BSSE ETH in Basel

Additional information
Coordination: Michal Okoniewski, Scientific IT Services ETH

Instructors
Michal Okoniewski, Samuel Fux, Diana Coman-Schmid

Schedule
====> High Performance Computing for genomic applications

Day 1, 13 Dec 2018, Erasmus Room

13:00- 13:40 Linux command line re-fresh


13:40- 14:20 Basics of shell scripting

14:20- 14:30 Coffee break

14:30- 15:00 Introduction to the Euler cluster and HPC

15:00- 16:00 LSF queueing system

16:00- 16:30 Genomic formats

16:30- 17:15 Working with genomic data using AWK

Day 2, 14 Dec 2018, Erasmus Room

9:30 - 10:00 Basics of useful R

10:00- 10:30 Basic R scripts for RNA-seq statistics

10:30- 10:45 Coffee break

10:45- 12:00 Genomic software on the Euler cluster

12:00- 13:00 Lunch break

13:00- 13:45 Data & Computing Services for Personalized Health Research

13:45- 14:30 Workflow orchestration with snakemake - demo

14:30- 16:30 Hackaton: participants' own problems and data

Cloud-SPAN Genomics Course


Cloud-SPAN is a project run by the Biology Department of the University of York with the aim
to develop advanced modules covering specialised knowledge and skills to generate and analyse
‘omics data using cloud-based High Performance Computing (HPC) resources.
This course is based on the Data Carpentry’s Genomics Workshop, streamlined and extended to
serve as the foundation course for the Cloud SPAN advanced modules.
The course teaches data management and analysis for genomics research including: (1) best
practices for organization of bioinformatics projects and data, (2) use of command-line utilities
to connect to and use cloud computing and storage resources, (3) use of command-line tools for
data preparation, (4) use of command-line tools to analyze sequence quality and perform and
automate variant calling.
The course is designed to be taught over four half days of instruction.

Getting Started

This course assumes that learners have no prior experience with the tools covered in the
course. However, learners are expected to have some familiarity with biological concepts,
including the concept of genomic variation within a population. Participants should bring their
own laptops and plan to participate actively.
To get started, follow the directions in the Setup tab to get access to the required software and
data for this workshop.

Data

This course uses data from a long term evolution experiment published in 2016: Tempo and
mode of genome evolution in a 50,000-generation experiment by Tenaillon O, Barrick JE,
Ribeck N, Deatherage DE, Blanchard JL, Dasgupta A, Wu GC, Wielgoss S, Cruveiller S,
Médigue C, Schneider D, and Lenski RE. (doi: 10.1038/nature18959)
All of the data used in this workshop can be downloaded from Figshare. More information
about this data is available on the Data page.

Course Overview
Lesson Overview

Project management for cloud Learn how to structure your data and metadata, plan for an NGS project, lo
genomics line.

Using the command line Learn to use the command line to navigate your file system, create, copy, m
files.

Data preparation and Learn how to automate commonly used workflows, organise your file syste
organisation quality control.

Data processing and analysis Learn how to filter out poor quality data, align reads to a reference genome
these tasks for efficiency and accuracy.
Teaching Platform
This workshop is designed to be run on pre-imaged Amazon Web Services (AWS) instances. All
the software and data used in the workshop are hosted on an Amazon Machine Image (AMI). To
access your AMI instance remotely, follow the directions in the Setup.

Licensed under CC-BY 4.0 2021–2022 by Cloud-SPAN


Licensed under CC-BY 4.0 2018–2022 by The Carpentries
Licensed under CC-BY 4.0 2016–2018 by Data Carpentry
Improve this page / Contributing / Source / Cite / Contact

Cloud-SPAN Genomics Course:


Setup
Overview
The software and data used for analysis during the course are hosted on an Amazon Web
Services (AWS) virtual machine (VM) instance. A copy of such instance that requires no
previous setup by you will be made available to you at no cost by the Cloud-SPAN team.
To accesss and use the resources in your AWS instance from your personal computer, you will
use a command-line interface (CLI) program that is widely known as the shell or terminal. The
shell is available by default for Linux and Mac users (so they don’t need to install any software).
Windows users will need to install Git for Windows on their computer as described below prior
to the course. Git includes Git Bash which is a Windows version of the Unix Bash shell, the
most widely used shell and the default shell in Linux systems.
You will need to use a laptop or desktop to take this course. Due to the need both to follow the
instructor in zoom and to perform analyses, tablets and ipads are not suitable for using during
this course. Having both an up to date browser and a stable internet connection is important.
Before the course you will receive via email the information that you will need to login to your
AWS instance during the course.
Installing Git Bash in your Windows
computer
The steps below correspond to the installation of Git for Windows version 2.33.1 from scratch.
The installation of a more recent version, or updating a previously installed version, may show
different wording in the screen messages mentioned below or may vary slightly in the number of
steps to follow. Choose as many of the options below as possible.

 Click on this link: Git for Windows download page


 Once in that page, click on Download to download the installer.
 Once the installer is downloaded,
o double click on it
o you will then be asked some questions and to select an option for each question.
o each question is shown below in Italics, and the selection to be made is shown
in bold
o during the actual installation each question will be displayed at the top of a small
window but we are showing only the small window for the question that requires
somewhat more help
o the first question is next:
 The app you’re trying to install isn’t a Microsoft-verified app ..?
o Click on Install anyway
 Do you want to allow this app to make changes to your device?
o Click on Yes
 GNU General Public License
o click on Next
 Select Destination Location
o click on Next (don’t change the location shown).
 Select Components
o click on Additional Icons (it will also select “On the Desktop” option)
o then click on Next
 Select Start Menu Folder
o click on Next (don’t change the folder name shown)
 Choosing the default editor used by Git
o select Use the nano editor by default and click on Next.
o NB: you may need to click on the dropdown menu and to scroll up with the
mouse to see this option – see the figure:
 Adjusting the name of the initial branch in new repositories
o keep the selected (or select the) option Let Git decide and click on Next.
 Adjusting your PATH environment
o keep the selected, Recommended option Git from the command line and also
from 3rd-party software
o or selec it, and click on Next.
o NB: if this option is not selected, some programs that you need for the course will
not work properly. If this happens rerun the installer and select the appropriate
option.
 Choosing the SSH executable
o keep the selected (or select the) option Use bundled OpenSSH and click
on Next.
 Choosing HTTPS transport backend
o keep the selected (or select the) option Use the OpenSSL library and click
on Next.
 Configuring the line ending conversions
o keep the selected (or select the) option Checkout Windows-style, commit Unix-
style line endings and click on Next.
 Configuring the terminal emulator to use with Git Bash
o keep the selected (or select the) option Use MinTTy (the default terminal of
MSYS2) and click on Next.
 Choose the default behaviour of git pull
o keep the selected (or select the) option Default (fast-forward or merge) and
click on Next.
 Choose a credential helper
o keep the selected (or select the) option Git Credential Manager Core and click
on Next.
 Configuring extra options
o keep the selected option (Enable File System Caching) and click on Next.
 Configuring experimental options
o click on Install without selecting any option
 Click on Finish

Run Git Bash by double clicking on the Git Bash icon in your Desktop screen.

Exit Git Bash by pressing Ctrl-d – that is pressing the keys Ctrl and d simultaneously.
Bash stands for Bourne Again Shell. In addition to be a CLI, Bash shell is a powerful command
programming language and has a long and interesting history which you can read in
the Wikipedia entry for Bash shell.

Skip to main content

 Learner View

 Genomics Workshop Overview


 Setup
 Glossary
 Learner Profiles
 More

Summary and Setup


Edit this page

Data Carpentry’s aim is to teach researchers basic concepts, skills, and tools
for working with data so that they can get more done in less time, and with
less pain. This workshop teaches data management and analysis for
genomics research including: best practices for organization of
bioinformatics projects and data, use of command-line utilities, use of
command-line tools to analyze sequence quality and perform variant calling,
and connecting to and using cloud computing. This workshop is designed to
be taught over two full days of instruction.

Please note that workshop materials for working with Genomics


data in R are in “alpha” development. These lessons are available
for review and for informal teaching experiences, but are not yet
part of The Carpentries’ official lesson offerings.

Interested in teaching these materials? We have an onboarding video and


accompanying slides available to prepare Instructors to teach these lessons.
After watching this video, please contact [email protected] so that we
can record your status as an onboarded Instructor. Instructors who have
completed onboarding will be given priority status for teaching at centrally-
organized Data Carpentry Genomics workshops.

FREQUENTLY ASKED QUESTIONS

Read our FAQ to learn more about Data Carpentry’s Genomics workshop, as
an Instructor or a workshop host.

GETTING STARTED
This lesson assumes that learners have no prior experience with the tools
covered in the workshop. However, learners are expected to have some
familiarity with biological concepts, including the concept of genomic
variation within a population. Participants should bring their own laptops and
plan to participate actively.

To get started, follow the directions in the Setup tab to get access to the
required software and data for this workshop.

DATA

This workshop uses data from a long term evolution experiment published in
2016: Tempo and mode of genome evolution in a 50,000-generation
experiment by Tenaillon O, Barrick JE, Ribeck N, Deatherage DE, Blanchard
JL, Dasgupta A, Wu GC, Wielgoss S, Cruveiller S, Médigue C, Schneider D, and
Lenski RE. (doi: 10.1038/nature18959)

All of the data used in this workshop can be downloaded from Figshare. More
information about this data is available on the Data page.

Workshop Overview
Lesson Overvie
Learn how to structure your metadata, organize and document your genom
Project organization and management
sequence read archive (SRA) database.
Learn to navigate your file system, create, copy, move, and remove files a
Introduction to the command line
wildcards.
Data wrangling and processing Use command-line tools to perform quality control, align reads to a refere
Introduction to cloud computing for
Learn how to work with Amazon AWS cloud computing and how to trans
genomics

Optional Additional
Lessons
Lesson Overview
Intro to R and RStudio for Genomics Use R to analyze and visualize between-sample variation.

Teaching Platform
This workshop is designed to be run on pre-imaged Amazon Web Services
(AWS) instances. All the software and data used in the workshop are hosted
on an Amazon Machine Image (AMI). If you want to run your own instance of
the server used for this workshop, follow the directions in the Setup tab.

Common Schedules
Schedule A (2 days OR 4 half days)

 Half-day 1: Project organization and management & Introduction to the


command line
 Half-day 2: Introduction to the command line (continued).
 Half-day 3 & 4 : Data wrangling and processing

Schedule B (2 days OR 4 half days)

 Half-day 1: Project organization and management & Introduction to the


command line
 Half-day 2: Introduction to the command line (continued)
 Half-day 3 & 4: Intro to R and RStudio for Genomics

Schedule C (3 days OR 6 half days)

 Half-day 1: Project organization and management & Introduction to the


command line
 Half-day 2: Introduction to the command line (continued)
 Half-day 3 & 4 : Data wrangling and processing
 Half-day 5 & 6: Intro to R and RStudio for Genomics

Overview
This workshop is designed to be run on pre-imaged Amazon Web Services
(AWS) instances. All of the data and most of the software used in the
workshop are hosted on an Amazon Machine Image (AMI). Some additional
software, detailed below, must be installed on your computer.

Please follow the instructions below to prepare your computer for the
workshop:

 Required additional software + Option A OR


 Required additional software + Option B

Required additional software


This lesson requires a working spreadsheet program. If you don’t have a
spreadsheet program already, you can use LibreOffice. It’s a free, open
source spreadsheet program. Directions to install are included for each
Windows, Mac OS X, and Linux systems below. For Windows, you will also
need to install either Git Bash, PuTTY, or the Ubuntu Subsystem.

WINDOWS

MAC OS X

LINUX

Option A (Recommended): Using


the lessons with Amazon Web
Services (AWS)
If you are signed up to take a Genomics Data Carpentry workshop, you
do not need to worry about setting up an AMI instance. The Carpentries staff
will create an instance for you and this will be provided to you at no cost.
This is true for both self-organized and centrally-organized workshops. Your
Instructor will provide instructions for connecting to the AMI instance at the
workshop.
If you would like to work through these lessons independently, outside of a
workshop, you will need to start your own AMI instance. Follow
these instructions on creating an Amazon instance. Use the AMI ami-
04dd77cd58b3ec654 (Data Carpentry Genomics with R 4.2) listed on the
Community AMIs page. Please note that you must set your location as N.
Virginia in order to access this community AMI. You can change your location
in the upper right corner of the main AWS menu bar. The cost of using this
AMI for a few days, with the t2.medium instance type is very low (about USD
$1.50 per user, per day). Data Carpentry has no control over AWS pricing
structure and provides this cost estimate with no guarantees. Please read
AWS documentation on pricing for up-to-date information.

If you’re an Instructor or Maintainer or want to contribute to these lessons,


please get in touch with us and we will start instances for you.

Option B: Using the lessons on


your local machine
While not recommended, it is possible to work through the lessons on your
local machine (i.e. without using AWS). To do this, you will need to install all
of the software used in the workshop and obtain a copy of the dataset.
Instructions for doing this are listed below.

Data

The data used in this workshop is available on FigShare. Because this


workshop works with real data, be aware that file sizes for the data are large.
Please read the FigShare page for information about the data and access to
the data files.

More information about these data will be presented in the first lesson of the
workshop.

Software
Software Version Manual Available for
FastQC 0.11.9 Link Linux, MacOS, Windows Quality control tool for high th
Trimmomatic 0.39 Link Linux, MacOS, Windows A flexible read trimming tool
BWA 0.7.17 Link Linux, MacOS Mapping DNA sequences aga
SAMtools 1.9 Link Linux, MacOS Utilities for manipulating align
BCFtools 1.9 Link Linux, MacOS Utilities for variant calling and
IGV Link Link Linux, MacOS, Windows Visualization and interactive e

QuickStart Software Installation Instructions

These are the QuickStart installation instructions. They assume familiarity


with the command line and with installation in general. As there are different
operating systems and many different versions of operating systems and
environments, these may not work on your computer. If an installation
doesn’t work for you, please refer to the user guide for the tool, listed in the
table above.

We have installed software using Conda. Conda is a package manager that


simplifies the installation process. Please first install Conda through the
Miniconda installer (see below) before proceeding to the installation of
individual tools. For more information on Miniconda, please refer to the
Conda documentation.

Conda

LINUX

MACOS

FastQC

MACOS

FASTQC SOURCE CODE INSTALLATION


Test your installation by running:

BASH
$ fastqc -h

Trimmomatic

MACOS

TRIMMOMATIC SOURCE CODE INSTALLATION


Test your installation by running: (assuming things are installed in ~/src)

BASH

$ java -jar ~/src/Trimmomatic-0.39/trimmomatic-0.39.jar

SIMPLIFY THE INVOCATION, OR TO TEST YOUR


INSTALLATION IF YOU INSTALLED WITH MINICONDA3:

BWA

MACOS

BWA SOURCE CODE INSTALLATION


Test your installation by running:

BASH

$ bwa

SAMtools

MACOS
SAMTOOLS VERSIONS

SAMtools has changed the command line invocation (for the better). But this
means that most of the tutorials on the web indicate an older and obsolete
usage.
Using SAMtools version 1.9 is important to work with the commands we
present in these lessons.

SAMTOOLS SOURCE CODE INSTALLATION


Test your installation by running:

BASH

$ samtools

BCFtools

MACOS

BCF TOOLS SOURCE CODE INSTALLATION


Test your installation by running:

BASH

$ bcftools

IGV

 Download the IGV installation files


 Install and run IGV using the instructions for your operating system .

This lesson is subject to the Code of Conduct

Edit on GitHub | Contributing | Source

Cite | Contact | About

Materials licensed under CC-BY 4.0 by the authors

Template licensed under CC-BY 4.0 by The Carpentries

Built with sandpaper (0.16.7), pegboard (0.7.6), and varnish (1.0.4)


Learn Genetics Online
Understand the building blocks of life – and the
future of medicine.

APPLY NOW

The way we diagnose and treat diseases is changing, with new


technologies enabled by a deeper understanding of the human
genome and its relationship to health and disease.

In HMX Fundamentals Genetics, you’ll get an overview of


fundamental concepts behind the evolving fields of human genetics,
genomics, and precision medicine.

This online certificate course is led by Harvard Medical School


faculty and features:

 detailed animations and illustrations of medical


concepts
 clinical application videos including real doctor-patient
interactions
 ongoing, rigorous assessments to ensure content
mastery
High-Performance
Computing Breaks the
Genomics Bottleneck
HPC enables accurate and rapid analysis for sequencing centers,
clinical teams, genomic researchers and developers of sequencing
equipment.

by
Nathan Eddy

Nathan Eddy works as an independent filmmaker and journalist based in Berlin,


specializing in architecture, business technology and healthcare IT. He is a graduate of
Northwestern University’s Medill School of Journalism.

Listen
05:57

Sequencing our 3.2 billion DNA base pairs is becoming increasingly crucial as
genomic testing gains widespread acceptance.

Advancements in genomics are improving the detection of mutations that


can lead to illnesses, with the potential to revolutionize personalized
medicine by enabling the development of more effective treatments for
genetic disorders.

High-performance computing is revolutionizing the field of genomics by


accelerating the speed of analysis and processing of large-scale gene
sequencing data sets.
However, genomics is facing a massive Big Data problem. Scientists are
struggling to process a growing volume of data as precision medicine turns
to gene sequencing for individual patients.
By leveraging the compute power of graphics processing units, geneticists
can speed up analysis and reduce the cost of processing the huge amounts
of data produced by gene sequencing.

NVIDIA is among the companies offering GPU-based HPC solutions.


NVIDIA’s Clara Parabricks is a GPU-accelerated computational genomics
toolkit that supports analytical workflows for next-generation sequencing,
including short- and long-read applications. The toolkit is designed for
analyzing genomic data after it comes off the sequencer and turning it into
interpretable data.
EXPLORE: How NVIDIA helps healthcare organizations unlock patient data.
“One of the main benefits of the software is that it is based on industry-
standard tools, so a lot of what would be run on CPU can now be run on
GPU,” explains Harry Clifford, NVIDIA’s head of genomics product. “There is a
huge acceleration factor with it being on GPU.”

That translates into more than 80 times the acceleration on some of those
industry standard tools, he notes, adding that the software is also scalable.

“It’s fully compatible with all the workflow managers genomics researchers
are using,” Clifford says. “There’s also the improved accuracy point, which is
provided by artificial intelligence-based deep learning and high accuracy
approaches included in the toolkit.”
Volume, Velocity and Variety of Data Pose Challenges in
Genomics
Clifford points out that Big Data analysis can be split broadly into three
pillars: the amount of data (volume), the speed of processing (velocity) and
the number of data types (variety).

“First off, we have this huge explosion of data, this volume problem in
genomics, and that’s why you need HPC solutions,” he says.

The second aspect of the Big Data challenge is velocity, as each sample that
is run through a sequencer must be run through a sequencing process, a wet
lab process and then through the computational analysis process.

“Those sequencers are now running so quickly that compute is the new
bottleneck in genomics,” Clifford explains.

First off, we have this huge explosion of data, this


volume problem in genomics, and that’s why you
need HPC solutions.”
Harry Clifford Head of Genomics Product, NVIDIA

The need to handle different sequencing experiments, from RNA to DNA


sequencing or tumor sequencing, means there’s a huge challenge of data
variety as well.

“That’s where deploying AI solutions and more adaptable solutions actually


becomes really important,” Clifford says.

He adds that AI is now vital to genomics by driving higher accuracy and


entirely novel insights, which is being done on GPUs at high speed and low
cost.

“We have that in our Clara Parabricks genomics analysis software with AI-
led, neural network–based solutions for high accuracy, as well as
downstream in a lot of the drug discovery work with large language models
driving new insights in the field,” he says.

The Benefits of a GPU-Based Approach for Genomics


Analysis
The GPU-based approach allows for acceleration in the processing of various
types of data.

“If you were to compare the run times on a CPU with a GPU for analysis of a
single sample end to end, you’re looking at somewhere on the order of 24
hours-plus on a CPU, whereas on the GPU we have that down to less than 25
minutes on our DGX systems,” Clifford says. “That’s a huge acceleration of
the analysis.”

He says the second benefit to increased processing power and reduced


analysis times is lower costs.

“If you need to run this in the cloud, for example, where time is money, then
that reduced time is saving you a huge amount in costs as well,” he says.
Clifford explains that NVIDIA’s full-stack approach to solutions means it’s
getting easier for healthcare organizations to tap the power of HPC.

“You’re able to program these chips, to use so many different libraries for
the data science steps and for the genomics itself with Clara Parabricks,” he
explains. “The tools are there, and all of this analysis can now be brought on
the GPU.”

MORE FROM HEALTHTECH: How tech helps identify and track social
determinants of health data.
He points to the next generation of chips, including the recently released
H100 Tensor Core GPU, which he describes as “very well suited to genomics
analysis.” It boasts a transformer engine that determines whether 8-bit and
16-bit floating point calculations are appropriate.

“The H100 also features a dynamic programing core, which accelerates


routing pattern algorithms,” says Clifford. “This is incredibly useful for the
alignment step of rebuilding a genome during rapid sequencing of DNA and
RNA.”

All these components mean it will work well with large language models and
some of the latest transformer-based deep learning architectures.

“These very large models give us a new ability to interpret data and
understand biological meaning,” he says. “That’s an area of the field that is
just getting started and benefits hugely from HPC and GPU.”

Brought to you by:

Deploying genomics workflows on


high performance computing (HPC)
platforms: storage, memory, and
compute considerations
View ORCID ProfileMarissa
E. Powers, Keith Mannthey, Priyanka Sebastian, Snehal Adsule, Elizabeth Kiernan, Jonath
an T. Smith, Jessica Way, Beri Shifaw, David Roazen, Paolo Narvaez
doi: https://doi.org/10.1101/2022.04.05.485833
This article is a preprint and has not been certified by peer review [what does this mean?].
0000007

 Abstract

 Full Text

 Info/History

 Metrics


 Preview PDF
Abstract
Next Generation Sequencing (NGS) workloads largely consist of pipelines of tasks with
heterogeneous compute, memory, and storage requirements. Identifying the optimal
system configuration has historically required expertise in both system architecture and
bioinformatics. This paper outlines infrastructure recommendations for one commonly
used genomics workload based on extensive benchmarking and profiling, along with
recommendations on how to tune genomics workflows for high performance computing
(HPC) infrastructure. The demonstrated methodology and learnings can be extended for
other genomics workloads and for other infrastructures such as the cloud.

Introduction
Since the advent of Next Generation Sequencing (NGS), the cost of sequencing genomic
data has drastically decreased, and the amount of genomic samples processed
continues to increase [1,2]. With this growth comes the need to more efficiently process
NGS datasets.
Prior work has focused on custom methods for deploying exomes on HPC systems [3],
as well as best practices for deploying genomics workflows on the cloud [4]. This paper
outlines how to optimize system utilization for one commonly used genomics workflow,
along with recommendations on how to tune genomics workflows for HPC infrastructure.
The Broad Institute’s Genome Analysis Toolkit (GATK) Best Practices Pipeline for
Germline Short Variant Discovery is a commonly used workflow for processing human
whole genome sequences (WGS) datasets. This pipeline consists of 24 tasks, each with
specific compute, memory, and disk requirements.

See S1 Table for a full list of the tasks and their requirements. Of these 24 tasks, six are
multithreaded, and the rest are single threaded.

For multithreaded tasks, the genome is broken into shards, and each is executed as a
parallel process. At the end of the task, the output datasets from all shards are
aggregated and passed as a single input to the next task. This ability to “scatter” a task
across multiple jobs, and then “gather” outputs for the next task is called “scatter-
gather” functionality [5]. The ability to process multiple jobs concurrently is referred to
as “parallelization.” Both scatter-gather functionality and parallelization are key
concepts for efficiently distributing genomic pipelines on a system.
For example, in the task BWA, which aligns fragments output by the sequencer into a
single aligned string, the genome is broken into 24 shards. On local high performance
computing (HPC) infrastructure, each of these 24 shards is packaged as a single batch
scheduler job. Once all 24 shards of BWA complete, the task MergeBamAlignment (Mba)
consolidates the 24 output files into a single input file for the next task, MarkDuplicates,
which is single threaded.

Each of these 24 BWA jobs is deployed on the cluster and executed in parallel, and each
job is allocated a recommended four CPU threads and 14GB DRAM (see S1 Table). When
BWA is running with these parameters, therefore, it consumes in total 96 threads and
336GB DRAM.

It’s important to note that while BWA can readily consume 96 threads on a system,
most of the tasks are single threaded. Figure 1 below shows CPU utilization (gray) and
memory utilization (in red) for the duration of the pipeline when processing a single 30X
coverage human whole genome sequence (WGS). Note that memory utilization is close
to its maximum for roughly only a third of the overall processing time. CPU utilization is
at 100% for even less time.

 Download figure
 Open in new tab
Fig 1.CPU and Memory Utilization for single WGS run.
The runtime for key tasks in the pipeline (with start and end times in parens) is BWA
(0.0-2.4); MarkDuplicates (2.4-4.8); SortSampleBam (4.8-5.7); BaseRecalibrator (5.7-
6.3); GatherBQSRReports (6.3-6.4); ApplyBQSR (6.4-6.9); GatherBamFiles (6.9-7.3);
HaplotypeCaller (7.3-8.7); MergeVCFs (8.7-9.3). CPU Utilization is at 100% for BWA, and
40-50% for HaplotypeCaller. Memory utilization is close to 100% only for BWA and
SortSampleBam. Because of this heterogeneity in resource utilization, achieving
maximum throughput requires efficient scheduling of multiple WGS samples in parallel.
Because of this heterogeneity, making efficient use of HPC infrastructure requires
tuning and orchestration of the workflow. First, these tasks need to be efficiently
sharded and distributed across the cluster. Second, each task needs to be allocated the
optimal number of threads and memory. Local disk must be used for temporary storage.
Finally, the tasks need to take advantage of underlying hardware features.

This paper outlines the impact of each of these factors on performance, and details best
known methods for configuring the Germline Variant Discovery pipeline on local HPC
infrastructure.

Materials and Methods


Benchmarking was performed on a five-server cluster, with one application server and
four compute servers. A full hardware configuration can be found in Supporting
Information S2 Table.

The publicly available NA12878 30X coverage whole genome sequence (WGS) dataset
was used for all benchmarking. For the resource profiling (e.g. Fig 1), a single sample
was run. For throughput tuning 40 WGS were submitted concurrently.
Jobs were orchestrated on the cluster with Slurm. The Broad Institute provides a Slurm
backend for Cromwell, which can be found in the Cromwell documentation [6]. GATK
Best Practices Pipelines are defined in Workflow Description Language (WDL). A given
WDL defines which GATK tools to call in the form of tasks and is accompanied by a JSON
file with dataset locations and other configuration settings.
The full list of tasks in the Germline Variant Discovery pipeline can be found in S1 Table,
along with the compute requirements for each task. Testing was performed with GATK
v4.1.8.0. A full software bill of materials is provided in the Supporting Information S3
Table.
The specific WDL and JSON files used for testing can be found in S3 Table. The
recommended resource allocation values are included in those workflows.

Results
1. Tuning Resource Allocation Values

To allocate specific amounts of cores and memory to each task, the HPC batch
scheduler must be configured to enable consumable resources. With Slurm, this is set in
slurm.conf by specifying both “SelectType=select/cons_res” and
“SelectTypeParameters=CR_Core_Memory.” Additional detail on consumable resources
is available in the Slurm documentation [7].
Another key component of resource tuning is Hyperthreading. With Hyperthreading
turned on (HT=On) two processes can be executed simultaneously on a single physical
core. Testing shows a 10% overall pipeline speedup with HT=On. With HT=On, a task is
allocated a set number of threads, with two threads available per physical core.

The most compute-intensive task in the Germline Variant Discovery pipeline is BWA.
Manipulating the number of threads per shard for BWA has a substantial impact on the
overall runtime of the pipeline, as shown in Figure 2A.

 Download figure
 Open in new tab
Fig. 2.Impact of Increasing threads-per-shard on BWA Performance.
Figure 2A shows the effect of increasing threads per shard on single shard runtime for
BWAmem. Figure 2B shows the effect of increasing the number of threads per shard
on total corehours consumed.
Figure 2A shows impact of increasing the number of threads per shard on the shard
runtime. When two threads are allocated per shard, the runtime is 200 minutes.
Increasing to 16 threads per shard decreases the runtime by 10X down to 27 minutes.
Increasing the thread count higher than 16 threads per shard has limited positive
impact on shard runtime. Figure 2B shows the impact, however, on total corehours.
Increasing the number of threads per shard gradually increases the total corehours
consumed on the cluster. When processing a single WGS, 16 threads per shard provides
a fast task runtime while limiting corehours.
S1 Table provides a detailed list of thread count and memory recommendations for
each individual task in the pipeline, including BWA. These values were determined
empirically specifically optimizing for throughput processing. The Discussion covers
considerations when optimizing for fastest single sample runtime, as well as
considerations for cloud infrastructure.
While 16 threads per shard results in the fastest single shard runtime, and the fastest
runtime for BWA, it does not necessarily result in the best throughput, or number of
genomes that can be processed on a system per day. On a 4-server 2-socket system
with 24-core CPUs and HT=On, there are 384 available threads available at any given
time. Setting BWA to consume 16 threads per shard for 24 shards results in BWA
consuming all 384 threads for a single WGS. Adjusting this thread count to, for example,
4 threads per shard, results in a longer runtime for BWA but allows for processing four
WGS samples in parallel.

While this section has focused on BWA tuning, similar methods were used to identify
optimal thread and memory allocations for each task in the pipeline. These
recommended values can be found in S1 Table.

2. Distributing Tasks Efficiently

The second most compute-intensive task in the pipeline is HaplotypeCaller, which


performs variant calling. For HaplotypeCaller, the number of shards the tasks are
distributed across is set in the WDL as the variable “scattercount.”

Figure 3 shows the relationship between the runtime of HaplotypeCaller and


scattercount. When HaplotypeCaller is sharded into just two jobs (scattercount=2), the
task takes 400 minutes to complete. As shown in Figure 3A, the task runtime
decreases as scattercount increases up to scattercount=48. Beyond scattercount=48,
there’s limited benefit in further sharding the task into smaller jobs.

 Download figure
 Open in new tab
Figure 3.Impact of scattercount on HaplotypeCaller Runtime and total Core
Hours.
As scattercount increases, the runtime of the longest running shard decreases (A) while
the total corehours consumed increases (B).
Figure 3B shows the relationship between scattercount and total corehours consumed
by HaplotypeCaller. Corehours gradually increases with scattercount. As
HaplotypeCaller is split into more small jobs the total corehours consumed increases.
It’s important to note that scattercount cannot be arbitrarily set without considering
potential artifact generation. For this reason, scattercount is set specifically to 48.
Concordance analysis is always required when tuning scattercount to ensure fidelity.

3. Local Disk for Temporary Storage

While BWA and HaplotypeCaller are the two most compute-intensive tasks in the
pipeline, one of the longest running tasks is the single threaded MarkDuplicates.
MarkDuplicates takes in a BAM or SAM file and compares and identifies duplicate reads.
The uncompressed files processed by MarkDuplicates for a 30X human whole genome
sequence can total over 200GB in size. The task is highly dependent on fast local
storage for processing these datasets.

Figure 4 shows the impact of running with a local Solid State Drive (SSD) compared to
running without an SSD and just using the parallel file system. With an SSD,
MarkDuplicates runtime is 2.5 hours. Without an SSD, MarkDuplicates runtime is 37.6
hours.

 Download figure
 Open in new tab
Fig. 4.Local disk for MarkDuplicates Performance.
The runtime for the entire Germline Variant Calling pipeline drastically decreases with
use of a local SSD. This is primarily due to the decrease in runtime of MarkDuplicate
(blue).
S2 Table includes an NVMe Pxxx SSD in each of the compute servers, largely for the
sake of MarkDuplicates processing.

4. Genomics Kernel Library (GKL) to Utilize Hardware Features

Ultimately each shard of each task is executed on a single thread of a CPU. Ensuring
these tasks are able to take advantage of the underlying CPU features is a key factor for
performance.

As of GATK 4.0, a number of tasks in the Germline Variant Discovery pipeline have been
accelerated to take advantage of Intel AVX-512 Instructions through the Genomics
Kernel Library (GKL). GKL is developed and maintained by Intel and is distributed open
source with GATK [8].
GKL includes compression and decompression from Intel’s ISA-L and zlib libraries, as
well as AVX-512 implementations of PairHMM and Smith-Waterman [8-10]. PairHMM
and Smith-Waterman are two key kernels included in a number of genomics tasks,
including HaplotypeCaller.
Figure 5 shows the benefit of GKL compression for the three tasks with the largest
input file sizes: ApplyBQSR (133GB), GatherBamFiles (62GB) and MarkDuplicates
(222GB). GKL provides compression at levels from 1-9 (CL=1-9). CL=1 (orange) with
GKL provides a 2-4X compression ratio relative to with no compression (blue). The
compression ratio continues to improve as compression levels increase up to level 5.
 Download figure
 Open in new tab
Fig. 5.Compression with GKL for tasks with largest input file sizes.
The Genomics Kernel Library (GKL) performs compression at levels 1 through 9 (CL=1-
9). Compression ratios relative to CL=1 are shown in Figure 5A for three different tasks
in the pipeline. Figure 5B shows the impact of each compression level on completion
time for MarkDuplicates.
Figure 5B shows the task runtime as a function of compression level for these three
tasks.
As the name suggests, MarkDuplicates checks the input BAM file for duplicate reads,
and tags any identified duplicates [11]. In doing so the task reads and writes small (kB)
intermediary files throughout the 2+ hours of processing. Each of these intermediary
files is compressed and decompressed. Because of this, higher compression levels
result in a high runtime cost with this task (see Figure 5B).
Based on these results, compression level is set to CL=2 in GATK 4.2.0.0. This
compression level provide a good balance between high compression ratio across tasks
(Figure 5A) and low runtime for MarkDuplicates (Figure 5B).
Figure 6 shows the difference between HaplotypeCaller runtime with the AVX-512
implementations of both kernels (left) compared to with the original implementations
with no AVX instructions (right). The middle bar shows the runtime with the AVX512
implementation of SW and the Java AVX2 implementation of pairHMM.

 Download figure
 Open in new tab
Fig. 6.Impact of GKL AVX flags on HaplotypeCaller Performance.
HaplotypeCaller performance, as measured by task runtime in seconds, drastically
improves with the use of GKL pairHMM and SmithWaterman (SW). Runtimes for
pairHMM are shown in blue; SW in orange.
Note the y-axis is a log scale. Without the AVX implementations, HaplotypeCaller takes
125,754 seconds, or 35 hours, to complete. With the GKL AVX512 implementations the
same task completes in less than one hour.

Notably, users do not need to set any special flags to run with the GKL implementations.
As shown in the HaplotypeCaller documentation, running with default flag
(FASTEST_AVAILABLE) automatically detects if the underlying CPU includes support for
AVX-512 instructions and, if so, deploys the GKL implementation [12].

Discussion
As illustrated above, optimal performance of the Germline Variant Discovery pipeline is
dependent on (1) efficiently distributing tasks across the cluster; (2) tuning resource
allocation values; (3) utilization of fast local storage; and (4) libraries that take
advantage of underlying CPU features.

Intel and Broad Institute have partnered to form the Intel-Broad Center for Genomics
Data Engineering. The Genomics Kernel Library (GKL) is a direct outcome of this joint
engineering Center. As part of this partnership, many of the configuration
recommendations outlined above (eg the Slurm backend for Cromwell and resource
allocation values) are directly incorporated into the Broad Institute workflows and
documentation.

A specific reference architecture, including a detailed Installation Guide, is available as


the Intel Select Solution for Genomics Analytics [13]. The Solution includes a detailed
hardware and software configuration similar to that provided in the Supporting
Information (A2 and A3).
The recommended resource allocation values provided in S1 Table have been
specifically tuned for throughput, or the number of WGS samples that can be processed
on a cluster per day. For institutions processing dozens, hundreds, or even thousands of
samples per month, this throughput metric is a higher priority than single sample
processing time. In these scenarios, reducing the number of threads per task allows for
more jobs to run concurrently on the system, increasing throughput.

In other cases, such as with single threaded tasks, it makes sense to increase the
thread allocation to increase throughput. As shown in S1 Table, most tasks in the
pipeline are allocated two threads each despite being single threaded. This is
specifically to optimize for throughput. The second thread (1) allows for Java collection
and (2) intentionally limits the number of jobs concurrently running on the system.
Limiting the number of overall jobs helps ensure each task has sufficient memory while
also leaving sufficient memory for scheduling and system level operations.

For scenarios where single sample processing time is the highest priority, increasing the
threads and memory allocated per task will reduce single sample runtime, while
decreasing the overall throughput of the cluster. A workflow optimized for single sample
runtime is provided in the same repository as the throughput WDL (see S3 Table).

Increasing threads per task is also beneficial in the cloud. When deploying the Germline
Variant Discovery pipeline through the Broad Institute’s Platform as a Service (PaaS)
Terra.bio, each shard of each task is allocated its own VM with a set number of virtual
CPUs (vCPUs) and DRAM. In this scenario, each of the 24 BWA shards is allocated 16
vCPUs, compared to the four threads per shard recommended for local deployments.
Allocating each 16 vCPUs to each BWA shard does not negatively impact the runtime of
other tasks and samples on the cloud, since there are no infrastructure scale
constraints. Van der Auwera and O’Connor provide a detailed guide on best practices for
deploying Broad Institute workflows on the cloud [4].
As shown, the optimal workflow configuration is dependent on both underlying
infrastructure and key performance metrics (e.g. throughput vs single sample runtime).
Profiling workloads with methods described here can be extended to genomics
workflows beyond Germline Variant Calling. Future work will include tuning additional
workflows as well as comparing cloud and local performance considerations.

Key Points
 Because genomics workflows consist of pipelines of tasks with heterogeneous
compute requirements, achieving maximum throughput requires efficient tuning
and orchestration of these workflows.

 Tasks need to be efficiently sharded and distributed across the cluster; Each task
needs to be allocated the optimal number of threads and memory; and local disk
must be used for temporary storage.

 The Genomics Kernel Library (GKL) improves GATK performance by taking


advantage of AVX-512 instruction sets and accelerating compression and
decompression.

Supporting Information
S1 Table. Recommended Resource Allocations for Germline Variant Discovery
tasks.
S2 Table. Hardware Configuration Used for Testing.
S3 Table. Software Configuration Used for Testing.

Acknowledgements
The authors thank Michael J. McManus PhD for his input and guidance from study
conception through data analysis. We thank Kyle Vernest, Kylee Degatano, and Louis
Bergelson for technical support throughout benchmarking

Healthcare and Life Sciences

Unlocking the Mysteries of Mutational Signatures of Cancer with


NVIDIA Accelerated Solutions

Objective

Sanger Institute uses the NVIDIA DGX server to power its mutational cancer
signature analysis pipeline—improving performance by 30x.

Customer

Sanger Institute

Use Case

Performance Improvement

Technology

NVIDIA DGX-1™ Server, NVIDIA® NVLink®

The Need to Better Understand Mutational Signatures of Cancer

Cancer is caused by damage to cells’ DNA known as somatic mutations. This


damage can be the result of behaviors such as smoking and drinking alcohol,
as well as environmental factors such as ultraviolet light and exposure to
radiation.

Damage to DNA occurs in specific patterns known as “mutational


signatures,” which are unique to the factor that caused the damage. For
example, although tobacco and ultraviolet radiation both cause cancer by
producing mutations, the signature caused by smoking tobacco is found in
lung cancer while the signature from ultraviolet light exposure is found in
skin cancer.

Many cancer-associated mutational signatures have been identified, but only


about half of them have known causes. In recent years, the analysis of DNA
from cancers has led to more than ninety different mutational signatures
being discovered. However, the environmental, lifestyle, genetic, or other
potential causes of many of these mutational signatures are still unknown.

As part of the Cancer Grand Challenges Mutographs team funded by Cancer


Research UK (CRUK), the Wellcome Sanger Institute, one of the premier
centers of genomic discovery and understanding in the world, is using
NVIDIA GPU-accelerated machine learning models to help understand how
naturally occurring DNA changes affect cancer.

The goal of the computational component of the project is to elucidate the


causes of major global geographical and temporal differences in cancer
incidences through the study of mutational signatures. Identifying a broader
set of mutational signatures will go a long way toward understanding the
correlations between them and their causes, ultimately leading to more
precise cancer treatments

Wellcome Sanger Institute researcher conducts DNA sequencing. Image


courtesy of Wellcome Sanger Institute.
Cases of esophageal squamous cell carcinoma vary greatly around the world.
Image courtesy of the Mutographs project. Data source: GLOBOCAN 2012.

Cracking the Code with GPU-Accelerated Computing

This work requires the solution of a computationally intensive machine


learning problem known as non-negative matrix factorization (NMF). Ludmil
Alexandrov developed the approach for detecting mutation signatures and
the software (SigProfiler) while at the Sanger Institute and continues to build
on this work with his team at the University of California, San Diego (UCSD).
Together, NVIDIA and the Mutographs teams at UCSD and the Sanger
Institute teamed up to use GPUs to accelerate this research.

“Research projects such as the Mutographs Grand Challenge are just that—
grand challenges that push the boundary of what’s possible,” said Pete
Clapham, leader of the Informatics Support Group at the Wellcome Sanger
Institute. “NVIDIA DGX systems provide considerable acceleration that
enables the Mutographs team to, not only meet the project’s computational
demands, but to drive it even further, efficiently delivering previously
impossible results.”

NVIDIA GPUs accelerate the scientific application by offloading the most


time-consuming parts of the code. While the Sanger Institute saves cost and
improves performance by running the computationally intensive work on
GPUs, the rest of the application still runs on the CPU. From the researcher’s
perspective, the overall application runs faster because it’s using the parallel
processing power of the GPU to improve performance.

In the current project, researchers are studying DNA from the tumors of
5,000 patients with five cancer types: pancreas, kidney, colorectal, and two
kinds of esophageal cancer. Five synthetic data matrices that mimic one type
of real-world mutational profiles were used for estimating compute
performance. An NVIDIA DGX-1 system runs the NMF algorithm against the
five matrices, while the corresponding replicated CPU jobs are executed in
docker containers on OpenStack virtual machines (VMs), specifically 60 cores
in Intel Xeon Skylake Processors with 2.6 GHz and 697.3 GB of random-
access memory (RAM).

The NVIDIA DGX-1 is an integrated system for AI featuring eight NVIDIA V100
Tensor Core GPUs that connect through NVIDIA NVLink, the NVIDIA high-
performance GPU interconnect, in a hybrid cube-mesh network. Together
with dual-socket Intel Xeon CPUs and four 100 Gb NVIDIA Mellanox®
InfiniBand network interface cards, the DGX-1 delivers one petaFLOPS of AI
power, for unprecedented training performance. The DGX-1 system software,
powerful libraries, and NVLink network are tuned for scaling up deep learning
across all eight V100 Tensor Core GPUs to provide a flexible, maximum
performance platform for the development and deployment of AI applications
in both production and research settings.

“Research projects such as the Mutographs Grand Challenge are just that—
grand challenges that push the boundary of what’s possible. NVIDIA DGX
systems provide considerable acceleration that enables the Mutographs
team to, not only meet the project’s computational demands, but to drive it
even further, efficiently delivering previously impossible results.”

Pete Clapham, Leader of the Informatics Support Group, Wellcome Sanger


Institute

CPU/VMDGX-
1data_set_1data_set_2data_set_3data_set_4data_set_5data_set_6051015202
5Executing Time (Days)

CPU/ DGX-
VM 1

data_set_
20.858 0.591
1

data_set_
21.053 0.75
2

data_set_
17.716 0.446
3

data_set_
17.707 0.609
4

data_set_
6.634 0.235
5

data_set_
13.219 0.487
6
Faster Results and More Complex Experiments Hold the Promise to
Improve Human Health

An average of 30X acceleration was observed when the pipeline jobs


were executed on the DGX-1 platform compared to those on CPU hardware.
The DGX-1 delivered accurate results in sixteen hours for an equivalent
CPU job that usually took twenty days in a real-life analysis.

The speedup and compute power of GPUs are enabling researchers to obtain
scientific results faster, run a greater number of experiments, and run more
complex experiments than were previously possible, paving the way for
scientific discoveries that could transform the future of cancer treatments.

https://figshare.com/articles/dataset/
Data_Carpentry_Genomics_beta_2_0/7726454

https://hpc.dccn.nl/docs/cluster_howto/software-maintainer-module-
howto.html

https://hpc.dccn.nl/docs/cluster_howto/index.html#accessing-the-cluster

https://training.pages.sigma2.no/tutorials/hpc-intro/episodes/13-
scheduler.html

You might also like