Webinar Series
Biology meets programming: Bioinformatics 101 for NGS researchers
13 May 2020
Participating experts
David Guttman, Ph.D.
University of Toronto
Toronto, Canada
Alberto Riva, Ph.D.
University of Florida
Gainesville, FL
Brought to you by the Science/AAAS Custom Publishing Office
Webinar Series
Instructions for viewers
• To share webinar via social media:
• To share webinar via email:
• To see speaker biographies, click: View Bio under speaker name
• To ask a question, click the Ask A Question tab to the right
Brought to you by the Science/AAAS Custom Publishing Office
Biology Meets Programming:
Bioinformatics 101 for NGS Researchers
David S. Guttman
Department of Cell & Systems Biology
Centre for the Analysis of Genome Evolution & Function
University of Toronto
Integrating
Next-Generation Sequencing
and Bioinformatics
into Your Research
Lessons Learned
Integrating NGS into Your Research
Bioinformatics in the wet lab
Experimental design
Data analysis
Data presentation
Integrating NGS into Your Research
Bioinformatics in the wet lab
Experimental design
Data analysis
Data presentation
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trained
– Finding & keeping people
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trainedSpecialized “Pet” Bioinformatician
– Finding & keeping people Pros
• High expertise
• Focused training
Cons
• Poor integration with group
• Isolated from field
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trainedCross-Trained Bioinformatician
– Finding & keeping people Pros
• Highly integrated with group
• Well-rounded training
Cons
• Time is finite
• Jack of all trades, master of none
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trained Focus
– I Finding
like
& keeping people Training
python
Credit
more than
pythons
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trained Focus
– Finding & keepingI like
people Training
pythons
Credit
more than
python
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trained Focus
– Finding & keeping people Training
HELLO… is Credit
there anyone
out there???
Bioinformatics in the Wet Lab
The care & feeding of bioinformaticians
– Specialized vs cross-trained Focus
– Finding Sigh…
& keeping people Training
Stuck in the
middle again. Credit
Lorem Ipsum 12(3):45-67
Nam liber tempor cum soluta nobis eleifend
option congue nihil imperdiet doming.
Wendy Wetbench, Bob Bioinformatics, G. Rand Poobah
Lorem ipsum dolor sit amet, consetetur sadipscing ipsum dolor sit amet, consetetur sadipscing elitr,
elitr, sed diam nonumy eirmod tempor invidunt ut sed diam nonumy eirmod tempor invidunt ut labore
labore et dolore magna aliquyam erat, sed diam et dolore magna aliquyam erat, sed diam voluptua.
voluptua. At vero eos et accusam et justo duo At vero eos et accusam et justo duo dolores et ea
dolores et ea rebum. Stet clita kasd gubergren, no rebum. Stet clita kasd gubergren, no sea takimata
sea takimata sanctus est Lorem ipsum dolor sit sanctus est Lorem ipsum dolor sit amet. Lorem
amet. Lorem ipsum dolor sit amet, consetetur ipsum dolor sit amet, consetetur sadipscing elitr,
sadipscing elitr, sed diam nonumy eirmod tempor sed diam nonumy eirmod tempor invidunt ut labore
Integrating NGS into Your Research
Bioinformatics in the wet lab
Experimental design
Data analysis
Data presentation
Experimental Design
Collecting the right data
Selecting the right NGS platform
Selecting the right NGS service provider
Experimental Design
Collecting the right data
– Sample structure
– Sample processing
– Sequence quantity & quality
– Sequence storage & security
Experimental Design
Collecting the right data
Sample size
– Sample structure Biological replicates
– Sample processing Controls
– Sequence quantity & quality
How many
samples do I
Sources of variation
need? Metadata
– Sequence storage & security
Experimental Design
Collecting the right data
Power analysis
– Sample structure Biological replicates
– Sample processing Controls
– SequenceSample
quantity & quality Sources of variation
Size Metadata
– Sequence
Statistical storage & security
Significance
Power Level How many
(type II error) (type I error) independent Bonferroni,
test? who??
Effect
Size
Experimental Design
Collecting the right data
Power analysis
– Sample structure Biological replicates
– Sample processing Controls
– Sequence
genotype quantity & quality
Sources of variation
Metadata
– Sequence storage & security
development
population structure
environment
Variance time
Experimental Design
Collecting the right data
Power analysis
– Sample structure Biological replicates
– Sample processing Controls
– Sequence quantity & quality Sources of variation
What metadata Metadata
– Sequence storage & security
will I need and
how should I
record it? Metadata = data about your data
Experimental Design
Collecting the right data
Power analysis
– Sample structure Biological replicates
– Sample processing Controls
– Sequence quantity & quality Sources of variation
Metadata
– Sequence storage & security
Statistician
Experimental Design
Collecting the right data
– Sample structure Collection
– Sample processing
Do I need Storage
to ship my
How quickly
– Sequence quantity & quality
samples? Extraction
can I get my
samples into
a freezer
– Sequence storage Do
&revive
I security
need to
any
organisms in the
sample?
Experimental Design
Collecting the right data
– Sample structure Collection
– Sample processing Storage
– Sequence quantity & quality
Are my samples being
sequenced on a
Extraction
– Sequence storage & security
long read or short read
sequencer?
Experimental Design
Collecting the right data
– Sample structure Coverage
– Sample processing Contamination
– Sequence quantity & quality
Batch controls
How much
– Sequence storage & security
sequence data
do I need?
Experimental Design
Collecting the right data
– Sample structure Coverage
– Sample processing Contamination
– Sequence quantity & quality
Batch controls
Depth
Coverage
– Sequence storage
5x &
Genome
security
Coverage
-15%
reads
genome
Experimental Design
Collecting the right data
– Sample structure Coverage
– Sample processing Contamination
– Sequence quantity & quality
Do I need a Batch controls
How big reference
is the genome?
– Sequence storage & security
genome?
Is there a
reference
genome??
Experimental Design
Collecting the right data
– Sample structure Coverage
– Sample processing Contamination
– Sequence quantity
Batch controls
How much of my&
dataquality
is actually useful?
– Sequence storage & security
How much of my
data will be
contaminating DNA
Experimental Design
Collecting the right data
– Sample structure Coverage
– Sample processing Contamination
– Sequence quantity & quality
Batch controls
–processed
Sequence
Will all my samples bestorage & security
or sequenced
at the same time?
Experimental Design
Collecting the right data
– Sample structure Coverage
– Sample processing Contamination
– Sequence quantity & quality
Batch controls
– Sequence storage & security
Genomics Core
Experimental Design
Collecting the right data
– Sample structure Storage
– Sample processing Identifiers
– Sequence quantity & quality
– Sequence storage & security
1001001101110
1001001101110
It’s a lot of data!
Experimental Design
Collecting the right data
– Sample structure Storage
– Sample processing Identifiers
– Sequence quantity & quality
–BobJones19.05.02
Sequence storage & security
AGTAGAGCGAGCGAGTAC
JaneDoe20.02.28 TAGCGGAGTAGACGAGAG
!!!
JohnSmith20.03.17 GACGAGGCAGCCAGATAG
Experimental Design
Selecting the right NGS platform
– Data structure
– ThroughputPlatform Lab or Core Reads Throughput Run Cost Error Rate
– Cost Oxford Nanopore MinION
Illumina MiSeq
Lab
Both
Long
Short
Low
Low
Med
High
High
Low
– Accuracy Illumina NovaSeq Core Short High Low Low
MGI DNBSEQ-G400 Core Short High Low Low
PacBio RSII Core Long Low High High
Experimental Design
Selecting the right NGS platform
– Data structure Resequencing,
– Throughput De novo sequencing,
Amplicon sequencing,
– Cost RNA-seq, ChIP-seq
etc.…
– Accuracy
Experimental Design
Selecting the right NGS platform
– Data structure Read length
– Throughput Read structure
Short read, single-end,
– Cost Short read, paired-end,
Long read
– Accuracy Read Structure Assembly Cost Throughput Error Rate
Short read, single-end Bad Very Low Very High Low
Short read, paired-end Fair Low High Low
Long read Great High Low High
Experimental Design
Selecting the right NGS platform
• Data structure
• Throughput Yield
How fast
• Cost can I get Multiplexing
my data?
• Accuracy
How many
samples do How many
I have? samples can I
multiplex on a
sequencer run?
Experimental Design
Selecting the right NGS platform
Sample collection
–Data structure Sample prep
–Throughput Sequencing
–Cost Sigh …
Data processing & storage
Time to write
–Accuracy
Platform purchase
another grant
Maintenance & service contracts
Support equipment
Support personnel
Experimental Design
Selecting the right NGS platform
– Data structure Error frequency
– Throughput Error profile
– Cost Are there a lot of
Is my genome homopolymer
– Accuracy
AT-rich tracts in my
genome?
Experimental Design
Selecting the right NGS platform
accuracy
– Data structure flexibility
– Throughput cost
yield
– Cost
– Accuracy
Genomics Core
Experimental Design
Selecting the right NGS service provider
– Level of service
Sequencing
Experimental Sample Data / Statistical Manuscript
Design Prep Analysis Prep
Integrating NGS into Your Research
Do all of this
before collecting
Bioinformatics in the wet lab
the first piece of
data!
Experimental design
Data analysis
Data presentation
Integrating NGS into Your Research
Bioinformatics in the wet lab
Experimental design
Data analysis
Data presentation
Data Analysis
Bioinformatics considerations
• Analysis tools
• Coding
• Verification & validation
• Public access
Data Analysis
Bioinformatics considerations
• Analysis tools Commercial packages
(e.g. CLC Genomics)
• Coding Public packages & platforms
(e.g. Galaxy, QIIME)
• Verification & validationPublished tools, pipelines, & libraries
House-made scripts & pipelines
• Public access
Data Analysis
Bioinformatics considerations
• Analysis tools Coding environment
(e.g. Jupyter Notebooks, R-markdown)
• Coding Commenting
• Verification &analysis
validation
How can I ensure
that my
Git version control
is reproducible?
• Public access
Data Analysis
Bioinformatics considerations
• Analysis tools Gold standard analyses
(i.e. Test Oracle)
• Coding Metamorphic testing
• Verification & validation
•Verification:
PublicDoes access
the software work (i.e. have no bugs)?
Validation: Does the software do what you what it to do?
Data Analysis
Bioinformatics considerations
• Analysis tools Data
• Coding Metadata
Code
• VerificationShould
& validation
we make
our full analysis
pipeline available?
• Public access
Integrating NGS into Your Research
Bioinformatics in the wet lab
Experimental design
Data generation
Data analysis
Data presentation
Data Presentation – a Visual Narrative
Data Visualization
• Four pillars
• Tufte principles How can I
tell a narrative
• Agile Development through
figures and tables?
Data Presentation – a Visual Narrative
Data Visualization
Accuracy
• Four pillars
Precision
• Tufte principles
Clarity
• Agile Development
Efficiency
Precision
NO…
Make them
go away!!
Accuracy
Data Presentation – a Visual Narrative
Data Visualization Overview first, zoom & filter,
details on demand
• Four pillars Pre-attentive processing
• Tufte principles Less can be more
• Agile Development Keep it proportional
Data / ink ratio
Color perception
Figures vs. tables
A
Data Presentation – a Visual Narrative
Data Visualization Overview first, zoom & filter,
details on demand
• Necessities Pre-attentive processing
• Tufte
B principles Less can be more
• Agile Development Keep it proportional
Data / ink ratio
C Gene 1 Gene 2 Gene 3
Color perception
Figures vs. tables
Civelek & Lusis 2014 Nat.Rev.Genet. 15: 34
Data Presentation – a Visual Narrative
Data Visualization
Normal view
Overview first, zoom & filter,
details on demand
• Necessities Pre-attentive processing
• Tufte principles Less can be more
• Agile Development
-1.5 0.0 1.5
Keep it proportional
Red-Green color blind view
Data / ink ratio
Color perception
Figures vs. tables
-1.5 0.0 1.5
Integrating NGS into a Wet Lab
The Death of Silo Science
• Big data requires a collaborative mindset Bioinformatics
– Experimental design
– Data generation
– Data / statistical analysis
• Collaboration from project inception
Statistics Genomics
– Greatly increases likelihood of success
– Enhances training opportunities
Thanks!
Webinar Series
Biology meets programming: Bioinformatics 101 for NGS researchers
13 May 2020
Participating experts
David Guttman, Ph.D.
University of Toronto
Toronto, Canada
Alberto Riva, Ph.D.
University of Florida
Gainesville, FL
Brought to you by the Science/AAAS Custom Publishing Office
Biology Meets Programming:
Bioinformatics 101 for NGS Researchers
Alberto Riva
ICBR Bioinformatics Core
University of Florida
[email protected]Who am I & how did I get here?
• Background in Computer Science, Bioengineering, Medical
Informatics.
• Started working in Bioinformatics in 2001, while at Children’s
Hospital Boston – Harvard Medical School.
• Joined UF in 2006, and ICBR in 2014.
• As of January 2019, Scientific Director of the Bioinformatics Core.
ICBR
The Interdisciplinary Center for Biological Research (ICBR) is a research
support organization that gathers multiple cores under the same roof:
• Bioinformatics
• Cytometry
• Gene Expression and Genotyping
• NextGen Sequencing
• Proteomics
• Electron Microscopy
• Monoclonal Antibodies
ICBR
The Interdisciplinary Center for Biological Research (ICBR) is a research
support organization that gathers multiple cores under the same roof:
• Bioinformatics
• Cytometry
• Gene Expression and Genotyping
• NextGen Sequencing
• Proteomics
• Electron Microscopy
• Monoclonal Antibodies
NGS raw data
The starting point for most NGS projects is a collection of fastq files,
containing millions of short reads with associated quality information.
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Second line is the read; fourth line contains quality information
encoded as characters (one for each base in the read). Quality score =
probability that the base is miscalled. E.g. best quality produced by
Illumina platforms is 40, corresponding to p=0.0001.
NGS data
Other de-facto standard data types:
• BAM – describes result of mapping reads to genome.
• VCF – describes variants, ie differences between sample(s) and
reference.
• Tab-delimited - lingua franca for tabular data.
Use Excel only for presenting final results (with caution)!
Outline of NGS analysis project
Most NGS analysis projects include the following steps:
• Quality control, data cleanup;
• Alignment / mapping to reference genome;
• Quantification: measuring relevant variables (e.g. gene expression);
• Differential analysis: comparing biological conditions;
• Downstream analysis: interpretation in context of research question;
• Presentation: packaging results, reports, plots.
Outline of NGS analysis project
Most NGS analysis projects include the following steps:
Standard, easily Quality control
automated
Alignment / mapping
Quantification
Differential analysis
Downstream analysis
Ad-hoc, hands-on Presentation
Choices
Should I…
Write my Use existing
own tools? tools?
Pros: Greater flexibility
Tool can be tailored to problem Pros: Minimal development effort
Opportunities for publishing / required
visibility Generally accepted method
Cons: Requires large time investment Cons: May not be exactly what you need
Requires technical expertise Dependent on other people’s work
Requires validation / publication
Choices
Should I…
Write my Use existing
own tools? tools?
Most common approach: combining existing tools with ad-hoc ones. A large part
of a bioinformatician’s work consists in gluing together different tools, ensuring
they interoperate correctly.
In practice, this often means converting data between the different formats used
by the required tools, using ad-hoc scripts.
Choices
Work style:
Interactive “Batch”
Pros: Allows for exploratory analysis Pros: Minimal manual work required
Easy generation of graphical Reproducibility built-in
reports (R, Jupyter) Suitable for HPC environments
Cons: Less reproducible Easily scalable
Constrained by hardware Cons: Requires upfront development effort
limitations Analysis path is “frozen” – can’t deal
with unexpected findings
Choices
Work style:
Interactive “Batch”
Common approach: first part of analysis performed in batch mode using a standard
process. Results from this phase should be made available in a format appropriate for
downstream interactive analysis.
This choice is influenced by the available personnel – pet bioinformatician vs. cross-trained.
Choices
Hardware
Local Cloud-based
Pros: Total control over hardware / software Pros: Does not require dedicated admin /
Lower recurring costs security personnel
Cons: Large initial investment Higher reliability, better hardware
Obsolescence Cons: Requires moving large amounts of data
Requires admin / security expertise Recurring costs proportional to volume
of data analysis
Putting the pieces together
• Analysis pipelines: sequence of analysis tools designed to perform
analysis task end-to-end, in an unsupervised way.
• Large-scale NGS analysis is almost always performed on UNIX (Linux)
environments.
• HPC environments allow running many (hundreds to thousands) of
jobs in parallel – large analysis tasks can be handled easily.
• Solutions range from hand-crafted scripts to very complex pipeline
managers.
• UNIX was designed by programmers for programmers! A “basic” Linux
environment already provides all you need.
• DIY approach: UNIX makes it very easy to combine small tools into
larger ones, building up your custom toolkit.
• For example, to count the number of reads in a fastq file:
#!/bin/bash
N=$(zcat $* | grep –c ^)
echo $((N/4))
Old School Cool
Bash scripting: primitive, quirky “glue” language – but can be very
effective in automating repetitive tasks.
Makefile: complete rule system to describe how to generate certain
files from their sources.
Very complex analysis pipelines can be built using Bash scripts
combined with makefiles. NOT user-friendly – but works out-of-the-
box, with no dependencies, external tools, etc.
“Traditional” programming
Most modern programming languages provide support for
bioinformatics. Examples:
• perl – specialized language for text processing. Provides bioperl.
• R – specialized data analysis language and environment. Probably the
most widely used in bioinformatics. Bioconductor: huge repository of
analysis packages. Powerful visualization. Pretty much unavoidable.
• python – general purpose language, simple and (relatively) fast.
Provides biopython (similar to bioperl), numpy / scipy (similar to R
and matlab).
Pipeline managers
Pipeline managers allow you to describe what the pipeline should do –
and the manager makes it happen. Able to handle:
• Parallelization (run same task on all input files);
• Dependencies (step B requires step A to be performed first);
• Conveying data from one step to the next;
• Submitting jobs in cluster environments;
• Errors / checkpointing (if pipeline stops, restart from last good result);
Nextflow
Nextflow provides a language to describe analysis steps in terms of
inputs, outputs, process. Automatically scales from single CPU to multi-
core to cluster environment.
/* Simple tool to reverse sequences */
process reverse {
input:
path x from records
output:
stdout into result
"""
cat $x | rev
"""
}
Containers
Installing, configuring and updating software tools is a significant
challenge. Tools change over time, hindering reproducibility.
Containers: self-contained virtual environments that store a collection
of software tools designed to work together.
Containers are immutable, can be easily distributed, and run anywhere.
Pipelines may make use of tools in containerize form; whole pipelines
may be distributed as containers.
Most commonly used: docker, singularity.
Take-home messages
• Programming is an essential skill in bioinformatics – how much you
used it depends on you!
• Know your tools: even a basic Linux environment contains a wealth of
very powerful tools ready for use.
• Carefully evaluate pros and cons of different approaches. DIY or off-
the-shelf?
• Be pragmatic! Bioinformatics == problem solving.
Webinar Series
Biology meets programming: Bioinformatics 101 for NGS researchers
13 May 2020
David Guttman, Ph.D.
University of Toronto
Toronto, Canada Q&A
Alberto Riva, Ph.D.
To ask a question, click
University of Florida
the Ask a Question tab
Gainesville, FL
on the right
Brought to you by the Science/AAAS Custom Publishing Office
Webinar Series
Biology meets programming: Bioinformatics 101 for NGS researchers
13 May 2020
Look out for more webinars in the series at:
webinar.sciencemag.org
To provide feedback on this webinar, please email
your comments to
[email protected] For related information on this webinar topic, go to:
https://go.roche.com/libraryprep
Brought to you by the Science/AAAS Custom Publishing Office