Bioinformatics Handbook Guide
Bioinformatics Handbook Guide
1. Introduction
Introduction 1.1
2. Getting started
How is bioinformatics practiced 2.1
3. Reproducibility
Data reproducibility 3.1
5. Ontologies
What do words mean 5.1
2
Gene ontology 5.3
6. DATA formats
Biological data sources 6.1
8. Sequencing Instruments
Sequencing instruments 8.1
9. Data Quality
Visualizing data quality 9.1
3
Quality control of data 9.2
12. BLAST
BLAST: Basic Local Alignment Search Tool 12.1
4
A guide to SAM Flags 14.1.4
5
19. RNA-Seq Analysis
The RNA-Seq puzzle 19.1
6
23. ChIP-Seq Analysis
ChIP-Seq introduction 24.1
7
A minimal BASH profile 27.3.4
Software tools
Tools for accessing data sources 28.1
Sequence manipulation
Tools for FASTQ/BAM files 30.1
Installing QC Tools
Tools for data quality control 31.1
8
Installing alignment tools
Alignment software 32.1
BBMap/BBTools suite
Installing the BBMap suite 34.1
9
RNA-Seq analysis tools 36.1
GEM 38.2
Impressum
About the author 39.1
10
Introduction
The Biostar Handbook introduces readers to bioinformatics. This new scientific discipline is positioned at
the intersection of biology, computer science, and statistical data analysis. It is dedicated to the digital
processing of genomic information.
The Handbook has been developed, improved, and refined over the last five years in a research university
setting. It is currently used as course material in an accredited PhD training program. The contents of this
book have provided the analytical foundation to hundreds of students, many of whom have become full-
time bioinformaticians and work at some of the most innovative companies in the world.
The Handbook is currently being developed and used as part of the BMMB 852: Applied Bioinformatics
graduate course at Penn State. Please contact the instructor if you are enrolled in this course but have not
yet received an email with your login information.
11
Introduction
Downloadable as a PDF
Downloadable as an EBook.
What is a Biostar?
It's not a what. It's a who! And it could be you.
Just about every modern life science research project depends on large-scale data analysis. Moreover, it's
the results of these data analyses that propel new scientific discoveries. Life science research projects
thrive or wither by the insights of data analyses. Consequently, bioinformaticians are the stars of the show.
But make no mistake: it is a high-pressure, high-reward job where an individual's skills, knowledge, and
determination are each needed to carry their team to success. This Handbook was carefully designed to
give you all the skills and knowledge needed to become the star of the life science show, but the
determination is up to you.
12
How to use this book
1. Bioinformatics foundations
Data formats and repositories.
Sequence alignments.
Data visualization.
Unix command line usage.
The table of contents on the left allows you to jump to the corresponding sections.
The results of bioinformatic analyses are relevant for most areas of study inside the life sciences. Even if a
scientist isn't performing the analysis themselves, they need to be familiar with how bioinformatics
operates so they can accurately interpret and incorporate the findings of bioinformaticians into their work.
13
How to use this book
All scientists informing their research with bioinformatic insights should understand how it works by
studying its principles, methods, and limitations––the majority of which is available for you in this
Handbook.
We believe that this book is of great utility even for those that don't plan to run the analysis themselves.
For best results Windows 10 users will need to join the Windows Insider program (a free service offered by
Microsoft) that will allow them to install the newest release of "Bash Unix for Windows."
14
How to use this book
The Biostar Handbook provides training and practical instructions for students and scientists interested in
data analysis methodologies of genome-related studies. Our goal is to enable readers to perform analyses
on data obtained from high throughput DNA sequencing instruments.
All of the Handbook's content is designed to be simple, brief, and geared towards practical application.
Its position at the intersection of these fields might make bioinformatics more challenging than other
scientific subdisciplines, but it also means that you're exploring the frontiers of scientific knowledge, and
few things are more rewarding than that!
The questions and answers in the Handbook have been carefully selected to provide you with steady,
progressive, accumulating levels of knowledge. Think of each question/answer pair as a small, well-
defined unit of instruction that builds on the previous ones.
Reading this book will teach you what bioinformatics is all about.
Running the code will teach you the skills you need to perform the analyses.
Of course, a more accurate answer depends on your background preparation, and each person's is
different. Prior training in at least one of the three fields that bioinformatics builds upon (biology, computer
science, and data analysis) is recommended. The time required to master all skills also depends on how
you plan to use them. Solving larger and more complex data problems will require greater skills, which
need more time to develop fully.
That being said, based on several years of evaluating trainees in the field, we have come to believe that
an active student would be able to perform publication quality analyses after dedicating about 100 hours of
study. This is what this book is really about -- to help you put those 100 hours to good use.
15
What is bioinformatics
What is bioinformatics?
Bioinformatics is a new, computationally-oriented Life Science domain. Its primary goal is to make sense
of the information stored within living organisms. Bioinformatics relies on and combines concepts and
approaches from biology, computer science, and data analysis. Bioinformaticians evaluate and define their
success primarily in terms of the new insights they produce about biological processes through digitally
parsing genomic information.
Bioinformatics is a data science that investigates how information is stored within and processed
by living organisms.
In the mid-2000s, the so-called next-generation, high-throughput sequencing instruments (such as the
Illumina HiSeq) made it possible to measure the full genomic content of a cell in a single experimental run.
With that, the quantity of data shot up immensely as scientists were able to capture a snapshot of
everything that is DNA-related.
These new technologies have transformed bioinformatics into an entirely new field of data science that
builds on the "classical bioinformatics" to process, investigate, and summarize massive data sets of
extraordinary complexity.
The Human Genome Project fell squarely in the assembly category. Since its completion, scientists have
assembled the genomes of thousands of others species. The genomes of many millions of species,
however, remain completely unknown.
Studies that attempt to identify changes relative to known genomes fall into the resequencing field of
study. DNA mutations and variants may cause phenotypic changes like emerging diseases, changing
fitness, different survival rates, etc. For example, there are several ongoing efforts to compile all variants
16
What is bioinformatics
present in the human genome––these efforts would fall into the resequencing category. Thanks to the
work of bioinformaticians, massive computing efforts are underway to produce clinically valuable
information from the knowledge gained through resequencing.
Living micro-organisms surround us, and we coexist with them in complex collectives that can only survive
by maintaining interdependent harmony. Classifying these mostly-unknown species of micro-organisms
by their genetic material is a fast-growing subfield of bioinformatics.
Finally, and perhaps most unexpectedly, bioinformatics methods can help us better understand biological
processes, like gene expressions, through quantification. In these protocols, the sequencing procedures
are used to determine the relative abundances of various DNA fragments that were made to correlate with
other biological processes
Over the decades biologists have become experts at manipulating DNA and are now able to co-opt the
many naturally-occurring molecular processes to copy, translate, and reproduce DNA molecules and
connect these actions to biological processes. Sequencing has opened a new window into this world, new
methods and sequence manipulations are being continuously discovered. The various methods are
typically named as ???-Seq for example RNA-Seq, Chip-Seq, RAD-Seq to reflect on what phenomena is
being captured/connected to sequencing. For example, RNA-Seq reveals the messenger RNA by turning it
into DNA. Sequencing this construct allows for simultaneously measuring the expression levels of all
genes of a cell.
All of these techniques fall into the quantification category. Each assay uses DNA sequencing to quantify
another measure, and many are examples of connecting DNA abundances to various biological
processes.
Notably, the list now contains nearly 100 technologies. Many people, us included, believe that these
applications of sequencing are of greater importance and impact than identifying the base composition of
genomes.
Below are some examples of the assay technologies on Dr. Pachter's list:
17
What is bioinformatics
I've been doing bioinformatics for about 10 years now. I used to joke with a friend of mine that most
of our work was converting between file formats. We don't joke about that anymore.
Jokes aside, modern bioinformatics relies heavily on file and data processing. The data sets are large and
contain complex interconnected information. A bioinformatician's job is to simplify massive datasets and
search them for the information that is relevant for the given study. Essentially, bioinformatics is the art of
finding the needle in the haystack.
18
What is bioinformatics
Is creativity required?
Bioinformatics requires a dynamic, creative approach. Protocols should be viewed as guidelines, not as
rules that guarantee success. Following protocols by the letter is usually quite counterproductive. At best,
doing so leads to sub-optimal outcomes; at worst, it can produce misinformation that spells the end of a
research project.
Living organisms operate in immense complexity. Bioinformaticians need to recognize this complexity,
respond dynamically to variations, and understand when methods and protocols are not suited to a data
set. The myriad complexities and challenges of venturing at the frontiers of scientific knowledge always
require creativity, sensitivity, and imagination. Bioinformatics is no exception.
Unfortunately, the misconception that bioinformatics is a procedural skill that anyone can quickly add to
their toolkit rather than a scientific domain in its own right can lead some people to underestimate the
value of a bioinformatician's individual contributions to the success of a project.
Biological data will continue to pile up unless those who analyze it are recognized as creative
collaborators in need of career paths.
Bioinformatics requires multiple skill sets, extensive practice, and familiarity with multiple analytical
frameworks. Proper training, a solid foundation and an in-depth understanding of concepts are required of
anyone who wishes to develop the particular creativity needed to succeed in this field.
This need for creativity and the necessity for a bioinformatician to think "outside the box" is what this
Handbook aims to teach. We don't just want to list instructions: "do this, do that". We want to help you
establish that robust and reliable foundation that will allow you to be creative when (not if) that time comes.
As the authors of Core services: Reward bioinformaticians, Nature, (2015) have observed of their own
projects,
No project was identical, and we were surprised at how common one-off requests were. There
were a few routine procedures that many people wanted, such as finding genes expressed in a
disease. But 79% of techniques applied to fewer than 20% of the projects. In other words, most
researchers came to the bioinformatics core seeking customized analysis, not a standardized
package.
In summary, this question is difficult to answer because there isn't a "typical" bioinformatics project. It is
quite common for projects to deviate from the standardized workflow.
19
What is bioinformatics
20
Authors and contributors
It is a simple process that works through GitHub via simple, text based, Markdown files. The only
permission we ask from authors are the rights to publish their content under our own terms (see below).
Please note that this right cannot be revoked later by the author.
Authors retain re-publishing rights for the material that they are the principal author of and may re-
distribute the content that they have created for this book under other terms of their choosing.
21
Authors and contributors
The only exception to this rule is that authors and contributors to the book retain re-publishing rights for the
material that they are the principal (primary) author of and may re-distribute that content under other terms
of their choosing. We define principal author as typically done in academia as the person that performed
the majority of the work and is primarily responsible for its content.
Acknowledgements
The Unix Bootcamp section is based on the Command-line Bootcamp by Keith Bradnam.
The word cloud was created by Guillaume Fillion from the abstracts of Bioinformatics from 2014 (for a
total around 1000 articles).
22
Latest updates
1. Introduction to metagenomics
2. How to analyze metagenomics data
3. Taxonomies and classification
4. Microbial sequence data
5. Classifying 16S sequences
6. Human Metagenome Demonstration Projects
7. Classifying whole-genome sequences
8. A realistic sample: Alaska Oil Reservoir
9. Tool installation
Extensive editing of various chapters. Special thanks to Madelaine Gogol and Paige M. Miller
April 6, 2017
Download: Biostar-Handbook-April-2017.zip
1. ChIP-Seq introduction
2. ChIP-Seq alignments
3. ChIP-Seq peak calling
4. ChIP-Seq motifs
5. ChIP-Seq analysis
6. ChIP-Seq downstream 1
7. ChIP-Seq downstream 2
23
Latest updates
Download: Biostar-Handbook-January-2017.zip
In addition to the new chapter editors have performed an expansive proofreading and editing the prior
content. Special thanks to Madelaine Gogol and Ram Srinivasan for their effort.
Download: biostar-handbook-14-12-2016.zip
December 5, 2016
The first version of the book is released.
24
ChIP-Seq analysis
ChIP-Seq analysis
The author of this guide is Ming Tang. The material was developed for the Biostar Handbook.
GitHub profile
Diving into Genetics and Genomics
What is ChIP-seq?
Chromatin immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-seq) is a
technique to map genome-wide transcription factor binding sites and histone-modification enriched
regions.
Briefly, DNA bounding proteins and DNA (Chromatin) are cross-linked by formaldehyde and the chromatin
is sheared by sonication into small fragments (typically 200 ~ 600 bp). The protein-DNA complex is
immnuoprecipitated by an antibody specific to the protein of interest. Then the DNA is purified and made to
a library for sequencing. After aligning the sequencing reads to a reference genome, the genomic regions
with many reads enriched are where the protein of interest bounds. ChIP-seq is a critical method for
dissecting the regulatory mechanisms of gene expression.
1. Quality control of your fastq read files using tools such as FASTQC. Read our previous section:
Visualizing sequencing data quality.
2. Aligning fastq reads to reference genome. The most popular read aligners are bwa and bowtie .
Read section Short Read Aligners. bowtie has two versions: bowtie1 and bowtie2 . bowtie2 is
better for reads length greater than 50bp. It is still very common to use 36bp single-end sequencing
library for ChIP-seq, so bowtie1 is preferred for ChIP-seq with short reads.
552
ChIP-Seq analysis
One problem with IgG control is that if too little DNA is recovered after immunoprecipitation(IP),
sequencing library will be of low complexity and binding sites identified using this control could be biased.
Input DNA control is ideal in most of the cases. It represents all the chromatin that was available for IP.
Read this biostars post for discussion.
Data sets
To illustrate all the analysis steps, we are going to use a public data set from two papers:
Genome-wide association between YAP/TAZ/TEAD and AP-1 at enhancers drives oncogenic growth.
Francesca Zanconato et.al. 2014. Nature Cell Biology. We will use transcription factor YAP1 ChIP-seq
data in MDA-MB-231 breast cancer cells from this paper.
Nucleosome positioning and histone modifications define relationships between regulatory elements
and nearby gene expression in breast epithelial cells. Suhn Kyong Rhie et.al. 2014. BMC Genomics.
We will use H3K27ac ChIP-seq data in MDA-MB-231 cells from this paper.
The data sets can be found with accession number GSE66081 and GSE49651.
# for YAP1. It takes time, alternatively download fastq directly from EGA https://www.ebi.ac.u
k/ega/home
fastq-dump SRR1810900
# for H3K27ac
fastq-dump SRR949140
In general, do not change the names of files. This can be error prone, especially when you have many
samples. Instead, prepare a name mapping file, and interpret the final results with the mapped names. For
demonstrating purpose, I will re-name the files to more meaningful names.
mv SRR1810900.fastq YAP1.fastq
mv SRR949140.fastq H3K27ac.fastq
mv SRR949142.fastq Input.fastq
553
ChIP-Seq analysis
Now, run FASTQC on all the samples. Read previous section on looping over files. or use GNU parallel:
Overall, the sequencing qualities are good for all three samples, I will skip trmming low quality bases, and
go ahead with the original fastq files for aligning.
For demonstrating purpose, I will write down explicitly all the commands, but the GNU parallel trick can be
used to save you some typing. In the end, I will introduce a shell script to chain all the steps together and
show you how to make reusable codes.
554
ChIP-Seq analysis
cd ../results/bams
samtools view -bS YAP1.sam | samtools sort -@ 4 - -T YAP1 -o YAP1.sorted.bam
# index the sorted bam
samtools index YAP1.sorted.bam
In addition to bam files, macs can take in other input file formats such as bed and sam . type macs on
your terminal and press Enter to see full list of helps.
Parameter settings can be different depending on your own data sets. There is no one-fit-all settings. You
may want to load the called peaks in to IGV and visually inspect them.
Note: Tao Liu is developing MACS2. there is a -broad option to call broad peaks for histone modification
ChIP-seq(e.g. H3K36me3). However, in my experience, macs14 is still widely used and performs very
well.
555
ChIP-Seq analysis
The fancy "super-enhancer" term was first introduced by Richard Young's lab in Whitehead Institute.
Basically, super-enhancers are clusters of enhancers (most commonly defined by H3K27ac peaks)
stitched together if they are within 12.5kb apart. The concept of super-enhancer is NOT new. One of the
most famous example is the Locus Control Region (LCR) that controls the globin gene expression, and
this has been known for decades. If you are interested in the controversy, please read:
A review in Nature Genetics What are super-enhancers? Another comment in Nature Genetics Is a super-
enhancer greater than the sum of its parts?
Super enhancer discovery in HOMER emulates the original strategy used by the Young lab. First,
peaks are found just like any other ChIP-Seq data set. Then, peaks found within a given distance
are 'stitched' together into larger regions (by default this is set at 12.5 kb). The super enhancer
signal of each of these regions is then determined by the total normalized number reads minus the
number of normalized reads in the input. These regions are then sorted by their score, normalized
to the highest score and the number of putative enhancer regions, and then super enhancers are
identified as regions past the point where the slope is greater than 1.
HOMER is a very nice tool for finding enriched peaks, doing motif analysis and many more. It requires
making a tag directory first. ROSE is the tool from Richard Young's lab. Since it works directly with bam
files, we will use it for demonstration.
cd ../../software/
wget https://bitbucket.org/young_computation/rose/get/1a9bb86b5464.zip
unzip 1a9bb86b5464.zip
cd young_computation-rose-1a9bb86b5464/
The bigWig format is for display of dense, continuous data that will be displayed in the Genome
Browser as a graph. BigWig files are created initially from wiggle (wig) type files, using the program
wigToBigWig. The resulting bigWig files are in an indexed binary format. The main advantage of
the bigWig files is that only the portions of the files needed to display a particular region are
transferred to UCSC, so for large data sets bigWig is considerably faster than regular wiggle files
Make sure you understand the other two closely related file formats: bedgraph and wig file.
556
ChIP-Seq analysis
macs14 can generate a bedgraph file by specifying --bdg , but the resulting bedgraph is NOT
normalized to sequencing depth. Instead, we are going to use another nice tool deeptools for this task. It
is a very versatile tool and can do many other things.
# install deeptools
conda install -c bioconda deeptools
# normalized bigWig with Reads Per Kilobase per Million mapped reads (RPKM)
bamCoverage -b YAP1.sorted.bam --normalizeUsingRPKM --binSize 30 --smoothLength 300 -p 10 --ex
tendReads 200 -o YAP1.bw
I set --extendReads to 200bp which is the fragment length. Why should we extend the reads? Because in
a real ChIP-seq experiment, we fragment the genome into small fragments of ~200bp, and pull down the
protein bound DNA with antibodies. However, we only sequence the first 36bp(50bp, or 100bp depending
on your library) of the pull-down DNA. To recapitulate the real experiment, we need to extend it to the
fragment size.
You can read more details Why do we need to extend the reads to fragment length/200bp?
The strategy is to write a shell script to chain all the steps together. see the section writing scripts in the
handbook.
First, you need to think about how the workflow should be. We will need to download the fastq files and
convert them to bam files; next, we will call peaks for a pair of IP and Input. Sometimes, you may get
sorted bam files, which can be used directly to call peaks. It is reasonable that we have two shell scripts
for each task, respectively.
557
ChIP-Seq analysis
#! /bin/bash
# the reference file path, change to where you put the reference
REF="/risapps/reference/bowtie1/hg19"
For more on set -euo pipefail , read this post Use the Unofficial Bash Strict Mode (Unless You Looove
Debugging).
execute:
# YAP1
./sra2bam.sh SRR1810900
# H3K27ac
./sra2bam.sh SRR949140
# Input
./sra2bam.sh SRR949142
Now, with the sra2bam.sh script, it saves you from typing different file names for aligning fastqs and
converting sam files. Moreover, it can be used to process any number of sra files.
If you have a sra_id.txt file with one sra id on each line, you can:
558
ChIP-Seq analysis
#! /bin/bash
IP_bam=$1
Input_bam=$2
oprefix=$(basename "${IP_bam}" .sorted.bam)
The sra2bam.sh script will generate the indexed bam files, those bam files can be fed into
bam2peaks.sh to call peaks. In our example data set, we only have two IPs and one Input. We can call
peaks by:
Imagine we have a lot of bam files generated by sra2bam.sh , and we want to call peaks for them, how
should we stream the workflow?
Because it invovles a pair of files (IP and Input control) for calling peaks, we need to make a tab delimited
txt file with pairs of file names on each line.
cat bam_names.txt
SRR949140.sorted.bam SRR949142.sorted.bam
SRR1810900.sorted.bam SRR949142.sorted.bam
Now, we can loop over the bam_names.txt file one by one and call peaks:
#! /bin/bash
# show help
show_help(){
cat << EOF
This is a wrapper to align fastq reads to bam files for ChIP-seq experiments.
usage: ${0##*/} -d < a directory path containing the fastq.gz files > -r < h or m>
-h display this help and exit
559
ChIP-Seq analysis
## parsing arguments
while getopts ":hf:r:" opt; do
case "$opt" in
h) show_help;exit 0;;
f) fq=$OPTARG;;
r) REF=$OPTARG;;
'?') echo "Invalid option $OPTARG"; show_help >&2; exit 1;;
esac
done
## mapping
bowtie -p 10 --best --chunkmbs 320 ${ref_genome} -q "${fq}" -S "${oprefix}".sam
rm "${oprefix}".sam
With this script, you can map not only ChIP-seq data for human but also for mouse. You can surely add
more arguments to set bowtie mapping thread numbers and samtools sort thread numbers (the
above script uses 10 and 4 respectively). You can read more on arguments handling for shell scripts: small
560
ChIP-Seq analysis
getopts tutorial.
Congratulations! You have come this far with me. If you want to take the next level of making your analysis
reproducible and flexiable, read our Makefile section.
These regions are often found at specific types of repeats such as centromeres, telomeres and
satellite repeats. It is especially important to remove these regions that computing measures of
similarity such as Pearson correlation between genome-wide tracks that are especially affected by
outliers
wget https://www.encodeproject.org/files/ENCFF001TDO/@@download/ENCFF001TDO.bed.gz
Note that more and more sequencing projects are moving their reference genome to the latest version
GRCh38 . GRCh38 blacklist was uploaded by Anshul Kundaje on 2016-10-16. The file can be downloaded
by:
wget https://www.encodeproject.org/files/ENCFF419RSJ/@@download/ENCFF419RSJ.bed.gz
Open IGV , click File , then Load From File , choose the peak files end with bed and the raw signal
files end with bw . you can change the scale of the bigWig tracks by right click the tracks and choose
Set Data Range . Now, you can browser through the genome or go to your favoriate genes.
561
ChIP-Seq analysis
one other example from the original Nature Cell Biology paper at the RAD18 locus:
We see that macs did a pretty good job in identifying enriched regions for histone modifications and
transcription factor (TF) binding peaks. However, we do also see some potential YAP1 binding sites are
missed. Fine tuning the peak calling parameters may improve the results.
562
ChIP-Seq downstream 1
GitHub profile
Diving into Genetics and Genomics
There are many sub-commands for bedtools , but the most commonly used one is bedtools intersect :
# make sure you are inside the folder containing the peak files and the black-listed region fi
le.
what's happening here? why there are only 1772 unique H3K27ac peaks overlap with YAP1 peaks?
563
ChIP-Seq downstream 1
It turns out that bedtools will output the overlapping peaks of H3K27ac whenever there is an overlapping
with YAP1 peaks, and it is possible that the same H3K27ac peak (tends to be really broad peaks) overlaps
with multiple YAP1 peaks. With that in mind, I always do sort | uniq following bedtools command.
One of the under-appreciated tools is BEDOPS . It has many nice features as well and the documentaion is
also very good. Read this biostar post: Question: Bedtools Compare Multiple Bed Files?
I will show you how to do peak annotation using an R bioconductor package ChIPseeker develeped by
Guangchuan Yu.
> # load the packages, install them following the instructions in the links above
> library(ChIPseeker)
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> library(rtracklayer)
> library("org.Hs.eg.db")
564
ChIP-Seq downstream 1
>
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
565
ChIP-Seq downstream 1
library(clusterProfiler)
## GO term enrichment
ego <- enrichGO(gene = as.data.frame(YAP1_anno)$SYMBOL,
OrgDb = org.Hs.eg.db,
keytype = "SYMBOL",
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
# visualization
dotplot(ego, showCategory = 20)
566
ChIP-Seq downstream 1
Note that we used all the genes that associated with a peak for the pathway analysis. This may not be
optimal, especially when you have a lot of peaks. One alternative is that you can filter the genes/peaks by
some rules first and then do the enrichment analysis.
e.g.
library(dplyr)
## only consider genes within 5000 bp of the peaks
as.data.frame(YAP1_anno) %>% dplyr::filter(abs(distanceToTSS) < 5000)
## or you can rank the peaks by the Pvalue and choose the top 1000 peaks (this number is arbit
rary..)
# score column is the -10*log10 Pvalue
as.data.frame(YAP1_anno) %>% dplyr::arrange(desc(score)) %>% head(n =1000)
The underlying mechanism is to compare the peaks associated gene set with the vairous annotated
pathway gene sets to see whether the peak assoicated genes are overpresented in any of the known gene
sets using hypergeometric test. Of course, many other complex algorithums have been developed for this
type of analysis. Further reading: A Comparison of Gene Set Analysis Methods in Terms of Sensitivity,
Prioritization and Specificity
567
ChIP-Seq downstream 1
MEME-ChIP needs 500bp DNA sequences around the YAP1 binding summits (with the highest signal of
binding), which is output by macs14 when you call peaks.
# get the coordinates of 500bp centered on the summit of the YAP1 peaks
cat YAP1_summits.bed | awk '$2=$2-249, $3=$3+250' OFS="\t" > YAP1_500bp_summits.bed
# you need a fasta file of the whole genome, and a bed file containing the cooridnates
# the whole genome fasta file can be gotten by:
rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ .
cat *fa.gz > UCSC_hg19_genome.fa.gz
gunzip UCSC_hg19_genome.fa.gz
568
ChIP-Seq downstream 1
I will just keep the defaults for all the options. It takes a while to finish.
569
ChIP-Seq downstream 1
YAP1 is not a DNA binding transcription factor itself, rather it is in a complex with TEAD, which is a DNA
binding transcription factor. The first motif is a TEAD binding motif in tandem, and the second motif is AP-1
binding motif.
One downside of MEME-ChIP is that you can only input the DNA sequences with the same length. Instead,
you can use Homer findMotifsGenome.pl with the length of the peaks for finding motifs:
There is a review paper on this topic: A comprehensive comparison of tools for differential ChIP-seq
analysis. You can choose tools based on various rules:
570
ChIP-Seq downstream 1
571
ChIP-Seq downstream 2
GitHub profile
Diving into Genetics and Genomics
Usually one has a matrix and then plot the matrix using R functions such as heatmap.2 , pheatmap or
Heatmap . With those R packages, it is very easy to make heatmaps, but you do need to read the
documentations of the packages carefully to understand the details of the arguments. Read a tale of two
heatmap functions.
For ChIP-seq data, one wants to plot ChIP-seq signal around the genomic features(promoters, CpG
islands, enhancers, and gene bodies). To do this, you need to first take the regions of say 5kb upstream
and downstream of all (around 30000) the transcription start sites (TSS), and then bin each 10kb region to
100 bins (100 bp per bin). Count the ChIP-seq signal (reads number) in each bin. Now, you have a data
matrix of 30000 (promoters) x 100 (bin), and you can plot the matrix using a heatmap function mentioned
above.
you can of course do it from scratch, but one of the taboos of bioinformatics is re-inventing the wheel. Most
likely, someone has written a package for this type of analysis. Indeed, check this biostar post for all the
available tools. Choose the right tool for yourself, if you are not that familiar with R, you can use some GUI
tools. EaSeq is a pretty powerful tool for windows users.
I personally like EnrichmentHeatmap by Zuguang Gu the most because it gives me the most flexiability to
control the looks of my heatmap. It is based on ComplexHeatmap , which I highly recommend you to learn
how to use. Below, I will walk you through how to make a heatmap with the EnrichmentHeatmap package.
# read in bigwig files to GRanges object, it can be slow. bigwig is several hundred MB
# you can use which argument to restrict the data in certain regions.
H3K27ac.bw<- import("~/ChIP-seq/results/bams/H3K27ac.bw", format = "BigWig")
YAP1.bw<- import("~/ChIP-seq/results/YAP1.bw", format = "BigWig")
572
ChIP-Seq downstream 2
We want to plot the H3K27ac signal and YAP1 signal 5kb flanking the center of YAP1 peaks.
## mapping colors
library(circlize)
573
ChIP-Seq downstream 2
The YAP1_mat and H3K27ac_mat are just like regular matrix, you can use colMeans to get the average
and plot the average in the same figure with ggplot2 . We can even add 95% confidence intervals onto
the graph.
574
ChIP-Seq downstream 2
library(tidyverse)
combine_both<- bind_rows(YAP1_mean, H3K27ac_mean)
575
ChIP-Seq downstream 2
576
ChIP-Seq downstream 2
As you can see, when you really know the power of R programming language, you do much more!
If you scroll to the Regulation tracks (hg19 version). There are many ChIP-seq data avaiable for you to
browse.
577
ChIP-Seq downstream 2
Now you can check various boxes to make it show up in the browser. quite useful! Many of the data sets
are from ENCODE.
I recommend you to watch the tutorials from openhelix to take full advantage of this great resource.
Cistrome
Cistrome is a project maintained by Sheirly Liu's lab in Harvard. It has a lot more data sets including
ENCODE and other public data sets from GEO. Some features are very friendly to wet biologists.
578