peakable 0.99.22
The peakable package provides convenient methods for facilitating CUT&RUN and CUT&Tag-seq peaksets quality control and downstream analysis. The functionality breaks down into three parts:
GRangesThe BED file is the common format to store peak intervals. The Bioconductor package rtracklayers provides tools for importing and exporting stardardized BED+6 and BED+12 files, while plyranges offers functions specifically for MACS2 narrow peaks (read_narrowPeaks() and write_narrowpeaks()).
The standardized BED+6 columns include chromosome, start, end, name, score, and strand. MACS2 narrow and broad peaks follow this stardard format with additional columns such as signalValue, pValue, qValue, and peak (only for narrow peaks). SEACR peaks, on the other hand, are in the BED+3 format with columns chromosome, start, and end, followed by additional columns AUC, max.signal, and max.signal.region. The peakable package provides convenient tools to both costom BED formats.
In Bioconductor, the imported peaks are formatted as GRanges instances, a standard data structure for representing genomic intervals. The plyranges pacakge provides dplyr-like interface functions for arithematics on GRanges. For more information, please visit here.
library(rtracklayer)
library(plyranges)
library(dplyr)
library(ggplot2)
# if peakable not install yet:
# devtools::install_github('chaochaowong/peakable')
library(peakable)
The peakable package provides functions for importing and exporting MACS2 and SEACR custom BED files.
While the plyranges package offers several functions for importing and exporting custom BED files, the peakable package provides additional tools to clean and organize the imported peaks.
Although the plyranges package provides a number of functions to import and export costumed bed files, the peakable package provide additional facility to tidy up the imported peaks.
read_macs2_narrow() and read_macs2_broad() are wrapper functions to rtracklayers::import.bed(), facilitating the import of MACS2 narrow and broad peak files in the BED6+4 format. The imported intervals are formatted as GRanges instances.
Import MACS narrowPeaks and exclude non-essential chromosomes such as poorly annotated chromosomes (“chrUn_…”) and chrM:
# wrapper function of rtracklayer::import.bed()
narrow_file <-
system.file('extdata',
'chr2_Rep1_H1_CTCF_peaks.narrowPeak',
package='peakable')
gr <- read_macs2_narrow(narrow_file,
drop_chrM = TRUE,
keep_standard_chrom = TRUE,
species = 'Homo_sapiens')
gr
## GRanges object with 5467 ranges and 6 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr2 11304-12188 * | GSM3391651_Rep1_H1_C.. 959
## [2] chr2 18466-19108 * | GSM3391651_Rep1_H1_C.. 118
## [3] chr2 142229-142507 * | GSM3391651_Rep1_H1_C.. 47
## [4] chr2 152119-152336 * | GSM3391651_Rep1_H1_C.. 92
## [5] chr2 246399-246881 * | GSM3391651_Rep1_H1_C.. 105
## ... ... ... ... . ... ...
## [5463] chr21 46325309-46325711 * | GSM3391651_Rep1_H1_C.. 24
## [5464] chr21 46423906-46424202 * | GSM3391651_Rep1_H1_C.. 47
## [5465] chr21 46635486-46636005 * | GSM3391651_Rep1_H1_C.. 69
## [5466] chr21 46639760-46640238 * | GSM3391651_Rep1_H1_C.. 104
## [5467] chr21 46660714-46661481 * | GSM3391651_Rep1_H1_C.. 305
## signalValue pValue qValue peak
## <numeric> <numeric> <numeric> <integer>
## [1] 37.31560 99.66110 95.99710 503
## [2] 8.61128 14.36300 11.85760 384
## [3] 5.16677 7.01562 4.76958 19
## [4] 6.95545 11.63870 9.23085 124
## [5] 8.03720 13.05580 10.59070 275
## ... ... ... ... ...
## [5463] 3.56330 4.50035 2.43234 88
## [5464] 5.16677 7.01562 4.76958 66
## [5465] 6.31494 9.32198 6.97675 77
## [5466] 7.94158 12.89480 10.44870 202
## [5467] 16.07440 33.42540 30.53640 476
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
Alternatively, plyranges::read_narrowPeaks() is also a wrapper function provided by the plyranges package to import MACS2 narrowPeaks files.
The extract_summit_macs2 function extracts the summit from the peak ranges:
# extract summit for MACS2 narrowPeaks
summit_macs2 <- extract_summit_macs2(gr)
summit_macs2
## GRanges object with 5467 ranges and 5 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr2 11807 * | GSM3391651_Rep1_H1_C.. 959
## [2] chr2 18850 * | GSM3391651_Rep1_H1_C.. 118
## [3] chr2 142248 * | GSM3391651_Rep1_H1_C.. 47
## [4] chr2 152243 * | GSM3391651_Rep1_H1_C.. 92
## [5] chr2 246674 * | GSM3391651_Rep1_H1_C.. 105
## ... ... ... ... . ... ...
## [5463] chr21 46325397 * | GSM3391651_Rep1_H1_C.. 24
## [5464] chr21 46423972 * | GSM3391651_Rep1_H1_C.. 47
## [5465] chr21 46635563 * | GSM3391651_Rep1_H1_C.. 69
## [5466] chr21 46639962 * | GSM3391651_Rep1_H1_C.. 104
## [5467] chr21 46661190 * | GSM3391651_Rep1_H1_C.. 305
## signalValue pValue qValue
## <numeric> <numeric> <numeric>
## [1] 37.31560 99.66110 95.99710
## [2] 8.61128 14.36300 11.85760
## [3] 5.16677 7.01562 4.76958
## [4] 6.95545 11.63870 9.23085
## [5] 8.03720 13.05580 10.59070
## ... ... ... ...
## [5463] 3.56330 4.50035 2.43234
## [5464] 5.16677 7.01562 4.76958
## [5465] 6.31494 9.32198 6.97675
## [5466] 7.94158 12.89480 10.44870
## [5467] 16.07440 33.42540 30.53640
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
# to export a summit GRanges to a bed file, use rtracklayer::export.bed(summit_macs, file=...)
To exporting a summit Granges to a bed file, use rtracklayer::export.bed. To exporting MACS2 narrowPeaks and preserve the format, use plyranges::write_narrowPeaks().
read_macs2_broad() imports MACS2 broadPeaks files:
broad_file <- system.file('extdata',
'chr2_Rep1_H1_H3K27me3_peaks.broadPeak',
package='peakable')
gr <- read_macs2_broad(broad_file,
drop_chrM = TRUE,
keep_standard_chrom = TRUE,
species = 'Homo_sapiens')
gr
## GRanges object with 1711 ranges and 5 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr2 44716-45426 * | GSM3391653_Rep1_H1_H.. 39
## [2] chr2 284603-291011 * | GSM3391653_Rep1_H1_H.. 113
## [3] chr2 467135-470997 * | GSM3391653_Rep1_H1_H.. 226
## [4] chr2 504630-509340 * | GSM3391653_Rep1_H1_H.. 33
## [5] chr2 514199-517591 * | GSM3391653_Rep1_H1_H.. 28
## ... ... ... ... . ... ...
## [1707] chr21 45644352-45645933 * | GSM3391653_Rep1_H1_H.. 18
## [1708] chr21 45759617-45759948 * | GSM3391653_Rep1_H1_H.. 12
## [1709] chr21 45972489-45975189 * | GSM3391653_Rep1_H1_H.. 29
## [1710] chr21 45982156-45982536 * | GSM3391653_Rep1_H1_H.. 19
## [1711] chr21 46667096-46668833 * | GSM3391653_Rep1_H1_H.. 56
## signalValue pValue qValue
## <numeric> <numeric> <numeric>
## [1] 4.51428 5.91375 3.90930
## [2] 7.39375 13.49210 11.36260
## [3] 11.06400 24.87600 22.64710
## [4] 4.10550 5.37188 3.38826
## [5] 4.06385 4.84263 2.86215
## ... ... ... ...
## [1707] 3.04040 3.78996 1.85959
## [1708] 3.17063 3.12584 1.20869
## [1709] 4.06586 4.97490 2.99319
## [1710] 3.60220 3.86604 1.91452
## [1711] 5.11228 7.73635 5.69934
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
Need to work on peakable::write_brodPeaks().
The peakable package provides functions, such as read_seacr(), extract_summit_searcr(), and write_seacr(), for importing and exporting SEACR-specific peak files.
The metadata columns of SEACR peaks include “AUC”, “max.signal”, and “max.signal.region” corresponding to peak area under curve, maxinum signal within the peak, and the summit region of the peak.
seacr_file <- system.file('extdata',
'chr2_Rep1_H1_CTCF.stringent.bed',
package='peakable')
seacr_gr <- read_seacr(seacr_file,
drop_chrM = TRUE,
keep_standard_chrom = TRUE,
species = 'Homo_sapiens')
seacr_gr
## GRanges object with 3095 ranges and 3 metadata columns:
## seqnames ranges strand | AUC max.signal
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr2 11295-12358 * | 5313.70 13.04530
## [2] chr2 263941-265688 * | 1807.99 3.66898
## [3] chr2 500734-501990 * | 1489.81 3.05748
## [4] chr2 528753-530212 * | 2486.34 6.11496
## [5] chr2 636663-637657 * | 1635.96 3.87281
## ... ... ... ... . ... ...
## [3091] chr21 45879950-45881616 * | 1957.60 2.85365
## [3092] chr21 45970260-45971678 * | 1585.81 4.68814
## [3093] chr21 45972936-45974986 * | 1703.22 1.83449
## [3094] chr21 46228266-46229804 * | 2040.97 2.64982
## [3095] chr21 46660612-46661533 * | 2540.77 5.50347
## max.signal.region
## <character>
## [1] chr2:11802-11811
## [2] chr2:264788-264851
## [3] chr2:501385-501389
## [4] chr2:529386-529390
## [5] chr2:637234-637239
## ... ...
## [3091] chr21:45880606-45880..
## [3092] chr21:45970732-45970..
## [3093] chr21:45973768-45973..
## [3094] chr21:46228746-46228..
## [3095] chr21:46661173-46661..
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
The extract_summit_seacr() function extracts summit region from the SEACR peaks:
summit_seacr <- extract_summit_seacr(seacr_gr)
summit_seacr
## GRanges object with 3095 ranges and 4 metadata columns:
## seqnames ranges strand | name AUC max.signal
## <Rle> <IRanges> <Rle> | <character> <numeric> <numeric>
## [1] chr2 11802-11811 * | peakname_1 5313.70 13.04530
## [2] chr2 264788-264851 * | peakname_2 1807.99 3.66898
## [3] chr2 501385-501389 * | peakname_3 1489.81 3.05748
## [4] chr2 529386-529390 * | peakname_4 2486.34 6.11496
## [5] chr2 637234-637239 * | peakname_5 1635.96 3.87281
## ... ... ... ... . ... ... ...
## [3091] chr21 45880606-45880843 * | peakname_3091 1957.60 2.85365
## [3092] chr21 45970732-45970740 * | peakname_3092 1585.81 4.68814
## [3093] chr21 45973768-45973855 * | peakname_3093 1703.22 1.83449
## [3094] chr21 46228746-46228762 * | peakname_3094 2040.97 2.64982
## [3095] chr21 46661173-46661272 * | peakname_3095 2540.77 5.50347
## itemRgb
## <character>
## [1] #0000FF
## [2] #0000FF
## [3] #0000FF
## [4] #0000FF
## [5] #0000FF
## ... ...
## [3091] #0000FF
## [3092] #0000FF
## [3093] #0000FF
## [3094] #0000FF
## [3095] #0000FF
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
The write_seacr() function exports the SEACR peaks in SEACR-specific bed format:
write_seacr(seacr_gr, file=file.path('./seacr_peaks.bed'))
The peakable package provides two innovative methods for evaluating the correlation of peak ranges across samples of different antibodies and the similarity between biological repliates.
Both methods involve constructing a peak-hit matix from a collection of peaksets across samples of interest:
Concepts:
Figure 1: Peak-hit matrix
Once the peak-hit matrix is constructed, we can apply PCA to observe the correlation across peaksets of samples of interest. First, let’s contrust a data.frame of sample and peak bed files information:
# construct a data.frame of sample information
narrow_pattern = '\\_peaks.narrowPeak$'
sample_info <- data.frame(
bed_file = list.files(
system.file('extdata', package = 'peakable'),
full.names = TRUE, pattern=narrow_pattern)) %>%
dplyr::mutate(sample_id =
stringr::str_replace(basename(bed_file),
narrow_pattern, '')) %>%
dplyr::mutate(antibody =
stringr::str_split(sample_id, '_',
simplify=TRUE)[, 4]) %>%
dplyr::relocate(sample_id, antibody, .before='bed_file')
sample_info %>%
kable(caption='sample information') %>%
kableExtra::kable_styling('striped')
| sample_id | antibody | bed_file |
|---|---|---|
| chr2_Rep1_H1_CTCF | CTCF | /Users/cwo11/Library/R/arm64/4.4/library/peakable/extdata/chr2_Rep1_H1_CTCF_peaks.narrowPeak |
| chr2_Rep1_H1_H3K4me3 | H3K4me3 | /Users/cwo11/Library/R/arm64/4.4/library/peakable/extdata/chr2_Rep1_H1_H3K4me3_peaks.narrowPeak |
| chr2_Rep2_H1_CTCF | CTCF | /Users/cwo11/Library/R/arm64/4.4/library/peakable/extdata/chr2_Rep2_H1_CTCF_peaks.narrowPeak |
| chr2_Rep2_H1_H3K4me3 | H3K4me3 | /Users/cwo11/Library/R/arm64/4.4/library/peakable/extdata/chr2_Rep2_H1_H3K4me3_peaks.narrowPeak |
Second, import the peak files as a list of GRanges:
# import bed files
grl <- lapply(sample_info$bed_file, read_macs2_narrow,
drop_chrM = TRUE,
keep_standard_chrom = TRUE,
species = 'Homo_sapiens')
names(grl) <- sample_info$sample_id
Finally, consolidate the peaks to make the peak-hit matrix (consolidate_peak_hits()):
# construct hit matrix and calculate hit PCA
hits_mat <- peakable::consolidated_peak_hits(grl)
pcs <- peakable:::.getPCA(hits_mat, sample_info, n_pcs=2)
ggplot(pcs, aes(x=PC1, y=PC2, color=antibody)) +
geom_point() + theme_minimal()
The cos_similarity() constructs a peak-hits-based matrix of two sets of peaks, with column \(u\) and \(v\) and apply cos similarity between two binary vectors \(\vec{u}\) and \(\vec{v}\), i.e., \(\frac{\vec{u} \cdot \vec{v}}{\|\vec{u}\|_2 \|\vec{v}\|_2}\).
# CTCF replicates
ctcf_cos_sim <- cos_similarity(gr_x = grl[["chr2_Rep1_H1_CTCF"]],
gr_y = grl[["chr2_Rep2_H1_CTCF"]])
# H3K4me3 replicates
k4me3_cos_sim <- cos_similarity(gr_x = grl[["chr2_Rep1_H1_H3K4me3"]],
gr_y = grl[["chr2_Rep2_H1_H3K4me3"]])
# visualize by ggplot
data.frame(antibody = c('CTCF', 'H3K4me3'),
cos_sim = c(ctcf_cos_sim, k4me3_cos_sim)) %>%
ggplot(aes(x=cos_sim, y=antibody)) +
geom_point() +
geom_segment(aes(x=0, y=antibody, xend=cos_sim,
yend=antibody), color='grey50') +
theme_light() +
labs(title='MACS2 narrow peakets: cos similarity between replicates')
Figure 2: cos similarity of peak hits between replicates
The plyranges package provides many tools, such as find_overlaps, filter_by_overlaps, to find the overlaps between two peaksets (GRanges) while preserving the metadata columns of the peaks. For example:
x <- grl[['chr2_Rep1_H1_CTCF']]
y <- grl[['chr2_Rep2_H1_CTCF']]
x %>%
plyranges::filter_by_overlaps(y, minoverlap = 40L)
## GRanges object with 4342 ranges and 6 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr2 11304-12188 * | GSM3391651_Rep1_H1_C.. 959
## [2] chr2 18466-19108 * | GSM3391651_Rep1_H1_C.. 118
## [3] chr2 152119-152336 * | GSM3391651_Rep1_H1_C.. 92
## [4] chr2 246399-246881 * | GSM3391651_Rep1_H1_C.. 105
## [5] chr2 264623-265132 * | GSM3391651_Rep1_H1_C.. 172
## ... ... ... ... . ... ...
## [4338] chr21 46098165-46098365 * | GSM3391651_Rep1_H1_C.. 37
## [4339] chr21 46225127-46225475 * | GSM3391651_Rep1_H1_C.. 105
## [4340] chr21 46228502-46229535 * | GSM3391651_Rep1_H1_C.. 105
## [4341] chr21 46635486-46636005 * | GSM3391651_Rep1_H1_C.. 69
## [4342] chr21 46660714-46661481 * | GSM3391651_Rep1_H1_C.. 305
## signalValue pValue qValue peak
## <numeric> <numeric> <numeric> <integer>
## [1] 37.31560 99.6611 95.99710 503
## [2] 8.61128 14.3630 11.85760 384
## [3] 6.95545 11.6387 9.23085 124
## [4] 8.03720 13.0558 10.59070 275
## [5] 10.90760 19.8543 17.22070 171
## ... ... ... ... ...
## [4338] 4.59268 5.92793 3.73807 99
## [4339] 8.03720 13.05580 10.59070 164
## [4340] 8.03720 13.05580 10.59070 253
## [4341] 6.31494 9.32198 6.97675 77
## [4342] 16.07440 33.42540 30.53640 476
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
find_consensus_macs2()
consensus <- find_consensus_macs2(x, y, minoverlap = 40L)
consensus
## GRanges object with 4342 ranges and 6 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr2 11304-12188 * | GSM3391651_Rep1_H1_C.. 959
## [2] chr2 18466-19108 * | GSM3391651_Rep1_H1_C.. 118
## [3] chr2 152119-152336 * | GSM3391651_Rep1_H1_C.. 92
## [4] chr2 246399-246881 * | GSM3391651_Rep1_H1_C.. 105
## [5] chr2 264623-265132 * | GSM3391651_Rep1_H1_C.. 172
## ... ... ... ... . ... ...
## [4338] chr21 46098165-46098365 * | GSM3391651_Rep1_H1_C.. 37
## [4339] chr21 46225127-46225475 * | GSM3391651_Rep1_H1_C.. 105
## [4340] chr21 46228502-46229535 * | GSM3391651_Rep1_H1_C.. 105
## [4341] chr21 46635486-46636005 * | GSM3391651_Rep1_H1_C.. 69
## [4342] chr21 46660714-46661481 * | GSM3391651_Rep1_H1_C.. 305
## signalValue pValue qValue peak
## <numeric> <numeric> <numeric> <integer>
## [1] 37.31560 99.6611 95.99710 503
## [2] 8.61128 14.3630 11.85760 384
## [3] 6.95545 11.6387 9.23085 124
## [4] 8.03720 13.0558 10.59070 275
## [5] 10.90760 19.8543 17.22070 171
## ... ... ... ... ...
## [4338] 4.59268 5.92793 3.73807 99
## [4339] 8.03720 13.05580 10.59070 164
## [4340] 8.03720 13.05580 10.59070 253
## [4341] 6.31494 9.32198 6.97675 77
## [4342] 16.07440 33.42540 30.53640 476
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
Venn diagram of overlaps
find_overlaps_venn(x, y,
label_x = 'chr2_Rep1_H1_CTCF',
label_y = 'chr2_Rep2_H1_CTCF',
minoverlap = 40L)
consensus_byconsensus_by: take advantage of sample_info and robustly get the consensus between replicates
# consensus_by() group the grl by 'antibody' and returns a data.frame and a list of
# two consensus ranges in GRanges instances
consensus <-
peakable:::consensus_by(sample_info,
peaks_grl = grl,
consensus_group_by = 'antibody',
peak_caller = 'macs2')
consensus$df
## sample_id number_of_peaks antibody
## 1 CTCF 4304 CTCF
## 2 H3K4me3 2612 H3K4me3
head(consensus$grl[['CTCF']])
## GRanges object with 6 ranges and 6 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr2 11304-12188 * | GSM3391651_Rep1_H1_C.. 959
## [2] chr2 18466-19108 * | GSM3391651_Rep1_H1_C.. 118
## [3] chr2 152119-152336 * | GSM3391651_Rep1_H1_C.. 92
## [4] chr2 246399-246881 * | GSM3391651_Rep1_H1_C.. 105
## [5] chr2 264623-265132 * | GSM3391651_Rep1_H1_C.. 172
## [6] chr2 501073-501578 * | GSM3391651_Rep1_H1_C.. 129
## signalValue pValue qValue peak
## <numeric> <numeric> <numeric> <integer>
## [1] 37.31560 99.6611 95.99710 503
## [2] 8.61128 14.3630 11.85760 384
## [3] 6.95545 11.6387 9.23085 124
## [4] 8.03720 13.0558 10.59070 275
## [5] 10.90760 19.8543 17.22070 171
## [6] 9.07609 15.5128 12.99140 315
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
find_consensus_seacr()
# wrapper of plyranges::find_overlaps(); query-based consensus peaks; carry over the metadata
seacr_file_x <-
system.file('extdata',
'chr2_Rep1_H1_CTCF.stringent.bed',
package='peaklerrr')
seacr_file_y <-
system.file('extdata',
'chr2_Rep2_H1_CTCF.stringent.bed',
package='peaklerrr')
x <- peaklerrr::read_seacr(seacr_file_x)
y <- peaklerrr::read_seacr(seacr_file_y)
minoverlap <- min(min(width(x)), min(width(y))) / 2
consensus <- find_consensus_seacr(x, y, minoverlap = minoverlap)
consensus
## GRanges object with 2755 ranges and 3 metadata columns:
## seqnames ranges strand | AUC max.signal
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr2 11295-12358 * | 5313.70 13.04530
## [2] chr2 500734-501990 * | 1489.81 3.05748
## [3] chr2 528753-530212 * | 1699.19 6.38286
## [4] chr2 636663-637657 * | 1635.96 3.87281
## [5] chr2 713765-714944 * | 2465.91 8.31705
## ... ... ... ... . ... ...
## [2751] chr21 45970260-45971678 * | 2605.95 7.93021
## [2752] chr21 46228266-46229804 * | 2040.97 2.64982
## [2753] chr21 46660612-46661533 * | 2540.77 5.50347
## [2754] chr21_GL383578v2_alt 21494-22983 * | 2019.77 5.09580
## [2755] chr21_GL383580v2_alt 63758-65400 * | 3199.14 7.94945
## max.signal.region
## <character>
## [1] chr2:11802-11811
## [2] chr2:501385-501389
## [3] chr2:529481-529546
## [4] chr2:637234-637239
## [5] chr2:714368-714391
## ... ...
## [2751] chr21:45970621-45970..
## [2752] chr21:46228746-46228..
## [2753] chr21:46661173-46661..
## [2754] chr21_GL383578v2_alt..
## [2755] chr21_GL383580v2_alt..
## -------
## seqinfo: 5 sequences from an unspecified genome; no seqlengths
peakable:::peakle_flow()