MethylScore is a pipeline to call differentially methylated regions between samples obtained from Whole Genome Bisulfite Sequencing (WGBS).
MethylScore starts from bam files containing alignments of bisulfite-sequenced reads to a reference genome, produced by bisulfite read alignment tools such as bismark and bwa_meth. Alternatively, bedGraph files with pre-tabulated methylation information as they are produced by MethylDackel extract or bismark methylation extractor.
If genomic alignments are supplied as an input, mapped reads from technical replicates are first merged and coordinate-sorted using samtools sort, and the mappings for each sample are de-duplicated using picard MarkDuplicates. For concurrent processing, the alignments are then split by chromosome, and, for each sample and chromosome, the numbers of methylated/unmethylated reads per position (pileup information) are retrieved using MethylDackel extract.
The obtained pileup information from all analysed samples is summarised in a so-called genome matrix that is generated per sample and chromosome in parallel. The global genome matrix serves as input for the detection of methylated regions per sample and methylated regions (MRs) are determined by a two-state Hidden Markov Model (HMM)-based method that learns different methylation level distributions for an unmethylated and a methylated state from whole genome data. Finally, to obtain significant differences in methylation on a regional scale between different samples, MethylScore clusters samples by methylation levels and statistically tests the groupwise methylation distributions for significant differences.
All software dependencies required by MethylScore are provided in a Docker container, the only requirements to run MethylScore are Nextflow, and a supported container engine (Singularity, Docker, Charliecloud or Podman).
MethylScore requires a samplesheet. It serves to create a mapping between sample identifiers and corresponding file locations of input files and should consist of two columns separated by tabs (column headers are not required):
| sampleID | path |
|---|---|
| S1 | /path/to/S1A.{bam,bedGraph} |
| S2 | /path/to/S2A.{bam,bedGraph} |
| S2 | /path/to/S2B.{bam,bedGraph} |
Samples sharing the same sampleID will be treated as technical replicates and merged prior to further processing.
To start the pipeline (using docker in this case), at least --SAMPLE_SHEET and --GENOME (in fasta format, same one as the reads were mapped against) have to be provided.
# If genomic alignments in bam format are provided
nextflow run Computomics/MethylScore --SAMPLE_SHEET=/path/to/samplesheet.tsv --GENOME=/path/to/reference_genome.fa -profile docker
# If bedGraph input is provided
nextflow run Computomics/MethylScore --BEDGRAPH --SAMPLE_SHEET=/path/to/samplesheet.tsv --GENOME=/path/to/reference_genome.fa -profile dockerThe pipeline will create a output directory structure that looks like the following:
├── 01mappings
│ ├── S1
│ │ ├── S1.cov.avg
│ │ ├── S1.cov_stats.tsv
│ │ └── S1.read_stats.tsv
│ └── S2
│ ├── S2.cov.avg
│ ├── S2.cov_stats.tsv
│ └── S2.read_stats.tsv
├── 02consensus
│ └── mbias
│ ├── S1.Chr1_OB.svg
│ ├── S1.Chr1_OT.svg
│ ├── S2.Chr1_OB.svg
│ └── S2.Chr1_OT.svg
├── 03matrix
│ ├── genome_matrix.tsv.gz
│ └── genome_matrix.tsv.gz.tbi
├── 04MRs
│ ├── hmm_parameters
│ │ ├── S1.hmm_params
│ │ └── S2.hmm_params
│ ├── S1.MRs.bed
│ ├── S2.MRs.bed
│ └── stats
│ ├── S1.MR_stats.tsv
│ └── S2.MR_stats.tsv
├── 05DMRs
│ └── all
│ ├── DMRs.CG.bed
│ ├── DMRs.CHG.bed
│ └── DMRs.CHH.bed
├── MethylScore_graph.png
├── MethylScore_report.html
└── MethylScore_trace.txt Contains alignment statistics for each sample.
Sorted and de-duplicated bam files are optionally stored in this directory (if run with --REMOVE_INTMED_FILES false).
Contains mbias plots showing methylation with respect to position along the sequencing reads that should be used to (re-)assess read trimming settings as needed.
Single-cytosine pileup information for each sample is optionally stored in this directory (if run with --REMOVE_INTMED_FILES false).
Contains the merged whole genome matrix across all samples as a bgzip compressed file, along with the corresponding tabix index.
The genome matrix for each chromosome are optionally stored in this directory (if run with --REMOVE_INTMED_FILES false).
Contains genomic coordinates of methylated regions (MRs) as they were segmented by the Hidden Markov Model, along with associated region-based metrics. The parameters obtained from training the model on each sample are stored and can be used to reduce computational burden in subsequent pipeline runs.
The coordinates are stored in bed format, with the following columns:
column 1: chromosome ID
column 2: (1-based) start position
column 3: (1-based) end position, half-open (i.e. this position is not part of the region)
column 4: Number of covered cytosines in MR
column 5: Mean read depth in MR
column 6: 5th percentile of read depth in MR
column 7: Mean methylation rate of cytosines within MR
column 8: SampleID
Example:
| X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 |
|---|---|---|---|---|---|---|---|
| Chr1 | 597 | 651 | 23 | 11 | 12 | 12 | S1 |
| Chr1 | 763 | 956 | 52 | 5 | 9 | 51 | S1 |
Contains genomic coordinates of differentially methylated regions (DMRs) that were determined as significantly different between sample clusters, after candidate region selection from MRs followed by statistical testing.
column 1: chromosome ID
column 2: (1-based) start position
column 3: (1-based) end position, half-open (i.e. this position is not part of the region)
column 4: Length in bp
column 5: Cluster-String, one symbol per sample:
* 1,2,3,... = cluster ID
* . = sample is not covered at all positions within region
* - = sample is not sufficiently covered at all positions within region (at least DMR_MIN_C positions with a minimum read depth of DMR_MIN_COV-fold are required)
column 6: Mean methylation rate of total sites, and of sites in the contexts CG, CHG, CHH (in this order) for cluster 1, comma-separated and preceded by the cluster ID and ":" (1:...)
column 7: Mean methylation rate of total sites, and of sites in the contexts CG, CHG, CHH (in this order) for cluster 2, comma-separated and preceded by the cluster ID and ":" (2:...)
column 8 to n-3: Potentially more mean methylation rates if there are more than 2 clusters
column n-2: Number of total sites, and of sites in the contexts CG, CHG, CHH (in this order), comma-separated, and preceded by "#:"
column n-1: Differentially methylated context
column n: Highly differentially methylated context
Example:
| X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | X11 | X12 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Chr1 | 42255 | 42409 | 155 | 11.21-211 | 1:17,20,0,12 | 2:75,80,0,28 | 1:S1,S3,... | 2:S2,S6,... | #:30,14,3,13 | CG,CHH | CHH |
Pipeline parameters can either be configured using a parameter file, or
individual parameters can be passed to the pipeline on the commandline.
For example, methylated regions can be visualized as IGV tracks by passing the --IGV parameter via the commandline:
nextflow run Computomics/MethylScore --SAMPLE_SHEET=samplesheet.tsv --GENOME=genome.fa --IGV -profile podmanAlternatively, the repository contains a template example_config.yaml, which can be edited and used to pass custom parameters in a more reproducible manner to the pipeline using the -params-file flag.
nextflow run Computomics/MethylScore --SAMPLE_SHEET=samplesheet.tsv --GENOME=genome.fa -params-file=/path/to/config.yaml -profile podman--SAMPLE_SHEET required
Path to sample sheet.
--GENOME required
Path to fasta file containing reference genome.
--BEDGRAPH default: false
Run MethylScore with bedGraph input.
--PAIRWISE default: false
Activate the pairwise mode, which will run all pairwise combinations between a user-specified sampleID
and each of the other samples. Use as --PAIRWISE followed by a sampleID from the first column of the samplesheet.
--DO_DEDUP default: true
Perform read de-duplication using picard MarkDuplicates
--IGV default: false
Generate IGV tracks.
--REMOVE_INTMED_FILES default: true
Do not store intermediate files.
--PROJECT_FOLDER default: './results'
Output path where results will be stored.
--STATISTICS default: true
Collect coverage statistics on alignments.
--ROI default: false
Supply regions of interest in bed format.
--METRICS default: true
Generate runtime metrics with Nextflow.
--MIN_QUAL default: 30
Minimum mapping quality of reads to consider (influenced by ploidy level!)
Only applies when running from bam input.
--IGNORE_OT default: 0,0,0,0
Disregard bases of original top strand for consensus calling (r1_start,r1_end,r2_start,r2_end)
Useful for removal of methylation bias from read ends. Uses the same syntax as MethylDackel extract.
Only applies when running from bam input.
--IGNORE_OB default: 0,0,0,0
Disregard bases of original top strand for consensus calling (r1_start,r1_end,r2_start,r2_end)
Useful for removal of methylation bias from read ends. Uses the same syntax as MethylDackel extract.
Only applies when running from bam input.
--MR_MIN_COV default: 1
Minimum per-site coverage to consider
--MR_FREQ_CHANGE default: 20
Percent MR frequency change (across all samples) leading to segment break, indicative of natural region boundaries that are widespread in the sample population.
This parameter should be adjusted to lower values if identification of regions with less frequent changes (i.e lower "epiallele-frequency") in the population is desired.
--MR_FREQ_DISTANCE default: 30
Upstream distance in bp to which MR frequency change is compared.
--DESERT_SIZE default: 100
If a region spans more than DESERT_SIZE bp without covered cytosines, break segment and rather start separate HMM path.
This prevents extending MRs over stretches of missing data.
--MR_MIN_C default: 20
Minimum number of covered cytosines within MRs, turns off permutation test (set to -1 for turning on the permutation test)
--MERGE_DIST default: 30
Merge MRs that are MERGE_DIST bp close to each other.
--TRIM_METHRATE default: 10
Trim sites off both MR ends below this methylation level.
--MR_PARAMS default: false
Provide pre-trained parameter file for calling MRs on all samples.
If set to false, the hmm will train on each sample.
--SLIDING_WINDOW_SIZE default: 0
If set to a value greater than 0, applies sliding window of width SLIDING_WINDOW_SIZE along segmented regions to breakdown candidate regions to test.
--SLIDING_WINDOW_STEP default: 0
If SLIDING_WINDOW_SIZE and SLIDING_WINDOW_STEP are set to values greater than 0, applies sliding window of size SLIDING_WINDOW_SIZE along selected regions to breakdown candidate regions to test, using a step size of SLIDING_WINDOW_STEP
--HUMAN default: false
If set to true, the segmentation will select hypomethylated regions instead of methylated regions.
Useful if analysing human data.
--MR_BATCH_SIZE default: 500
Number of MR blocks per file.
--DMRS_PER_CONTEXT default: true
Call DMRs for each sequence context.
--DMR_CONTEXTS default: CG,CHG,CHH
Sequence contexts to analyse.
--CLUSTER_MIN_METH_DIFF{_CG,_CHG,_CHH} default: 20
Minimum methylation level differences per context between any pair of sample clusters.
CLUSTER_MIN_METH_DIFF_CG,CLUSTER_MIN_METH_DIFF_CHG and CLUSTER_MIN_METH_DIFF_CHH only apply
when DMRS_PER_CONTEXT is set to true.
For each candidate region, cluster centers are searched to minimize the within-group variance. The value of k is iteratively incremented starting from k = 2, until the pairwise comparison of all cluster centers results in a methylation difference of less than CLUSTER_MIN_METH_DIFF.
This effectively discards likely irrelevant DMRs with only a few percentage points methylation difference.
--CLUSTER_MIN_METH{_CG,_CHG,_CHH} default: 20
Minimum methylation level of any sample cluster per context.
CLUSTER_MIN_METH_CG,CLUSTER_MIN_METH_CHG and CLUSTER_MIN_METH_CHH only apply
when DMRS_PER_CONTEXT is set to true.
--DMR_MIN_COV default: 3
Minimum read coverage of cytosines within (candidate) DMRs.
--DMR_MIN_C default: 5
Minimum number of cytosines exceeding DMR_MIN_COV that are required within (candidate) DMRs.