cfDNAFE

Introduction

cfDNAFE(cell free DNA Feature Eextraction) is a tool for extracting cfDNA features, it contains End Motif(EDM), Breakpoint End(BPM), Motif-Diversity Score(MDS), Fragment Size Ratio (FSR),Fragment Size Distribution (FSD) , Windows protection score(WPS), Orientation-aware cfDNA fragmentation(OCF) value, Copy Number variations(CNV), UXM fragment-level and mutation signature.

The main functions are as the following picture.

cfDNAFE Function

Section 1: Installation Tutorial

Section 1.1: System requirement

Since many whole genome sequencing(WGS) or whole genome bisulfite sequencing(WGBS) analysis toolkits are released on Unix/Linux systems, they are based on different programming languages. Therefore, it is very difficult to rewrite all software in one language. Fortunately, the conda/bioconda program collects many popular python modules and bioinformatics software, so we can install all dependencies via conda/bioconda.

We recommend using conda/Anaconda and create a virtual environment to manage all the dependencies.

Section 1.2: Create environment and Install Dependencies

Running the following command. The environment will be created and all the dependencies as well as the latest cfDNAFE will be installed.

#download cfDNAFE from github(From cfDNAFE you can get the files necessary for the software to run)
git clone https://github.com/Cuiwanxin1998/cfDNAFE.git
cd cfDNAFE/
#create a virtual environment and activate environment
conda env create -n cfDNAFE -f environment.yml
conda activate cfDNAFE
#install R function
Rscript ./scripts/install_Rpackages.R
#Add environment variables
export PATH=${PATH}:$PWD

Section 1.3: Use cfDNAFE

After the above steps, the user can use the cfDNAFE using the following command.

$ cfDNAFE
A tool for extracting cfDNA feature, version 0.1.0
Usage: cfDNAFE <command> [<args>]
run cfDNAFE <command> -h for more information
Optional commands:
bam, extracting end motif, breakpoint motif and Motif-Diversity Score
fsc, extracting fragment size coverage
fsd, extracting fragment size distribution
fsr, extracting fragment size ratio
cnv, extracting copy number variations
wps, extracting window protect score
ocf, extracting orientation-aware cfDNA fragmentation
mutation, extracting 96 single base substitution mutation (SBS) profile and a mutation signature profile
meth, extracting UXM fragment-level

Section 2: Bam File Data Processing

cfDNAFE mainly processes bam file data. This function input is the initial step of cfDNAFE, which mainly extracts the bed input files required by the following functions and Motif End, Breakpoint End, MDS.

Human reference genome can be obtained through UCSC, here we provide GRCh37/hg19 and GRCh38/hg38.

Detailed parameters

usage: bam [-h] -p BAMPATH [-b BLACKLIST] -g GENOME_REFERENCE [-o OUTPUT] [-m MAPQUALITY] 
           [-c CHR] [-f] [-minl MINLEN] [-maxl MAXLEN] [-k K_MER] [-t THREADS]

Example Usage

cfDNAFE bam -p /path/to/bamfile -g /path/to/hg38.fa -o output_folders

Output Folder Arrangement

output_folders/
├──sample1.bed
├──sample1.bed.gz
├──sample1.bed.gz.tbi
├──EDM
│   ├──sample1.EndMotif
├──BPM
│   ├──sample1.BreakPointMotif
├──MDS
│   ├──sample1.MDS

Section 3: Fragment Size Ratio (FSR), Fragment Size Coverage (FSC) and Fragment Size Distribution (FSD)

FSR: The fragment sizes were used to construct fragmentation profiles with in-house scripts. The FSR was adapted from the DELFI method and optimized by introducing an extra fragment size group and using improved cutoff. It was was generated using the short/intermediate/long fragments ratios except using different cutoffs: the short, intermediate and long fragments were defined as 65-150bp, 151-220bp and 221-400bp, according to the overall fragment lengths profile in our cohorts.

FSC: FSC was generated using the coverages of short (65-150bp), intermediate (151-260bp), long (261-400bp), and total (65-400bp) cfDNA fragments. The extended ranges allowed the inclusion of broader size regions than what DELFI has reported. The genome was firstly divided into 100 kB bins. Next, the coverage of the four fragment size groups in each 100 kB bin was calculated and corrected by GC content. We then combined the coverages in every 50 contiguous 100 kB bins to calculate the coverage in the corresponding 5 MB (50 × 100 kB) bin. For each fragmentation size group, the scaled coverage score (z-score) in every 5 MB bin was calculated by comparing the variable value against the overall mean value.

FSD:The FSD examined fragment length patterns at a high resolution by grouping cfDNA fragments into length bins of 5bp ranging from 65bp and 400bp and calculating the ratio of fragments in each bin at arm level for each chromosome. A total of 41 chromosome arms were examined.

We provide 100kb window files for hg19 and hg38, you can find in /cfDNAFE/data/ChormosomeBins/ and chromosome arms files for hg19 and hg38, you can find in /cfDNAFE/data/ChormosomeArms/. Here, we use hg38 as deafult.

Detailed parameters

usage: fsr [-h] -p BEDGZPATH [-b BININPUT] [-w WINDOWS] [-c CONTINUE_N]
           [-o OUTPUT] [-t THREADS]
		   
usage: fsr [-h] -p BEDGZPATH [-b BININPUT] [-w WINDOWS] [-c CONTINUE_N]
           [-o OUTPUT] [-t THREADS]
		   
usage: fsd [-h] -p BEDGZPATH [-a ARMSINPUT] [-o OUTPUT] [-t THREADS]

Example Usage

# run FSR
cfDNAFE fsr -p /path/to/bedgzfile -o output_folders/
# run FSC
cfDNAFE fsc -p /path/to/bedgzfile -o output_folders/
# run FSD
cfDNAFE fsd -p /path/to/bedgzfile -o output_folders/

Output Folder Arrangement

output_folders/
├──sample1.FSR
├──sample1.FSC
├──sample1.FSD

Section 4: Windows protection score (WPS)

WPS: Both outer alignment coordinates of PE data were extracted for properly paired reads. Both end coordinates of SR alignments were extracted when PE data were collapsed to SR data by adapter trimming. A fragment coverage is defined as all positions rag between the two (inferred) , inclusive of endpoints. We define the windowed protection score (WPS) of a window of size k as the number of molecules spanning the window minus those with an endpoint within the window.

We provide the downstream 1kb and 10kb of the TSS of each gene for hg19 and hg38 in /cfDNAFE/data/transcriptAnno/.

Moreover, we will illustrate how to get gene bodies from gencode annotation files. Users can download gencode annotation files from gencode database, the commonly used files are gencode.v19.annotation.gtf.gz for hg19 and gencode.v37.annotation.gtf.gz for hg38. Here, we use transcriptAnno-hg38-1kb.tsv as deafult.

Detailed parameters

usage: wps [-h] -p BEDGZPATH [-tsv TSVINPUT] [-o OUTPUT] [-w {L,S}]
           [-empty {True,False}] [-t THREADS]

Example Usage

cfDNAFE wps -p /path/to/bedgzfile -o output_folders/

Output Folder Arrangement

output_folders/
├──sample1
│   ├──sample1_ENSG000000003.14.tsv.gz
│   ├──sample1_ENSG000000005.5tsv.gz
│   ├──sample1_ENSG0000000419.12.tsv.gz
│   │.....

Section 5: Orientation-aware cfDNA fragmentation (OCF)

OCF: To explore the potential in inferring the relative contributions of various tissues in the plasma DNA pool, Sun et al. developed a novel approach to measure the differential phasing of upstream (U) and downstream (D) fragment ends in tissue-specific open chromatin regions. They called this strategy orientation-aware cfDNA fragmentation (OCF) analysis. OCF values are based on the differences in U and D end signals in the center of the relevant open chromatin regions. For tissues that contributed DNA into plasma, one would expect much cfDNA fragmentation to have occurred at the nucleosome-depleted region in the center of the corresponding tissue-specific open chromatin regions. In such a region, U and D ends exhibited the highest read densities (i.e., peaks) at ∼60 bp from the center, whereas the peaks for U and D ends were located on the right- and left-hand sides, respectively. Conversely, this pattern would not be expected for tissue-specific open chromatin regions where the corresponding tissue did not contribute DNA into the plasma. Thus measured the differences of U and D end signals in 20-bp windows around the peaks in the tissue-specific open chromatin regions as the OCF value for the corresponding tissue.

We provide tissue specific open chromatin regions for seven tissues, you can find in /cfDNAFE/data/OpenChromatinRegion/.

Detailed parameters

usage: ocf [-h] -p BEDGZPATH [-ocr OCRINPUT] [-o OUTPUT] [-t THREADS]

Example Usage

cfDNAFE ocf -p /path/to/bedgzfile -o output_folders/

Output Folder Arrangement

output_folders/
├──sample1
│   ├──Breast.sync.end
│   ├──Intestine.sync.end
│   ├──Liver.sync.end
│   ├──Lung.sync.end
│   ├──Ovary.sync.end
│   ├──Placenta.sync.end
│   ├──Tcell.sync.end
│   ├──all.ocf.csv

Section 6: Copy Number variations (CNV)

CNV: The Copy Number Variation (CNV) profile was calculated using ichorCNA as reported by Wan et al.. First, the genome of each sample was divided into 1 MB bins. For each bin, the depth after bin-level GC correction was used by a Hidden Markov Model (HMM) to compare against the software baseline. Then, we calculated the log2 ratio for the CNV score.

There are 2 main steps in this part, generating read count coverage information using readCounter from the HMMcopy suite.

Detailed parameters

usage: cnv [-h] -p BAMPATH [-c CHROMOSOME] [-o OUTPUTDIR] [-w WINDOW_SIZE]
           [-q QUALITY] [-ploidy PLOIDY] [-normal NORMAL] [-maxCN MAXCN]
           [-gcWig GCWIG] [-mapWig MAPWIG] [-centromere CENTROMERE]
           [-includeHOMD INCLUDEHOMD] [-chrs CHRS] [-chrTrain CHRTRAIN]
           [-normalPanel NORMALPANEL] [-estimateNormal ESTIMATENORMAL]
           [-estimatePloidy ESTIMATEPLOIDY]
           [-estimateScPrevalence ESTIMATESCPREVALENCE] [-scStates SCSTATES]
           [-txnE TXNE] [-txnStrength TXNSTRENGTH] [-seqinfo {hg38,hg19}]
           [-t THREADS]

Example Usage

cfDNAFE cnv -p /path/to/bamfile -o output_folders/

Output Folder Arrangement

output_folders/ 
├──sample1.CNV

Section 7: Mutation Signature

Mutation signature: Each mutational process is thought to leave its own characteristic mark on the genome. For example, AID/APOBEC activity can specifically cause C > T and C > G substitutions at TpCpA and TpCpT sites (of which the underlined nucleotide is mutated. Thus, patterns of somatic mutations can serve as readout of the mutational processes that have been active and as proxies for the molecular perturbations in a tumour. These mutational signatures are characterized by a specific contribution of 96 base substitution types with a certain sequence context.

Detailed parameters

usage: mutation [-h] -p BAMPATH -g GENOME_REFERENCE [-o OUTPUTDIR]
                [-r {hg19,hg38}] [-mq MAPQ] [-bq BASEQ] [-id ID] [-t THREADS]

Example Usage

cfDNAFE mutation -p /path/to/bamfile -o output_folders/

Output Folder Arrangement

output_folders/
├──sample.signatures
├──sample.96.mutation.profle

Section 8: UXM fragment-level

UXM fragment-level: Each fragment was annotated as U (mostly unmethylated), M (mostly methylated) or X (mixed) depending on the number of methylated and unmethylated CpGs64. We then calculated, for each genomic region (marker) and across all cell types, the proportion of U/X/M fragments with at least k CpGs.

cfDNAFE provide a function runMeth to calculate the UXM fragment-level. We collect the top 25 differentially unmethylated regions for each cell type comprise a human cell-type-specific methylation atlas in /cfDNAFE/data/MethMark/. And cfDNAFE allows users to use custom disease-specific or tissue-specific marks, which requires that the mark contains the chromosome, the start position of the region, the end position of the region and file names ends with .bed file

Detailed parameters

usage: meth [-h] -p BAMPATH [-m MARKINPUT] [-o OUTPUTDIR] [-mq MAPQUALITY]
            [-mCpG MINCPG] [-mT METHYTHRESHOLD] [-umT UNMETHYTHRESHOLD]
            [-t THREADS]

Example Usage

cfDNAFE meth -p /path/to/bamfile -o output_folders/

Output Folder Arrangement

output_folders/
├── sample1.UXM.tsv
├── sample2.UXM.tsv

Authors

xq Peng- Central South University
wx Cui - Central South University

Version update

cfDNAFE v0.1.0 - * News version 0.1.0, 2023.04.01 The first version.*

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
data		data
pics		pics
scripts		scripts
src		src
.gitattributes		.gitattributes
LICENSE		LICENSE
README		README
README.md		README.md
README.rst		README.rst
cfDNAFE		cfDNAFE
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

cfDNAFE

Introduction

Section 1: Installation Tutorial

Section 1.1: System requirement

Section 1.2: Create environment and Install Dependencies

Section 1.3: Use cfDNAFE

Section 2: Bam File Data Processing

Section 3: Fragment Size Ratio (FSR), Fragment Size Coverage (FSC) and Fragment Size Distribution (FSD)

Section 4: Windows protection score (WPS)

Section 5: Orientation-aware cfDNA fragmentation (OCF)

Section 6: Copy Number variations (CNV)

Section 7: Mutation Signature

Section 8: UXM fragment-level

Authors

Version update

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Cuiwanxin1998/cfDNAFE

Folders and files

Latest commit

History

Repository files navigation

cfDNAFE

Introduction

Section 1: Installation Tutorial

Section 1.1: System requirement

Section 1.2: Create environment and Install Dependencies

Section 1.3: Use cfDNAFE

Section 2: Bam File Data Processing

Section 3: Fragment Size Ratio (FSR), Fragment Size Coverage (FSC) and Fragment Size Distribution (FSD)

Section 4: Windows protection score (WPS)

Section 5: Orientation-aware cfDNA fragmentation (OCF)

Section 6: Copy Number variations (CNV)

Section 7: Mutation Signature

Section 8: UXM fragment-level

Authors

Version update

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages