A tool to analyze haplotype-specific chromosome-scale somatic copy number aberrations and aneuploidy using long reads (Oxford Nanopore, PacBio). Wakhan takes long-read alignment and phased heterozygous variants as input, and first extends the phased blocks and corrects phase-switch errors using hapcorrect module, taking advantage of the copy numbers differences between the haplotypes. Wakhan estimates purity and ploidy of the sample and generates inetractive haplotype-specific copy number and coverage plots.
git clone https://github.com/KolmogorovLab/Wakhan.git
cd Wakhan/
conda env create -f environment.yml -n wakhan
conda activate wakhan
conda create -n wakhan_env wakhan
conda activate wakhan_env
Wakhan can be run as a standalone phase-correction and copy number profiling tool using below [1] tumor-only and tumor/normal pair commands. In case phased SVs/breakpoints, Long-Read Somatic Variant Calling pipeline mode [2] is recommended.
Please refer to prerequisite section to generate required phased VCF and breakpoints VCF.
python wakhan.py --threads <24> --reference <ref.fa> --target-bam <tumor.bam> --normal-phased-vcf <normal_phased.vcf.gz> --genome-name <cellline/dataset name> --out-dir-plots <genome_abc_output> --breakpoints <severus-sv-VCF>
python wakhan.py --threads <24> --reference <ref.fa> --target-bam <tumor.bam> --tumor-phased-vcf <tumor_phased.vcf.gz> --genome-name <cellline/dataset name> --out-dir-plots <genome_abc_output> --breakpoints <severus-sv-VCF>
Severus also produces phased breakpoints/structural variations after rephasing tumor (tumor-only mode) or normal (tumor/normal pair mode) phased VCF which can be used in Wakhan by setting --use-sv-haplotypes param.
This option enables to segment copy numbers boundaries in only one appropriate haplotype.
To use phased SVs/breakpoints Wakhan works in two steps, in first step it uses hapcorrect() module for phase-correction and generates rephased VCF, which is used to haplotag BAMs through Whatshap,
then Severus uses haplotagged BAMs to generate phased SVs. In second step, Wakhan takes this resultant Severus phased SVs VCF and runs cna() module with --use-sv-haplotypes param.
For this purpose, we have developed Nextflow based Long-Read Somatic Variant Calling pipeline.
This pipeline (Wakhan - hapcorrect -> Whatshap - haplotagging -> Severus - sv/breakpoints -> Wakhan - cna) can generate Severus phased SVs/breakpoints, which could be used in Wakhan by setting --use-sv-haplotypes param.
Alternatively, user can use following commands to mimic this workflow:
#Wakhan hapcorrect()
python wakhan.py hapcorrect --threads 16 --reference ${REF_FASTA} --target-bam ${BAM_T} --normal-phased-vcf ${VCF} --genome-name ${SAMPLE_T}
VCF='<genome_abc_output>/phasing_output/rephased.vcf.gz'
#Index for rephased VCF
tabix ${VCF}
#Haplotag Normal with rephased VCF
whatshap haplotag --reference ${REF_FASTA} ${VCF} ${BAM_N} -o ${SAMPLE_N}.haplotagged.bam --ignore-read-groups --tag-supplementary --skip-missing-contigs --output-threads=4
#Haplotag Tumor with rephased VCF
whatshap haplotag --reference ${REF_FASTA} ${VCF} ${BAM_T} -o ${SAMPLE_T}.haplotagged.bam --ignore-read-groups --tag-supplementary --skip-missing-contigs --output-threads=4
#Index for Normal haplotagged BAM
samtools index ${SAMPLE_N}.haplotagged.bam
#Index for Tumor haplotagged BAM
samtools index ${SAMPLE_T}.haplotagged.bam
#Severus tumor-normal mode
severus --target-bam ${SAMPLE_T}.haplotagged.bam --control-bam ${SAMPLE_N}.haplotagged.bam --out-dir severus_out \
-t 16 --phasing-vcf ${VCF} --vntr-bed ./vntrs/human_GRCh38_no_alt_analysis_set.trf.bed
#Wakhan cna() tumor-normal mode
python wakhan.py cna --threads 16 --reference ${REF_FASTA} --target-bam ${BAM_T} --normal-phased-vcf ${VCF} --genome-name ${SAMPLE_T} \
--breakpoints severus_out/severus_somatic.vcf --use-sv-haplotypes
#Wakhan hapcorrect()
python wakhan.py hapcorrect --threads 16 --reference ${REF_FASTA} --target-bam ${BAM_T} --tumor-phased-vcf ${VCF} --genome-name ${SAMPLE_T} --out-dir-plots ${SAMPLE_T}
VCF='<genome_abc_output>/phasing_output/rephased.vcf.gz'
#Index for rephased VCF
tabix ${VCF}
#Haplotag Tumor with rephased VCF
whatshap haplotag --reference ${REF_FASTA} ${VCF} ${BAM_T} -o ${SAMPLE_T}.haplotagged.bam --ignore-read-groups --tag-supplementary --skip-missing-contigs --output-threads=4
#Index for Tumor haplotagged BAM
samtools index ${SAMPLE_T}.haplotagged.bam
#Severus tumor-only mode
severus --target-bam ${SAMPLE_T}.haplotagged.bam --out-dir severus_out -t 16 --phasing-vcf ${VCF} \
--vntr-bed ./vntrs/human_GRCh38_no_alt_analysis_set.trf.bed --PON ./pon/PoN_1000G_hg38.tsv.gz
#Wakhan cna() tumor-only mode
python wakhan.py cna --threads 16 --reference ${REF_FASTA} --target-bam ${BAM_T} --tumor-phased-vcf ${VCF} --genome-name ${SAMPLE_T} \
--breakpoints severus_out/severus_somatic.vcf --use-sv-haplotypes
-
--referenceReference file path -
--target-bampath to target bam files (must be indexed) -
--out-dir-plotspath to output coverage plots -
--genome-namegenome cellline/sample name to be displayed on plots -
--normal-phased-vcfnormal phased VCF file (tumor/normal pair mode) to generate het SNPs frequncies pileup for tumor BAM (if tumor-only mode, use phased--tumor-phased-vcfinstead) -
--tumor-phased-vcfphased VCF is required in tumor-only mode -
--breakpoints(Highly recommended) Somatic SV calls in vcf format -
--threadsnumber of threads to use
-
--contigsList of contigs (chromosomes, default: chr1-22,chrX) to be included in the plots [e.g., chr1-22,chrX,chrY], Note: Please use 1-22,X [e.g., 1-22,X,Y] in case REF, BAM, and VCFs entries don't containchrname/notion, same should be observed in--centromereand--cancer-genesparams in case nochrin names, use*_nochr.bedinstead available insrc/annotationsor customized. -
--use-sv-haplotypesTo use phased Severus SV/breakpoints [default: disabled] -
--cpd-internal-segmentsFor change point detection algo on internal segments after breakpoint/cpd algo for more precise segmentation. -
--cut-thresholdMaximum cut threshold for coverage (readdepth) plots [default: 100] -
--centromerePath to centromere annotations BED file [default: annotations/grch38.cen_coord.curated.bed] -
--cancer-genesPath to Cancer Genes in TSV format to display in CNA plots [default: annotations/CancerGenes.tsv] -
--pdf-enableEnabling PDF output for plots
Wakhan can also be used in case phasing is not good in input tumor or analysis is being performed without considering phasing:
--without-phasingEnable it if CNA analysis is being performed without phasing in conjunction with--phaseblock-flipping-disableand--histogram-coveragewith all other required parameters as mentioned in example command
A sample command-line for running unphased mode (Mouse WGS data) could be:
python wakhan.py --threads <> --reference <mouse_ref> --target-bam <tumor_bam> --cut-threshold 75 --normal-phased-vcf <phased_normal.vcf.gz> --out-dir-plots <mouse_output> --genome-name mouse --copynumbers-subclonal-enable --loh-enable --breakpoints <severus_somatic.vcf> --contigs <chr1-19,chrX> --without-phasing --phaseblock-flipping-disable --histogram-coverage --centromere <annotations/mouse_chr.bed> --cpd-internal-segments --hets-ratio 0.4 --hets-smooth-window 10
Here is a sample copy number/breakpoints output plot without phasing for a mouse subline dataset.
Wakhan accepts Severus structural variants VCF as breakpoints with param --breakpoints inputs to detect copy number changes and this option is highly recommended.
However, if --breakpoints option is not used, --change-point-detection-for-cna should be used instead to use change point detection algorithm ruptures alternatively.
User can input both --ploidy-range [default: 1.5-5.5 -> [min-max]] and --purity-range [default: 0.5-1.0 -> [min-max] to inform copy number model about normal contamination in tumor to estimate copy number states correctly.
By default, Wakhan uses COSMIC cancer census genes (100 genes freely available) to display corresponding copy number states in <genome_name>_<ploidy>_<purity>_<confidence>_genes_genome.html file.
Complete COSMIC academic/research purpose cancer census genes set (Cosmic_CancerGeneCensus_v101_GRCh38.tsv) could be downloaded from COSMIC.
Please run the scripts/cosmic.py to extract the required fields and then input resultant cosmic_genes.tsv in param --user-input-genes.
Alternatively, user can also input path through param --user-input-genes to custom input genes/subset of genes [examples in src\annotations\user_input_genes_example_.bed] bed file these genes will be used in plots instead of default COSMIC cancer genes.
grch38 reference genes will be use as default, user can input alternate (i.e, chm13) --reference-name to change to T2T-CHM13 coordinates instead.
Wakhan produces reads coverage coverage.csv (bin-size based reads coverage) and phasesets reads coverage coverage_ps.csv data, phase-corrected coverage phase_corrected_coverage.csv (as well as tumor BAM pileup pileup_SNPs.csv in case Tumor/normal mode) and stores in directory coverage_data inside --out-dir-plots location.
If this data has already been generated in a previous Wakhan run, user can rerun the Wakhan with additionally passing --quick-start and --quick-start-coverage-path <path to coverage_data directory -> e.g., /home/rezkuh/data/1437/coverage_data> in addition to required params in above example runs.
This will save runtime significantly by not invoking coverage and pileup methods.
Few cell lines arbitrary phase-switch correction and copy number estimation output with coverage profile is included in the examples directory.
Based on best confidence scores, tumor purity and ploidy values are calculated and solution(s) are ranked accordingly with output as solution_<N> symlink directories.
Each sub-folder in output directory represents best <ploidy><purity><confidence> values.
<genome-name>_genome_copynumber_details.htmlGenome-wide copy number plots with coverage information on same axis<genome-name>_copynumber_breakpoints.htmlGenome-wide copy number plots with coverage information on opposite axis, additionally breakpoints and genes annotations<genome-name>_copynumber_breakpoints_subclonal.htmlGenome-wide subclonal/fractional copy number plots with coverage information on opposite axis, additionally breakpoints and genes annotations (--copynumbers-subclonal-enable)bed_outputIt contains copy numbers segments in bed formatvcf_outputIt contains copy numbers segments in VCF formatvariation_plotsCopy number chromosomes-scale plots with segmentation, coverage and LOH
Following are coverage and SNPs/LOH plots and bed directories in output folder, independent of CNA analysis
snps_loh_plotsSNPs and SNPs ratios plots with LOH representation in chromosomes-scale and genome-wide (in tumor-only mode)<genome-name>_genome_loh.htmlGenome-wide LOH plot (in tumor-only mode)coverage_plotsHaplotype specific coverage plots for chromosomes with option for unphased coveragecoverage_dataHaplotype specific phase-corrected coverage data including SNPs pileupphasing_outputPhase-switch error correction plots and phase corrected VCF file (rephased.vcf.gz)
Wakhan requires tumor BAM and normal phased VCF (in case tumor-normal mode) or tumor phased VCF (in case tumor-only mode). Following Clair3 command with longphase as phasing tool is recommended for generating required phased VCF.
BAM= Path to normal BAM
BAM= Path to tumor BAM
Running Clair3 on BAM with --enable_phasing --longphase_for_phasing for germline variants:
#For ONT data
clair3 --bam_fn=${BAM} --ref_fn=${REF_FASTA} --threads=${THREADS} --platform=ont --model_path=</clair3_models/r1041_e82_400bps_sup_v420/> --output=${OUTPUT_DIR} --enable_phasing --longphase_for_phasing
#For PacBio data
clair3 --bam_fn=${BAM} --ref_fn=${REF_FASTA} --threads=${THREADS} --platform=hifi --model_path=</clair3_models/hifi/> --output=${OUTPUT_DIR} --enable_phasing --longphase_for_phasing
We also recommend to use structural variants (breakpoints) in Wakhan, please refer to Severus. To used phased SVs/Breakpoints please refer to generating Severus phased SVs section.