Thanks to visit codestin.com
Credit goes to github.com

Skip to content

rhysf/HaplotypeTools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Documentation

All documentation for HaplotypeTools can be found at https://github.com/rhysf/HaplotypeTools

Introduction

HaplotypeTools is a collection of scripts for phasing aligned whole-genome sequencing (WGS) data and analyzing haplotype structure. It reconstructs haplotypes from reads that overlap two or more bi-allelic heterozygous positions, allowing direct inference of phase relationships between variants.

Once phased, the resulting VCF files can be:

  • compared to consensus genomes (e.g. from known relatives or parental strains, as generated by VCF_and_FASTA_to_consensus_FASTA.pl),
  • or compared between phased VCFs (from different isolates or samples within a multi-sample VCF) to identify crossovers and potential recombination events.

The main script, haplotypetools, which does the phasing, can take considerable time on large genomes.
It should be run in parallel when possible using the -g, -a, and -q options.

Phased-in-Any / Phased-in-All (PIA)

Genomic regions that are phased in ≥1 VCF (or across multiple samples within a VCF) are referred to as phased-in-any/all (PIA) regions.

  • -p 1Phased in any (default): outputs all regions phased in at least one sample, maximizing genomic coverage.
  • -p 2Phased in all: outputs only regions phased in every sample, useful for comparative analyses requiring shared loci.

PIA files act as intermediates for:

  • extracting haplotypes in FASTA format, and
  • downstream analysis with HaplotypePlacer.

For comparisons against consensus genomes, -p 1 is usually recommended; for identifying high-confidence shared sites across samples, -p 2 may be preferable.


HaplotypePlacer

Haplotype_placer.pl compares phased haplotypes against reference or consensus genomes and iteratively constructs trees (via FastTree) to infer the nearest relative for each haplotype.
This approach is useful for:

  • identifying likely parental origins of recombinants or hybrids,
  • visualizing recombination breakpoints, and
  • summarizing haplotype clustering patterns.

The script produces both tabular summaries and “window” output files that can be plotted (e.g., using the supplied R scripts).


Phased VCF Comparison

VCF_phased_compare_to_VCF_phased.pl compares two phased VCFs directly to identify genomic regions containing crossovers between them.
It distinguishes between:

  • Likely gene conversion events (closely spaced crossovers, <1 kb), and
  • Candidate meiotic crossovers (isolated events with sufficient local phase support).

This enables fine-scale analysis of recombination structure across isolates or lineages.

Support

For issues, questions, comments or feature requests, please check or post to the issues tab on github: https://github.com/rhysf/HaplotypeTools/issues

Prerequisites:

  • R (plyr, RColorBrewer, data.table, ggplot2, tools)
  • Perl 5 (tested on version 34)
  • Bio::SeqIO
  • Bio::DB::HTS
  • Samtools v0.1.10 or higher (samtools.sourceforge.net)
  • FastTree (http://www.microbesonline.org/fasttree/)

Updates:

  • 20 October 2025
    • Fixed directionality issue where some crossovers were only identified in one direction.
    • Major code cleanup for readability and consistency (simplified make_printable_lines logic).
    • Added refined crossover classification: Crossover_likely_gene_conversion = detected crossovers <1 kb apart. Crossover_meiosis_candidate = isolated crossovers ≥1 kb apart with ≥3 phased sites on each side.
    • Added refined summary output with counts and percentages of new crossover categories.
    • Introduced new R script for plotting crossover distance distributions, including log-scaled histograms and automatic PDF output.
    • General improvements to structure, logging, and output consistency.
  • 20 October 2021
    • Example data included
  • 21 June 2021
    • Stable release

Example pipeline for phasing individual sample

git clone [email protected]:rhysf/HaplotypeTools.git
cd HaplotypeTools/

./haplotypetools \
  -v example/Hybrid-SA.vcf-chr1.vcf \
	-b example/Hybrid-SA.vcf-chr1.bam \
	-u Hybrid-SA-EC3 \
	-f example/Hybrid-SA.vcf-chr1.fasta

perl util/VCF_phased_to_PIA.pl \
  -v example/Hybrid-SA.vcf-chr1.vcf-Phased-m-4-c-90-r-100000.vcf \
  -f example/Hybrid-SA.vcf-chr1.fasta

perl util/VCF_phased_and_PIA_to_FASTA.pl \
  -v example/Hybrid-SA.vcf-chr1.vcf-Phased-m-4-c-90-r-100000.vcf \
  -l example/Hybrid-SA.vcf-chr1.vcf-Phased-m-4-c-90-r-100000.vcf-PIA-p-1-c-10-s-all.tab \
  -r example/Hybrid-SA.vcf-chr1.fasta

perl util/FASTA_compare_sequences.pl \
  -f example/Hybrid-SA.vcf-chr1.vcf-Phased-m-4-c-90-r-100000.vcf-PIA-p-1-c-10-s-all.tab_1.fasta \
  -a example/Hybrid-SA.vcf-chr1.vcf-Phased-m-4-c-90-r-100000.vcf-PIA-p-1-c-10-s-all.tab_2.fasta \
  -o a \
  > example/haplotypes_length_and_pc_similarity.tab

perl util/Haplotype_placer.pl \
	-p example/Hybrid-SA.vcf-chr1.vcf-Phased-m-4-c-90-r-10000.vcf-PIA-p-1-c-10-s-all.tab \
	-a example/Hybrid-SA.vcf-chr1.vcf-Phased-m-4-c-90-r-10000.vcf-PIA-p-1-c-10-s-all.tab_1.fasta \
	-b example/Hybrid-SA.vcf-chr1.vcf-Phased-m-4-c-90-r-10000.vcf-PIA-p-1-c-10-s-all.tab_2.fasta \
	-n example/Name_Type_Location_consensus_genomes.tab \
	-d y \
	-l example/Hybrid-SA.vcf-chr1.fasta

perl util/Windows_haplotypes_to_R_figure.pl \
	-w HaplotypeTools_windows -s 7000

Example pipeline for phasing multi sample VCF

haplotypetools -v <multi sample vcf> \
	-b <sorted BAMs (separated by comma)> \
	-u <VCF sample names in order of input BAM files (separated by comma)> \
	-f <reference.fasta>
perl util/VCF_phased_to_PIA.pl \
	-v <vcf-Phased-m-4-c-90-r-10000.vcf> \
	-f <reference.fasta>
	[if PIA is only for subset of samples] -s <VCF sample names to restrict analysis to>
perl util/VCF_phased_and_PIA_to_FASTA.pl \
	-v <vcf-Phased-m-4-c-90-r-10000.vcf> \
	-l <vcf-Phased-m-4-c-90-r-10000.vcf-PIA-p-1-c-10-s-all.tab> \
	-r <reference.fasta>
	-u <sample name in VCF to pull haplotypes from>
perl util/FASTA_compare_sequences.pl \
	-f <vcf-Phased-m-4-c-90-r-10000.vcf-PIA-p-1-c-10-s-<opt_u>.tab_1.fasta> \
	-a <vcf-Phased-m-4-c-90-r-10000.vcf-PIA-p-1-c-10-s-<opt_u>.tab_2.fasta> \ 
	-o a > summary

Example pipeline for comparing haplotypes to other genomes

To make consensus genomes from your VCF and FASTA reference genome, you should run the following command:

perl util/VCF_and_FASTA_to_consensus_FASTA.pl \
	-v <sample1.vcf> \
	-r <reference.fasta>

Next, make a tabular file in the following format (Name_Type_Location.tab) E.g.:

Sample1	FASTA	/dir/with/your/consensus/FASTA/sample1.vcf-WGS-s-n-i-n-n-N-consensus.fasta
Sample2	FASTA	/dir/with/your/consensus/FASTA/sample2.vcf-WGS-s-n-i-n-n-N-consensus.fasta
Sample3	FASTA	/dir/with/your/consensus/FASTA/sample3.vcf-WGS-s-n-i-n-n-N-consensus.fasta
...

And an optional tabular file for clades/lineages E.g.:

Sample1	clade1
Sample2	clade1
Sample3	clade2

Then run the following commands:

perl util/VCF_phased_to_PIA.pl \
	-v <phased_sample1.vcf>,<phased_sample2.vcf>,<phased_sample3.vcf> \
	-f <reference.fasta>

perl util/VCF_phased_and_PIA_to_FASTA.pl \
	-v <phased_sample1.vcf> \
	-l <phased_sample1.vcf-plus_other_VCFs-PIA-p-1-c-10-s-all.tab> \
	-r <reference.fasta> 

perl util/Haplotype_placer.pl \
	-p <phased_sample1-plus_other_VCFs-PIA-p-1-c-10-s-all.tab> \
	-a <phased_sample1.vcf-plus_other_VCFs-PIA-p-1-c-10-s-all.tab_1.fasta> \
	-b <phased_sample1.vcf-plus_other_VCFs-PIA-p-1-c-10-s-all.tab_2.fasta> \
	-n <Name_Type_Location.tab>

For window plots, run

perl util/Haplotype_placer.pl \
	-p <phased_sample1-plus_other_VCFs-PIA-p-1-c-10-s-all.tab> \
	-a <phased_sample1.vcf-plus_other_VCFs-PIA-p-1-c-10-s-all.tab_1.fasta> \
	-b <phased_sample1.vcf-plus_other_VCFs-PIA-p-1-c-10-s-all.tab_2.fasta> \
	-n <Name_Type_Location.tab> \
	-d y \
	-l <reference.fasta>
perl util/Windows_haplotypes_to_R_figure.pl -w HaplotypeTools_windows

Example pipeline to identify crossovers

perl util/VCF_phased_compare_to_VCF_phased.pl \
	-a <phased_sample1.vcf> \
	-b <phased_sample2.vcf> \
	-o sample1_vs_sample2

or

perl util/VCF_phased_compare_to_VCF_phased.pl \
	-a <phased.vcf> \
	-c <sample ID 1 from VCF>
	-d <sample ID 2 from VCF>
	-o sample1_vs_sample2

The output 'Positions-Refined.tab' includes cutoffs (currently hard coded in VCF_phased_compare_to_VCF_phased) for distinguishing:

  • Crossover_likely_gene_conversion (crossovers < 1kb from each other, or < 3 supporting phased hets either side of the crossover, compared with
  • Crossover_meiosis_candidate (crossovers >= 1kb and >= phased hets per side).
  • Cross-over_unresolved (anything that either has a distance < 1 kb or < 3 phased sites either side)

The next step might be to plot the distances of 'all crossovers' (or subset the files to only those of interest e.g., Crossover_meiosis_candidate) either for a single Phase-Positions.tab or a file with the locations of many Phase-Positions.tab's included as a single column:

Rscript ./HaplotypeTools/util/plot_distance_of_crossovers_many_files.R file_of_all_phased_positions.txt
Rscript ./HaplotypeTools/util/plot_distance_of_crossovers.R vcf-HaplotypeTools_VCF1_vs_VCF2-Phase-Positions.tab

Individual script details:

  • HaplotypeTools parameters are shown below, followed by their default settings given in [].
  • Files are highlighted in <>.
HaplotypeTools.pl 
Parameters: -v <VCF> 
            -b <BAM (sorted) separated by comma if phasing multiple samples in a multi VCF> 
            -f <reference fasta>
            -u VCF sample names in order of input BAM files (separated by comma if phasing multiple samples)
Optional:   -c Cut-off percent reads supporting phase group [90]
            -m Minimum read depth overlapping two heterozygous SNPs [4]
            -r Max phase length [10000]
            -s Steps (1=process VCF, 2=process BAM, 3=assign read info to VCF, 4=validate and assign phase groups, 5=concatenate) [12345]
Parallel:   -g Run commands on the grid (y/n) [n]
            -a Platform (UGER, LSF, GridEngine) [UGER]
            -q Queue name [short]
Outputs:    -o Output folder for tmp files [opt_v-HaplotypeTools-phased-r-opt_r]
            -p Phased VCF [opt_v-Phased-m-opt_m-c-opt_c-r-opt_r.vcf]
            -y Phased summary [opt_v-Phased-m-opt_m-c-opt_c-r-opt_r-summary.tab]

FASTA_compare_sequences.pl
Parameters: -f <FASTA>
Optional:   -o Output (p=pairwise within opt_f, s=summary of pairwise within opt_f, a=additional FASTA: 1:1 order) [p]
            -g Exclude if gaps percent is greater than this [100]
            -a <Additional FASTA in same order> []

Haplotype_list_count.pl <haplotypes list> <full sequence file>

Haplotype_placer.pl 
Parameters: -p <phased in any/all; PIA>
            -a <Haplotype file 1 FASTA>
            -b <Haplotype file 2 FASTA>
            -n <Name tab Type tab Location for consensus genomes>
Optional:   -f Folder for output FASTA and Trees [HaplotypeTools_output]
            -c Clades/Lineages/Metadata to cluster isolates by. Format (Name tab Clades/Lineages/Metadata) []
            -t Location of FastTree [/seq/annotation/bio_tools/FastTree/current/FastTreeMP]
            -m Mimimum length of haplotypes [100]
            -v Verbose (y/n) [n]
Windows:    -d Calculate windows (y/n) [n]
            -l <Reference FASTA> []
            -z Window length [10000]
Output:     -f Output folder for genomic FASTA files and Trees [HaplotypeTools_output]
            -s Output summary [HaplotypeTools_summary]
            -w Output windows [HaplotypeTools_windows]

Haplotype_FASTA_files_to_compare_to_IRMS_het_sites.pl
Parameters: -a <PIA_1 FASTA> 
            -b <PIA_2 FASTA> 
            -c <IRMS HET.fasta> 
            -d <IRMS HET.details>
Optional:   -p print to stdout (s) or (f) [s]
            -o File to print to if opt_p=f <opt_a-and-2.accuracy>
            -z Verbose for testing (y/n) [n]


VCF_and_FASTA_to_consensus_FASTA.pl 
Parameters: -v <VCF> 
            -r <reference FASTA>
Optional:   -i Include homozygous indels (y/n) [n]
            -h Include bi-allelic heterozygous positions if ref base not included (y/n) [y]
            -a For heterozygous positions, use ambiguity code (a) or first position for haploid consensus (f) [f]
            -n Character for ambiguous [N]
            -s Restrict to only this supercontig [n]
            -t Restrict to only this isolate [n]
Notes: Prints to opt_v-isolate-name-consensus.fasta

VCF_phase.pl
Parameters: -v <VCF from step 1 of HaplotypeTools>
            -a <alignment file from step 2 of HaplotypeTools>

Optional:   -s Sample number
Output:     -o Output [opt_v-phased-opt_s]

VCF_phase_haploid.pl <VCF file 1> > outfile.vcf

VCF_phased_and_PIA_to_FASTA.pl 
Parameters: -v <VCF file> 
            -l <PIA file (contig tab start tab stop)> 
            -r <Reference FASTA>
Optional:   -u Sample name in VCF to pull haplotypes from [WGS]
            -p Printing option (o=outfile, s=split to opt_l_1 and _2) [s]
            -e Exclude printing if haplotypes are identical (y/n) [n]
            -m min length [10]
            -i Include indels (y/n) [n]
            -z Verbose for testing (y/n) [n]

VCF_phased_append_name_to_phase_group.pl <VCF file 1> > outfile.vcf

VCF_phased_calculate_haplotype_lengths.pl <Phased VCF file (with optional additional VCFs separated by comma)>
Optional:  -s Sample name (with optional additional sample names separated by comma)
           -e Output results for every sample (y/n) [n]

VCF_phased_compare_to_VCF_phased.pl
Parameters: -a <VCF file 1>
            -b <VCF file 2>
	    or

            -a <VCF>
            -c Sample ID 1
            -d Sample ID 2
Optional:   -o Outfile extension [HaplotypeTools_VCF1_vs_VCF2]

VCF_phased_to_PIA.pl
Parameters: -v <VCFs (separated by comma)> 
            -f <reference FASTA>
Optional:   -c Length cut-off for minimum haplotype to be considered [10]
            -p Phased in any (1), Phased in all (2) [1]
            -s Sample names to restrict analysis too (separated by comma)
            -t Phase Tag (HaplotypeTools uses PID, Whatshap uses PS or HP etc) [PID]
Output:     -u Summary [opt-v-PIA-p-Opt_p-c-Opt_c-s-Opt_s.summary]
            -o Output [opt-v-PIA-p-Opt_p-c-Opt_c-s-Opt_s.tab]

VCF_phased_validate_and_assign_phase_groups.pl
Parameters: -v <VCF-phased-Sample_Number (from VCF_phase.pl)>
Optional:   -c Cut-off percent reads supporting phase group [90]
            -m Minimum read depth overlapping two heterozygous SNPs [4]
            -z Verbose for error checking (y/n) [n]
Output:     -o Output VCF [opt_v-and-assigned.tab]
            -l Tallies for percent of reads agreeing with phases [opt_v-Tally-agree-disagree-opt_m-MD-opt_c-pc_cutoff.tab]

Windows_haplotypes_to_R_figure.pl
Parameters: -w <Windows haplotypes dataframe>
Optional    -s Scaling factor for width (nucleotides per mm) [20000]
            -h Scaling factor for height (genome per mm) [70]
            -r Resolution (The nominal resolution in ppi) [100]
            -c Minimum contig length to include label [1000000]
            -g Optional size of axis text [2]
            -l Legend size [2]
            -x X-min [0]

About

A toolkit for identifying recombination and recombinant genotypes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published