GenoTools

Getting Started

GenoTools is a suite of automated genotype data processing steps written in Python. The core pipeline was built for Quality Control and Ancestry estimation of data in the Global Parkinson's Genetics Program (GP2)

Setup just requires:

git clone https://github.com/dvitale199/GenoTools
cd GenoTools
pip install .

The core pipeline can be called as:

python3 run_qc_pipeline.py --geno <genotype file to be QC'd (plink format)> --ref <genotype file of reference panel (plink format)> --ref_labels <labels for reference panel ancestry> --out <path and prefix for output>

options

--geno: Path to genotype to be processed in Plink (.bed/.bim/.fam) format. Include everything up to .bed/.bim/.fam. MUST HAVE PHENOTYPES OR SEVERAL STEPS WILL FAIL

--ref: Path to reference panel genotype in Plink (.bed/.bim/.fam) format. Include everything up to .bed/.bim/.fam. For GP2, we use a combination of 1kGenomes + Ashkenazi Jewish Reference panel

--ref_labels: Path to a tab-separated (Plink-style) file containing ancestry labels for the reference panel with the following columns: FID IID label with NO HEADER.

--out: Path and prefix to output QC'd and ancestry-predicted genotype in Plink (.bed/.bim/.fam) format. Include everything up to .bed/.bim/.fam. For now, this just outputs a log file but will include function to move output files soon.

Core Pipeline Overview

The core pipeline is broken down into 3 main pieces:

Sample-level Quality Control
Ancestry Estimation
Variant-level Quality Control

The quality control steps have been developed in large part by: Cornelis Blauwendraat, Mike Nalls, Hirotaka Iwaki, Sara Bandres-Ciga, Mary Makarious, Ruth Chia, Frank Grenn, Hampton Leonard, Monica Diez-Fairen, Jeff Kim of the Laboratory of Neurogenetics and Center for Alzheimer's and Related Dementias at the National Institute on Aging, NIH and has been adapted into an automated Python package by Dan Vitale.

Sample-level Quality Control

callrate_prune(geno_path, out_path, mind=0.02)
sex_prune(geno_path, out_path, check_sex=[0.25,0.75])

Function Reference

QC.qc

Python Function	Parameters	Returns
`callrate_prune(geno_path, out_path, mind=0.02)` Call Rate Pruning	`geno_path` str: Path to the Plink genotypes (everything before .bed/.bim/.fam). `out_path` str: Path to the output Plink genotypes (everything before .bed/.bim/.fam). `mind` float: excludes with more than 2% missing genotypes by default. This is much more stringent than Plink default missingness threshold of 10%	dict `'step'` str: Name of step in pipeline `'metrics'` dict: metrics from step with keys: `'outlier_count'` `'output'` dict: paths to output files with keys: `'outliers_path'` `'plink_out'` `'phenos_path'`
`sex_prune(geno_path, out_path, check_sex=[0.25,0.75])` Sex Pruning. Done in 2 steps: 1. Plink `--check-sex` on whole genotype 2. Plink `--check-sex` on X chromosome (`--chr 23 --from-bp 2699520 --to-bp 154931043`)	`geno_path` str: Path to the Plink genotypes (everything before .bed/.bim/.fam). `out_path` str: Path to the output Plink genotypes (everything before .bed/.bim/.fam). `check_sex` list: two values indicating F threshold. A male call is made if F is more than 0.75; a femle call is made if F is less than 0.25, which is less stringent than Plink default of 0.8 and 0.2, respectively.	dict `'step'` str: Name of step in pipeline `'metrics'` dict: metrics from step with keys: `'outlier_count'` `'output'` dict: paths to output files with keys: `'sex_fails'` `'plink_out'`
`het_prune(geno_path, out_path)` Heterozygosity Pruning	`geno_path` str: Path to the Plink genotypes (everything before .bed/.bim/.fam). `out_path` str: Path to the output Plink genotypes (everything before .bed/.bim/.fam).	dict `'step'` str: Name of step in pipeline `'metrics'` dict: metrics from step with keys: `'outlier_count'` `'output'` dict: paths to output files with keys: `'het_outliers'` `'plink_out'`
`related_prune(geno_path, out_path, related_grm_cutoff=0.125, duplicated_grm_cutoff=0.95)` Relatedness Pruning. Done using GCTA --grm	`geno_path` str: Path to the Plink genotypes (everything before .bed/.bim/.fam). `out_path` str: Path to the output Plink genotypes (everything before .bed/.bim/.fam). `related_grm_cutoff` float: GRM cutoff for related samples `duplicated_grm_cutoff` float: GRM cutoff for duplicated samples	dict `'step'` str: Name of step in pipeline `'metrics'` dict: metrics from step with keys: `'related_count'`, `'duplicated_count'` `'output'` dict: paths to output files with keys: `'relateds'` `'plink_out'`
`variant_prune(geno_path, out_path)` Variant Pruning. Missingness by: 1. case/control, 2. haplotype. Filtering controls for Hardy-Weinberg Equilibrium (remove hwe_p > 1e-4	`geno_path` str: Path to the Plink genotypes (everything before .bed/.bim/.fam). `out_path` str: Path to the output Plink genotypes (everything before .bed/.bim/.fam).	dict `'step'` str: Name of step in pipeline `'metrics'` dict: metrics from step with keys: `'geno_removed_count'`, `'mis_removed_count'`,`'haplotype_removed_count'`,`'hwe_removed_count'`,`'total_removed_count'` `'output'` dict: paths to output files with keys: `'plink_out'`
`avg_miss_rates(geno_path, out_path)` Calculate average missingness rates (sample-level and variant-level.	`geno_path` str: Path to the Plink genotypes (everything before .bed/.bim/.fam). `out_path` str: Path to the output Plink genotypes (everything before .bed/.bim/.fam).	dict `'step'` str: Name of step in pipeline `'metrics'` dict: metrics from step with keys: `'avg_lmiss'`, `'avg_imiss'`

QC.utils

Python Function	Parameters	Returns
`shell_do(command, log=False, return_log=False)` Run shell commands from Python	`command` str: Command to be run in shell. `log` str: Default=False. If True, print stdout `return_log` str: Default=False. if True, return stdout	`stdout` datatype dependent on input command
`merge_genos(geno_path1, geno_path2, out_name)` Merge 2 Plink Genotypes NEEDS TO BE FIXED TO RETURN OUTPUT FILE PATHS AND IMPORTANT METRICS. WILL CHANGE `out_name` to `out_path`	`geno_path1` str: Path to the Plink genotypes (everything before .bed/.bim/.fam). `geno_path2` str: Path to the Plink genotypes (everything before .bed/.bim/.fam).	`None` Currently does not return anything but outputs Plink files to out_name
`ld_prune(geno_path, out_name, window_size=1000, step_size=50, rsq_thresh=0.05)` Prune for Linkage Disequilibrium. Produces a subset of markers that are in LD with eachother based on r^2 threshold provided NEEDS TO BE FIXED TO RETURN OUTPUT FILE PATHS AND IMPORTANT METRICS. WILL CHANGE `out_name` to `out_path`	`geno_path` str: Path to the Plink genotypes (everything before .bed/.bim/.fam). `out_name` str: Path to the output Plink genotypes (everything before .bed/.bim/.fam). `window_size` int: Default=1000. window size in kilo-bases `step_size` int: Default=50. step size in kilo-bases `rsq_threshold` float: Default=o.05. r^2 threshold for inclusion	`None` Currently does not return anything but outputs Plink files to out_name

GWAS.gwas

Python Function	Parameters	Returns
`plink_pca(geno_path, out_path, n_pcs=10)` Principal Component Analysis	`geno_path` str: Path to the Plink genotypes (everything before .bed/.bim/.fam). `out_path` str: Path to the output Plink pca files (everything before .eigenval/.eigenvec). `n_pcs` int: Number of PCs calculated. 10 is the Plink default.	dict `'step'` str: Name of step in pipeline `'metrics'` dict: metrics from step with keys: `'n_pcs'` `'output'` dict: paths to output files with keys: `'plink_out'`
`assoc(geno_path, covar_path, out_path, model)` Association Analysis	`geno_path` str: Path to the Plink genotypes (everything before .bed/.bim/.fam). `covar_path` str: Path to tab-separated covariate file with `#FID IID` in the first two columns (including file extension). `out_path` str: Path to the output Plink assocation files (everything before .PHENO1.glm.logistic/linear). `model` str: Type of association to be run. `logistic` or `linear`.	dict `'step'` str: Name of step in pipeline `'metrics'` dict: metrics from step with keys: `'hits'` `'output'` dict: paths to output files with keys: `'pheno_counts'` `'hits'` `'hits_info'` `'plink_out'`
`prs(geno_path, out_path, assoc, clump_p1=1e-3, clump_r2=0.50, clump_kb=250)` PRS Analysis with LD Clumping	`geno_path` str: Path to the Plink genotypes (everything before .bed/.bim/.fam). `out_path` str: Path to the output Plink clump and PRS files (everything before .). `assoc` str: Path to the association file (/path/to/file/name.PHENO1.glm.logistic/linear). `clump_p1` float: Clumping p-value threshold. `clump_r2` float: Clumping r^2 threshold. `clump_kb` int*: Clumping kb threshold.	dict `'step'` str: Name of step in pipeline `'metrics'` dict: metrics from step with keys: `'clump_pval'`, `'clump_r2'`, `'clump_kb'`, `'num_clumps'` `'output'` dict: paths to output files with keys: `'SNP_weights'` `'clump_SNPs'` `'SNP_pvals'` `'ranges'` `'assoc'` `'plink_out'`
`calculate_inflation(pval_array, normalize=False, ncases=False, ncontorls=False)` Lambda/Genomic Inflation Calculation	`pval_array` numpy array: P-values from GWAS summary statistics. `normalize` bool: Normalize to 1000 cases and 1000 controls. Recommended if there is a large discrepancy between cases and controls. `ncases` int: Number of cases. Required if normalize=True. `ncontrols` int: Number of controls. Required if normalize=True.	dict `'step'` str: Name of step in pipeline `'metrics'` dict: metrics from step with keys: `'inflation'`
`munge(geno_path, out_path, assoc, ref_panel)` Munge Summary Statistics	`geno_path` str Path to Plink genotypes (everything before .bed/.bim/.fam). `out_path` str: Path to output Plink frequency report files (everything before .afreq). `assoc` str Path to the association file output by PRS (/path/to/file/name.assoc). `ref_panel` str Path to the Plink-format reference panel used to identify non-rsID SNPs (everything before .bed/.bim/.fam)	dict `'step'` str: Name of step in pipeline `'metrics'` dict metrics from step with keys: `'num_snps'` `'data'` dict dataframes from step with keys: `'ma_format_df'` `'coordinates'` `'output'` dict: paths to output files with keys: `'plink_out'`

GWAS.utils

Python Function	Parameters	Returns
`zscore_pval_conversion(zscores=None, pvals=None, stats=None)` Convert Between Z-score and P-values	`zscores` numpy array: Z-scores to be converted to P-values. `pvals` numpy array: P-values to be converted to Z-scores. `stats` numpy array: Summary stats that P-values are based off of (required if converting from P-values to Z-scores).	numpy array Array of either Z-scores or P-values depending on direction of conversion

Genotype Calling via Illumina Gencall CLI

Genotypes can be called from .idats in parallel as follows:

iaap-cli gencall {bpm} {cluster_file} {ped_dir} -f {idat} -p -t 8

iaap-cli can be found in directory: executables/iaap-cli-linux-x64-1.1.0-sha.80d7e5b3d9c1fdfc2e99b472a90652fd3848bbc7/iaap-cli/

bpm is the Illumina manifest file (.bpm)

cluster_file is the clusterfile included with the genotypes (.etg)

ped_dir is the directory which the .ped file will be output to

-f {idat_dir} is the directory that contains .idat files. In our case, each directory contains all of the idats for a single chip.

-p means "output .ped"

-t is number of threads

More information about how the iaap-cli works is in GenoTools/executables/iaap-cli-linux-x64-1.1.0-sha.80d7e5b3d9c1fdfc2e99b472a90652fd3848bbc7 directory.

The way this is used for the GP2 pipeline can be found in the GenoTools/GP2_data_processing/GP2_shulman_processing.ipynb notebook, in which I launch a swarm job in NIH's Biowulf slurm system like so:

with open(f'{swarm_scripts_dir}/idat_to_ped.swarm', 'w') as f:
    for code in manifest.SentrixBarcode_A.unique():
        idat_to_ped_cmd = f'\
{iaap} gencall \
{bpm} \
{cluster_file} \
{ped_dir}/ \
-f {idat_dir} \
-p \
-t 8'
        f.write(f'{idat_to_ped_cmd}\n')
f.close()

In this example, I write one command per sample (individual .idat) and launch a swarm job

Name		Name	Last commit message	Last commit date
Latest commit History 237 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
Ancestry		Ancestry
GWAS		GWAS
IDAT		IDAT
QC		QC
Streamlit		Streamlit
exec		exec
executables/iaap-cli-linux-x64-1.1.0-sha.80d7e5b3d9c1fdfc2e99b472a90652fd3848bbc7		executables/iaap-cli-linux-x64-1.1.0-sha.80d7e5b3d9c1fdfc2e99b472a90652fd3848bbc7
prototype		prototype
utils		utils
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
run_ancestry_training_pipeline.py		run_ancestry_training_pipeline.py
run_gwas_pipeline.py		run_gwas_pipeline.py
run_imputation_pipeline.py		run_imputation_pipeline.py
run_qc_pipeline.py		run_qc_pipeline.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GenoTools

Getting Started

options

Core Pipeline Overview

Sample-level Quality Control

Function Reference

QC.qc

QC.utils

GWAS.gwas

GWAS.utils

Genotype Calling via Illumina Gencall CLI

More Coming Soon!!!!

About

Uh oh!

Releases

Packages

Languages

yagoubali/GenoTools

Folders and files

Latest commit

History

Repository files navigation

GenoTools

Getting Started

options

Core Pipeline Overview

Sample-level Quality Control

Function Reference

QC.qc

QC.utils

GWAS.gwas

GWAS.utils

Genotype Calling via Illumina Gencall CLI

More Coming Soon!!!!

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages