95_Assignment.md

Self training

PCA using 1000 Genome Project Dataset

In this self-learning module, we would like you to put your hands on the 1000 Genome Project data and apply the skills you have learned to this mini-project.

!!! note "Aim"

Aim:

1. Download 1000 Genome VCF files. 
2. Perform PCA using 1000 Genome samples.
3. Plot the PCs of these individuals.
4. Interpret the results.

Here is a brief overview of this mini project.

![image](https://user-images.githubusercontent.com/40289485/161699756-13c9f474-f9a2-4514-886f-61022a374bf8.png)

The ultimate goal of this assignment is simple, which is to help you get familiar with the skills and the most commonly used datasets in complex trait genomics.

!!! tip Please pay attention to the details of each step. Understanding why and how we do certain steps is much more important than running the sample code itself.

1. Download the publicly available 1000 Genome VCF

Download the files we need from 1000 Genomes Project FTP site:

Autosome VCF files
Ancestry information file
Reference genome sequence
Strict mask

!!! tip - Autosome VCF: chr1 - chr22 files: in this directory. - Ancestry information file: download here - Reference genome sequence: in this directory. - Strict mask: download here

!!! note If it takes too long or if you are using your local laptop, you can just download the files for chr1.

!!! example "Sample shell script for downloading the files" ``` #!/bin/bash for chr in $(seq 1 22) #Note: If it takes too long, you can download just chr1. do wget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz wget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi done

wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai

wget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
wget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/accessible_genome_masks/20141020.strict_mask.whole_genome.bed
```

2. Re-align, normalize and remove duplication

We need to use bcftools to process the raw vcf files.

!!! tip "Install bcftools" http://www.htslib.org/download/

Since the variants are not normalized and also have many duplications, we need to clean the vcf files.

Normalize: https://samtools.github.io/bcftools/bcftools.html#norm
Remove duplication: https://samtools.github.io/bcftools/bcftools.html#common_options

!!! example "Re-align with the reference genome, normalize variants and remove duplications" #!/bin/bash for chr in $(seq 1 22) do bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \ ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \ bcftools annotate -I +'%CHROM:%POS:%REF:%ALT' | \ bcftools norm -Ob --rm-dup both \ > ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf bcftools index ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf done

3. Convert VCF files to plink binary format

!!! example
#!/bin/bash for chr in $(seq 1 22) do plink \ --bcf ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf \ --keep-allele-order \ --vcf-idspace-to _ \ --const-fid \ --allow-extra-chr 0 \ --split-x b37 no-fail \ --make-bed \ --out ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes done

4. Using SNPs only in strict masks

Strict masks are in this directory.

!!! quote "Strict mask" The overlapped region with this mask is “callable” (or credible variant calls). This mask was developed in the 1KG main paper and it is well explained in https://www.biostars.org/p/219634/

!!! tip Use plink --make-set option with the BED files to extract SNPs in the strict mask.

5. QC it and prune it to ~ 100K variants.

!!! tip Use PLINK.

QC: only SNPs (exclude indels), MAF>0.1

Pruning: `plink --indep-pairwise`

6. Perform PCA

!!! tip plink --pca

7. Visualization and interpretation.

Draw PC1 - PC2 plot and color each individual by ancestry information (from ALL.panel file). Interpret the result.

!!! tip You can use R, python, or any other tools you like (even Excel can do the job.)

(If you are having trouble performing any of the steps, you can also refer to: https://www.biostars.org/p/335605/.)

Checklist

What does variant normalization mean and What are the two principles for variant normalization?
For chromosome 1, what is the proportion of common variants(MAF >5%) / low-frequency-variants(1<=MAF <5%) / and rare variants(MAF <1%) ? If possible, please draw a figure showing the distribution of MAF. (plink --freq)
What pattern did you observe from the 1KG PCA plot?

References

https://www.biostars.org/p/335605/
1000 Genomes Project Consortium. "A global reference for human genetic variation." Nature 526.7571 (2015): 68.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self training

PCA using 1000 Genome Project Dataset

1. Download the publicly available 1000 Genome VCF

2. Re-align, normalize and remove duplication

3. Convert VCF files to plink binary format

4. Using SNPs only in strict masks

5. QC it and prune it to ~ 100K variants.

6. Perform PCA

7. Visualization and interpretation.

Checklist

References

FilesExpand file tree

95_Assignment.md

Latest commit

History

95_Assignment.md

File metadata and controls

Self training

PCA using 1000 Genome Project Dataset

1. Download the publicly available 1000 Genome VCF

2. Re-align, normalize and remove duplication

3. Convert VCF files to plink binary format

4. Using SNPs only in strict masks

5. QC it and prune it to ~ 100K variants.

6. Perform PCA

7. Visualization and interpretation.

Checklist

References