In this self-learning module, we would like you to put your hands on the 1000 Genome Project data and apply the skills you have learned to this mini-project.
!!! note "Aim"
Aim:
1. Download 1000 Genome VCF files.
2. Perform PCA using 1000 Genome samples.
3. Plot the PCs of these individuals.
4. Interpret the results.
Here is a brief overview of this mini project.

The ultimate goal of this assignment is simple, which is to help you get familiar with the skills and the most commonly used datasets in complex trait genomics.
!!! tip Please pay attention to the details of each step. Understanding why and how we do certain steps is much more important than running the sample code itself.
Download the files we need from 1000 Genomes Project FTP site:
- Autosome VCF files
- Ancestry information file
- Reference genome sequence
- Strict mask
!!! tip - Autosome VCF: chr1 - chr22 files: in this directory. - Ancestry information file: download here - Reference genome sequence: in this directory. - Strict mask: download here
!!! note If it takes too long or if you are using your local laptop, you can just download the files for chr1.
!!! example "Sample shell script for downloading the files" ``` #!/bin/bash for chr in $(seq 1 22) #Note: If it takes too long, you can download just chr1. do wget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz wget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi done
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai
wget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
wget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/accessible_genome_masks/20141020.strict_mask.whole_genome.bed
```
We need to use bcftools to process the raw vcf files.
!!! tip "Install bcftools" http://www.htslib.org/download/
Since the variants are not normalized and also have many duplications, we need to clean the vcf files.
- Normalize: https://samtools.github.io/bcftools/bcftools.html#norm
- Remove duplication: https://samtools.github.io/bcftools/bcftools.html#common_options
!!! example "Re-align with the reference genome, normalize variants and remove duplications"
#!/bin/bash for chr in $(seq 1 22) do bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \ ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \ bcftools annotate -I +'%CHROM:%POS:%REF:%ALT' | \ bcftools norm -Ob --rm-dup both \ > ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf bcftools index ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf done
!!! example
#!/bin/bash for chr in $(seq 1 22) do plink \ --bcf ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf \ --keep-allele-order \ --vcf-idspace-to _ \ --const-fid \ --allow-extra-chr 0 \ --split-x b37 no-fail \ --make-bed \ --out ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes done
Strict masks are in this directory.
!!! quote "Strict mask" The overlapped region with this mask is “callable” (or credible variant calls). This mask was developed in the 1KG main paper and it is well explained in https://www.biostars.org/p/219634/
!!! tip
Use plink --make-set option with the BED files to extract SNPs in the strict mask.
!!! tip Use PLINK.
QC: only SNPs (exclude indels), MAF>0.1
Pruning: `plink --indep-pairwise`
!!! tip
plink --pca
Draw PC1 - PC2 plot and color each individual by ancestry information (from ALL.panel file). Interpret the result.
!!! tip You can use R, python, or any other tools you like (even Excel can do the job.)
(If you are having trouble performing any of the steps, you can also refer to: https://www.biostars.org/p/335605/.)
- What does variant normalization mean and What are the two principles for variant normalization?
- For chromosome 1, what is the proportion of common variants(MAF >5%) / low-frequency-variants(1<=MAF <5%) / and rare variants(MAF <1%) ? If possible, please draw a figure showing the distribution of MAF. (plink --freq)
- What pattern did you observe from the 1KG PCA plot?
- https://www.biostars.org/p/335605/
- 1000 Genomes Project Consortium. "A global reference for human genetic variation." Nature 526.7571 (2015): 68.