Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

My documentation for running the full-length homology-based r gene prediction pipeline, complete with scripts. Fills in gaps of the official pipeline.

Notifications You must be signed in to change notification settings

aliciasillers/HRP-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Homology-based R Gene Prediction Pipeline

Intro

This repository contains my personal documentation of running the full-length Homology-based R gene Prediciton pipeline from https://github.com/AndolfoG/HRP. This includes the commands from the official documentation with some minor modificaitons, plus commands for steps that were not explained in the documentation, such as filtering the protein fasta list to genes identified in the early analyses. These commands are all contained in a series of scripts, in which the output of one script is the input for the next script. The commands in these scripts use my file names, which will need to be changed to match others' files. They also contain my personal SLURM headers, which will need to be changed or removed for others' uses. The commands can also be run on the command line without the use of scripts. Instructions are below.

Pipeline

  1. Obtain necessary software: env.sh

Required software includes InterProScan, MEME suite, Bedtools, genBlastG, and AGAT suite. I created a conda environment for installation of the programs to which I did not already have access, which included only genBlastG and the AGAT suite. If you need additional programs, you can add them to the hrp.yml file in this repository.

conda env create -n hrpenv -f hrp.yml
  1. Prepare data

You will need a genome fasta file, a protein fasta file, and ideally an annotation file. I am using data from the following link: https://datadryad.org/stash/dataset/doi:10.25338/B8TP7G

  1. Protein domain search: interpro.sh
interproscan.sh -f TSV -appl Pfam -i ../Genome/farr1.protein.fa -b farr1.interpro
  1. Filter Interpro results for NB-ARC domain and put in BED format: format.sh
grep NB-ARC farr1.interpro.tsv | cut -f1,7,8 > NB.bed
  1. Extract sequences of domains: extract.sh
bedtools getfasta -fi ../Genome/farr1.protein.fa -bed NB.bed -fo NB_Pfam_Domain_Sequences.fasta
  1. Get motifs from domain sequences: motifpt1.sh
meme NB_Pfam_Domain_Sequences.fasta -o meme_out -protein -mod zoops -nmotifs 19 -minw 4 -maxw 7 -objfun classic -markov_order 0
  1. Use motifs to identify additional domains: motifpt2.sh
mast meme_out/meme.txt ../../Genome/farr1.protein.fa -o mast_out
  1. Get gene IDs from mast output (output from motifpt2.sh): getids.sh
awk '/SEQUENCE NAME/ && /DESCRIPTION/{print;flag=1;next} /^$/{flag=0} flag { print$1 }' mast_out/mast.txt | awk 'NR>=3' > geneids.txt
  1. Get sequences of gene list from mast: faidx.sh
samtools faidx -r geneids.txt ../Genome/farr1.protein.fa -o farr1.protein.subset.fa
  1. Protein domain search for LRR domain in gene list: interpro2.sh
interproscan.sh -f TSV -appl SUPERFAMILY -i farr1.protein.subset.fa -b farr1.interpro2
  1. Check whether NLRs are complete or partial: HRPscript.sh
./IPS2fpGs.sh farr1.interpro2.tsv > nlrlengths.tsv
  1. Get gene IDs for complete/full-length NLR genes: getids2.sh
awk '$2 == "full-length'' {print $1}' nlrlengths.tsv > geneids2.txt
  1. Get sequences of full-length NLR genes: faidx2.sh
samtools faidx -r geneids2.txt ../Genome/farr1.transcript.fa -o full-lenth_NB-LRRs.fa
  1. Identify paralogous gene models using genblastG: genblast.sh
genblastG -q full-length_NB-LRRs.fa -t ../Genome/farr1.fa -gff -cdna -pro -o genblastG-output
  1. Filter gene models based on length: filter.sh
agat_sp_filter_gene_by_length.pl --gff genblastG-output_1.1c_2.3_s1_0_16_1.gff --size 20000 --test "<" -o genblastG-output_FbL.gff
  1. Identify overlapping gene models gene models: overlaps.sh
grep transcript genblastG-output_FbL.gff | gff2bed | sortBed | clusterBed -s | awk -F'=|;|\t' '{ print $11,$16,$1,$2,$3 }'  > genblastG-output_FbL_clusters
  1. Estimate protein sequence lengths: seqlengths.sh
awk 'BEGIN{FS="[> ]"} /^>/{val=$2;next}  {print val,length($0);val=""} END{if(val!=""){print val}}' genblastG-output_1.1c_2.3_s1_0_16_1.pro | tr ' ' \\t | sort > genblastG-output_FbL_length
  1. Select non-redundant gene models: select.sh
join -1 1 -2 1 -o 1.1,1.2,2.2,1.3,1.4,1.5 <( sort -bk1 genblastG-output_FbL_clusters) <(sort -bk1 genblastG-output_FbL_length) | sort -bk2,2 -bk3,3 -nr | sort -uk2,2 | awk '{OFS="\t"} {print $4,$5,$6}' | sort | uniq > R-gene_coordinates.bed
  1. Get list of candidates from coordinates of non-redundant gene models: getcandidates.sh
bedtools intersect -a ../Genome/farr1.gene_models_updated.gff -b R-gene_coordinates.bed -wa | grep 'mRNA' | awk -F'\t|=|;' '{print$10}' | uniq > Rgene_candidates.txt
  1. Get sequences of candidates: faidx3.sh
samtools faidx -r Rgene_candidates.txt ../Genome/farr1.transcript.fa -o NB-LRR_gene_candidates.fa
  1. Use Interpro to annontate candidates: annotate.sh
interproscan.sh -f TSV,GFF3 -i NB-LRR_gene_candidates.fa -b NLR_final_candidates

About

My documentation for running the full-length homology-based r gene prediction pipeline, complete with scripts. Fills in gaps of the official pipeline.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages