Homology-based R Gene Prediction Pipeline

Intro

This repository contains my personal documentation of running the full-length Homology-based R gene Prediciton pipeline from https://github.com/AndolfoG/HRP. This includes the commands from the official documentation with some minor modificaitons, plus commands for steps that were not explained in the documentation, such as filtering the protein fasta list to genes identified in the early analyses. These commands are all contained in a series of scripts, in which the output of one script is the input for the next script. The commands in these scripts use my file names, which will need to be changed to match others' files. They also contain my personal SLURM headers, which will need to be changed or removed for others' uses. The commands can also be run on the command line without the use of scripts. Instructions are below.

Pipeline

Obtain necessary software: env.sh

Required software includes InterProScan, MEME suite, Bedtools, genBlastG, and AGAT suite. I created a conda environment for installation of the programs to which I did not already have access, which included only genBlastG and the AGAT suite. If you need additional programs, you can add them to the hrp.yml file in this repository.

conda env create -n hrpenv -f hrp.yml

Prepare data

You will need a genome fasta file, a protein fasta file, and ideally an annotation file. I am using data from the following link: https://datadryad.org/stash/dataset/doi:10.25338/B8TP7G

Protein domain search: interpro.sh

interproscan.sh -f TSV -appl Pfam -i ../Genome/farr1.protein.fa -b farr1.interpro

Filter Interpro results for NB-ARC domain and put in BED format: format.sh

grep NB-ARC farr1.interpro.tsv | cut -f1,7,8 > NB.bed

Extract sequences of domains: extract.sh

bedtools getfasta -fi ../Genome/farr1.protein.fa -bed NB.bed -fo NB_Pfam_Domain_Sequences.fasta

Get motifs from domain sequences: motifpt1.sh

meme NB_Pfam_Domain_Sequences.fasta -o meme_out -protein -mod zoops -nmotifs 19 -minw 4 -maxw 7 -objfun classic -markov_order 0

Use motifs to identify additional domains: motifpt2.sh

mast meme_out/meme.txt ../../Genome/farr1.protein.fa -o mast_out

Get gene IDs from mast output (output from motifpt2.sh): getids.sh

awk '/SEQUENCE NAME/ && /DESCRIPTION/{print;flag=1;next} /^$/{flag=0} flag { print$1 }' mast_out/mast.txt | awk 'NR>=3' > geneids.txt

Get sequences of gene list from mast: faidx.sh

samtools faidx -r geneids.txt ../Genome/farr1.protein.fa -o farr1.protein.subset.fa

Protein domain search for LRR domain in gene list: interpro2.sh

interproscan.sh -f TSV -appl SUPERFAMILY -i farr1.protein.subset.fa -b farr1.interpro2

Check whether NLRs are complete or partial: HRPscript.sh

./IPS2fpGs.sh farr1.interpro2.tsv > nlrlengths.tsv

Get gene IDs for complete/full-length NLR genes: getids2.sh

awk '$2 == "full-length'' {print $1}' nlrlengths.tsv > geneids2.txt

Get sequences of full-length NLR genes: faidx2.sh

samtools faidx -r geneids2.txt ../Genome/farr1.transcript.fa -o full-lenth_NB-LRRs.fa

Identify paralogous gene models using genblastG: genblast.sh

genblastG -q full-length_NB-LRRs.fa -t ../Genome/farr1.fa -gff -cdna -pro -o genblastG-output

Filter gene models based on length: filter.sh

agat_sp_filter_gene_by_length.pl --gff genblastG-output_1.1c_2.3_s1_0_16_1.gff --size 20000 --test "<" -o genblastG-output_FbL.gff

Identify overlapping gene models gene models: overlaps.sh

grep transcript genblastG-output_FbL.gff | gff2bed | sortBed | clusterBed -s | awk -F'=|;|\t' '{ print $11,$16,$1,$2,$3 }'  > genblastG-output_FbL_clusters

Estimate protein sequence lengths: seqlengths.sh

awk 'BEGIN{FS="[> ]"} /^>/{val=$2;next}  {print val,length($0);val=""} END{if(val!=""){print val}}' genblastG-output_1.1c_2.3_s1_0_16_1.pro | tr ' ' \\t | sort > genblastG-output_FbL_length

Select non-redundant gene models: select.sh

join -1 1 -2 1 -o 1.1,1.2,2.2,1.3,1.4,1.5 <( sort -bk1 genblastG-output_FbL_clusters) <(sort -bk1 genblastG-output_FbL_length) | sort -bk2,2 -bk3,3 -nr | sort -uk2,2 | awk '{OFS="\t"} {print $4,$5,$6}' | sort | uniq > R-gene_coordinates.bed

Get list of candidates from coordinates of non-redundant gene models: getcandidates.sh

bedtools intersect -a ../Genome/farr1.gene_models_updated.gff -b R-gene_coordinates.bed -wa | grep 'mRNA' | awk -F'\t|=|;' '{print$10}' | uniq > Rgene_candidates.txt

Get sequences of candidates: faidx3.sh

samtools faidx -r Rgene_candidates.txt ../Genome/farr1.transcript.fa -o NB-LRR_gene_candidates.fa

Use Interpro to annontate candidates: annotate.sh

interproscan.sh -f TSV,GFF3 -i NB-LRR_gene_candidates.fa -b NLR_final_candidates

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Homology-based R Gene Prediction Pipeline

Intro

Pipeline

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
HRPscript.sh		HRPscript.sh
IPS2fpGs.sh		IPS2fpGs.sh
README.md		README.md
alignscore.txt		alignscore.txt
annotate.sh		annotate.sh
blastall		blastall
compare.sh		compare.sh
conf1.tsv		conf1.tsv
conf2.tsv		conf2.tsv
env.sh		env.sh
extract.sh		extract.sh
faidx.sh		faidx.sh
faidx2.sh		faidx2.sh
faidx3.sh		faidx3.sh
filter.sh		filter.sh
format.sh		format.sh
formatdb		formatdb
genblast.sh		genblast.sh
getcandidates.sh		getcandidates.sh
getids.sh		getids.sh
getids2.sh		getids2.sh
hrp.yml		hrp.yml
interpro.sh		interpro.sh
interpro2.sh		interpro2.sh
motifpt1.sh		motifpt1.sh
motifpt2.sh		motifpt2.sh
overlaps.sh		overlaps.sh
select.sh		select.sh
seqlengths.sh		seqlengths.sh

aliciasillers/HRP-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Homology-based R Gene Prediction Pipeline

Intro

Pipeline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages