This is a shell/awk re-written PLAR script for lncRNA discovery.
Requirements: awk, Python, Biopython, bedtools, bbmap, gtf_to_fasta(tophat module), CPC2.0beta, HMMER(with Pfam-A dataset)
This script uses original output GTF file of stringtie. There should be unique transcript IDs on field 2, FPKMs on field 4 for de novo or on field 7 for reference transcript of column if stringtie was ran with GENCODE ref GTF guide.
The filtering principle and threshold are based on PLAR with several steps simplification and improvement.
- Separate reference lncRNA from input by matching the unique ID of known lncRNAs of the reference.
GENCODE release29 (GRCh37) was used by default, can be changed on line 73-79.
!NOTE: Use the SAME version of ref GTF annotation for transcript assembly.
- Filter the known lncRNA by FPKM (default>1).
- Get all de novo transcripts by removing the transcripts which have been assigned with reference ID.
- Sort the rest of transcripts into single- or multi- exon groups.
- Filter 1 - FPKM and length: transcript length has to be >200bp and FPKM has to be >1 for single-exon or >0.1 for multi-exon to be kept.
Thresholds are on line 311.
- Filter 2 - repeat/gap regions: Any transcripts which have at least 1 exon overlaps with genomic repeat sequences or gap regions for >=50% fraction will be removed.
line 313.
- Filter 3 - CDS: Any transcripts which overlap with protein coding sequences on the same strand will be removed.
line 315.
- Filter 4 - distance to genes: For the single-exon transcripts which have distance < 2000bp to protein coding gene and <500 for multi-exon transcripts will be removed.
line 317.
- Filter 5 - CPC2 and HMMER: Non-coding potential and protein coding potential of the transcripts will be calculated by CPC2.0beta and hmmscan with Pfam-A dataset. Transcripts passed both will be kept and could be consider as potential de novo lncRNAs.
line 321.
Then the script combines known and de novo lncRNAs together to a final.gtf.
help message can be shown by lncRNA.sh -h
Usage: lncRNA.sh <options> -g|-n <transcriptd.gtf>
### INPUT: GTF file (stringtie output) ###
### python3/biopython/CPC2/bbmap/HMMER/bedtools/gtf_to_fasta required ###
Options:
-g indicate GENCODE reference was used for transcript assembly
-n indicate NONCODE reference was used for transcript assembly
-s single-exon FPKM cutoff (defualt:1)
-m multi-exon FPKM cutoff (defualt:0.1)
-r referenced transcript FPKM cutoff (defualt:0.1)
-h Print this help message1. Reads mapping and transcripts assembly. trans_assemble.sh can be used. i.e.
wget https://raw.githubusercontent.com/zhaoshuoxp/Pipelines-Wrappers/master/trans_assemble.sh
chmod 755 trans_assemble.sh
./trans_assemble.sh test_R1.fastq.gz test_R2.fastq.gz rfThe output test.gtf can be used for lncRNA discovery.
2. lncRNA filtering:
git clone https://github.com/zhaoshuoxp/lncRNA
cd lncRNA
chmod 755 lncRNA.sh
./lncRNA.sh -g test.gtfAll results will be store in current (./) directory. Log will be printed during running.
- test_final.gtf: combined final transcripts in GTF format.
- test_known_lncRNA_f1.gtf: FPKM>1 filtered reference lncRNA transcripts.
- test_multi_exon_f5.gtf: all filter passed multi-exon de novo lncRNA transcripts.
- test_single_exon_f5.gtf: all filter passed single-exon de novo lncRNA transcripts.
- non_add: for NONCODE reference transcripts, the pipeline will put them into de novo filters and output the filtered transcripts in this subdirectory.
Further transcript deduplication could be performed if you merged multiple GTFs before or after running this pipeline:
wget https://raw.githubusercontent.com/zhaoshuoxp/Pipelines-Wrappers/master/GTF_rmdup.sh
chmod 755 GTF_rmdup.sh
./GTF_rmdup.sh final.gtf final_uniq.gtfNOTE:UCSC Genome Browser utility gtfToGenePred and genePredToBed are required.
Author @zhaoshuoxp
Nov 8 2019