Thanks to visit codestin.com
Credit goes to github.com

Skip to content

kellyquek/TRUP

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TRUP is a pipeline designed for analyzing RNA-seq data from Tumor samples..

Learn More

Cancer cells express many rearranged transcripts posing increased complexity to transcriptome analysis. As an unified pipeline, TRUP is designed to sensitively and accurately dissect the complexity of the cancer transcriptome by analyzing RNA-seq data obtained from tumour tissues. The current functionalities of TRUP include: 1) identification of fusion transcripts; 3) RNA-seq quality assesment; 2) Gene-read counting. The fusion detection module in TRUP combines split-read/read-pair mapping with regional de-novo assembly to achieve a balance between sensitivity and precision.

Dependencies

Annotation files

Annotation files are available upon request. Currently hg19 and mm10 are available.

All the annotation files and directories should be put under an central annotation directory whose name will be used for argument "--anno" when running TRUP. The structure of the annotation directory and sub-directories are illustrated as the following example:

Example:

     TRUP_ANNOTATION  /   hg19   /    hg19.SelfChain_UCSC.txt
     TRUP_ANNOTATION  /   hg19   /    hg19.bowtie2_index   /  *
           |               |                  |
     central_directory / version  / annotation files and directories

If users choose to prepare annotation directories and files by themselves, it can be achieved in the following way (using hg19 as an example):

TRUP_ANNOTATION/hg19/

  • hg19.genome_UCSC.2bit (for running BLAT: runlevel 4) How to get: Use faToTwoBit to generate a 2.bit file from a genome fasta file (http://genome.ucsc.edu/goldenPath/help/blatSpec.html#faToTwoBitUsage)

  • hg19.transcriptome_length.txt (for generating html report: runlevel 2) How to get: Sum up the length of all exons for each chromosome. Each line is formatted as "chromosome sun_of_exons_length" (tab delimited).

  • hg19.genes_Ensembl.bed12 (for gene read counting: runlevel 2)

  • hg19.gencode/gencode.v14.annotation.gene.bed12 (not available for organism other than hg19) How to get: These 12 column bed files can be generated by my script "exon_union_gtf2bed.pl" under the TRUP directory src/ by using normal .gtf files downloadable from Ensembl ftp site for a specific organism.

  • hg19.transcripts_Ensembl.gtf (for tophat2 mapping and cufflinks: runlevel 2 and 5) How to get: This file can be downloaded from Ensembl ftp. use this to generate hg19.genes_Ensembl.bed12 mentioned above.

  • hg19.gencode/gencode.v14.annotation.gtf (not available for organism other than hg19) How to get: downloadable from UCSC table browser, use this to generate hg19.gencode/gencode.v14.annotation.gene.bed12 mentioned above.

  • hg19.gencode/gencode.v14.genemap (for runlevel 2. not available for organism other than hg19) How to get: use the script "gencode_annotation.map.pl" under TRUP/src and the gencode gtf file to generate this gene map file.

  • hg19.genes_RefSeq.bed12 (for runlevel 2. may not be used in the new version of TRUP)

  • hg19.SelfChain_UCSC.txt (for fusion detection: runlevel 4)

  • hg19.repeats_UCSC.gff (for fusion detection: runlevel 4) How to get: Downloadable from UCSC table browser

  • hg19.transcripts_Ensembl.gff How to get: turn hg19.transcripts_Ensembl.gtf to generate the corresponding gff file by using "gtf2gff3" (http://search.cpan.org/~lds/GBrowse-2.54/bin/gtf2gff3.pl)

  • hg19.biomart.txt How to get: use R package biomaRt or web converter at (http://www.ensembl.org/biomart/martview/), choose appropriate databse, feed in filter names (all ensembl gene id), and select attributes for "Associated Gene Name", "Gene Biotype", "Description", "EntrezGene ID", "WikiGene Name", "WikiGene Description".

  • hg19.gmap_index/hg19/hg19.* How to get: download from (http://research-pub.gene.com/gmap/) or follow instructions provided by gmap/gsnap

  • hg19.bowtie2_index/hg19/hg19.*

  • hg19.bowtie2_index/hg19_trans/hg19_known_ensemble_trans.* How to get: download from (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) or follow instructions provided by bowtie2. For annotation files under hg19_trans/, run tophat2 once with --transcriptome-index hg19.bowtie2_index/hg19_trans/hg19_known_ensemble_trans and --GTF hg19.transcripts_Ensembl.gtf to get tophat index for transcriptome (tophat2 can be stopped as long as the indexes are generated, thus user can use any fastq file as input for running tophat2, just to get the indexes). These transcriptome indexes can be used for later tophat2 calls.

Installation

No installation is required if pre-compiled binaries in bin/ are compatible with your system.

If pre-compiled binaries are incompatible with your system, you will need to rebuild them from source.

Make sure you have installed Bamtools (https://github.com/pezmaster31/bamtools). Then write down the bamtools_directory where ./lib/ and ./include/ sub-directories are located. Currently bamtools 2.0 has been tested.

run make in following way to install

$ make BAMTOOLS_ROOT=/bamtools_directory/

The binaries will be built at bin/. Any existing files will be overwritten.

Usage

TRUP can be run with UNIX command-line interface.

Assuming that in the directory "RP" there are two gzipped fastq-files are called as SAMPLE_R?1(_XXX)?.fq.gz and SAMPLE_R?2(_XXX)?.fq.gz (where "_R?1" and "_R?2" means mate 1 and mate 2 reads, XXX means a run ID, R and _XXX are optional that can be missing), the path to the directory of the pipeline is called "PD" and the path to tha annotation files are called "AD", one can run the pipeline in the following way:

Run-level 1 (Quality check, mapping of spiked-in read pairs and give a rough estimate of the insert size etc.):

$ perl RTrace.pl --runlevel 1 --sampleName SAMPLE --type p --readpool RP --root PD --threads TH --anno AD 2>>run.log 

where "TH" is the number of computing threads. If one wants to stop after the quality check, add --QC in the command call. --type indicates sequencing type, either s(single-end) or p(paired-end).

Run-level 2 (mapping with gsnap [default] or STAR, generating reports):

$ perl RTrace.pl --runlevel 2 --sampleName SAMPLE --type p --readpool RP --$root PD --anno AD --threads TH --WIG --patient ID --tissue type --threads TH --gf pdf 2>>run.log

where --WIG is set to generate bigWiggle file for visualization purpose. --patient and --tissue can be set if user need to run edgeR and cuffdiff after processing all the samples. If a X11 is not always available, set --gf to be pdf to generate the report figures in pdf format instead of png. However, if X11 is available, do not set the --gf option.

Run-level 3 (collect regions containning potential breakpoints from the mapping of gsnap, perform regional assembly in the candidate regions):

$ perl RTrace.pl --runlevel 3 --sampleName SAMPLE --type p --readpool RP --root PD --anno AD --threads TH --RA 1 2>>run.log

where --RA is set to 1 to indicate an independent regional assembly around each breakpoint.

Run-level 4 (Dectecting fusion events using the assembled transcripts):

$ perl RTrace.pl --runlevel 4 --sampleName SAMPLE --type p --readpool RP --root PD --anno AD --threads TH 2>>run.log

If user could not use BLAT, --BT should be set to use GMAP in this step.

Run-level 5 (run cufflinks for gene/isoform quantification):

$ perl RTrace.pl --runlevel 5 --sampleName SAMPLE --type p --readpool RP --root PD --anno AD --threads TH --gtf-guide --known-trans refseq 2>>run.log

where --known-trans can be set to 'ensembl' or 'refseq' to indicate the annotation to be used for cufflinks.

Run-level 6 (run cuffdiff for diffrential gene/isoform expression analysis):

$ perl RTrace.pl --runlevel 6 --root PD --anno AD --threads TH 2>>run.log

Run-level 7 (run edgeR for diffrential gene expression analysis):

$ perl RTrace.pl --runlevel 7 --root PD --anno AD --threads TH --priordf 5 --spaired 1 2>>run.log

where --priordf sets the prior.df parameter in edgeR. --spaired should be set to 1 if the samples are paired (paired normal-diese samples from the sample patient).

Note: do not set --lanename for runlevel 6 and 7, as these two runlevels are dealing with multiple samples.

The Run-levels metioned above can also be combined in the same command line, as following,

$ perl RTrace.pl --runlevel 1-4 --sampleName SAMPLE --readpool RP --root PD --anno AD --threads TH --WIG --patient ID --tissue type --gf pdf --RA 1 2>>run.log

One can also call combinations of different Run-levels, such as --$runlevel 1,2 or --runlevel 2,3,4. But in these combined ways, one should specify all necessary options in the command line.

Runlevel dependencies (->): 4->3->2->1, 6->5->2->1, 7->2->1

Options

check the options by running the program without options or with --help.

Contact

Sun Ruping

Current Affiliation: Califano Lab, Department of Systems Biology, Columbia University, New York, NY, USA

Previous Affiliation: Dept. Vingron (Computational Molecular Biology) Max Planck Institute for Molecular Genetics. Ihnestr. 63-73, D-14195 Berlin, Germany

Email: [email protected]

Project Website: https://github.com/ruping/TRUP

About

Tumor-specimen suited RNA-seq Unified Pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Perl 87.1%
  • C++ 7.1%
  • R 4.4%
  • C 1.2%
  • JavaScript 0.2%