Codestin Search App

TRUP is a pipeline designed for analyzing RNA-seq data from Tumor samples..

Learn More

Cancer cells express many rearranged transcripts posing increased complexity to transcriptome analysis. As an unified pipeline, TRUP is designed to sensitively and accurately dissect the complexity of the cancer transcriptome by analyzing RNA-seq data obtained from tumour tissues. The current functionalities of TRUP include: 1) identification of fusion transcripts; 3) RNA-seq quality assesment; 2) Gene-read counting. The fusion detection module in TRUP combines split-read/read-pair mapping with regional de-novo assembly to achieve a balance between sensitivity and precision.

Dependencies

Samtools.
Bowtie2.
GSNAP (version > 2012-07-20) or STAR (version 2.4.0)
Velvet (https://github.com/dzerbino/velvet).
Oases (https://github.com/dzerbino/oases).
R libraries: feilds, KernSmooth, lattice, RColorBrewer and R2HTML.
Bioconductor packages: DESeq2, edgeR.

Annotation files

Annotation files are available upon request. Currently hg19 and mm10 are available.

All the annotation files and directories should be put under an central annotation directory whose name will be used for argument "--anno" when running TRUP. The structure of the annotation directory and sub-directories are illustrated as the following example:

Example:

     TRUP_ANNOTATION  /   hg19   /    hg19.SelfChain_UCSC.txt
     TRUP_ANNOTATION  /   hg19   /    hg19.bowtie2_index   /  *
           |               |                  |
     central_directory / version  / annotation files and directories

If users choose to prepare annotation directories and files by themselves, it can be achieved in the following way (using hg19 as an example):

TRUP_ANNOTATION/hg19/

hg19.genome_UCSC.2bit (for running BLAT: runlevel 4) How to get: Use faToTwoBit to generate a 2.bit file from a genome fasta file (http://genome.ucsc.edu/goldenPath/help/blatSpec.html#faToTwoBitUsage)
hg19.transcriptome_length.txt (for generating html report: runlevel 2) How to get: Sum up the length of all exons for each chromosome. Each line is formatted as "chromosome sun_of_exons_length" (tab delimited).
hg19.genes_Ensembl.bed12 (for gene read counting: runlevel 2)
hg19.gencode/gencode.v14.annotation.gene.bed12 (not available for organism other than hg19) How to get: These 12 column bed files can be generated by my script "exon_union_gtf2bed.pl" under the TRUP directory src/ by using normal .gtf files downloadable from Ensembl ftp site for a specific organism.
hg19.transcripts_Ensembl.gtf (for tophat2 mapping and cufflinks: runlevel 2 and 5) How to get: This file can be downloaded from Ensembl ftp. use this to generate hg19.genes_Ensembl.bed12 mentioned above.
hg19.gencode/gencode.v14.annotation.gtf (not available for organism other than hg19) How to get: downloadable from UCSC table browser, use this to generate hg19.gencode/gencode.v14.annotation.gene.bed12 mentioned above.
hg19.gencode/gencode.v14.genemap (for runlevel 2. not available for organism other than hg19) How to get: use the script "gencode_annotation.map.pl" under TRUP/src and the gencode gtf file to generate this gene map file.
hg19.genes_RefSeq.bed12 (for runlevel 2. may not be used in the new version of TRUP)
hg19.SelfChain_UCSC.txt (for fusion detection: runlevel 4)
hg19.repeats_UCSC.gff (for fusion detection: runlevel 4) How to get: Downloadable from UCSC table browser
hg19.transcripts_Ensembl.gff How to get: turn hg19.transcripts_Ensembl.gtf to generate the corresponding gff file by using "gtf2gff3" (http://search.cpan.org/~lds/GBrowse-2.54/bin/gtf2gff3.pl)
hg19.biomart.txt How to get: use R package biomaRt or web converter at (http://www.ensembl.org/biomart/martview/), choose appropriate databse, feed in filter names (all ensembl gene id), and select attributes for "Associated Gene Name", "Gene Biotype", "Description", "EntrezGene ID", "WikiGene Name", "WikiGene Description".
hg19.gmap_index/hg19/hg19.* How to get: download from (http://research-pub.gene.com/gmap/) or follow instructions provided by gmap/gsnap
hg19.bowtie2_index/hg19/hg19.*
hg19.bowtie2_index/hg19_trans/hg19_known_ensemble_trans.* How to get: download from (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) or follow instructions provided by bowtie2. For annotation files under hg19_trans/, run tophat2 once with --transcriptome-index hg19.bowtie2_index/hg19_trans/hg19_known_ensemble_trans and --GTF hg19.transcripts_Ensembl.gtf to get tophat index for transcriptome (tophat2 can be stopped as long as the indexes are generated, thus user can use any fastq file as input for running tophat2, just to get the indexes). These transcriptome indexes can be used for later tophat2 calls.

Installation

No installation is required if pre-compiled binaries in bin/ are compatible with your system.

If pre-compiled binaries are incompatible with your system, you will need to rebuild them from source.

Make sure you have installed Bamtools (https://github.com/pezmaster31/bamtools). Then write down the bamtools_directory where ./lib/ and ./include/ sub-directories are located. Currently bamtools 2.0 has been tested.

run make in following way to install

$ make BAMTOOLS_ROOT=/bamtools_directory/

The binaries will be built at bin/. Any existing files will be overwritten.

Usage

TRUP can be run with UNIX command-line interface.

Assuming that in the directory "RP" there are two gzipped fastq-files are called as SAMPLE_R?1(_XXX)?.fq.gz and SAMPLE_R?2(_XXX)?.fq.gz (where "_R?1" and "_R?2" means mate 1 and mate 2 reads, XXX means a run ID, R and _XXX are optional that can be missing), the path to the directory of the pipeline is called "PD" and the path to tha annotation files are called "AD", one can run the pipeline in the following way:

Run-level 1 (Quality check, mapping of spiked-in read pairs and give a rough estimate of the insert size etc.):

$ perl RTrace.pl --runlevel 1 --sampleName SAMPLE --type p --readpool RP --root PD --threads TH --anno AD 2>>run.log

where "TH" is the number of computing threads. If one wants to stop after the quality check, add --QC in the command call. --type indicates sequencing type, either s(single-end) or p(paired-end).

Run-level 2 (mapping with gsnap [default] or STAR, generating reports):

$ perl RTrace.pl --runlevel 2 --sampleName SAMPLE --type p --readpool RP --$root PD --anno AD --threads TH --WIG --patient ID --tissue type --threads TH --gf pdf 2>>run.log

where --WIG is set to generate bigWiggle file for visualization purpose. --patient and --tissue can be set if user need to run edgeR and cuffdiff after processing all the samples. If a X11 is not always available, set --gf to be pdf to generate the report figures in pdf format instead of png. However, if X11 is available, do not set the --gf option.

Run-level 3 (collect regions containning potential breakpoints from the mapping of gsnap, perform regional assembly in the candidate regions):

$ perl RTrace.pl --runlevel 3 --sampleName SAMPLE --type p --readpool RP --root PD --anno AD --threads TH --RA 1 2>>run.log

where --RA is set to 1 to indicate an independent regional assembly around each breakpoint.

Run-level 4 (Dectecting fusion events using the assembled transcripts):

$ perl RTrace.pl --runlevel 4 --sampleName SAMPLE --type p --readpool RP --root PD --anno AD --threads TH 2>>run.log

If user could not use BLAT, --BT should be set to use GMAP in this step.

Run-level 5 (run cufflinks for gene/isoform quantification):

$ perl RTrace.pl --runlevel 5 --sampleName SAMPLE --type p --readpool RP --root PD --anno AD --threads TH --gtf-guide --known-trans refseq 2>>run.log

where --known-trans can be set to 'ensembl' or 'refseq' to indicate the annotation to be used for cufflinks.

Run-level 6 (run cuffdiff for diffrential gene/isoform expression analysis):

$ perl RTrace.pl --runlevel 6 --root PD --anno AD --threads TH 2>>run.log

Run-level 7 (run edgeR for diffrential gene expression analysis):

$ perl RTrace.pl --runlevel 7 --root PD --anno AD --threads TH --priordf 5 --spaired 1 2>>run.log

where --priordf sets the prior.df parameter in edgeR. --spaired should be set to 1 if the samples are paired (paired normal-diese samples from the sample patient).

Note: do not set --lanename for runlevel 6 and 7, as these two runlevels are dealing with multiple samples.

The Run-levels metioned above can also be combined in the same command line, as following,

$ perl RTrace.pl --runlevel 1-4 --sampleName SAMPLE --readpool RP --root PD --anno AD --threads TH --WIG --patient ID --tissue type --gf pdf --RA 1 2>>run.log

One can also call combinations of different Run-levels, such as --$runlevel 1,2 or --runlevel 2,3,4. But in these combined ways, one should specify all necessary options in the command line.

Runlevel dependencies (->): 4->3->2->1, 6->5->2->1, 7->2->1

Options

check the options by running the program without options or with --help.

Contact

Sun Ruping

Current Affiliation: Califano Lab, Department of Systems Biology, Columbia University, New York, NY, USA

Previous Affiliation: Dept. Vingron (Computational Molecular Biology) Max Planck Institute for Molecular Genetics. Ihnestr. 63-73, D-14195 Berlin, Germany

Email: [email protected]

Project Website: https://github.com/ruping/TRUP

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
annovar		annovar
bin		bin
src		src
tools_binaries_linux_x86_64		tools_binaries_linux_x86_64
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Learn More

Dependencies

Annotation files

Installation

Usage

Options

Contact

About

Uh oh!

Releases

Packages

Languages

kellyquek/TRUP

Folders and files

Latest commit

History

Repository files navigation

Learn More

Dependencies

Annotation files

Installation

Usage

Options

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages