Quick start

DupRecover is a Maximum Likelihood estimator for sampling-induced read duplication in deep sequencing experiments.

Quick start

Install

$ wget https://bitbucket.org/wanding/duprecover/get/default.zip
$ unzip default.zip

cd to unzipped dir

$ chmod u+x duprecover

Dependency

Python 2.6+

Run

Precompute Stirling number

$ ./duprecover stirling -o stirling_compute.dat

Estimate sampling-induced read duplication

$ ./duprecover estimate -m 50 -l 100 -i stirling_compute.dat

Or use a list as input

$ ./duprecover estimate -r read_count.list -i stirling_compute.dat

Each line in read_count.list has the format:

<unique read count>   <number of possible reads>
<unique read count>   <number of possible reads>

The output has the format:

<unique read count>   <number of possible reads>   <read count correction>

Help

See ./duprecover {stirling, estimate} -h for detail.

Remarks

For single-end reads with single mutation site, each correction corresponds to read length (if the read length is fixed then there is only one correction for that mutation site).
For paired-end reads with single mutation site, each correction corresponds to a specific insert size. The reads have to be categorized by their insert sizes and the correction for the site coverage is obtained by summing corrections from all insert sizes.
For single-end reads with multiple mutation sites, each correction corresponds to a combination of 1) read length and 2) read cover. The correction for the entire region (Maximum Phasable Window) is obtained by summing corrections from all combinations.
For paired-end reads with multiple mutation sites, each correction corresponds to a combination of 1) insert size and 2) insert cover. The correction for the entire region (Maximum Phasable Window) is obtained by summing corrections from all combinations.
For RNA-Seq data, the number of unique reads are tallied from the whole transcript. Each correction corresponds to a specific read length (for single-end reads) or insert size (for paired-end reads). The number of all possible unique reads is the summation of the length of the union of regions that span all exons and their flanking regions with the same length as read length.
Please make sure the unique read count is strictly smaller than the number of all possible unique reads. Otherwise a warning will be issued and the correction is chosen arbitrarily.

Publication

Zhou†, Chen, Zhao, Eterovic, Meric-Bernstam, Mills, Chen†. “Bias from removing read duplication in ultra-deep sequencing experiments.” Bioinformatics (2014)

About

This work is a collaboration between Wanding Zhou and Prof. Ken Chen at University of Texas at MD Anderson Cancer Center.

This work was supported in part by the National Cancer Institute (NCI) grant R01CA172652-01 to KC; The MD Anderson Odyssey recruitment fellowship to WZ; The MD Anderson Cancer Center Sheikh Khalifa Ben Zayed Al Nahyan Institute of Personalized Cancer Therapy and the National Cancer Institute Cancer Center Support Grant [P30CA016672].

Disclaimer

The code is provided "as is", with no guarantee or warranty of any kind.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README.md		README.md
duprecover		duprecover

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick start

Install

Dependency

Run

Help

Remarks

Publication

About

Disclaimer

About

Uh oh!

Releases

Packages

Languages

zwdzwd/duprecover

Folders and files

Latest commit

History

Repository files navigation

Quick start

Install

Dependency

Run

Help

Remarks

Publication

About

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages