DupRecover is a Maximum Likelihood estimator for sampling-induced read duplication in deep sequencing experiments.
$ wget https://bitbucket.org/wanding/duprecover/get/default.zip
$ unzip default.zip
cd to unzipped dir
$ chmod u+x duprecover
Python 2.6+
Precompute Stirling number
$ ./duprecover stirling -o stirling_compute.dat
Estimate sampling-induced read duplication
$ ./duprecover estimate -m 50 -l 100 -i stirling_compute.dat
Or use a list as input
$ ./duprecover estimate -r read_count.list -i stirling_compute.dat
Each line in read_count.list has the format:
<unique read count> <number of possible reads>
<unique read count> <number of possible reads>
The output has the format:
<unique read count> <number of possible reads> <read count correction>
See ./duprecover {stirling, estimate} -h for detail.
-
For single-end reads with single mutation site, each correction corresponds to read length (if the read length is fixed then there is only one correction for that mutation site).
-
For paired-end reads with single mutation site, each correction corresponds to a specific insert size. The reads have to be categorized by their insert sizes and the correction for the site coverage is obtained by summing corrections from all insert sizes.
-
For single-end reads with multiple mutation sites, each correction corresponds to a combination of 1) read length and 2) read cover. The correction for the entire region (Maximum Phasable Window) is obtained by summing corrections from all combinations.
-
For paired-end reads with multiple mutation sites, each correction corresponds to a combination of 1) insert size and 2) insert cover. The correction for the entire region (Maximum Phasable Window) is obtained by summing corrections from all combinations.
-
For RNA-Seq data, the number of unique reads are tallied from the whole transcript. Each correction corresponds to a specific read length (for single-end reads) or insert size (for paired-end reads). The number of all possible unique reads is the summation of the length of the union of regions that span all exons and their flanking regions with the same length as read length.
-
Please make sure the unique read count is strictly smaller than the number of all possible unique reads. Otherwise a warning will be issued and the correction is chosen arbitrarily.
Zhou†, Chen, Zhao, Eterovic, Meric-Bernstam, Mills, Chen†. “Bias from removing read duplication in ultra-deep sequencing experiments.” Bioinformatics (2014)
This work is a collaboration between Wanding Zhou and Prof. Ken Chen at University of Texas at MD Anderson Cancer Center.
This work was supported in part by the National Cancer Institute (NCI) grant R01CA172652-01 to KC; The MD Anderson Odyssey recruitment fellowship to WZ; The MD Anderson Cancer Center Sheikh Khalifa Ben Zayed Al Nahyan Institute of Personalized Cancer Therapy and the National Cancer Institute Cancer Center Support Grant [P30CA016672].
The code is provided "as is", with no guarantee or warranty of any kind.