Thanks to visit codestin.com
Credit goes to link.springer.com

Skip to main content

A MAD-Bayes Algorithm for State-Space Inference and Clustering with Application to Querying Large Collections of ChIP-Seq Data Sets

  • Conference paper
  • First Online:
Research in Computational Molecular Biology (RECOMB 2016)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9649))

  • 2160 Accesses

Abstract

Current analytic approaches for querying large collections of chromatin immunoprecipitation followed by sequencing (ChIP-seq) data from multiple cell types rely on individual analysis of each dataset (i.e., peak calling) independently. This approach discards the fact that functional elements are frequently shared among related cell types and leads to overestimation of the extent of divergence between different ChIP-seq samples. Methods geared towards multi-sample investigations have limited applicability in settings that aim to integrate 100s to 1000s of ChIP-seq datasets for query loci (e.g., thousands of genomic loci with a specific binding site). Recently, [1] developed a hierarchical framework for state-space matrix inference and clustering, named MBASIC, to enable joint analysis of user-specified loci across multiple ChIP-seq datasets. Although this versatile framework both estimates the underlying state-space (e.g., bound vs. unbound) and also groups loci with similar patterns together, its Expectation-Maximization based estimation structure hinders its applicability with large numbers of loci and samples. We address this limitation by developing a MAP-based Asymptotic Derivations from Bayes (MAD-Bayes) framework for MBASIC. This results in a K-means-like optimization algorithm which converges rapidly and hence enables exploring multiple initialization schemes and flexibility in tuning. Comparisons with MBASIC indicates that this speed comes at a relatively insignificant loss in estimation accuracy. Although MAD-Bayes MBASIC is specifically designed for the analysis of user-specified loci, it is able to capture overall patterns of histone marks from multiple ChIP-seq datasets similar to those identified by genome-wide segmentation methods such as ChromHMM and Spectacle.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+
from £29.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 35.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 44.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Zuo, C., Hewitt, K.J., Bresnick, E.H., Keleş, S.: A hierarchical framework for state-space matrix inference and clustering. Ann. Appl. Stat. (Revised)

    Google Scholar 

  2. The ENCODE project consortium: an integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012)

    Google Scholar 

  3. Roadmap epigenomics consortium: integrative analysis of 111 reference human epigenomes. Nature 518(7539), 317–330 (2015)

    Google Scholar 

  4. Bardet, A.F., He, Q., Zeitlinger, J., Stark, A.: A computational pipeline for comparative ChIP-seq analyses. Nat. Protoc. 7(1), 45–61 (2012)

    Article  Google Scholar 

  5. Bao, Y., Vinciotti, V., Wit, E., AC’t Hoen, P.: Accounting for immunoprecipitation efficiencies in the statistical analysis of ChIP-seq data. BMC Bioinform. 14(1), 169 (2013)

    Article  Google Scholar 

  6. Zeng, X., Sanalkumar, R., Bresnick, E.H., Li, H., Chang, Q., Keleş, S.: jMOSAiCS: joint analysis of multiple ChIP-seq datasets. Genome Biol. 14, R38 (2013). Highly accessed. An R package for joint analysis of multiple ChIP-seq datasets. Available in Bioconductor http://bioconductor.org/packages/2.12/bioc/html/jmosaics.html

    Article  Google Scholar 

  7. Kuan, P.F., Chung, D., Pan, G., Thomson, J., Stewart, R., Keleş, S.: A statistical framework for the analysis of ChIP-Seq data. J. Am. Stat. Assoc. 106, 891–903 (2011). Software available on Galaxy http://toolshed.g2.bx.psu.edu/ and also on Bioconductor http://bioconductor.org/packages/2.8/bioc/html/mosaics.html

    Article  MATH  MathSciNet  Google Scholar 

  8. Bao, Y., Vinciotti, V., Wit, E., ’t Hoen, P.: Joint modeling of ChIP-seq data via a Markov random field model. Biostatistics 15(2), 296–310 (2014)

    Article  Google Scholar 

  9. Chen, K.B., Hardison, R., Zhang, Y.: dCaP: detecting differential binding events in multiple conditions and proteins. BMC Genomics 15(9), 1–14 (2014)

    Article  Google Scholar 

  10. Ernst, J., Kellis, M.: Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat. Biotechnol. 28(8), 817–825 (2010)

    Article  Google Scholar 

  11. Hoffman, M.M., Buske, O.J., Wang, J., Weng, Z., Bilmes, J.A., Noble, W.S.: Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat. Methods 9, 473–476 (2012)

    Article  Google Scholar 

  12. Song, J., Chen, K.C.: Spectacle: fast chromatin state annotation using spectral learning. Genome Biol. 16(1), 33 (2015)

    Article  Google Scholar 

  13. Sohn, K.A., Ho, J.W.K., Djordjevic, D., Jeong, H.H., Park, P.J., Kim, J.H.: hiHMM: Bayesian non-parametric joint inference of chromatin state maps. Bioinformatics, btv117 (2015)

    Google Scholar 

  14. Liang, K., Keleş, S.: Detecting differential binding of transcription factors with ChIP-seq. Bioinformatics 28(1), 121–122 (2012). Available in Bioconductor (http://www.bioconductor.org/packages/2.12/bioc/html/DBChIP.html)

    Article  Google Scholar 

  15. Mahony, S., Edwards, M.D., Mazzoni, E.O., Sherwood, R.I., Kakumanu, A., Morrison, C.A., Wichterle, H., Gifford, D.K.: An integrated model of multiple-condition ChIP-Seq data reveals predeterminants of Cdx2 binding. PLoS Comput. Biol. 10(3), e1003501 (2014)

    Article  Google Scholar 

  16. Song, Q., Smith, A.D.: Identifying dispersed epigenomic domains from ChIP-Seq data. Bioinformatics 27, 870–1 (2011)

    Article  Google Scholar 

  17. Ferguson, J.P., Cho, J.H., Zhao, H.: A new approach for the joint analysis of multiple ChIP-seq libraries with application to histone modification. Stat. Appl. Genet. Mol. Biol. 11(3), Article 1 (2012)

    Google Scholar 

  18. Taslim, C., Huang, T., Lin, S.: DIME: R-package for identifying differential ChIP-seq based on an ensemble of mixture models. Bioinformatics 27(11), 1569–70 (2011)

    Article  Google Scholar 

  19. Ji, H., Li, X., Wang, Q.F., Ning, Y.: Differential principal component analysis of ChIP-seq. Proc. Nat. Acad. Sci. U.S.A. 110(17), 6789–6794 (2013)

    Article  Google Scholar 

  20. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B Met. 39, 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  21. Zuo, C., Keleş, S.: A statistical framework for power calculations in ChIP-seq experiments. Bioinformatics 30(6), 853–860 (2014)

    Article  Google Scholar 

  22. Broderick, T., Kulis, B., Jordan, M.: MAD-Bayes: MAP-based asymptotic derivations from Bayes. In: Proceedings of the 30th International Conference on Machine Learning (2013)

    Google Scholar 

  23. Blackwell, D., MacQueen, J.B.: Ferguson distributions via Polya urn schemes. Ann. Stat. 1(2), 353–355 (1973)

    Article  MATH  MathSciNet  Google Scholar 

  24. Aldous, D.J.: Exchangeability and related topics. In: Hennequin, P.L. (ed.) École d’Été de Probabilités de Saint-Flour XIII, vol. 1117, pp. 1–198. Springer, Heidelberg (1983)

    Chapter  Google Scholar 

  25. Hewitt, K.J., Kim, D.H., Devadas, P., Prathibha, R., Zuo, C., Sanalkumar, R., Johnson, K.D., Kang, Y.A., Kim, J.S., Dewey, C.N., Keleş, S., Bresnick, E.: Hematopoietic signaling mechanism revealed from a stem/progenitor cell cistrome. Mol. Cell 59(1), 62–74 (2015)

    Article  Google Scholar 

  26. Johnson, K.D., Hsu, A., Ryu, M.J., Boyer, M.E., Keleş, S., Zhang, J., Lee, Y., Holland, S.M., Bresnick, E.H.: Cis-element mutation in a GATA-2-dependent immunodeficiency syndrome governs hematopoiesis and vascular integrity. J. Clin. Inv. 10(122), 3692–3704 (2012)

    Article  Google Scholar 

  27. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  MATH  Google Scholar 

  28. Wei, Y., Li, X., Wang, Q.F., Ji, H.: iASeq: integrative analysis of allele-specificity of protein-DNA interactions in multiple ChIP-seq datasets. BMC Genomics 13, 681 (2012)

    Article  Google Scholar 

  29. Gerstein, M.B., Kundaje, A., Hariharan, M., Landt, S.G., Yan, K.K., Cheng, C., Mu, X.J., Khurana, E., Rozowsky, J., Alexander, R., Min, R., Alves, P., Abyzov, A., Addleman, N., Bhardwaj, N., Boyle, A.P., Cayting, P., Charos, A., Chen, D.Z., Cheng, Y., Clarke, D., Eastman, C., Euskirchen, G., Frietze, S., Fu, Y., Gertz, J., Grubert, F., Harmanci, A., Jain, P., Kasowski, M., Lacroute, P., Leng, J., Lian, J., Monahan, H., O’Geen, H., Ouyang, Z., Partridge, E.C., Patacsil, D., Pauli, F., Raha, D., Ramirez, L., Reddy, T.E., Reed, B., Shi, M., Slifer, T., Wang, J., Wu, L., Yang, X., Yip, K.Y., Zilberman-Schapira, G., Batzoglou, S., Sidow, A., Farnham, P.J., Myers, R.M., Weissman, S.M., Snyder, M.: Architecture of the human regulatory network derived from ENCODE data. Nature 489(7414), 91–100 (2012)

    Article  Google Scholar 

  30. Wei, Y., Tenzen, T., Ji, H.: Joint analysis of differential gene expression in multiple studies using correlation motifs. Biostatistics 16(1), 31–46 (2015)

    Article  MathSciNet  Google Scholar 

  31. Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)

    Article  Google Scholar 

  32. Tan, P.N., Steinbach, M., Kumar, V.: Cluster analysis: basic concepts and algorithms. In: Introduction to Data Mining, chap. 8 (2005)

    Google Scholar 

  33. Landt, S.G., Marinov, G.K., Kundaje, A., Kheradpour, P., Pauli, F., Batzoglou, S., Bernstein, B.E., Bickel, P., Brown, J.B., Cayting, P., et al.: ChIP-seq guidelines and practices of the encode and modencode consortia. Genome Res. 22(9), 1813–1831 (2012)

    Article  Google Scholar 

  34. Banerjee, A.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)

    MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sündüz Keleş .

Editor information

Editors and Affiliations

Appendix

Appendix

Fig. 4.
figure 4

A graphical interpretation of the conjugacy between \(\lambda _r\) and J. We use the K-means initialization to compute surrogate values for L(J) for a large collection of \(J \ge 1\). The \(\lambda _r\) value that can yield J clusters in the global solution must satisfy: \(\sup _{J'>J}\frac{L(J)-L(J')}{J-J'}\le \lambda _r\le \inf _{J'>J}\frac{L(J')-L(J)}{J'-J}\). When \(\lambda _r\) satisfies this condition, a line with slope \(-\lambda _r\) passing through (JL(J)) on the graph should be tangent to the trace of all L(J) values. Although using the surrogate L(J) values can lead to the curve connecting the L(J) values to be con-convex, making the solution for \(\lambda _r\) not hold for some J, we can use a convex approximation to the trace of L(J) so that so that a \(\lambda _r\) exists for each J. A simpler approach is to order the L(J) from largest to smallest and require the following condition for \(\lambda _r\). \(L(J) - L(J+1) \le \lambda _r \le L(J-1)-L(J)\). Algorithm 2 essentially applies this idea to select the \(\lambda _r\) values. Each J corresponds to a \(\lambda _r\) of value \([L(J-1)-L(J+1)]/2\) that satisfies the conjugacy inequality. The algorithm essentially tries to identify the range of \(\lambda _r\) that leads up to \(\sqrt{I}\) number of clusters.

Fig. 5.
figure 5

Comparison of the clustering accuracy with the adjusted Rand index by excluding the singleton loci.

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Zuo, C., Chen, K., Keleş, S. (2016). A MAD-Bayes Algorithm for State-Space Inference and Clustering with Application to Querying Large Collections of ChIP-Seq Data Sets. In: Singh, M. (eds) Research in Computational Molecular Biology. RECOMB 2016. Lecture Notes in Computer Science(), vol 9649. Springer, Cham. https://doi.org/10.1007/978-3-319-31957-5_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-31957-5_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-31956-8

  • Online ISBN: 978-3-319-31957-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics