Add CDF implementation. #77

johanneskoester · 2016-07-29T14:19:34Z

This PR implements generic support for cumulative distribution functions and probability mass functions.

johanneskoester · 2016-07-29T14:23:03Z

@dikaiosune @cramertj don't know if this is of interest to you, but I would appreciate your opinion.

anp · 2016-08-01T16:52:32Z

Cool! I'm certainly not well-versed enough to comment on the implementation :). I do have a question: is there a point where the statistical functionality is robust enough to package as a separate crate? I imagine it might be more discoverable to non-bioinformatics users -- I am of course assuming that these statistical methods are used outside of bioinformatics. Is that correct?

cramertj · 2016-08-01T17:04:19Z

src/stats/cdf.rs

+use std::slice;
+
+use num::traits::{cast, NumCast};
+use itertools::Itertools;


Do you use this anywhere? Running cargo test lists it as an unused import.

This one yes, but the import in the test module was superfluous. I removed it. Thanks!

cramertj · 2016-08-01T17:07:42Z

src/stats/cdf.rs

+    ///
+    /// * `pmf` - the PMF as a vector of value/probability pairs
+    pub fn from_pmf(mut entries: Vec<(T, LogProb)>) -> Self {
+        entries.sort_by(|&(ref a, _), &(ref b, _)| a.partial_cmp(b).unwrap());


It doesn't make sense to me to unwrap partial_cmp on unknown PartialOrd types. If you expect there to always be a valid ordering, you should use Ord, and if not, you should handle the case in which there's no ordering.

You are somehow right. I unwrap it in order to allow floats for T. In case of NaN, this would panic. One could think about returning a result instead. What's your feeling? Panic or result?

You could also impose an ordering on NaN, always treating them as either larger or smaller than other numbers.

True, but with a CDF, the ordering is critical. And it is impossible to guess what the user has in mind. In case the user wants to support NaN in his CDF, he can always wrap f64 and implement Ord.

If a total ordering is necessary, it seems like the only available solution is to require Ord, and then provide some mechanism for producing an OrdFloat or similar that is guaranteed to be neither infinity nor NaN (are those the only two cases a float won't produce a valid ordering?). If the solution is to panic when no ordering is available, that invariant should be enforced in the Ord implementation.

In the latest commits, I now require Ord. There is the ordered-float crate that provides newtype wrappers around floats that can be used to have e.g. floats without NaNs or floats where NaNs are treated like suggested by IEEE.

cramertj · 2016-08-01T17:08:24Z

src/stats/cdf.rs

+    pub fn from_pmf(mut entries: Vec<(T, LogProb)>) -> Self {
+        entries.sort_by(|&(ref a, _), &(ref b, _)| a.partial_cmp(b).unwrap());
+        let mut inner: Vec<(T, LogProb)> = Vec::new();
+        for mut e in entries.into_iter() {


The call to into_iter here is unnecessary.

Absolutely.

johanneskoester · 2016-08-01T17:21:45Z

@dikaiosune yes, I think it would be reasonable to move the stats module into an external crate. The same is however also true for the pattern matching module and most of the data structures. One model would be to have them as separate repos within our organization. Then, each repo spawns into a separate crate. I just wonder about the naming...

On the other hand, very small modules are also less visible (e.g. in terms of download numbers). E.g. I would guess the stats module profits from being inside of Rust-Bio.
Since e.g. Seqan also has a generic graph implementation, I currently tend to leave it as it is, but somehow advertise that Rust-Bio has also generic algorithms and data structures. For example, we can make it very obvious in the description of the crate.

cramertj · 2016-08-01T17:27:02Z

src/stats/cdf.rs

+        let mut inner: Vec<(T, LogProb)> = Vec::new();
+        for mut e in entries.into_iter() {
+            let p = logprobs::add(inner.last().map_or(f64::NEG_INFINITY, |e| e.1), e.1);
+            if !inner.is_empty() && inner.last().unwrap().0 == e.0 {


This really should be an if let Some(ref mut last) = inner.last_mut() or similar. However, I couldn't get it to compile, because the lifetime of the reference taken out by last_mut makes inner unmodifiable in the else branch. I think it's fixed here, but it'll be a bit before this makes it to stable.

cramertj · 2016-08-01T18:14:06Z

@johanneskoester I'd be in support of splitting up rust-bio into several smaller crates. I feel like some of the generally useful data structures and algorithms are less visible to the general public because people aren't looking for them in a bio library.

johanneskoester · 2016-08-18T12:48:24Z

@dikaiosune and @cramertj thanks a lot for the amazing review. CDF is much better now.
Regarding the split up: a downside is that people won't find this stuff easily when they come across rust-bio. Also, my feeling is that we end up with only the IO module and alphabets in Rust-Bio. Maybe Rust-Bio should depend on all those modules and reexport (hacky...)? Alternatively, it could include them via git-submodules (even more hacky...). None of the solutions seems perfect to me from Rust-Bio perspective.

johanneskoester added 2 commits July 29, 2016 16:06

Add CDF implementation.

af8f4ff

Documentation.

db23351

Merge branch 'master' into cdf

20a9010

cramertj reviewed Aug 1, 2016
View reviewed changes

Merge branch 'master' into cdf

d6dceda

cramertj reviewed Aug 1, 2016
View reviewed changes

Remove superfluous import of itertools in the tests.

74d2d60

cramertj reviewed Aug 1, 2016
View reviewed changes

johanneskoester added 4 commits August 18, 2016 12:59

Merge branch 'master' into cdf. Improve CDF API.

c132e9e

Refactoring.

ef0e510

Allow beta to fail.

f0d94f7

Remove superfluous into_iter calls.

2e10cc7

johanneskoester added 3 commits August 18, 2016 13:52

docs

ccde50b

Generalize credible interval implementation.

b302870

Improve implementation of expected value and variance.

1d888d6

johanneskoester merged commit 6e0d95a into master Aug 18, 2016

johanneskoester deleted the cdf branch October 4, 2021 09:02

Add CDF implementation. #77

Add CDF implementation. #77

Uh oh!

Conversation

johanneskoester commented Jul 29, 2016

Uh oh!

johanneskoester commented Jul 29, 2016

Uh oh!

anp commented Aug 1, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johanneskoester commented Aug 1, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cramertj commented Aug 1, 2016

Uh oh!

johanneskoester commented Aug 18, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants