-
Notifications
You must be signed in to change notification settings - Fork 217
Add CDF implementation. #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@dikaiosune @cramertj don't know if this is of interest to you, but I would appreciate your opinion. |
|
Cool! I'm certainly not well-versed enough to comment on the implementation :). I do have a question: is there a point where the statistical functionality is robust enough to package as a separate crate? I imagine it might be more discoverable to non-bioinformatics users -- I am of course assuming that these statistical methods are used outside of bioinformatics. Is that correct? |
src/stats/cdf.rs
Outdated
| use std::slice; | ||
|
|
||
| use num::traits::{cast, NumCast}; | ||
| use itertools::Itertools; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you use this anywhere? Running cargo test lists it as an unused import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one yes, but the import in the test module was superfluous. I removed it. Thanks!
src/stats/cdf.rs
Outdated
| /// | ||
| /// * `pmf` - the PMF as a vector of value/probability pairs | ||
| pub fn from_pmf(mut entries: Vec<(T, LogProb)>) -> Self { | ||
| entries.sort_by(|&(ref a, _), &(ref b, _)| a.partial_cmp(b).unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't make sense to me to unwrap partial_cmp on unknown PartialOrd types. If you expect there to always be a valid ordering, you should use Ord, and if not, you should handle the case in which there's no ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are somehow right. I unwrap it in order to allow floats for T. In case of NaN, this would panic. One could think about returning a result instead. What's your feeling? Panic or result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also impose an ordering on NaN, always treating them as either larger or smaller than other numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but with a CDF, the ordering is critical. And it is impossible to guess what the user has in mind. In case the user wants to support NaN in his CDF, he can always wrap f64 and implement Ord.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a total ordering is necessary, it seems like the only available solution is to require Ord, and then provide some mechanism for producing an OrdFloat or similar that is guaranteed to be neither infinity nor NaN (are those the only two cases a float won't produce a valid ordering?). If the solution is to panic when no ordering is available, that invariant should be enforced in the Ord implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the latest commits, I now require Ord. There is the ordered-float crate that provides newtype wrappers around floats that can be used to have e.g. floats without NaNs or floats where NaNs are treated like suggested by IEEE.
src/stats/cdf.rs
Outdated
| pub fn from_pmf(mut entries: Vec<(T, LogProb)>) -> Self { | ||
| entries.sort_by(|&(ref a, _), &(ref b, _)| a.partial_cmp(b).unwrap()); | ||
| let mut inner: Vec<(T, LogProb)> = Vec::new(); | ||
| for mut e in entries.into_iter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to into_iter here is unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely.
|
@dikaiosune yes, I think it would be reasonable to move the stats module into an external crate. The same is however also true for the pattern matching module and most of the data structures. One model would be to have them as separate repos within our organization. Then, each repo spawns into a separate crate. I just wonder about the naming... On the other hand, very small modules are also less visible (e.g. in terms of download numbers). E.g. I would guess the stats module profits from being inside of Rust-Bio. |
src/stats/cdf.rs
Outdated
| let mut inner: Vec<(T, LogProb)> = Vec::new(); | ||
| for mut e in entries.into_iter() { | ||
| let p = logprobs::add(inner.last().map_or(f64::NEG_INFINITY, |e| e.1), e.1); | ||
| if !inner.is_empty() && inner.last().unwrap().0 == e.0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really should be an if let Some(ref mut last) = inner.last_mut() or similar. However, I couldn't get it to compile, because the lifetime of the reference taken out by last_mut makes inner unmodifiable in the else branch. I think it's fixed here, but it'll be a bit before this makes it to stable.
|
@johanneskoester I'd be in support of splitting up rust-bio into several smaller crates. I feel like some of the generally useful data structures and algorithms are less visible to the general public because people aren't looking for them in a bio library. |
|
@dikaiosune and @cramertj thanks a lot for the amazing review. CDF is much better now. |
This PR implements generic support for cumulative distribution functions and probability mass functions.