Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@johanneskoester
Copy link
Contributor

This PR implements generic support for cumulative distribution functions and probability mass functions.

@johanneskoester
Copy link
Contributor Author

@dikaiosune @cramertj don't know if this is of interest to you, but I would appreciate your opinion.

@anp
Copy link
Contributor

anp commented Aug 1, 2016

Cool! I'm certainly not well-versed enough to comment on the implementation :). I do have a question: is there a point where the statistical functionality is robust enough to package as a separate crate? I imagine it might be more discoverable to non-bioinformatics users -- I am of course assuming that these statistical methods are used outside of bioinformatics. Is that correct?

src/stats/cdf.rs Outdated
use std::slice;

use num::traits::{cast, NumCast};
use itertools::Itertools;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you use this anywhere? Running cargo test lists it as an unused import.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one yes, but the import in the test module was superfluous. I removed it. Thanks!

src/stats/cdf.rs Outdated
///
/// * `pmf` - the PMF as a vector of value/probability pairs
pub fn from_pmf(mut entries: Vec<(T, LogProb)>) -> Self {
entries.sort_by(|&(ref a, _), &(ref b, _)| a.partial_cmp(b).unwrap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make sense to me to unwrap partial_cmp on unknown PartialOrd types. If you expect there to always be a valid ordering, you should use Ord, and if not, you should handle the case in which there's no ordering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are somehow right. I unwrap it in order to allow floats for T. In case of NaN, this would panic. One could think about returning a result instead. What's your feeling? Panic or result?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also impose an ordering on NaN, always treating them as either larger or smaller than other numbers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but with a CDF, the ordering is critical. And it is impossible to guess what the user has in mind. In case the user wants to support NaN in his CDF, he can always wrap f64 and implement Ord.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a total ordering is necessary, it seems like the only available solution is to require Ord, and then provide some mechanism for producing an OrdFloat or similar that is guaranteed to be neither infinity nor NaN (are those the only two cases a float won't produce a valid ordering?). If the solution is to panic when no ordering is available, that invariant should be enforced in the Ord implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest commits, I now require Ord. There is the ordered-float crate that provides newtype wrappers around floats that can be used to have e.g. floats without NaNs or floats where NaNs are treated like suggested by IEEE.

src/stats/cdf.rs Outdated
pub fn from_pmf(mut entries: Vec<(T, LogProb)>) -> Self {
entries.sort_by(|&(ref a, _), &(ref b, _)| a.partial_cmp(b).unwrap());
let mut inner: Vec<(T, LogProb)> = Vec::new();
for mut e in entries.into_iter() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to into_iter here is unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely.

@johanneskoester
Copy link
Contributor Author

@dikaiosune yes, I think it would be reasonable to move the stats module into an external crate. The same is however also true for the pattern matching module and most of the data structures. One model would be to have them as separate repos within our organization. Then, each repo spawns into a separate crate. I just wonder about the naming...

On the other hand, very small modules are also less visible (e.g. in terms of download numbers). E.g. I would guess the stats module profits from being inside of Rust-Bio.
Since e.g. Seqan also has a generic graph implementation, I currently tend to leave it as it is, but somehow advertise that Rust-Bio has also generic algorithms and data structures. For example, we can make it very obvious in the description of the crate.

src/stats/cdf.rs Outdated
let mut inner: Vec<(T, LogProb)> = Vec::new();
for mut e in entries.into_iter() {
let p = logprobs::add(inner.last().map_or(f64::NEG_INFINITY, |e| e.1), e.1);
if !inner.is_empty() && inner.last().unwrap().0 == e.0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really should be an if let Some(ref mut last) = inner.last_mut() or similar. However, I couldn't get it to compile, because the lifetime of the reference taken out by last_mut makes inner unmodifiable in the else branch. I think it's fixed here, but it'll be a bit before this makes it to stable.

@cramertj
Copy link
Contributor

cramertj commented Aug 1, 2016

@johanneskoester I'd be in support of splitting up rust-bio into several smaller crates. I feel like some of the generally useful data structures and algorithms are less visible to the general public because people aren't looking for them in a bio library.

@johanneskoester
Copy link
Contributor Author

@dikaiosune and @cramertj thanks a lot for the amazing review. CDF is much better now.
Regarding the split up: a downside is that people won't find this stuff easily when they come across rust-bio. Also, my feeling is that we end up with only the IO module and alphabets in Rust-Bio. Maybe Rust-Bio should depend on all those modules and reexport (hacky...)? Alternatively, it could include them via git-submodules (even more hacky...). None of the solutions seems perfect to me from Rust-Bio perspective.

@johanneskoester johanneskoester merged commit 6e0d95a into master Aug 18, 2016
@johanneskoester johanneskoester deleted the cdf branch October 4, 2021 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants