Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add statistical methods #33

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

Add statistical methods #33

wants to merge 6 commits into from

Conversation

kgryte
Copy link
Contributor

@kgryte kgryte commented Dec 10, 2020

This PR

  • adds specifications for statistical methods.
  • is derived from comparing API signatures across dataframe libraries.

Notes

  • Statistical methods are widely implemented across dataframe libraries and are used by downstream libraries.

  • Series not included based on previous consortium discussions where dataframe/series distinction not considered necessary. See Avoiding the "pandas trap" #4 and Separate object for a dataframe colum? (is Series needed?) #6.

  • vaex and ibis have considerably different APIs than pandas, Dask, Modin, cuDF, and Koalas, and only influenced API inclusion based on whether the libraries provided a particular method name (or equivalent), but not keyword arguments.

  • Comments for each proposed method:

    • cummax: only universally implemented keyword argument is skipna. Both cuDF and Koalas support skipna, but not axis (pandas, Dask, Modin).
    • cummin: same comments as for cummax.
    • cumsum: same comments as for cummax.
    • cumprod: same comments as for cummax.
    • max: only universally implemented keyword argument is axis. pandas, Modin, and Koalas support numeric_only, but others do not.
    • mean: same comments as for max.
    • min: same comments as for max.
    • nlargest: no keywords arguments are universally implemented. cuDF does not support multiple column labels. This PR does specify that multiple labels be permitted. Not clear whether cuDF cannot technically support multiple column labels or whether this is not yet implemented.
    • nsmallest: same comments as for nlargest.
    • prod: same comments as for max. Koalas can only support positive numbers due to implementation algorithm. pandas, Dask, Modin, and cuDF support a min_count keyword argument, but Koalas does not.
    • std: same comments as for max. Koalas does not support a correction factor. Similar to the array API specification, renamed ddof to correction, as this is a historical "bug" carried over from NumPy.
    • sum: same comments as for max. pandas, Dask, Modin, and cuDF support a min_count keyword argument, but Koalas does not.
    • var: same comments as for std.
  • methods excluded from this initial proposal: mode, median, nunique, and quantile due to either lack of universal availability, divergent behavior, increased complexity, or lack of downstream usage. These can be considered in a future PR.

@rgommers
Copy link
Member

A note on cum* - those are very weird names for native English speakers (search engineering humor; Matlab's fault originally I believe), and SciPy recently renamed to cumulative_ due to that. Now may be the right time to make that change here. They're not used all that much I believe, so the extra characters are not super important, and the longer name is clearer anyway.

@kgryte
Copy link
Contributor Author

kgryte commented Dec 10, 2020

@rgommers Another alternative that I have seen is cu* (e.g., cumax, cumin, cusum, cuprod).

@ueshin
Copy link

ueshin commented Dec 11, 2020

For some comments on Koalas:

Koalas can support the followings, so we can include them in the proposal:

  • min_count for prod and sum
  • axis for cumulative functions

As for correction for std and ver, we are still discussing.

Thanks.

Updates:
We already implemented all of the above including correction which we use ddof for now to follow pandas.

@kkraus14
Copy link
Collaborator

kkraus14 commented Jan 7, 2021

Re cuDF:

  • We can support the numeric_only kwarg
  • We can support multiple column labels for nlargest, etc.
  • We can possibly support axis but it will be a big engineering effort

@rgommers rgommers marked this pull request as draft April 26, 2023 13:57
@rgommers
Copy link
Member

As discussed, closing this PR given that we already have the most common reductions and the rest doesn't really fit with the direction we've taken over the last 6 months.

@rgommers rgommers closed this May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants