Thanks to visit codestin.com
Credit goes to github.com

Skip to content

partial_fit for RobustScaler #30408

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zaccharieramzi opened this issue Dec 4, 2024 · 3 comments
Open

partial_fit for RobustScaler #30408

zaccharieramzi opened this issue Dec 4, 2024 · 3 comments
Labels
Needs Decision - Include Feature Requires decision regarding including feature New Feature

Comments

@zaccharieramzi
Copy link
Contributor

Describe the workflow you want to enable

I would like to be able to use partial_fit with the RobustScaler preprocessing for streaming cases or when my data doesn't fit in memory.
As I understand from this paper https://sites.cs.ucsb.edu/~suri/psdir/ency.pdf, it would probably only be possible to compute the robustly estimated variance and mean up to some precision epsilon, that could probably become an attribute of the class.

Not sure whether this should be considered a new algorithm or not, let me know what you think.

Describe your proposed solution

I haven't looked much into it but I think there were 2 approaches:

Describe alternatives you've considered, if relevant

An alternative proposed in this SO comment is to load the data column by column if it reduces the memory load.

However, that would be super impractical in my setting where I just cannot load all data into memory at once.

Additional context

No response

@ogrisel
Copy link
Member

ogrisel commented Dec 4, 2024

it would probably only be possible to compute the robustly estimated variance and mean up to some precision epsilon

Nitpicking and to avoid any confusion: RobustScaler does not estimate the mean and variance but rather the median and IQR. Those can be very different values depending on the distribution.

Personally, I find the reservoir sampling approach interesting. Not that a subsample param similar to what we do in KBinsDiscretizer would make sense, irrespective of whether or not we implement partial_fit. It could later be reused to control the size of the reservoir sampler used for partial_fit.

In addition to the methods referenced in the paper linked above, there is also the t-digest out-of-core / online estimator: https://github.com/tdunning/t-digest (but I have no idea about what are the pros and cons).

@ogrisel ogrisel added Needs Decision Requires decision and removed Needs Triage Issue requires triage labels Dec 4, 2024
@ogrisel
Copy link
Member

ogrisel commented Dec 4, 2024

Not sure whether this should be considered a new algorithm or not, let me know what you think.

We could introduce an extra algorithm or quantile_method constructor parameter for the RobustScaler class to select the strategy.

Note that in #29907, we are introducing the quantile_method="averaged_inverted_cdf" parameter as the future default method to be able to implement the expected weight/repetition semantics when fitting KBinsDiscretizer with strategy="quantile" and sample_weight.

We can problem reuse that parameter in RobustScaler (and also add support for sample_weight). This parameter could then be expanded to implement quantile_method="t-digest" or similar later, and some of those can be made to support partial_fit (with or without subsample).

@ogrisel ogrisel added Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Decision Requires decision labels Dec 4, 2024
@ogrisel
Copy link
Member

ogrisel commented Dec 4, 2024

Thinking about it, the reservoir sampling approach might not be robust if the data is not i.i.d. or exchangeable, and might as a result lead to very different values that depend on the ordering of the data passed to successive partial_fit calls. Other methods are probably more robust to this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Decision - Include Feature Requires decision regarding including feature New Feature
Projects
None yet
Development

No branches or pull requests

2 participants