-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
partial_fit
for RobustScaler
#30408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Nitpicking and to avoid any confusion: Personally, I find the reservoir sampling approach interesting. Not that a In addition to the methods referenced in the paper linked above, there is also the t-digest out-of-core / online estimator: https://github.com/tdunning/t-digest (but I have no idea about what are the pros and cons). |
We could introduce an extra Note that in #29907, we are introducing the We can problem reuse that parameter in |
Thinking about it, the reservoir sampling approach might not be robust if the data is not i.i.d. or exchangeable, and might as a result lead to very different values that depend on the ordering of the data passed to successive |
Describe the workflow you want to enable
I would like to be able to use
partial_fit
with theRobustScaler
preprocessing for streaming cases or when my data doesn't fit in memory.As I understand from this paper https://sites.cs.ucsb.edu/~suri/psdir/ency.pdf, it would probably only be possible to compute the robustly estimated variance and mean up to some precision epsilon, that could probably become an attribute of the class.
Not sure whether this should be considered a new algorithm or not, let me know what you think.
Describe your proposed solution
I haven't looked much into it but I think there were 2 approaches:
Describe alternatives you've considered, if relevant
An alternative proposed in this SO comment is to load the data column by column if it reduces the memory load.
However, that would be super impractical in my setting where I just cannot load all data into memory at once.
Additional context
No response
The text was updated successfully, but these errors were encountered: