-
-
Notifications
You must be signed in to change notification settings - Fork 261
MAINT: randomly sample chunks for partial_fit calls #276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT: randomly sample chunks for partial_fit calls #276
Conversation
|
I've had to reformat |
dask_ml/_partial.py
Outdated
|
|
||
| nblocks = len(x.chunks[0]) | ||
| order = list(range(nblocks)) | ||
| random.shuffle(order) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reproducibility, Incremental should take a random_state and then use that for shuffling the data.
I also think the shuffling should be optionally enabled / disabled (not sure what the default should be) by a hyper parameter.
Finally, I'm not sure what the parameter name should be. Typically scikit-learn uses shuffle=True/False. We can either re-use that, or call it shuffle_blocks, and reserve shuffle for when we actually shuffle data between blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think shuffling should be enabled by default. That's what sklearn default too, and it's what SGD theory points too.
I've used shuffle_blocks; that describes exactly what it does. We can use shuffle for full shuffling, and shuffle_each_block to shuffle in each block.
|
I have actually tested this now, and believe it's ready for merge. |
e955a83 to
e3627f0
Compare
|
This all seems fine to me. The sklearn-dev failures are interesting. Do we know if it is due to this PR or something upstream? |
|
I don't see how it could be this PR. All the failures are with |
|
OK. Fixing it separately here: #283
…On Thu, Jul 5, 2018 at 2:13 PM, Scott Sievert ***@***.***> wrote:
I don't see how it could be this PR. All the failures are with
RobustScaler, which doesn't call fit or Incremental.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#276 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszAuYbtB4cWg8BEY6QNhEEQOVjPDPks5uDldIgaJpZM4VBaAM>
.
|
|
There was a PR on |
|
That would help explain things :)
…On Thu, Jul 5, 2018 at 2:21 PM, Scott Sievert ***@***.***> wrote:
There was a PR on RobustScalar that was merged earlier this morning:
scikit-learn/scikit-learn#11308
<scikit-learn/scikit-learn#11308>.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#276 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszDBJ0_aXlMEz2Jdlu6neAFXhv3VNks5uDllAgaJpZM4VBaAM>
.
|
mrocklin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment. Otherwise this looks fine to me.
My guess is that @TomAugspurger is out of contact for a few days. I'll wait until this afternoon before merging in case he wants to jump in with something.
tests/test_incremental.py
Outdated
| from dask.array.utils import assert_eq | ||
| from sklearn.base import clone | ||
| from sklearn.linear_model import SGDClassifier | ||
| import numpy.linalg as LA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LA pronoun isn't very common. I'd prefer the fully qualified name if possible.
import numpy as np
np.linalg.svd(...)
dask_ml/wrappers.py
Outdated
| a single NumPy array, which may exhaust the memory of your worker. | ||
| You probably want to always specify `scoring`. | ||
| random_state : int or numpy.random.RandomState |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Append , optional.
dask_ml/wrappers.py
Outdated
| random_state : int or numpy.random.RandomState | ||
| Random object that determines how to shuffle blocks. | ||
| shuffle_blocks : bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
append , default True.
dask_ml/wrappers.py
Outdated
| Random object that determines how to shuffle blocks. | ||
| shuffle_blocks : bool | ||
| Whether to randomly shuffle the blocks or now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: now -> not.
Could you elaborate on what this means? Specifically, clarify shuffling the block order from shuffling within blocks (from shuffling between blocks).
|
Thanks for the review @mrocklin and @TomAugspurger. I've addressed those comments. |
|
Thanks! |
Closes #274.
A screenshot of different
partial_fitcalls:I only show the before for some comparison, though it's not exact (and was in the debug process). It is "almost" because it uses a fixed seed every time, not a random seed. It does not run through the blocks in sequential order as mentioned in the previous docs.