MAINT: randomly sample chunks for partial_fit calls #276

stsievert · 2018-07-03T18:09:56Z

Closes #274.

A screenshot of different partial_fit calls:

Before (almost)	After

I only show the before for some comparison, though it's not exact (and was in the debug process). It is "almost" because it uses a fixed seed every time, not a random seed. It does not run through the blocks in sequential order as mentioned in the previous docs.

stsievert · 2018-07-03T18:34:03Z

I've had to reformat test_incremental_basic to train longer and assert that the results are close enough, not exactly the same.

TomAugspurger · 2018-07-03T19:14:37Z

dask_ml/_partial.py


    nblocks = len(x.chunks[0])
+    order = list(range(nblocks))
+    random.shuffle(order)


For reproducibility, Incremental should take a random_state and then use that for shuffling the data.

I also think the shuffling should be optionally enabled / disabled (not sure what the default should be) by a hyper parameter.

Finally, I'm not sure what the parameter name should be. Typically scikit-learn uses shuffle=True/False. We can either re-use that, or call it shuffle_blocks, and reserve shuffle for when we actually shuffle data between blocks.

I think shuffling should be enabled by default. That's what sklearn default too, and it's what SGD theory points too.

I've used shuffle_blocks; that describes exactly what it does. We can use shuffle for full shuffling, and shuffle_each_block to shuffle in each block.

stsievert · 2018-07-04T17:23:59Z

I have actually tested this now, and believe it's ready for merge.

mrocklin · 2018-07-05T17:46:34Z

This all seems fine to me. The sklearn-dev failures are interesting. Do we know if it is due to this PR or something upstream?

stsievert · 2018-07-05T18:13:28Z

I don't see how it could be this PR. All the failures are with RobustScalar, which doesn't call fit or Incremental.

mrocklin · 2018-07-05T18:18:10Z

OK. Fixing it separately here: #283

…

On Thu, Jul 5, 2018 at 2:13 PM, Scott Sievert ***@***.***> wrote: I don't see how it could be this PR. All the failures are with RobustScaler, which doesn't call fit or Incremental. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#276 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszAuYbtB4cWg8BEY6QNhEEQOVjPDPks5uDldIgaJpZM4VBaAM> .

stsievert · 2018-07-05T18:21:52Z

There was a PR on RobustScalar that was merged earlier this morning: scikit-learn/scikit-learn#11308.

mrocklin · 2018-07-05T18:29:25Z

That would help explain things :)

…

On Thu, Jul 5, 2018 at 2:21 PM, Scott Sievert ***@***.***> wrote: There was a PR on RobustScalar that was merged earlier this morning: scikit-learn/scikit-learn#11308 <scikit-learn/scikit-learn#11308>. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#276 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszDBJ0_aXlMEz2Jdlu6neAFXhv3VNks5uDllAgaJpZM4VBaAM> .

mrocklin

Small comment. Otherwise this looks fine to me.

My guess is that @TomAugspurger is out of contact for a few days. I'll wait until this afternoon before merging in case he wants to jump in with something.

mrocklin · 2018-07-06T11:40:44Z

tests/test_incremental.py

 from dask.array.utils import assert_eq
 from sklearn.base import clone
 from sklearn.linear_model import SGDClassifier
+import numpy.linalg as LA


The LA pronoun isn't very common. I'd prefer the fully qualified name if possible.

import numpy as np np.linalg.svd(...)

TomAugspurger · 2018-07-06T13:19:50Z

dask_ml/wrappers.py

           a single NumPy array, which may exhaust the memory of your worker.
           You probably want to always specify `scoring`.

+    random_state : int or numpy.random.RandomState


Append , optional.

TomAugspurger · 2018-07-06T13:20:07Z

dask_ml/wrappers.py

+    random_state : int or numpy.random.RandomState
+        Random object that determines how to shuffle blocks.
+
+    shuffle_blocks : bool


append , default True.

TomAugspurger · 2018-07-06T13:20:46Z

dask_ml/wrappers.py

+        Random object that determines how to shuffle blocks.
+
+    shuffle_blocks : bool
+        Whether to randomly shuffle the blocks or now


typo: now -> not.

Could you elaborate on what this means? Specifically, clarify shuffling the block order from shuffling within blocks (from shuffling between blocks).

stsievert · 2018-07-06T16:15:50Z

Thanks for the review @mrocklin and @TomAugspurger. I've addressed those comments.

TomAugspurger · 2018-07-10T13:49:56Z

Thanks!

stsievert added 4 commits July 3, 2018 12:29

MAINT: call partial_fit on blocks in random order

654574e

BUG: order x and y the same

d4ac282

TST: tests aren't passing.

4c47cb2

TST: tests are passing

0a31cac

TomAugspurger reviewed Jul 3, 2018

View reviewed changes

stsievert added 2 commits July 4, 2018 11:56

MAINT: add keywords to fit, doc, test, pass from incremental

9f32d0d

TST: actually test shuffle_blocks

84425d0

stsievert changed the title ~~WIP: MAINT: randomly sample chunks for partial_fit calls~~ MAINT: randomly sample chunks for partial_fit calls Jul 4, 2018

flake8

e3627f0

stsievert force-pushed the fit-block-random-ordering branch from e955a83 to e3627f0 Compare July 4, 2018 17:24

DOC: make note of random in docs

f18d946

mrocklin reviewed Jul 6, 2018

View reviewed changes

TomAugspurger reviewed Jul 6, 2018

View reviewed changes

MAINT: respond to review

0dea2fb

TomAugspurger mentioned this pull request Jul 6, 2018

Release #284

Closed

TST: pass

30949a9

TomAugspurger merged commit 7444a98 into dask:master Jul 10, 2018

Uh oh!

MAINT: randomly sample chunks for partial_fit calls #276

MAINT: randomly sample chunks for partial_fit calls #276

Uh oh!

Conversation

stsievert commented Jul 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stsievert commented Jul 3, 2018

Uh oh!

TomAugspurger Jul 3, 2018

Choose a reason for hiding this comment

Uh oh!

stsievert Jul 4, 2018

Choose a reason for hiding this comment

Uh oh!

stsievert commented Jul 4, 2018

Uh oh!

mrocklin commented Jul 5, 2018

Uh oh!

stsievert commented Jul 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Jul 5, 2018 via email

Uh oh!

stsievert commented Jul 5, 2018

Uh oh!

mrocklin commented Jul 5, 2018 via email

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

mrocklin Jul 6, 2018

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jul 6, 2018

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jul 6, 2018

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jul 6, 2018

Choose a reason for hiding this comment

Uh oh!

stsievert commented Jul 6, 2018

Uh oh!

TomAugspurger commented Jul 10, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stsievert commented Jul 3, 2018 •

edited

Loading

stsievert commented Jul 5, 2018 •

edited

Loading