Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@stsievert
Copy link
Member

@stsievert stsievert commented Jul 3, 2018

Closes #274.

A screenshot of different partial_fit calls:

Before (almost) After
screen shot 2018-07-03 at 11 06 04 am screen shot 2018-07-03 at 1 07 36 pm

I only show the before for some comparison, though it's not exact (and was in the debug process). It is "almost" because it uses a fixed seed every time, not a random seed. It does not run through the blocks in sequential order as mentioned in the previous docs.

@stsievert
Copy link
Member Author

I've had to reformat test_incremental_basic to train longer and assert that the results are close enough, not exactly the same.


nblocks = len(x.chunks[0])
order = list(range(nblocks))
random.shuffle(order)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reproducibility, Incremental should take a random_state and then use that for shuffling the data.

I also think the shuffling should be optionally enabled / disabled (not sure what the default should be) by a hyper parameter.

Finally, I'm not sure what the parameter name should be. Typically scikit-learn uses shuffle=True/False. We can either re-use that, or call it shuffle_blocks, and reserve shuffle for when we actually shuffle data between blocks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think shuffling should be enabled by default. That's what sklearn default too, and it's what SGD theory points too.

I've used shuffle_blocks; that describes exactly what it does. We can use shuffle for full shuffling, and shuffle_each_block to shuffle in each block.

@stsievert stsievert changed the title WIP: MAINT: randomly sample chunks for partial_fit calls MAINT: randomly sample chunks for partial_fit calls Jul 4, 2018
@stsievert
Copy link
Member Author

I have actually tested this now, and believe it's ready for merge.

@stsievert stsievert force-pushed the fit-block-random-ordering branch from e955a83 to e3627f0 Compare July 4, 2018 17:24
@mrocklin
Copy link
Member

mrocklin commented Jul 5, 2018

This all seems fine to me. The sklearn-dev failures are interesting. Do we know if it is due to this PR or something upstream?

@stsievert
Copy link
Member Author

stsievert commented Jul 5, 2018

I don't see how it could be this PR. All the failures are with RobustScalar, which doesn't call fit or Incremental.

@mrocklin
Copy link
Member

mrocklin commented Jul 5, 2018 via email

@stsievert
Copy link
Member Author

There was a PR on RobustScalar that was merged earlier this morning: scikit-learn/scikit-learn#11308.

@mrocklin
Copy link
Member

mrocklin commented Jul 5, 2018 via email

Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment. Otherwise this looks fine to me.

My guess is that @TomAugspurger is out of contact for a few days. I'll wait until this afternoon before merging in case he wants to jump in with something.

from dask.array.utils import assert_eq
from sklearn.base import clone
from sklearn.linear_model import SGDClassifier
import numpy.linalg as LA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LA pronoun isn't very common. I'd prefer the fully qualified name if possible.

import numpy as np


np.linalg.svd(...)

a single NumPy array, which may exhaust the memory of your worker.
You probably want to always specify `scoring`.
random_state : int or numpy.random.RandomState
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Append , optional.

random_state : int or numpy.random.RandomState
Random object that determines how to shuffle blocks.
shuffle_blocks : bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

append , default True.

Random object that determines how to shuffle blocks.
shuffle_blocks : bool
Whether to randomly shuffle the blocks or now
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: now -> not.

Could you elaborate on what this means? Specifically, clarify shuffling the block order from shuffling within blocks (from shuffling between blocks).

@stsievert
Copy link
Member Author

Thanks for the review @mrocklin and @TomAugspurger. I've addressed those comments.

@TomAugspurger TomAugspurger mentioned this pull request Jul 6, 2018
@TomAugspurger TomAugspurger merged commit 7444a98 into dask:master Jul 10, 2018
@TomAugspurger
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incremental.partial_fit does not randomly shuffle blocks on repeated calls

3 participants