Hugo/initial work #1

hhuuggoo · 2020-11-10T22:11:28Z

No description provided.

jameslamb

I left some initial comments, mostly small stuff.

I also have a higher-level question

This seems super overfit to working with image data. Is that just the first thing you've used it for, or the long-term goal?

If it's the long-term goal then I think the project name should become dask-pytorch-image or dask-pytorch-cv or something, to be clear that you can't use this for, say, NLP work. If it's not the goal, then the README should explain that right now it only works with image data but that other data types are coming in the future.

dask_pytorch/data.py

jameslamb · 2020-11-10T22:34:36Z

dask_pytorch/dispatch.py

+    workers = client.scheduler_info()["workers"]
+    worker_keys = sorted(workers.keys())
+    host = workers[worker_keys[0]]["host"]
+    return worker_keys, host


Should the first worker be popped off of worker_keys if len(worker_keys) > 1?

I'm thinking about the case where some training or data loading work gets scheduled onto the first worker and blows out that worker's memory, killing it. If that happens, the "master" process would also be killed and you'd be in trouble, right?

I'm not sure if that's possible. I think the "master" has to do stuff with DDP in order to sync the gradients around.

dask_pytorch/dispatch.py

setup.py

dask_pytorch/data.py

skirmer

Added a few suggestions/comments- but by and large looks nice and I am v excited!

README.md

dask_pytorch/data.py

dask_pytorch/dispatch.py

+        )
+        index_to_fut[idx] = fut
+
+    return [index_to_fut[x] for x in range(len(worker_keys))]


dask_pytorch/dispatch.py

hhuuggoo · 2020-11-11T19:42:08Z

I left some initial comments, mostly small stuff.

I also have a higher-level question

This seems super overfit to working with image data. Is that just the first thing you've used it for, or the long-term goal?

If it's the long-term goal then I think the project name should become dask-pytorch-image or dask-pytorch-cv or something, to be clear that you can't use this for, say, NLP work. If it's not the goal, then the README should explain that right now it only works with image data but that other data types are coming in the future.

This is a good point. The library is specific to PyTorch and should be usable no matter what type of pytorch thing you're trying to do. The only thing specific to image processing is the ImageFolder dataset class. I want to start accumulating common dataset here, but I think it's pretty common to implement your own. Implementing a PyTorch dataset (assuming map style random access) just requires implementing __getitem__(self, idx: number): and __len__(self):

hhuuggoo · 2020-11-11T19:42:34Z

I left some initial comments, mostly small stuff.
I also have a higher-level question

This seems super overfit to working with image data. Is that just the first thing you've used it for, or the long-term goal?

If it's the long-term goal then I think the project name should become dask-pytorch-image or dask-pytorch-cv or something, to be clear that you can't use this for, say, NLP work. If it's not the goal, then the README should explain that right now it only works with image data but that other data types are coming in the future.

This is a good point. The library is specific to PyTorch and should be usable no matter what type of pytorch thing you're trying to do. The only thing specific to image processing is the ImageFolder dataset class. I want to start accumulating common dataset here, but I think it's pretty common to implement your own. Implementing a PyTorch dataset (assuming map style random access) just requires implementing __getitem__(self, idx: number): and __len__(self):

Maybe clarifying this point in the README?

skirmer · 2020-11-12T16:06:59Z

Maybe clarifying this point in the README?

Added a paragraph about this in README.

stsievert

I mostly looked at the README. It looks pretty good. Most of my comments are style nits or clarifications. Here are the questions I have:

Would it be possible to create a dashboard plugin for DaskResultsHandler to show training progress on the Dask dashboard? If so, I'd be interested in seeing training stats alongside testing stats.
How resilient is dask-pytorch to a variable Dask cluster size? For example, let's say I'm using spot instances on EC2 and all but one of my workers die. Does that mean the optimization is run with a larger learning rate?

I haven't looked a ton at the implementation.

README.md

hhuuggoo · 2020-11-13T17:23:16Z

I mostly looked at the README. It looks pretty good. Most of my comments are style nits or clarifications. Here are the questions I have:

Would it be possible to create a dashboard plugin for DaskResultsHandler to show training progress on the Dask dashboard? If so, I'd be interested in seeing training stats alongside testing stats.

I was going to plug this into tensorboard - what do you think? (I haven't looked at how folks use tensorboard, but it seems like the dominant choice for things like this)

How resilient is dask-pytorch to a variable Dask cluster size? For example, let's say I'm using spot instances on EC2 and all but one of my workers die. Does that mean the optimization is run with a larger learning rate?

Not at all. I need to look at what torchelastic does - from what I've read, I think they reload from checkpoint and re-start training when a worker dies. Right now when a worker dies, the current training step just hangs forever. So that's something I need to implement. I'm going to try to cause the tasks to fail when this happens, but I'm not sure if that is possible (maybe I can pass a timeout into pytorch comms). I was going to use the result handler stuff to store checkpoints, but haven't implemented reload from checkpoint or any of that other stuff.

I haven't looked a ton at the implementation.

No problem. I'm confident that the implementation is functional, but could use alot of work on things like resilience and recovery.

stsievert · 2020-11-15T18:03:53Z

I was going to plug this into tensorboard - what do you think?

I think that's a solid choice for logging training progress. Neptune also might be a good choice; they even have a comparison with TensorBoard.

jameslamb

I left a few more small suggestions. Leaving a "Comment" review so @skirmer 's review is the one that ultimately determines when this gets merged.

There are other things we can do (like adding CI to run the tests and push to PyPi on releases) that don't have to be part of this PR.

Great work!

dask_pytorch/dispatch.py

dask_pytorch/results.py

setup.py

dask_pytorch/data.py

skirmer

@hhuuggoo I think we're at a good stopping point, let me know if anything else needs edits before we merge.

hhuuggoo added 10 commits September 13, 2020 15:52

ultra bare README.md

4ba1781

initial dispatch work

ca407ce

initial dispatch work

6c43d49

unit tests

f84cba2

formatting

fd0eb6b

added data class for S3 Image Folders

36dbadc

Added tests

892ca61

finished unit tests

c9971bf

forgot test file

82438fa

README

9be13fe

jameslamb requested review from jameslamb, jsignell and skirmer November 10, 2020 22:20

jameslamb suggested changes Nov 10, 2020

View reviewed changes

skirmer reviewed Nov 11, 2020

View reviewed changes

skirmer added 2 commits November 12, 2020 15:48

Making edits requested in PR review

d342d96

Adding notes about image datasets

10c76f2

skirmer requested a review from jameslamb November 12, 2020 16:06

stsievert reviewed Nov 13, 2020

View reviewed changes

skirmer added 2 commits November 16, 2020 14:38

Adding grammatical changes requested on PR

29dda1f

Fixing some args not supplied to fn

f009381

jameslamb reviewed Nov 16, 2020

View reviewed changes

dask_pytorch/dispatch.py Outdated Show resolved Hide resolved

dask_pytorch/results.py Outdated Show resolved Hide resolved

dask_pytorch/results.py Outdated Show resolved Hide resolved

setup.py Outdated Show resolved Hide resolved

dask_pytorch/data.py Outdated Show resolved Hide resolved

skirmer added 3 commits November 16, 2020 17:39

Imports we missed

2fbaa0b

More typing fixes

bbe43b4

Removing scaled lr from README

26cecb9

skirmer self-requested a review November 16, 2020 20:32

skirmer approved these changes Nov 16, 2020

View reviewed changes

jameslamb self-requested a review November 16, 2020 20:52

jameslamb approved these changes Nov 16, 2020

View reviewed changes

hhuuggoo merged commit d59de9f into main Nov 16, 2020

Hugo/initial work #1

Hugo/initial work #1

Uh oh!

Conversation

hhuuggoo commented Nov 10, 2020

Uh oh!

jameslamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jameslamb Nov 10, 2020

Choose a reason for hiding this comment

Uh oh!

hhuuggoo Nov 11, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

skirmer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

hhuuggoo commented Nov 11, 2020

Uh oh!

hhuuggoo commented Nov 11, 2020

Uh oh!

skirmer commented Nov 12, 2020

Uh oh!

stsievert left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hhuuggoo commented Nov 13, 2020

Uh oh!

stsievert commented Nov 15, 2020

Uh oh!

jameslamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

skirmer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!