-
Notifications
You must be signed in to change notification settings - Fork 9
Hugo/initial work #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some initial comments, mostly small stuff.
I also have a higher-level question
This seems super overfit to working with image data. Is that just the first thing you've used it for, or the long-term goal?
If it's the long-term goal then I think the project name should become dask-pytorch-image
or dask-pytorch-cv
or something, to be clear that you can't use this for, say, NLP work. If it's not the goal, then the README should explain that right now it only works with image data but that other data types are coming in the future.
workers = client.scheduler_info()["workers"] | ||
worker_keys = sorted(workers.keys()) | ||
host = workers[worker_keys[0]]["host"] | ||
return worker_keys, host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the first worker be popped off of worker_keys
if len(worker_keys) > 1
?
I'm thinking about the case where some training or data loading work gets scheduled onto the first worker and blows out that worker's memory, killing it. If that happens, the "master" process would also be killed and you'd be in trouble, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if that's possible. I think the "master" has to do stuff with DDP in order to sync the gradients around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few suggestions/comments- but by and large looks nice and I am v excited!
dask_pytorch/dispatch.py
Outdated
) | ||
index_to_fut[idx] = fut | ||
|
||
return [index_to_fut[x] for x in range(len(worker_keys))] |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
This is a good point. The library is specific to PyTorch and should be usable no matter what type of pytorch thing you're trying to do. The only thing specific to image processing is the |
Maybe clarifying this point in the README? |
Added a paragraph about this in README. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly looked at the README. It looks pretty good. Most of my comments are style nits or clarifications. Here are the questions I have:
- Would it be possible to create a dashboard plugin for
DaskResultsHandler
to show training progress on the Dask dashboard? If so, I'd be interested in seeing training stats alongside testing stats. - How resilient is dask-pytorch to a variable Dask cluster size? For example, let's say I'm using spot instances on EC2 and all but one of my workers die. Does that mean the optimization is run with a larger learning rate?
I haven't looked a ton at the implementation.
I was going to plug this into tensorboard - what do you think? (I haven't looked at how folks use tensorboard, but it seems like the dominant choice for things like this)
Not at all. I need to look at what torchelastic does - from what I've read, I think they reload from checkpoint and re-start training when a worker dies. Right now when a worker dies, the current training step just hangs forever. So that's something I need to implement. I'm going to try to cause the tasks to fail when this happens, but I'm not sure if that is possible (maybe I can pass a timeout into pytorch comms). I was going to use the result handler stuff to store checkpoints, but haven't implemented reload from checkpoint or any of that other stuff.
No problem. I'm confident that the implementation is functional, but could use alot of work on things like resilience and recovery. |
I think that's a solid choice for logging training progress. Neptune also might be a good choice; they even have a comparison with TensorBoard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few more small suggestions. Leaving a "Comment" review so @skirmer 's review is the one that ultimately determines when this gets merged.
There are other things we can do (like adding CI to run the tests and push to PyPi on releases) that don't have to be part of this PR.
Great work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hhuuggoo I think we're at a good stopping point, let me know if anything else needs edits before we merge.
No description provided.