Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Hugo/initial work #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Nov 16, 2020
Merged

Hugo/initial work #1

merged 17 commits into from
Nov 16, 2020

Conversation

hhuuggoo
Copy link
Contributor

No description provided.

Copy link
Contributor

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some initial comments, mostly small stuff.

I also have a higher-level question

This seems super overfit to working with image data. Is that just the first thing you've used it for, or the long-term goal?

If it's the long-term goal then I think the project name should become dask-pytorch-image or dask-pytorch-cv or something, to be clear that you can't use this for, say, NLP work. If it's not the goal, then the README should explain that right now it only works with image data but that other data types are coming in the future.

workers = client.scheduler_info()["workers"]
worker_keys = sorted(workers.keys())
host = workers[worker_keys[0]]["host"]
return worker_keys, host
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the first worker be popped off of worker_keys if len(worker_keys) > 1?

I'm thinking about the case where some training or data loading work gets scheduled onto the first worker and blows out that worker's memory, killing it. If that happens, the "master" process would also be killed and you'd be in trouble, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that's possible. I think the "master" has to do stuff with DDP in order to sync the gradients around.

Copy link
Contributor

@skirmer skirmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few suggestions/comments- but by and large looks nice and I am v excited!

)
index_to_fut[idx] = fut

return [index_to_fut[x] for x in range(len(worker_keys))]

This comment was marked as outdated.

@hhuuggoo
Copy link
Contributor Author

I left some initial comments, mostly small stuff.

I also have a higher-level question

This seems super overfit to working with image data. Is that just the first thing you've used it for, or the long-term goal?

If it's the long-term goal then I think the project name should become dask-pytorch-image or dask-pytorch-cv or something, to be clear that you can't use this for, say, NLP work. If it's not the goal, then the README should explain that right now it only works with image data but that other data types are coming in the future.

This is a good point. The library is specific to PyTorch and should be usable no matter what type of pytorch thing you're trying to do. The only thing specific to image processing is the ImageFolder dataset class. I want to start accumulating common dataset here, but I think it's pretty common to implement your own. Implementing a PyTorch dataset (assuming map style random access) just requires implementing __getitem__(self, idx: number): and __len__(self):

@hhuuggoo
Copy link
Contributor Author

I left some initial comments, mostly small stuff.
I also have a higher-level question

This seems super overfit to working with image data. Is that just the first thing you've used it for, or the long-term goal?

If it's the long-term goal then I think the project name should become dask-pytorch-image or dask-pytorch-cv or something, to be clear that you can't use this for, say, NLP work. If it's not the goal, then the README should explain that right now it only works with image data but that other data types are coming in the future.

This is a good point. The library is specific to PyTorch and should be usable no matter what type of pytorch thing you're trying to do. The only thing specific to image processing is the ImageFolder dataset class. I want to start accumulating common dataset here, but I think it's pretty common to implement your own. Implementing a PyTorch dataset (assuming map style random access) just requires implementing __getitem__(self, idx: number): and __len__(self):

Maybe clarifying this point in the README?

@skirmer skirmer requested a review from jameslamb November 12, 2020 16:06
@skirmer
Copy link
Contributor

skirmer commented Nov 12, 2020

Maybe clarifying this point in the README?

Added a paragraph about this in README.

Copy link

@stsievert stsievert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly looked at the README. It looks pretty good. Most of my comments are style nits or clarifications. Here are the questions I have:

  1. Would it be possible to create a dashboard plugin for DaskResultsHandler to show training progress on the Dask dashboard? If so, I'd be interested in seeing training stats alongside testing stats.
  2. How resilient is dask-pytorch to a variable Dask cluster size? For example, let's say I'm using spot instances on EC2 and all but one of my workers die. Does that mean the optimization is run with a larger learning rate?

I haven't looked a ton at the implementation.

@hhuuggoo
Copy link
Contributor Author

I mostly looked at the README. It looks pretty good. Most of my comments are style nits or clarifications. Here are the questions I have:

  1. Would it be possible to create a dashboard plugin for DaskResultsHandler to show training progress on the Dask dashboard? If so, I'd be interested in seeing training stats alongside testing stats.

I was going to plug this into tensorboard - what do you think? (I haven't looked at how folks use tensorboard, but it seems like the dominant choice for things like this)

  1. How resilient is dask-pytorch to a variable Dask cluster size? For example, let's say I'm using spot instances on EC2 and all but one of my workers die. Does that mean the optimization is run with a larger learning rate?

Not at all. I need to look at what torchelastic does - from what I've read, I think they reload from checkpoint and re-start training when a worker dies. Right now when a worker dies, the current training step just hangs forever. So that's something I need to implement. I'm going to try to cause the tasks to fail when this happens, but I'm not sure if that is possible (maybe I can pass a timeout into pytorch comms). I was going to use the result handler stuff to store checkpoints, but haven't implemented reload from checkpoint or any of that other stuff.

I haven't looked a ton at the implementation.

No problem. I'm confident that the implementation is functional, but could use alot of work on things like resilience and recovery.

@stsievert
Copy link

I was going to plug this into tensorboard - what do you think?

I think that's a solid choice for logging training progress. Neptune also might be a good choice; they even have a comparison with TensorBoard.

Copy link
Contributor

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few more small suggestions. Leaving a "Comment" review so @skirmer 's review is the one that ultimately determines when this gets merged.

There are other things we can do (like adding CI to run the tests and push to PyPi on releases) that don't have to be part of this PR.

Great work!

@skirmer skirmer self-requested a review November 16, 2020 20:32
Copy link
Contributor

@skirmer skirmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hhuuggoo I think we're at a good stopping point, let me know if anything else needs edits before we merge.

@jameslamb jameslamb self-requested a review November 16, 2020 20:52
@hhuuggoo hhuuggoo merged commit d59de9f into main Nov 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants