Adding a Dask best practices section to the user guide #5190

HGWright · 2023-03-10T11:21:23Z

🚀 Pull Request

Description

I have brought the internal Dask best practice advice and examples into the Dask documentation. I have updated to change specific internal information to more generic language, more relevant in the documentation.

This should be linked to #4959. But does not fully close the issue.

Consult Iris pull request check list

codecov · 2023-03-10T11:33:26Z

Codecov Report

Patch and project coverage have no change.

Comparison is base (48e3a86) 89.37% compared to head (aac31cd) 89.37%.

❗ Current head aac31cd differs from pull request most recent head 95540f3. Consider uploading reports for the commit 95540f3 to get more accurate results

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #5190   +/-   ##
=======================================
  Coverage   89.37%   89.37%           
=======================================
  Files          89       89           
  Lines       22419    22419           
  Branches     5380     5380           
=======================================
  Hits        20036    20036           
  Misses       1637     1637           
  Partials      746      746

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

lbdreyer

Looking good, thanks @HGWright !

A few general comments:

There are a few terms that need tidying up. I think we were quite lazy in the draft Dask best practices docs, but we need to be more careful here if this is to be included in the Iris docs, particularly, the following need to be update:
numpy -> NumPy
netcdf -> netCDF (I don't believe this needs to be capitalised)
Also dask should always be capitalised Dask
We have used CPU's in quite a few places, but the apostrophe is incorrect, so that should just be CPUs.
You have used the term "multiprocessing system", but I think a term like "computing cluster" would be more appropriate.
There are a few examples of MO specific sections that need generalising a bit more, I have added specific comments where this is required.

docs/src/whatsnew/latest.rst

docs/src/further_topics/dask_best_practices/index.rst

docs/src/further_topics/dask_best_practices/dask_bags_and_greed.rst

docs/src/further_topics/dask_best_practices/index.rst

lbdreyer

This is looking very close to being ready to merge!
There are just two outstanding issues that I can see:

Malformed tables causing docs tests to fail. I suspect this could just be solved by adding the extra spaces that you lost when you changed CPU's -> CPUS. See my suggestions below:
"This branch is out-of-date with the base branch" I've never seen this error before. It might be alluding to a merge conflict? But maybe something else!

docs/src/further_topics/dask_best_practices/dask_bags_and_greed.rst

pp-mo · 2023-03-30T14:30:54Z

"This branch is out-of-date with the base branch"

This is not an error.
I think it is just something that GitHub has started offering as an option -- basically, an automatic merge-back from the target branch, or rebase onto it.
I don't really understand the benefit of these, if there are no conflicts, but possibly it enables you to see more exactly what the result will be when merged back (e.g. docs builds).

pp-mo · 2023-03-30T14:59:01Z

How open are we to further modifying this now?
I'm just re-reading some of the content and clearly some things could be improved.
Obviously this content is quite old now, and besides what may have changed, I think in some places our understanding may also have improved since it was written.

pp-mo · 2023-03-30T15:03:29Z

How open are we to further modifying this

Some random things I have noticed as I re-read it (but we can still address afterwards/elsewhere):

in the "PP and Fieldsfiles section" (under Chunking), we should probably say that the same applies to GRIB too.
in the "Netcdf Files" section (under Chunking), we might usefully explain that...
- one can somewhat adjust how Iris chunks netcdf data by setting the netcdf target chunk-size,
  e.g. dask.config.set(**{'array.chunk-size': '250 Mib'})), but that ...
- sometimes the default choice will not suit your usage (e.g. the access to vertical slices in the Parallelising a Loop of Multiple Calls to a Third Party Library section), that
- there is no direct control over input chunking (at present), and especially that
- rechunking cannot fix how data is fetched from files.
in the "Dask bags and greedy parallelism" example
we should probably mention that
- (A) Bags use a process-scheduler by default,
- (B) iris lazy computation does not function with a process-based scheduler, but that doesn't matter here because iris loading only constructs dask arrays, and never computes them -- I think?
- (C) use of distributed is often/usually advised as better than 'processes' (as already noted in this section)

lbdreyer · 2023-03-30T15:35:00Z

How open are we to further modifying this now? I'm just re-reading some of the content and clearly some things could be improved. Obviously this content is quite old now, and besides what may have changed, I think in some places our understanding may also have improved since it was written.

I'd been in support of improving things before this gets added to a release.

Are we intending this to go in Iris 3.5?

pp-mo · 2023-03-30T15:52:29Z

Are we intending this to go in Iris 3.5?

I had thought so, but I see it's not actually on the board.
@ESadek-MO can you explain ?

HGWright · 2023-06-02T11:23:19Z

@pp-mo & @lbdreyer. Given that I have lost some momentum with this, my preference would be to bank this and then I will open a new issue to make improvements as this is already quite a big PR. Then at least the information is out there.

I think this should be good to go if that's what we are doing.

lbdreyer

Looks good to me, just one final small change and then this should be ready to merge. Could you also create an issue to capture the extra work that @pp-mo suggests?

docs/src/whatsnew/latest.rst

Co-authored-by: lbdreyer <[email protected]>

lbdreyer · 2023-06-12T10:16:04Z

Great work @HGWright ! Please remember to create a new issue to address these points

HGWright · 2023-06-12T10:18:30Z

Thanks @lbdreyer great to finally get this across the line. For the new issue please see #5344

HGWright requested a review from lbdreyer March 10, 2023 11:26

HGWright removed the request for review from lbdreyer March 10, 2023 12:28

lbdreyer self-assigned this Mar 16, 2023

lbdreyer requested changes Mar 20, 2023

View reviewed changes

HGWright force-pushed the dask-bp branch from 84155af to 4a12f9b Compare March 24, 2023 15:16

lbdreyer requested changes Mar 28, 2023

View reviewed changes

docs/src/further_topics/dask_best_practices/dask_bags_and_greed.rst Show resolved Hide resolved

docs/src/further_topics/dask_best_practices/dask_bags_and_greed.rst Show resolved Hide resolved

pp-mo mentioned this pull request Mar 30, 2023

Lazy netcdf saves #5191

Merged

7 tasks

HGWright added 5 commits June 2, 2023 10:55

Adding a Dask best practices section to the user guide

9f7421a

Updated example 2, adjusted internal MO references

d710d5c

making requested changes from review

1c57220

fixing merge conflict and rest of requested changes

222fc97

finishing requested changes?

97cdaaa

HGWright force-pushed the dask-bp branch from 4a12f9b to 97cdaaa Compare June 2, 2023 10:06

fixing dask docs link for linkcheck

aac31cd

lbdreyer requested changes Jun 12, 2023

View reviewed changes

docs/src/whatsnew/latest.rst Outdated Show resolved Hide resolved

Update docs/src/whatsnew/latest.rst

95540f3

Co-authored-by: lbdreyer <[email protected]>

HGWright mentioned this pull request Jun 12, 2023

Add to and improve the information in the Dask best practices section #5344

Closed

lbdreyer approved these changes Jun 12, 2023

View reviewed changes

lbdreyer merged commit 18d24a9 into SciTools:main Jun 12, 2023

github-actions bot mentioned this pull request Jun 12, 2023

Performance Shift(s): 18d24a97 #5345

Closed

trexfeathers mentioned this pull request Feb 19, 2024

Add user advice on Dask "best practices" #4959

Closed

Adding a Dask best practices section to the user guide #5190

Adding a Dask best practices section to the user guide #5190

Uh oh!

Conversation

HGWright commented Mar 10, 2023

🚀 Pull Request

Description

Uh oh!

codecov bot commented Mar 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lbdreyer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lbdreyer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pp-mo commented Mar 30, 2023

Uh oh!

pp-mo commented Mar 30, 2023

Uh oh!

pp-mo commented Mar 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lbdreyer commented Mar 30, 2023

Uh oh!

pp-mo commented Mar 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HGWright commented Jun 2, 2023

Uh oh!

lbdreyer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lbdreyer commented Jun 12, 2023

Uh oh!

HGWright commented Jun 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Mar 10, 2023 •

edited

Loading

pp-mo commented Mar 30, 2023 •

edited

Loading

pp-mo commented Mar 30, 2023 •

edited

Loading