Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH: Parallel gallery generation #877

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
Jul 10, 2024
Merged

Conversation

jschueller
Copy link
Contributor

@jschueller jschueller commented Oct 28, 2021

Closes #25

@jschueller jschueller mentioned this pull request Oct 28, 2021
@jschueller jschueller force-pushed the paral branch 2 times, most recently from 5d2372d to 9baa7d8 Compare October 28, 2021 08:14
@larsoner
Copy link
Contributor

Thinking about this a bit more, I'd expect this not to work for (at least) the matplotlib, mayavi, and pyvista scrapers because these are all global-state based. And then there will be tricky interactions with reset_modules, which also by default does global state stuff at least for matplotlib. So I'm not sure this will ever work at least for the majority of our users :(

@larsoner
Copy link
Contributor

I'll copy that over to the issue so more people can tell me how I'm wrong :)

@jschueller
Copy link
Contributor Author

maybe the sphinx folks (@samdoran or @tk0miya) could give hints, or sklearn (@Titan-C @GaelVaroquaux) would be interested in speeding up their builds too ?

@tk0miya
Copy link

tk0miya commented Jan 11, 2022

It seems this handler is invoked on the bootstrap process of Sphinx (on builder-inited event). So there is no special support from Sphinx framework.

@jschueller
Copy link
Contributor Author

indeed, we're looking at parallelizing the gallery jobs at the extension level

@larsoner
Copy link
Contributor

larsoner commented Jun 4, 2024

@jschueller there has been some renewed interest in this so I'm taking a stab at actually making it work. I think it's close but I need to fix a test and maybe a couple of examples, we'll see!

@larsoner larsoner marked this pull request as ready for review June 5, 2024 00:44
@jschueller
Copy link
Contributor Author

@jschueller there has been some renewed interest in this so I'm taking a stab at actually making it work. I think it's close but I need to fix a test and maybe a couple of examples, we'll see!

this is great!

@larsoner larsoner changed the title Try at parallel gallery generation ENH: Parallel gallery generation Jun 5, 2024
Copy link
Contributor Author

@jschueller jschueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to work! Thanks a lot
Do you plan a new release with this ?

@larsoner
Copy link
Contributor

larsoner commented Jun 5, 2024

I still need to figure out how to test this. And also decide if joblib is the right thing to use for parallelization. I suspect if I just used multiprocessing with spawn then we maybe could keep memory measurements working, which would be nice. I have to think about it...

But yes I think once this is in and maybe @jschueller you and @lagru run some preliminary tests we could cut a release. I'll probably write in the docs that it's new/experimental, though, since I do expect there to be issues just because parallelization seems to always create them!

  • Actually test
  • Try multiprocessing maybe?
  • Add docs saying it's experimental

@jschueller
Copy link
Contributor Author

jschueller commented Jun 5, 2024

I already tested it on a real project (openturns) and it just works

@larsoner
Copy link
Contributor

Okay pushed a doc update -- @lucyleeow feel free to review and merge if happy when you get a chance!

@larsoner
Copy link
Contributor

Okay I realized that for MNE-Python we treat all warnings in doc builds/examples as errors, so I needed to prevent raising the UserWarning that joblib emits as an error. Then I could test a few cases:

MNE-Python

No examples (noplot)

This build takes ~8 min.

All examples, parallel=1 (serial)

Four slowest examples took:

    - ../examples/datasets/opm_data.py:                                       57.88 sec   0.0 MB
    - ../examples/time_frequency/source_power_spectrum_opm.py:                55.67 sec   0.0 MB
    - ../examples/datasets/spm_faces_dataset.py:                              49.65 sec   0.0 MB
    - ../tutorials/io/60_ctf_bst_auditory.py:                                 45.00 sec   0.0 MB

the full doc build takes about an hour:

real	59m47.935s
user	54m54.940s

So ~52 minutes to build the examples.

All examples, parallel=4

My machine has 4 physical cores so I tried this next and saw a bit of CPU oversubscription by eye and looking at the four slowest examples:

    - ../examples/datasets/opm_data.py:                                       68.57 sec   0.0 MB
    - ../examples/inverse/multi_dipole_model.py:                              63.33 sec   0.0 MB
    - ../examples/datasets/spm_faces_dataset.py:                              57.83 sec   0.0 MB
    - ../tutorials/preprocessing/40_artifact_correction_ica.py:               50.90 sec   0.0 MB

but the overall build time was cut in half (hooray!):

real	27m9.492s
user	59m48.404s

And the example part of that is ~19min.

So our speedup for the example-running portion is ~52min down to ~19min! A factor of ~2.7 which is pretty good I think!

There is one fancy example that has a custom scraper in MNE that I don't think I could actually get to work properly in parallel based on what it does (creates HTML files, sets them to be copied later, etc.). I think we'll want something like a # sphinx_gallery_non_parallel comment or something that forces it to be run in non-parallel mode, but I can do that in a follow-up PR I think. I don't think it will be very complicated but don't want to add more to this PR!

@lucyleeow
Copy link
Contributor

lucyleeow commented Jun 27, 2024

Sorry, I've missed the ping, I'll take a quick look today.

Copy link
Contributor

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I pushed a small commit to note version support drop in CHANGES.rst

I know you've noted more fixes to do, happy to look again later.

"sphinx": ("https://www.sphinx-doc.org/en/master", None),
"pandas": ("https://pandas.pydata.org/pandas-docs/stable/", None),
}
intersphinx_mapping = get_intersphinx_mapping(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

f"{gallery_conf['show_memory']=} disabled due to "
f"{gallery_conf['parallel']=}."
)
gallery_conf["show_memory"] = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question - how is the update to gallery_conf["show_memory"] passed outside the function ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a dict, gallery_conf` is mutable so the change does not need to be passed back, changing it here changes it everywhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welp I returned gallery_conf unnecessarily in my configuration clean up PR then. Will fix this later

elif "passing" in out_vars:
assert "stale" not in out_vars
gallery_conf["passing_examples"].append(src_file)
elif "stale" in out_vars: # non-executable files have none of these three
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are "these three"?

Would it be clearer to say that 'stale' examples are 'not re-executed'? or something similar?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing to at the top of the three conditionals:

        # n.b. non-executable files have none of these three variables defined,
        # so the last conditional must be "elif" not just "else"

Comment on lines -744 to -756
if sys.platform in ("win32", "darwin"):
sleep, timeout = (1, 2)
else:
sleep, timeout = (0.5, 1)
proc = subprocess.Popen(
[sys.executable, "-c", f"import time, sys; time.sleep({sleep}); sys.exit(0)"],
close_fds=True,
)
memories = memory_usage(proc, interval=1e-3, timeout=timeout)
proc.communicate(timeout=timeout)
# On OSX sometimes the last entry can be None
memories = [mem for mem in memories if mem is not None] + [0.0]
memory_base = max(memories)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my info - why are we deleting? 😬

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It never really worked robustly. On macOS it almost always errored to open and measure the subprocess

Copy link
Contributor

@larsoner larsoner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay pushing a commit, let me know what you think @lucyleeow !

elif "passing" in out_vars:
assert "stale" not in out_vars
gallery_conf["passing_examples"].append(src_file)
elif "stale" in out_vars: # non-executable files have none of these three
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing to at the top of the three conditionals:

        # n.b. non-executable files have none of these three variables defined,
        # so the last conditional must be "elif" not just "else"

Comment on lines -744 to -756
if sys.platform in ("win32", "darwin"):
sleep, timeout = (1, 2)
else:
sleep, timeout = (0.5, 1)
proc = subprocess.Popen(
[sys.executable, "-c", f"import time, sys; time.sleep({sleep}); sys.exit(0)"],
close_fds=True,
)
memories = memory_usage(proc, interval=1e-3, timeout=timeout)
proc.communicate(timeout=timeout)
# On OSX sometimes the last entry can be None
memories = [mem for mem in memories if mem is not None] + [0.0]
memory_base = max(memories)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It never really worked robustly. On macOS it almost always errored to open and measure the subprocess

f"{gallery_conf['show_memory']=} disabled due to "
f"{gallery_conf['parallel']=}."
)
gallery_conf["show_memory"] = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a dict, gallery_conf` is mutable so the change does not need to be passed back, changing it here changes it everywhere

@larsoner
Copy link
Contributor

larsoner commented Jul 9, 2024

@lucyleeow okay to merge this now?

@lucyleeow
Copy link
Contributor

Yes I'll merge. Sorry I totally missed that you said fixes can be in follow up PR!

Okay I realized that for MNE-Python we treat all warnings in doc builds/examples as errors, so I needed to prevent raising the UserWarning that joblib emits as an error.

Just for info, what UserWarning ?

@lucyleeow lucyleeow merged commit 1b4aecd into sphinx-gallery:master Jul 10, 2024
18 checks passed
@lucyleeow
Copy link
Contributor

Should we release or should we wait for maybe #1313 ?

@jschueller jschueller deleted the paral branch July 10, 2024 05:11
@jschueller
Copy link
Contributor Author

Dont wait! (kidding :)
Thanks a lot for pursuing this @larsoner @lucyleeow

@larsoner
Copy link
Contributor

It would be good to get #1313 and #1344 then release

@larsoner
Copy link
Contributor

Oh and forgot to say:

Just for info, what UserWarning ?

Inside joblib it could emit a warning like "A worker stopped while some jobs were given to the executor", can't remember if it was because some examples executed too quickly or my pre_dispatch was set too high 🤷 But treating that as an error in the doc build (in MNE we treat all uncaught warnings as errors by default) was really problematic because it caused joblib to hang.

@jschueller
Copy link
Contributor Author

still any plans for a new release ? I see PR1313 and PR1344 are done

@larsoner
Copy link
Contributor

@lucyleeow you up for making a release soon? If not then I can do it

Comment on lines +1284 to +1285
"stale"
True if the example was stale.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc looks incorrect? On line 1315, the 'stale' key is set to a string of something.

@QuLogic
Copy link
Contributor

QuLogic commented Jul 26, 2024

PS, thanks for getting this done; it's going to speed up our CI a bunch now that we can enable parallel builds.

@QuLogic
Copy link
Contributor

QuLogic commented Jul 27, 2024

Oh and forgot to say:

Just for info, what UserWarning ?

Inside joblib it could emit a warning like "A worker stopped while some jobs were given to the executor", can't remember if it was because some examples executed too quickly or my pre_dispatch was set too high 🤷 But treating that as an error in the doc build (in MNE we treat all uncaught warnings as errors by default) was really problematic because it caused joblib to hang.

It's probably joblib/joblib#883; it restarts the worker if it uses "too much" memory, which is a bit arbitrary. I also had to disable the warning-as-error to get Matplotlib to build; it might be a good idea to document that somewhere.

@larsoner
Copy link
Contributor

I also had to disable the warning-as-error to get Matplotlib to build; it might be a good idea to document that somewhere.

I didn't realize other projects also had their own strict filterwarnings("error") in place! In that case yes it would be good to document it somewhere (making clear that it's different from the -W sphinx option).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiprocesor support?
6 participants