Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

lorentzenchr
Copy link
Member

Reference Issues/PRs

None

What does this implement/fix? Explain your changes.

This PR make the the finding of thresholds/quantiles in _BinMapper.fit parallel with with concurrent.futures.ThreadPoolExecutor, see https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example. This works indeed one of the recommended ways to make numpy parallel in pure Python, see https://numpy.org/doc/stable/reference/random/multithreading.html.

Any other comments?

Is there a reason, we never use a ThreadPoolExecutor?

The gain in execution speed is low as only a fraction of the fit time is spend finding the thresholds.

Copy link

github-actions bot commented Jan 4, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 5b171d0. Link to the linter CI: here

approximately the same number of samples.
"""
# Author: Nicolas Hug
import concurrent.futures
Copy link
Member

@jjerphan jjerphan Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason, we never use a ThreadPoolExecutor?

I think loky is preferred and used as a back-end via joblib instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this here uses ThreadPoolExecutor, not ProcessPoolExecutor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

I think @ogrisel is much more knowledgeable than me to make an educated decision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is interesting ; I wonder whether we could abstract it and reuse it in some places.

Copy link
Member

@ogrisel ogrisel Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm that loky is only useful for process-based parallelism. For Python-level thread-based parallelism, ThreadPoolExecutor is perfectly fine.

@ogrisel
Copy link
Member

ogrisel commented Jan 8, 2024

Is there a reason, we never use a ThreadPoolExecutor?

We traditionally used joblib.Parallel(prefer="threads") (overridable selection of the threading backend) or joblib.Parallel(backend="threading") (non-overriable selection of the threading backend) for Python level thread-based, parallelism in scikit-learn.

Internally, joblib with the "threading" backend uses multiprocessing.ThreadPool which is the ancestor of concurrent.futures.ThreadPoolExecutor but they are very similar. I find the API of concurrent.futures.ThreadPoolExecutor cleaner than multiprocessing.ThreadPool, but joblib users should not see any difference.

In cases where we want to hard-code the uses of threads, I think that ThreadPoolExecutor and joblib.Parallel(backend="threading") are both valid options. ThreadPoolExecutor might be slightly lower-overhead (no joblib backend negotiation abstractions) and offers a richer API than joblib that makes it possible to consumes the results on the fly (with as_completed) while this is only possible with the dev version of joblib (via the return_as="unordered_generator" new parameter).

The gain in execution speed is low as only a fraction of the fit time is spend finding the thresholds.

Could you please post the results of a quick ad-hoc timeit for that step alone to quantify the speed-up out of curiosity?

Note that _BinMapper.fit finds the thresholds on a bounded random subsample of the training set (while _BinMapper.transform transforms the full training set but is already using OpenMP threads in its Cython code.

So when n_samples is large (in the millions or more), I expect _BinMapper.fit to be negligible compared to _BinMapper.transform and the latter should already be near-optimal in terms of parallelism.

@ogrisel
Copy link
Member

ogrisel commented Jan 8, 2024

Note: prior to making the code more complex for parallelization, maybe we should investigate optimizing the single-threaded variant: at the moment we sort each (subsampled) columns data twice:

Once in np.unique and once in np.percentile. Maybe there is a way to sort only once instead.

@lorentzenchr
Copy link
Member Author

lorentzenchr commented Jan 11, 2024

Edit: Update (long) after merge of #28102, commit 9c5e16d

n_threads fit time fit_transformtime
1 378 ms 602 ms
4 103 ms 188 ms
import numpy as np

from sklearn.datasets import make_classification
from sklearn.ensemble._hist_gradient_boosting.binning import _BinMapper


n_samples, n_features = 200_000, 20
n_bins = 256

X, y = make_classification(n_samples=n_samples, n_features=n_features)
categorical_remapped = np.zeros(n_features, dtype=bool)

bin_mapper = _BinMapper(
    n_bins=n_bins,
    is_categorical=categorical_remapped,
    known_categories=None,
    random_state=1,
    n_threads=1,
)
%timeit bin_mapper.fit(X)  # 378 ms ± 2.57 ms (old 465 ms ± 3.08 ms)
%timeit bin_mapper.fit_transform(X)  # 602 ms ± 8.5 ms (old 682 ms ± 4.33 ms)

bin_mapper = _BinMapper(
    n_bins=n_bins,
    is_categorical=categorical_remapped,
    known_categories=None,
    random_state=1,
    n_threads=4,
)
%timeit bin_mapper.fit(X)  # 103 ms ± 1.06 ms (old 137 ms ± 3.04 ms)
%timeit bin_mapper.fit_transform(X)  #  188 ms ± 9.76 ms (old 227 ms ± 3.85 ms)

@lorentzenchr lorentzenchr added Quick Review For PRs that are quick to review Performance labels May 23, 2024
@lorentzenchr lorentzenchr added this to the 1.6 milestone May 23, 2024
Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @lorentzenchr

@OmarManzoor OmarManzoor added the Waiting for Second Reviewer First reviewer is done, need a second one! label Jun 28, 2024
@OmarManzoor OmarManzoor merged commit 2107404 into scikit-learn:main Jun 28, 2024
@lorentzenchr lorentzenchr deleted the hgbt_parallel_quantile_binning branch June 28, 2024 15:21
@lesteve
Copy link
Member

lesteve commented Jul 1, 2024

So this broke Pyodide, very likely because you can not start a thread in Pyodide, see build log

Stack-trace
_____________________ test_bin_mapper_n_features_transform _____________________

    def test_bin_mapper_n_features_transform():
>       mapper = _BinMapper(n_bins=42, random_state=42).fit(DATA)
        w          = <concurrent.futures.thread._WorkItem object at 0x111cf120>
/lib/python312.zip/concurrent/futures/thread.py:202: in _adjust_thread_count
    t.start()
        num_threads = 0
        self       = <concurrent.futures.thread.ThreadPoolExecutor object at 0x113cbb50>
        t          = <Thread(ThreadPoolExecutor-0_0, initial)>
        thread_name = 'ThreadPoolExecutor-0_0'
        weakref_cb = <function ThreadPoolExecutor._adjust_thread_count.<locals>.weakref_cb at 0x11454080>
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Thread(ThreadPoolExecutor-0_0, initial)>

    def start(self):
        """Start the thread's activity.
    
        It must be called at most once per thread object. It arranges for the
        object's run() method to be invoked in a separate thread of control.
    
        This method will raise a RuntimeError if called more than once on the
        same thread object.
    
        """
        if not self._initialized:
            raise RuntimeError("thread.__init__() not called")
    
        if self._started.is_set():
            raise RuntimeError("threads can only be started once")
    
        with _active_limbo_lock:
            _limbo[self] = self
        try:
>           _start_new_thread(self._bootstrap, ())
E           RuntimeError: can't start new thread

self       = <Thread(ThreadPoolExecutor-0_0, initial)>

/lib/python312.zip/threading.py:992: RuntimeError

The work-around is to check whether we are inside Pyodide and not use multi-threading in this case? I guess Pyodide is the only case where you can not create threads.

I don't have a strong opinion on using joblib for multi-threading, but using joblib.Parallel would have switched to n_jobs=1 inside Pyodide. Side-comment: I think returns_as='unordered_generator' is in joblib 1.4 and not only in the joblib dev version see changelog. Having said that not sure we want to require joblib>=1.4, released April 2024, to use this feature.

@jeremiedbb jeremiedbb mentioned this pull request Jul 2, 2024
11 tasks
snath-xoc pushed a commit to snath-xoc/scikit-learn that referenced this pull request Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:ensemble Performance Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants