ENH add n_jobs to mutual_info_regression and mutual_info_classif #28085

netomenoci · 2024-01-09T11:39:04Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This implement the addition of parameter n_jobs to sklearn.feature_selection.mutual_info_regression and sklearn.feature_selection.mutual_info_classif

github-actions · 2024-01-09T11:40:56Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 5a54c2f. Link to the linter CI: here}

…earn into mutual_info_parallel

sklearn/feature_selection/_mutual_info.py

glemaitre · 2024-01-11T20:42:14Z

We will need an entry in doc/whats_new/v1.5.rst to acknowledge adding the parameter.
We also need to have a test that try different n_jobs. For instance we can test that n_jobs=None provide the same results than n_jobs=2.

sklearn/feature_selection/_mutual_info.py

Accept suggestions Co-authored-by: Guillaume Lemaitre <[email protected]>

…earn into mutual_info_parallel

doc/whats_new/v1.5.rst

netomenoci · 2024-01-14T19:38:27Z

@glemaitre thanks for the suggestions. I have implemented them.

glemaitre

It almost look good. Here are a couple of suggestions.

doc/whats_new/v1.5.rst

sklearn/feature_selection/_mutual_info.py

sklearn/feature_selection/tests/test_mutual_info.py

Co-authored-by: Guillaume Lemaitre <[email protected]>

netomenoci · 2024-01-15T11:31:59Z

It almost look good. Here are a couple of suggestions.

@glemaitre i have made the suggestions, thanks. One of the circleci tests is failing, but i don't know why.

glemaitre · 2024-01-15T17:00:30Z

sklearn/feature_selection/tests/test_mutual_info.py

@@ -252,3 +253,18 @@ def test_mutual_info_regression_X_int_dtype(global_random_seed):
    expected = mutual_info_regression(X_float, y, random_state=global_random_seed)
    result = mutual_info_regression(X, y, random_state=global_random_seed)
    assert_allclose(result, expected)
+
+
+@pytest.mark.parametrize(


@netomenoci I push a piece of code that show how to make the parallelization if you are interested in.

awesome, thanks :)

Hi bary Wery good nice good luck bary

glemaitre

LGTM. We will need a second review.

glemaitre · 2024-01-15T17:04:29Z

One of the circleci tests is failing, but i don't know why.

It was because of a git conflict.

lesteve · 2024-01-18T09:45:22Z

Here is a quick benchmark, loosely adapted from #27795 (comment) (which unless I am mistaken was not really a benchmark of parallelizing mutual_info_regression).

I get 2.5 max speed improvement on my laptop, that has 8 logical cores (4 physical cores + hyper-threading).

from sklearn.datasets import make_sparse_uncorrelated
from sklearn.feature_selection import mutual_info_regression


n_features = 100
n_samples = int(1e4)
X, y = make_sparse_uncorrelated(random_state=0, n_features=n_features, n_samples=n_samples)

print('n_jobs=1')
%timeit mutual_info_regression(X, y, n_jobs=1)

print('n_jobs=4')
%timeit mutual_info_regression(X, y, n_jobs=4)

print('n_jobs=8')
%timeit mutual_info_regression(X, y, n_jobs=8)

I get:

n_jobs=1
3.56 s ± 127 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
n_jobs=4
1.41 s ± 179 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
n_jobs=8
1.34 s ± 28.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

lesteve · 2024-01-18T09:55:21Z

sklearn/feature_selection/_mutual_info.py

@@ -201,11 +202,13 @@ def _iterate_columns(X, columns=None):
 def _estimate_mi(
    X,
    y,
+    *,


I used keyword arguments to make the code more readable in the _estimate_mi function call and also use keyword-only arguments in the _estimate_mi definition.

I guess that's fine since _estimate_mi is private. @glemaitre do you agree?

Yes I do agree.

lesteve · 2024-01-18T11:48:56Z

Let's merge this one, thanks @netomenoci!

netomenoci · 2024-01-18T11:49:12Z

Here is a quick benchmark, loosely adapted from #27795 (comment) (which unless I am mistaken was not really a benchmark of parallelizing mutual_info_regression).

I get 2.5 max speed improvement on my laptop, that has 8 logical cores (4 physical cores + hyper-threading).

from sklearn.datasets import make_sparse_uncorrelated

from sklearn.feature_selection import mutual_info_regression





n_features = 100

n_samples = int(1e4)

X, y = make_sparse_uncorrelated(random_state=0, n_features=n_features, n_samples=n_samples)



print('n_jobs=1')

%timeit mutual_info_regression(X, y, n_jobs=1)



print('n_jobs=4')

%timeit mutual_info_regression(X, y, n_jobs=4)



print('n_jobs=8')

%timeit mutual_info_regression(X, y, n_jobs=8)

I get:


n_jobs=1

3.56 s ± 127 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

n_jobs=4

1.41 s ± 179 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

n_jobs=8

1.34 s ± 28.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using sklearn wrapper of parallel slows it down because there is some extra overhead on top of joblib (eg copies)

If u think it's useful , we can change (1 line change) the import of Parallel to joblib instead of sklearn utils and it should be faster and closer to the previous mentioned benchmark.

glemaitre · 2024-01-22T10:02:39Z

If u think it's useful , we can change (1 line change) the import of Parallel to joblib instead of sklearn utils and it should be faster and closer to the previous mentioned benchmark.

We cannot do that because we will have issue with the global config. This is the reason to have the wrapper.

…kit-learn#28085) Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Loïc Estève <[email protected]>

add n_jobs to mutual_info_regression and mutual_info_classif

02d995b

github-actions bot added the module:feature_selection label Jan 9, 2024

Merge branch 'main' into mutual_info_parallel

d9a1886

netomenoci mentioned this pull request Jan 9, 2024

Enable parallel sklearn.feature_selection.mutual_info_regression #27795

Closed

netomenoci changed the title ~~add n_jobs to mutual_info_regression and mutual_info_classif~~ FEAT add n_jobs to mutual_info_regression and mutual_info_classif Jan 9, 2024

netomenoci added 2 commits January 9, 2024 13:18

add n_jobs to validate_params

e27f6db

Merge branch 'mutual_info_parallel' of github.com:netomenoci/scikit-l…

d215fea

…earn into mutual_info_parallel

glemaitre reviewed Jan 11, 2024

View reviewed changes