Enable parallel sklearn.feature_selection.mutual_info_regression #27795

netomenoci · 2023-11-17T02:28:16Z

Describe the workflow you want to enable

I can raise the PR if someone is willing to review and potentially merge.

from sklearn.feature_selection import mutual_info_regression
mutual_info = mutual_info_regression(X, y, n_jobs = -1)

Describe your proposed solution

In:

scikit-learn/sklearn/feature_selection/_mutual_info.py

Line 304 in 0ab3699

mi = [

Change

    mi = [
        _compute_mi(x, y, discrete_feature, discrete_target, n_neighbors)
        for x, discrete_feature in zip(_iterate_columns(X), discrete_mask)
    ]

to

from joblib import Parallel, delayed
def process_column(x, discrete_feature):
    return _compute_mi(x, y, discrete_feature, discrete_target, n_neighbors)

mi = Parallel(n_jobs=n_jobs)(delayed(process_column)(x, discrete_feature) 
                                       for x, discrete_feature in zip(_iterate_columns(X), discrete_mask))

Describe alternatives you've considered, if relevant

None

Additional context

Enable the user to choose multicore or single core.

The text was updated successfully, but these errors were encountered:

glemaitre · 2023-11-17T10:17:43Z

Could you provide a benchmark to show that the parallelization would be beneficial.

FlorinAndrei · 2023-12-18T20:03:53Z

@glemaitre I found this issue while looking for ways to accelerate mutual_info_regression(). It's very slow currently, and only one CPU core is used - all the others are sitting idle while my code loops very slowly from 1 to n_features.

Here's my benchmark. Perhaps not the benchmark you're looking for (the parallelism here is external to the function itself), but a benchmark showing how much faster this function can be when it's running in a simple parallel context. The dataset size used here is similar to an actual dataset I was looking at, which prompted my search.

from sklearn.datasets import make_sparse_uncorrelated
from sklearn.feature_selection import SelectKBest, r_regression, mutual_info_regression
from sklearn.linear_model import LinearRegression
from joblib import Parallel, delayed


def objective(X, y, score_func, num_features):
    skb = SelectKBest(score_func=score_func, k=num_features)
    X_small = skb.fit_transform(X, y)
    # save X_small somewhere
    # ...
    lr = LinearRegression(n_jobs=1)
    lr.fit(X_small, y)
    return num_features, lr.score(X_small, y)


n_features = 100
n_samples = int(1e4)
X, y = make_sparse_uncorrelated(random_state=0, n_features=n_features, n_samples=n_samples)

Run mutual_info_regression() in parallel:

%%time
_ = Parallel(n_jobs=-1)(
    delayed(objective)(X, y, mutual_info_regression, num_features) for num_features in range(1, n_features + 1)
)
best_res_mi_multi = dict(_)

CPU times: user 74.2 ms, sys: 111 ms, total: 186 ms
Wall time: 38 s

All CPU cores are used all the time.

Run mutual_info_regression() single-process:

%%time
_ = Parallel(n_jobs=1)(
    delayed(objective)(X, y, mutual_info_regression, num_features) for num_features in range(1, n_features + 1)
)
best_res_mi_single = dict(_)

CPU times: user 5min 53s, sys: 1min 39s, total: 7min 33s
Wall time: 5min 31s

It's mostly just one CPU core that is used at any given time.

Comparison with r_regression() single-process (which may not be fair, but it's provided for context):

%%time
_ = Parallel(n_jobs=1)(
    delayed(objective)(X, y, r_regression, num_features) for num_features in range(1, n_features + 1)
)
best_res_r2_single = dict(_)

CPU times: user 6.9 s, sys: 21.6 s, total: 28.5 s
Wall time: 1.79 s

My CPU:

$ cat /proc/cpuinfo | grep "model name" | head -n 1
model name      : AMD Ryzen 7 5800X3D 8-Core Processor

glemaitre · 2024-01-09T09:44:23Z

Indeed, it shows a net benefit. We could then expose an n_jobs parameter and use joblib.

@FlorinAndrei @netomenoci Would you like to make a pull request?

netomenoci · 2024-01-09T11:41:32Z

Hey @FlorinAndrei, @glemaitre here's the pr. #28085

I have used

from ..utils.parallel import Parallel, delayed

instead of
from joblib import Parallel, delayed

since it seems that this is the default in sklearn? (The other functions seem to be using it, as it comes with some sklearn checks)

netomenoci · 2024-01-18T13:49:15Z

merged #28085

FlorinAndrei · 2024-01-18T21:38:03Z

@glemaitre I guess it didn't make the cut into 1.4.0. Will it be included in 1.4.1?

glemaitre · 2024-01-22T10:04:52Z

It will be 1.5 because we don't include any API changes in bug fixes release. So it should be in scikit-learn in May.

netomenoci added Needs Triage Issue requires triage New Feature labels Nov 17, 2023

netomenoci changed the title ~~Make sklearn.feature_selection.mutual_info_regression multiprocessing~~ Enable parallel sklearn.feature_selection.mutual_info_regression Nov 17, 2023

glemaitre added Needs Decision Requires decision Needs Benchmarks A tag for the issues and PRs which require some benchmarks and removed New Feature Needs Triage Issue requires triage labels Nov 17, 2023

netomenoci mentioned this issue Jan 9, 2024

ENH add n_jobs to mutual_info_regression and mutual_info_classif #28085

Merged

netomenoci closed this as completed Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable parallel sklearn.feature_selection.mutual_info_regression #27795

Enable parallel sklearn.feature_selection.mutual_info_regression #27795

netomenoci commented Nov 17, 2023

glemaitre commented Nov 17, 2023

FlorinAndrei commented Dec 18, 2023 •

edited

Loading

glemaitre commented Jan 9, 2024

netomenoci commented Jan 9, 2024 •

edited

Loading

netomenoci commented Jan 18, 2024

FlorinAndrei commented Jan 18, 2024

glemaitre commented Jan 22, 2024

Enable parallel sklearn.feature_selection.mutual_info_regression #27795

Enable parallel sklearn.feature_selection.mutual_info_regression #27795

Comments

netomenoci commented Nov 17, 2023

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

glemaitre commented Nov 17, 2023

FlorinAndrei commented Dec 18, 2023 • edited Loading

glemaitre commented Jan 9, 2024

netomenoci commented Jan 9, 2024 • edited Loading

netomenoci commented Jan 18, 2024

FlorinAndrei commented Jan 18, 2024

glemaitre commented Jan 22, 2024

FlorinAndrei commented Dec 18, 2023 •

edited

Loading

netomenoci commented Jan 9, 2024 •

edited

Loading