Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Enable parallel sklearn.feature_selection.mutual_info_regression #27795

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
netomenoci opened this issue Nov 17, 2023 · 7 comments
Closed

Enable parallel sklearn.feature_selection.mutual_info_regression #27795

netomenoci opened this issue Nov 17, 2023 · 7 comments
Labels
Needs Benchmarks A tag for the issues and PRs which require some benchmarks Needs Decision Requires decision

Comments

@netomenoci
Copy link
Contributor

Describe the workflow you want to enable

I can raise the PR if someone is willing to review and potentially merge.

from sklearn.feature_selection import mutual_info_regression
mutual_info = mutual_info_regression(X, y, n_jobs = -1)

Describe your proposed solution

In:


Change

    mi = [
        _compute_mi(x, y, discrete_feature, discrete_target, n_neighbors)
        for x, discrete_feature in zip(_iterate_columns(X), discrete_mask)
    ]

to

from joblib import Parallel, delayed
def process_column(x, discrete_feature):
    return _compute_mi(x, y, discrete_feature, discrete_target, n_neighbors)

mi = Parallel(n_jobs=n_jobs)(delayed(process_column)(x, discrete_feature) 
                                       for x, discrete_feature in zip(_iterate_columns(X), discrete_mask))

Describe alternatives you've considered, if relevant

None

Additional context

Enable the user to choose multicore or single core.

@netomenoci netomenoci added Needs Triage Issue requires triage New Feature labels Nov 17, 2023
@netomenoci netomenoci changed the title Make sklearn.feature_selection.mutual_info_regression multiprocessing Enable parallel sklearn.feature_selection.mutual_info_regression Nov 17, 2023
@glemaitre
Copy link
Member

Could you provide a benchmark to show that the parallelization would be beneficial.

@glemaitre glemaitre added Needs Decision Requires decision Needs Benchmarks A tag for the issues and PRs which require some benchmarks and removed New Feature Needs Triage Issue requires triage labels Nov 17, 2023
@FlorinAndrei
Copy link

FlorinAndrei commented Dec 18, 2023

@glemaitre I found this issue while looking for ways to accelerate mutual_info_regression(). It's very slow currently, and only one CPU core is used - all the others are sitting idle while my code loops very slowly from 1 to n_features.

Here's my benchmark. Perhaps not the benchmark you're looking for (the parallelism here is external to the function itself), but a benchmark showing how much faster this function can be when it's running in a simple parallel context. The dataset size used here is similar to an actual dataset I was looking at, which prompted my search.

from sklearn.datasets import make_sparse_uncorrelated
from sklearn.feature_selection import SelectKBest, r_regression, mutual_info_regression
from sklearn.linear_model import LinearRegression
from joblib import Parallel, delayed


def objective(X, y, score_func, num_features):
    skb = SelectKBest(score_func=score_func, k=num_features)
    X_small = skb.fit_transform(X, y)
    # save X_small somewhere
    # ...
    lr = LinearRegression(n_jobs=1)
    lr.fit(X_small, y)
    return num_features, lr.score(X_small, y)


n_features = 100
n_samples = int(1e4)
X, y = make_sparse_uncorrelated(random_state=0, n_features=n_features, n_samples=n_samples)

Run mutual_info_regression() in parallel:

%%time
_ = Parallel(n_jobs=-1)(
    delayed(objective)(X, y, mutual_info_regression, num_features) for num_features in range(1, n_features + 1)
)
best_res_mi_multi = dict(_)
CPU times: user 74.2 ms, sys: 111 ms, total: 186 ms
Wall time: 38 s

All CPU cores are used all the time.


Run mutual_info_regression() single-process:

%%time
_ = Parallel(n_jobs=1)(
    delayed(objective)(X, y, mutual_info_regression, num_features) for num_features in range(1, n_features + 1)
)
best_res_mi_single = dict(_)
CPU times: user 5min 53s, sys: 1min 39s, total: 7min 33s
Wall time: 5min 31s

It's mostly just one CPU core that is used at any given time.


Comparison with r_regression() single-process (which may not be fair, but it's provided for context):

%%time
_ = Parallel(n_jobs=1)(
    delayed(objective)(X, y, r_regression, num_features) for num_features in range(1, n_features + 1)
)
best_res_r2_single = dict(_)
CPU times: user 6.9 s, sys: 21.6 s, total: 28.5 s
Wall time: 1.79 s

My CPU:

$ cat /proc/cpuinfo | grep "model name" | head -n 1
model name      : AMD Ryzen 7 5800X3D 8-Core Processor

@glemaitre
Copy link
Member

Indeed, it shows a net benefit. We could then expose an n_jobs parameter and use joblib.

@FlorinAndrei @netomenoci Would you like to make a pull request?

@netomenoci
Copy link
Contributor Author

netomenoci commented Jan 9, 2024

Hey @FlorinAndrei, @glemaitre here's the pr. #28085

I have used

from ..utils.parallel import Parallel, delayed

instead of
from joblib import Parallel, delayed

since it seems that this is the default in sklearn? (The other functions seem to be using it, as it comes with some sklearn checks)

@netomenoci
Copy link
Contributor Author

merged #28085

@FlorinAndrei
Copy link

@glemaitre I guess it didn't make the cut into 1.4.0. Will it be included in 1.4.1?

@glemaitre
Copy link
Member

It will be 1.5 because we don't include any API changes in bug fixes release. So it should be in scikit-learn in May.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Benchmarks A tag for the issues and PRs which require some benchmarks Needs Decision Requires decision
Projects
None yet
Development

No branches or pull requests

3 participants