-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Enable parallel sklearn.feature_selection.mutual_info_regression #27795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Could you provide a benchmark to show that the parallelization would be beneficial. |
@glemaitre I found this issue while looking for ways to accelerate Here's my benchmark. Perhaps not the benchmark you're looking for (the parallelism here is external to the function itself), but a benchmark showing how much faster this function can be when it's running in a simple parallel context. The dataset size used here is similar to an actual dataset I was looking at, which prompted my search. from sklearn.datasets import make_sparse_uncorrelated
from sklearn.feature_selection import SelectKBest, r_regression, mutual_info_regression
from sklearn.linear_model import LinearRegression
from joblib import Parallel, delayed
def objective(X, y, score_func, num_features):
skb = SelectKBest(score_func=score_func, k=num_features)
X_small = skb.fit_transform(X, y)
# save X_small somewhere
# ...
lr = LinearRegression(n_jobs=1)
lr.fit(X_small, y)
return num_features, lr.score(X_small, y)
n_features = 100
n_samples = int(1e4)
X, y = make_sparse_uncorrelated(random_state=0, n_features=n_features, n_samples=n_samples) Run %%time
_ = Parallel(n_jobs=-1)(
delayed(objective)(X, y, mutual_info_regression, num_features) for num_features in range(1, n_features + 1)
)
best_res_mi_multi = dict(_)
All CPU cores are used all the time. Run %%time
_ = Parallel(n_jobs=1)(
delayed(objective)(X, y, mutual_info_regression, num_features) for num_features in range(1, n_features + 1)
)
best_res_mi_single = dict(_)
It's mostly just one CPU core that is used at any given time. Comparison with %%time
_ = Parallel(n_jobs=1)(
delayed(objective)(X, y, r_regression, num_features) for num_features in range(1, n_features + 1)
)
best_res_r2_single = dict(_)
My CPU:
|
Indeed, it shows a net benefit. We could then expose an @FlorinAndrei @netomenoci Would you like to make a pull request? |
Hey @FlorinAndrei, @glemaitre here's the pr. #28085 I have used
instead of since it seems that this is the default in sklearn? (The other functions seem to be using it, as it comes with some sklearn checks) |
merged #28085 |
@glemaitre I guess it didn't make the cut into 1.4.0. Will it be included in 1.4.1? |
It will be 1.5 because we don't include any API changes in bug fixes release. So it should be in scikit-learn in May. |
Describe the workflow you want to enable
I can raise the PR if someone is willing to review and potentially merge.
Describe your proposed solution
In:
scikit-learn/sklearn/feature_selection/_mutual_info.py
Line 304 in 0ab3699
Change
to
Describe alternatives you've considered, if relevant
None
Additional context
Enable the user to choose multicore or single core.
The text was updated successfully, but these errors were encountered: