-
-
Notifications
You must be signed in to change notification settings - Fork 26k
ENH Add Friedman's H-squared #28375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH Add Friedman's H-squared #28375
Conversation
❌ Linting issuesThis PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling You can see the details of the linting issues under the
|
@mayer79 Thanks for working on this important inspection tool. To get rid of the linter issues, you might use a pre-commit hook, see https://scikit-learn.org/dev/developers/contributing.html#how-to-contribute. @amueller @glemaitre @adrinjalali ping as this might interest you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first quick pass. Maybe, _partial_dependence_brute
can help with the tests.
Co-authored-by: Christian Lorentzen <[email protected]>
The naming will pop up during further review anyway. One possibility would be |
Basically, the ball is more on my side: I need to get through the literature before to provide a meaningful review. I'll do my best to start after the release of 1.5.2 and push it for the 1.6 release |
OK this time promised, I really focus on reviewing this PR. I'll first look at the core implementation. I have already some comment regarding naming but I don't think that this is important in the first pass. Again sorry @mayer79 for the delay. I'll push a first commit to resolve the conflict. |
@mayer79 I have a couple of high level questions (with high variance regarding the topic):
I'll probably make a PR on your own fork regarding some code styling that would be to annoying to request via a code review. |
n = X.shape[0] | ||
n_grid = grid.shape[0] | ||
|
||
X_stacked = _safe_indexing(X, np.tile(np.arange(n), n_grid), axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I assume that the speed up observed between this function _calculate_pd_brute_fast
and _partial_dependence_brute
is only related to stacking all samples in a single matrix and call .predict_proba
a single time.
So basically, we have a trade-off speed/memory consumption. Here, we might blow up the memory with large dataset and if we decide to not subsample
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ease maintenance, I'm really leaning towards using the _partial_dependence_brute
implementation.
However, the pattern here shows that we can have a good speed-up by concatenating data but we probably need to think about a chunking strategy to not blow up memory.
""" | ||
|
||
# Select grid columns and remove duplicates (will compensate below) | ||
grid = _safe_indexing(X, feature_indices, axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where I'm thinking that we could reuse the _grid_from_X
instead. I don't know if taking quantile will actually have a statistical impact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using quantiles is a strategy. However, the hard part of the calculations are the 2D partial dependence calculations. If you work with grid size 50. The resulting grid (using only existing combinations) will be almost as large as the selected n = 500 rows. There is, additionally, the complication of distinguishing discrete from continuous features. This does not mean we should not go for the quantile strategy.
sklearn/inspection/_h_statistic.py
Outdated
sample_weight : array-like of shape (n_samples,), default=None | ||
Sample weights used in calculating partial dependencies. | ||
|
||
n_max : int, default=500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we call this subsample
in other places like KBinsDiscretizer
. I would probably keep the same naming.
numerator_pairwise : ndarray of shape (n_pairs, output_dim) | ||
Numerator of pairwise H-squared statistic. | ||
Useful to see which feature pair has strongest absolute interaction. | ||
Take square-root to get values on the scale of the predictions. | ||
|
||
denominator_pairwise : ndarray of shape (n_pairs, output_dim) | ||
Denominator of pairwise H-squared statistic. Used for appropriate | ||
normalization of H. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to store individually those? Would storing the H**2 enough for the majority of use case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to store individually those? Would storing the H**2 enough for the majority of use case
In practice, both the relative H2 as well as the numerator (absolute measure) are useful.
Thanks a lot for the excellent high-level view!
Speed matters. This example shows a speed-up of factor 10, but it might be exaggerated. from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection._partial_dependence import _partial_dependence_brute
X, y = make_regression(n_samples=1000, n_features=10, random_state=0)
model = RandomForestRegressor(n_jobs=8).fit(X, y)
grid = np.linspace(0, 1, 100)
# 0.1 seconds
pd = _calculate_pd_brute_fast(model.predict, X, 0, grid=grid)
pd[0:4].flatten() # array([-10.58987152, -10.56758906, -10.45100341, -10.4694109 ])
# 1.3 seconds
_partial_dependence_brute(
model, grid.reshape(-1, 1), [0], X, response_method="predict"
)[0].flatten()[0:4] # array([-10.58987152, -10.56758906, -10.45100341, -10.4694109 ])
We can think about this. The reason for the current approach is that aggregation of the results is easy. We simply calculate In (ugly) pseudo code:
Handling of categorical features is probably easier with the exact approach. Conceptually, the two approaches are quite close: if you replace the original data balues by their corresponding grid values, you get an approximation of the data. Then you apply the exact H statistic algorithm. This gives the same as the approach via grids.
That would be very neat! Of course, if you want to calculate the statistic only for a single pair, you can simply call the function with these two features.
Great suggestions! We can change the name of the function, as well as the output API (which is also not ideal).
That would be fantastic, thanks a lot. |
`output_dim` equals the number of values predicted per observation. | ||
For single-output regression and binary classification, `output_dim` is 1. | ||
|
||
feature_pairs : list of length n_feature_pairs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it would be better to store the pair of original keys meaning if a user is passing some strings, we should store those instead of always indices.
I opened #30111 to discuss a parameter to deal with the trade-off memory/speed. I think it could be a good addition. |
Quick pass on some parameter naming
@antoinebaker would you be kind to give a review here? |
Hi @mayer79 and @glemaitre, what is the current status of this PR ? If I understood correctly, this PR implements its own partial dependence computation through Should this PR wait for #30111 to be finalized before moving on, and refactor to use the new |
The issue with #30111 is that it doesn't really seem to be doing what it intends to do, so we shouldn't wait for that. |
In terms of scope, it would be nice to merge this feature. In terms of code, I recall that I wanted to avoid to repeat some common code with partial dependence for maintainability. When it comes to #30111, the idea was to get a way to limit the impact of memory consumption at the cost of computation. But reading back the experiments, it was not conclusive. I assume that we might go forward with a first version and then we could always try to improve it later. |
Co-authored-by: Quentin Barthélemy <[email protected]>
Reference Issues/PRs
Implements #22383
What does this implement/fix? Explain your changes.
@lorentzenchr
This PR implements a clean version of Friedman's H^2 statistic of pairwise interaction strength. It uses a couple of tricks to speed up the calculations. Still, one needs to be cautious when adding more than 6-8 features. The basic strategy is to select e.g. the top 5 predictors via permutation importance and then crunch the corresponding pairwise (absolute and relative) interaction strength statistics.
(My) reference implementation: https://github.com/mayer79/hstats
Any other comments?