ENH Add Friedman's H-squared #28375

mayer79 · 2024-02-06T21:04:51Z

Reference Issues/PRs

Implements #22383

What does this implement/fix? Explain your changes.

This PR implements a clean version of Friedman's H^2 statistic of pairwise interaction strength. It uses a couple of tricks to speed up the calculations. Still, one needs to be cautious when adding more than 6-8 features. The basic strategy is to select e.g. the top 5 predictors via permutation importance and then crunch the corresponding pairwise (absolute and relative) interaction strength statistics.

(My) reference implementation: https://github.com/mayer79/hstats

Any other comments?

The implementation also works for multi-output or multi-class classification.
Plots might follow in a later PR.
Univariate H-statistics also exist, but I have not added them (yet). They measure the proportion of prediction variability only explained by interactions involving feature j. We need to keep this in mind when thinking about the output API.

github-actions · 2024-02-06T21:06:06Z

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here

`ruff format`

ruff detected issues. Please run ruff format locally and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.5.1.


--- build_tools/circle/list_versions.py
+++ build_tools/circle/list_versions.py
@@ -71,9 +71,7 @@
     "Web-based documentation is available for versions listed below:\n",
 ]
 
-ROOT_URL = (
-    "https://api.github.com/repos/scikit-learn/scikit-learn.github.io/contents/"  # noqa
-)
+ROOT_URL = "https://api.github.com/repos/scikit-learn/scikit-learn.github.io/contents/"  # noqa
 RAW_FMT = "https://raw.githubusercontent.com/scikit-learn/scikit-learn.github.io/master/%s/index.html"  # noqa
 VERSION_RE = re.compile(r"scikit-learn ([\w\.\-]+) documentation</title>")
 NAMED_DIRS = ["dev", "stable"]

--- examples/applications/plot_species_distribution_modeling.py
+++ examples/applications/plot_species_distribution_modeling.py
@@ -109,7 +109,7 @@
 
 
 def plot_species_distribution(
-    species=("bradypus_variegatus_0", "microryzomys_minutus_0")
+    species=("bradypus_variegatus_0", "microryzomys_minutus_0"),
 ):
     """
     Plot the species distribution.

--- examples/ensemble/plot_bias_variance.py
+++ examples/ensemble/plot_bias_variance.py
@@ -177,8 +177,8 @@
 
     plt.subplot(2, n_estimators, n_estimators + n + 1)
     plt.plot(X_test, y_error, "r", label="$error(x)$")
-    plt.plot(X_test, y_bias, "b", label="$bias^2(x)$"),
-    plt.plot(X_test, y_var, "g", label="$variance(x)$"),
+    (plt.plot(X_test, y_bias, "b", label="$bias^2(x)$"),)
+    (plt.plot(X_test, y_var, "g", label="$variance(x)$"),)
     plt.plot(X_test, y_noise, "c", label="$noise(x)$")
 
     plt.xlim([-5, 5])

--- examples/linear_model/plot_tweedie_regression_insurance_claims.py
+++ examples/linear_model/plot_tweedie_regression_insurance_claims.py
@@ -604,8 +604,9 @@
             "predicted, frequency*severity model": np.sum(
                 exposure * glm_freq.predict(X) * glm_sev.predict(X)
             ),
-            "predicted, tweedie, power=%.2f"
-            % glm_pure_premium.power: np.sum(exposure * glm_pure_premium.predict(X)),
+            "predicted, tweedie, power=%.2f" % glm_pure_premium.power: np.sum(
+                exposure * glm_pure_premium.predict(X)
+            ),
         }
     )
 

--- examples/manifold/plot_lle_digits.py
+++ examples/manifold/plot_lle_digits.py
@@ -10,7 +10,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 # %%
 # Load digits dataset
 # -------------------

--- examples/manifold/plot_manifold_sphere.py
+++ examples/manifold/plot_manifold_sphere.py
@@ -50,7 +50,7 @@
 t = random_state.rand(n_samples) * np.pi
 
 # Sever the poles from the sphere.
-indices = (t < (np.pi - (np.pi / 8))) & (t > ((np.pi / 8)))
+indices = (t < (np.pi - (np.pi / 8))) & (t > (np.pi / 8))
 colors = p[indices]
 x, y, z = (
     np.sin(t[indices]) * np.cos(p[indices]),

--- sklearn/_loss/tests/test_loss.py
+++ sklearn/_loss/tests/test_loss.py
@@ -215,7 +215,8 @@
 
 
 @pytest.mark.parametrize(
-    "loss, y_pred_success, y_pred_fail", Y_COMMON_PARAMS + Y_PRED_PARAMS  # type: ignore
+    "loss, y_pred_success, y_pred_fail",
+    Y_COMMON_PARAMS + Y_PRED_PARAMS,  # type: ignore
 )
 def test_loss_boundary_y_pred(loss, y_pred_success, y_pred_fail):
     """Test boundaries of y_pred for loss functions."""
@@ -493,12 +494,14 @@
         sample_weight=sample_weight,
         loss_out=out_l1,
     )
-    loss.closs.loss(
-        y_true=y_true,
-        raw_prediction=raw_prediction,
-        sample_weight=sample_weight,
-        loss_out=out_l2,
-    ),
+    (
+        loss.closs.loss(
+            y_true=y_true,
+            raw_prediction=raw_prediction,
+            sample_weight=sample_weight,
+            loss_out=out_l2,
+        ),
+    )
     assert_allclose(out_l1, out_l2)
     loss.gradient(
         y_true=y_true,

--- sklearn/cluster/_feature_agglomeration.py
+++ sklearn/cluster/_feature_agglomeration.py
@@ -6,7 +6,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import numpy as np
 from scipy.sparse import issparse
 

--- sklearn/cross_decomposition/tests/test_pls.py
+++ sklearn/cross_decomposition/tests/test_pls.py
@@ -404,12 +404,12 @@
 
     X_orig = X.copy()
     with pytest.raises(AssertionError):
-        pls.transform(X, Y, copy=False),
+        (pls.transform(X, Y, copy=False),)
         assert_array_almost_equal(X, X_orig)
 
     X_orig = X.copy()
     with pytest.raises(AssertionError):
-        pls.predict(X, copy=False),
+        (pls.predict(X, copy=False),)
         assert_array_almost_equal(X, X_orig)
 
     # Make sure copy=True gives same transform and predictions as predict=False

--- sklearn/ensemble/_bagging.py
+++ sklearn/ensemble/_bagging.py
@@ -3,7 +3,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import itertools
 import numbers
 from abc import ABCMeta, abstractmethod

--- sklearn/ensemble/_forest.py
+++ sklearn/ensemble/_forest.py
@@ -35,7 +35,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import threading
 from abc import ABCMeta, abstractmethod
 from numbers import Integral, Real

--- sklearn/ensemble/tests/test_forest.py
+++ sklearn/ensemble/tests/test_forest.py
@@ -168,11 +168,12 @@
     reg = ForestRegressor(n_estimators=5, criterion=criterion, random_state=1)
     reg.fit(X_reg, y_reg)
     score = reg.score(X_reg, y_reg)
-    assert (
-        score > 0.93
-    ), "Failed with max_features=None, criterion %s and score = %f" % (
-        criterion,
-        score,
+    assert score > 0.93, (
+        "Failed with max_features=None, criterion %s and score = %f"
+        % (
+            criterion,
+            score,
+        )
     )
 
     reg = ForestRegressor(

--- sklearn/experimental/enable_hist_gradient_boosting.py
+++ sklearn/experimental/enable_hist_gradient_boosting.py
@@ -13,7 +13,6 @@
 # Don't remove this file, we don't want to break users code just because the
 # feature isn't experimental anymore.
 
-
 import warnings
 
 warnings.warn(

--- sklearn/feature_selection/_univariate_selection.py
+++ sklearn/feature_selection/_univariate_selection.py
@@ -3,7 +3,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import warnings
 from numbers import Integral, Real
 

--- sklearn/gaussian_process/tests/test_gpc.py
+++ sklearn/gaussian_process/tests/test_gpc.py
@@ -147,8 +147,9 @@
     # Define a dummy optimizer that simply tests 10 random hyperparameters
     def optimizer(obj_func, initial_theta, bounds):
         rng = np.random.RandomState(global_random_seed)
-        theta_opt, func_min = initial_theta, obj_func(
-            initial_theta, eval_gradient=False
+        theta_opt, func_min = (
+            initial_theta,
+            obj_func(initial_theta, eval_gradient=False),
         )
         for _ in range(10):
             theta = np.atleast_1d(

--- sklearn/gaussian_process/tests/test_gpr.py
+++ sklearn/gaussian_process/tests/test_gpr.py
@@ -394,8 +394,9 @@
     # Define a dummy optimizer that simply tests 50 random hyperparameters
     def optimizer(obj_func, initial_theta, bounds):
         rng = np.random.RandomState(0)
-        theta_opt, func_min = initial_theta, obj_func(
-            initial_theta, eval_gradient=False
+        theta_opt, func_min = (
+            initial_theta,
+            obj_func(initial_theta, eval_gradient=False),
         )
         for _ in range(50):
             theta = np.atleast_1d(

--- sklearn/linear_model/_linear_loss.py
+++ sklearn/linear_model/_linear_loss.py
@@ -509,9 +509,9 @@
             if l2_reg_strength > 0:
                 # The L2 penalty enters the Hessian on the diagonal only. To add those
                 # terms, we use a flattened view on the array.
-                hess.reshape(-1)[
-                    : (n_features * n_dof) : (n_dof + 1)
-                ] += l2_reg_strength
+                hess.reshape(-1)[: (n_features * n_dof) : (n_dof + 1)] += (
+                    l2_reg_strength
+                )
 
             if self.fit_intercept:
                 # With intercept included as added column to X, the hessian becomes

--- sklearn/linear_model/_ridge.py
+++ sklearn/linear_model/_ridge.py
@@ -5,7 +5,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import numbers
 import warnings
 from abc import ABCMeta, abstractmethod

--- sklearn/linear_model/_theil_sen.py
+++ sklearn/linear_model/_theil_sen.py
@@ -5,7 +5,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import warnings
 from itertools import combinations
 from numbers import Integral, Real

--- sklearn/manifold/_spectral_embedding.py
+++ sklearn/manifold/_spectral_embedding.py
@@ -3,7 +3,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import warnings
 from numbers import Integral, Real
 

--- sklearn/metrics/_classification.py
+++ sklearn/metrics/_classification.py
@@ -10,7 +10,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import warnings
 from numbers import Integral, Real
 

--- sklearn/metrics/_ranking.py
+++ sklearn/metrics/_ranking.py
@@ -10,7 +10,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import warnings
 from functools import partial
 from numbers import Integral, Real

--- sklearn/metrics/cluster/_supervised.py
+++ sklearn/metrics/cluster/_supervised.py
@@ -7,7 +7,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import warnings
 from math import log
 from numbers import Real

--- sklearn/metrics/cluster/_unsupervised.py
+++ sklearn/metrics/cluster/_unsupervised.py
@@ -3,7 +3,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import functools
 from numbers import Integral
 

--- sklearn/metrics/tests/test_common.py
+++ sklearn/metrics/tests/test_common.py
@@ -975,7 +975,8 @@
 @pytest.mark.parametrize("metric", CLASSIFICATION_METRICS.values())
 @pytest.mark.parametrize(
     "y_true, y_score",
-    invalids_nan_inf +
+    invalids_nan_inf
+    +
     # Add an additional case for classification only
     # non-regression test for:
     # https://github.com/scikit-learn/scikit-learn/issues/6809
@@ -2005,7 +2006,6 @@
 
 
 def check_array_api_metric_pairwise(metric, array_namespace, device, dtype_name):
-
     X_np = np.array([[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]], dtype=dtype_name)
     Y_np = np.array([[0.2, 0.3, 0.4], [0.5, 0.6, 0.7]], dtype=dtype_name)
 

--- sklearn/model_selection/_validation.py
+++ sklearn/model_selection/_validation.py
@@ -6,7 +6,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import numbers
 import time
 import warnings

--- sklearn/multioutput.py
+++ sklearn/multioutput.py
@@ -8,7 +8,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 from abc import ABCMeta, abstractmethod
 from numbers import Integral
 

--- sklearn/neighbors/tests/test_neighbors.py
+++ sklearn/neighbors/tests/test_neighbors.py
@@ -650,10 +650,12 @@
             assert_allclose(np.concatenate(list(ind)), np.concatenate(list(ind1)))
 
         for i in range(len(results) - 1):
-            assert_allclose(
-                np.concatenate(list(results[i][0])),
-                np.concatenate(list(results[i + 1][0])),
-            ),
+            (
+                assert_allclose(
+                    np.concatenate(list(results[i][0])),
+                    np.concatenate(list(results[i + 1][0])),
+                ),
+            )
             assert_allclose(
                 np.concatenate(list(results[i][1])),
                 np.concatenate(list(results[i + 1][1])),

--- sklearn/tests/test_common.py
+++ sklearn/tests/test_common.py
@@ -318,7 +318,6 @@
     "transformer", GET_FEATURES_OUT_ESTIMATORS, ids=_get_check_estimator_ids
 )
 def test_transformers_get_feature_names_out(transformer):
-
     with ignore_warnings(category=(FutureWarning)):
         check_transformer_get_feature_names_out(
             transformer.__class__.__name__, transformer

--- sklearn/tests/test_metaestimators.py
+++ sklearn/tests/test_metaestimators.py
@@ -155,11 +155,12 @@
             if method in delegator_data.skip_methods:
                 continue
             assert hasattr(delegate, method)
-            assert hasattr(
-                delegator, method
-            ), "%s does not have method %r when its delegate does" % (
-                delegator_data.name,
-                method,
+            assert hasattr(delegator, method), (
+                "%s does not have method %r when its delegate does"
+                % (
+                    delegator_data.name,
+                    method,
+                )
             )
             # delegation before fit raises a NotFittedError
             if method == "score":
@@ -189,11 +190,12 @@
             delegate = SubEstimator(hidden_method=method)
             delegator = delegator_data.construct(delegate)
             assert not hasattr(delegate, method)
-            assert not hasattr(
-                delegator, method
-            ), "%s has method %r when its delegate does not" % (
-                delegator_data.name,
-                method,
+            assert not hasattr(delegator, method), (
+                "%s has method %r when its delegate does not"
+                % (
+                    delegator_data.name,
+                    method,
+                )
             )
 
 

--- sklearn/utils/_metadata_requests.py
+++ sklearn/utils/_metadata_requests.py
@@ -1098,8 +1098,9 @@
             method_mapping = MethodMapping()
             for method in METHODS:
                 method_mapping.add(caller=method, callee=method)
-            yield "$self_request", RouterMappingPair(
-                mapping=method_mapping, router=self._self_request
+            yield (
+                "$self_request",
+                RouterMappingPair(mapping=method_mapping, router=self._self_request),
             )
         for name, route_mapping in self._route_mappings.items():
             yield (name, route_mapping)

--- sklearn/utils/tests/test_multiclass.py
+++ sklearn/utils/tests/test_multiclass.py
@@ -416,12 +416,13 @@
 def test_type_of_target():
     for group, group_examples in EXAMPLES.items():
         for example in group_examples:
-            assert (
-                type_of_target(example) == group
-            ), "type_of_target(%r) should be %r, got %r" % (
-                example,
-                group,
-                type_of_target(example),
+            assert type_of_target(example) == group, (
+                "type_of_target(%r) should be %r, got %r"
+                % (
+                    example,
+                    group,
+                    type_of_target(example),
+                )
             )
 
     for example in NON_ARRAY_LIKE_EXAMPLES:

--- sklearn/utils/tests/test_seq_dataset.py
+++ sklearn/utils/tests/test_seq_dataset.py
@@ -154,30 +154,34 @@
 
 def test_buffer_dtype_mismatch_error():
     with pytest.raises(ValueError, match="Buffer dtype mismatch"):
-        ArrayDataset64(X32, y32, sample_weight32, seed=42),
+        (ArrayDataset64(X32, y32, sample_weight32, seed=42),)
 
     with pytest.raises(ValueError, match="Buffer dtype mismatch"):
-        ArrayDataset32(X64, y64, sample_weight64, seed=42),
+        (ArrayDataset32(X64, y64, sample_weight64, seed=42),)
 
     for csr_container in CSR_CONTAINERS:
         X_csr32 = csr_container(X32)
         X_csr64 = csr_container(X64)
         with pytest.raises(ValueError, match="Buffer dtype mismatch"):
-            CSRDataset64(
-                X_csr32.data,
-                X_csr32.indptr,
-                X_csr32.indices,
-                y32,
-                sample_weight32,
-                seed=42,
-            ),
+            (
+                CSRDataset64(
+                    X_csr32.data,
+                    X_csr32.indptr,
+                    X_csr32.indices,
+                    y32,
+                    sample_weight32,
+                    seed=42,
+                ),
+            )
 
         with pytest.raises(ValueError, match="Buffer dtype mismatch"):
-            CSRDataset32(
-                X_csr64.data,
-                X_csr64.indptr,
-                X_csr64.indices,
-                y64,
-                sample_weight64,
-                seed=42,
-            ),
+            (
+                CSRDataset32(
+                    X_csr64.data,
+                    X_csr64.indptr,
+                    X_csr64.indices,
+                    y64,
+                    sample_weight64,
+                    seed=42,
+                ),
+            )

33 files would be reformatted, 888 files already formatted

_{Generated for commit: 93e0b9a. Link to the linter CI: here}

lorentzenchr · 2024-02-08T11:36:52Z

@mayer79 Thanks for working on this important inspection tool. To get rid of the linter issues, you might use a pre-commit hook, see https://scikit-learn.org/dev/developers/contributing.html#how-to-contribute.

@amueller @glemaitre @adrinjalali ping as this might interest you.

lorentzenchr

A first quick pass. Maybe, _partial_dependence_brute can help with the tests.

sklearn/inspection/_friedmans_h.py

Co-authored-by: Christian Lorentzen <[email protected]>

sklearn/inspection/_friedmans_h.py

lorentzenchr · 2024-02-09T14:36:51Z

I keep struggling over the fact that "Friedman's H-statistic" is actually an H-squared.

The naming will pop up during further review anyway. One possibility would be h2_statistics.

glemaitre · 2024-09-09T20:51:46Z

@glemaitre: I will try to revive this every couple of month. How can we proceed?

Basically, the ball is more on my side: I need to get through the literature before to provide a meaningful review. I'll do my best to start after the release of 1.5.2 and push it for the 1.6 release

doc/modules/h_statistic.rst

glemaitre · 2024-10-18T07:11:05Z

OK this time promised, I really focus on reviewing this PR.

I'll first look at the core implementation. I have already some comment regarding naming but I don't think that this is important in the first pass. Again sorry @mayer79 for the delay.

I'll push a first commit to resolve the conflict.

glemaitre · 2024-10-18T15:17:34Z

@mayer79 I have a couple of high level questions (with high variance regarding the topic):

do you recall the gain of reimplementing the partial dependence brute instead of reusing the function _partial_dependence_brute;
the API of the partial_dependence generate a grid from the data. Here, we sample instead. What is your thought on adopting the same API for consistency. Alternatively, we could think of extending the grid_resolution API such that it only take a subsample instead of creating new data points.
I'm wondering if we should extend features such that it takes an array of tuple to only compute the two-way h-statistics for a pair of feature? Basically, when passing the feature indices, you can compute for each possible pair and we could also store in the bunch the overall interaction strength.
I said I would yet speak about it but let's discuss it now as well :) About the naming, I'm wondering if we could come with a name that relates to the functionality of the feature instead of pure statistical definition. I would be more comfortable with something like feature_interactions or friedman_feature_interaction. Then, we need to make sure that the documentation is mentioning the H-statistic as it is now.

I'll probably make a PR on your own fork regarding some code styling that would be to annoying to request via a code review.

glemaitre · 2024-10-18T16:55:44Z

sklearn/inspection/_h_statistic.py

+    n = X.shape[0]
+    n_grid = grid.shape[0]
+
+    X_stacked = _safe_indexing(X, np.tile(np.arange(n), n_grid), axis=0)


So I assume that the speed up observed between this function _calculate_pd_brute_fast and _partial_dependence_brute is only related to stacking all samples in a single matrix and call .predict_proba a single time.

So basically, we have a trade-off speed/memory consumption. Here, we might blow up the memory with large dataset and if we decide to not subsample.

To ease maintenance, I'm really leaning towards using the _partial_dependence_brute implementation.

However, the pattern here shows that we can have a good speed-up by concatenating data but we probably need to think about a chunking strategy to not blow up memory.

glemaitre · 2024-10-18T17:22:11Z

sklearn/inspection/_h_statistic.py

+    """
+
+    # Select grid columns and remove duplicates (will compensate below)
+    grid = _safe_indexing(X, feature_indices, axis=1)


This is where I'm thinking that we could reuse the _grid_from_X instead. I don't know if taking quantile will actually have a statistical impact?

Using quantiles is a strategy. However, the hard part of the calculations are the 2D partial dependence calculations. If you work with grid size 50. The resulting grid (using only existing combinations) will be almost as large as the selected n = 500 rows. There is, additionally, the complication of distinguishing discrete from continuous features. This does not mean we should not go for the quantile strategy.

glemaitre · 2024-10-18T17:23:52Z

sklearn/inspection/_h_statistic.py

+    sample_weight : array-like of shape (n_samples,), default=None
+        Sample weights used in calculating partial dependencies.
+
+    n_max : int, default=500


I think that we call this subsample in other places like KBinsDiscretizer. I would probably keep the same naming.

glemaitre · 2024-10-18T17:25:18Z

sklearn/inspection/_h_statistic.py

+        numerator_pairwise : ndarray of shape (n_pairs, output_dim)
+            Numerator of pairwise H-squared statistic.
+            Useful to see which feature pair has strongest absolute interaction.
+            Take square-root to get values on the scale of the predictions.
+
+        denominator_pairwise : ndarray of shape (n_pairs, output_dim)
+            Denominator of pairwise H-squared statistic. Used for appropriate
+            normalization of H.


What is the reason to store individually those? Would storing the H**2 enough for the majority of use case

What is the reason to store individually those? Would storing the H**2 enough for the majority of use case

In practice, both the relative H2 as well as the numerator (absolute measure) are useful.

mayer79 · 2024-10-18T17:37:59Z

Thanks a lot for the excellent high-level view!

@mayer79 I have a couple of high level questions (with high variance regarding the topic):

do you recall the gain of reimplementing the partial dependence brute instead of reusing the function _partial_dependence_brute;

Speed matters. _partial_dependence_brute() is relatively slow since it calls predict() once per grid value. The fast implementation makes a single call on the stacked data and thus trades speed for memory. Ideally, we could replace the slow version by the fast one altogether (with some API changes). But this clashes with users that pass a large training data as background data (the partial dependence function exposed to the user unfortunately does not do any subsampling of the background data).

This example shows a speed-up of factor 10, but it might be exaggerated.

from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection._partial_dependence import _partial_dependence_brute

X, y = make_regression(n_samples=1000, n_features=10, random_state=0)
model = RandomForestRegressor(n_jobs=8).fit(X, y)
grid = np.linspace(0, 1, 100)

# 0.1 seconds
pd = _calculate_pd_brute_fast(model.predict, X, 0, grid=grid)
pd[0:4].flatten()  # array([-10.58987152, -10.56758906, -10.45100341, -10.4694109 ])

# 1.3 seconds
_partial_dependence_brute(
    model, grid.reshape(-1, 1), [0], X, response_method="predict"
)[0].flatten()[0:4]  # array([-10.58987152, -10.56758906, -10.45100341, -10.4694109 ])

the API of the partial_dependence generate a grid from the data. Here, we sample instead. What is your thought on adopting the same API for consistency. Alternatively, we could think of extending the grid_resolution API such that it only take a subsample instead of creating new data points.

We can think about this. The reason for the current approach is that aggregation of the results is easy. We simply calculate $mean_i(pd_ijk - pd_ij - pd_ik)^2$ over all i sampled lines. We could, alternatively, work with a bivariate grid and do the aggregation over the grid, using the number of observations falling into the bivariate bin as weights. This would be an approximation of the exact statistic, but it might be faster, and we would not need to do the slightly unsafe deduplication of observed values.

In (ugly) pseudo code:

pd_j, univariate_grids = partial_dep(model, [feat_1, feat_2, ...], data)
for feature pair, pair of univariate grids:
  bivariate_grid = cartesian product of univariate_grids
  counts_jk = number of observations in bivariate grid  bins
  relevant_bivariate_grid = bivariate_grid[counts_jk > 0]
  pd_jk = partial_dep(model, [(feature pair)], relevant_bivariate_grid, data)
  h_stat = (pd_jk - pd_j[of corresponding grid value of first feature] - pd_k[of second feature grid value])^2 * counts_jk

Handling of categorical features is probably easier with the exact approach. Conceptually, the two approaches are quite close: if you replace the original data balues by their corresponding grid values, you get an approximation of the data. Then you apply the exact H statistic algorithm. This gives the same as the approach via grids.

I'm wondering if we should extend features such that it takes an array of tuple to only compute the two-way h-statistics for a pair of feature? Basically, when passing the feature indices, you can compute for each possible pair and we could also store in the bunch the overall interaction strength.

That would be very neat! Of course, if you want to calculate the statistic only for a single pair, you can simply call the function with these two features.

I said I would yet speak about it but let's discuss it now as well :) About the naming, I'm wondering if we could come with a name that relates to the functionality of the feature instead of pure statistical definition. I would be more comfortable with something like feature_interactions or friedman_feature_interaction. Then, we need to make sure that the documentation is mentioning the H-statistic as it is now.

Great suggestions! We can change the name of the function, as well as the output API (which is also not ideal).

I'll probably make a PR on your own fork regarding some code styling that would be to annoying to request via a code review.

That would be fantastic, thanks a lot.

glemaitre · 2024-10-18T18:05:58Z

sklearn/inspection/_h_statistic.py

+        `output_dim` equals the number of values predicted per observation.
+        For single-output regression and binary classification, `output_dim` is 1.
+
+        feature_pairs : list of length n_feature_pairs


I think that it would be better to store the pair of original keys meaning if a user is passing some strings, we should store those instead of always indices.

glemaitre · 2024-10-19T11:37:25Z

I opened #30111 to discuss a parameter to deal with the trade-off memory/speed. I think it could be a good addition.

Quick pass on some parameter naming

adrinjalali · 2025-04-30T09:15:45Z

@antoinebaker would you be kind to give a review here?

antoinebaker · 2025-06-04T09:16:15Z

Hi @mayer79 and @glemaitre, what is the current status of this PR ?

If I understood correctly, this PR implements its own partial dependence computation through _calculate_pd_brute_fast, which is faster but could use a lot of memory. Following #28375 (comment) it seems that the way to go is to first improve the partial_dependence API, allowing and choosing among different strategies (subsampling, trimming duplicates, compressed grid using quantiles, controlling the memory/speed tradeoff).

Should this PR wait for #30111 to be finalized before moving on, and refactor to use the new _partial_dependence_brute ?

adrinjalali · 2025-06-04T10:08:10Z

The issue with #30111 is that it doesn't really seem to be doing what it intends to do, so we shouldn't wait for that.

glemaitre · 2025-06-07T08:29:56Z

Hi @mayer79 and @glemaitre, what is the current status of this PR ?

In terms of scope, it would be nice to merge this feature. In terms of code, I recall that I wanted to avoid to repeat some common code with partial dependence for maintainability.

When it comes to #30111, the idea was to get a way to limit the impact of memory consumption at the cost of computation. But reading back the experiments, it was not conclusive. I assume that we might go forward with a first version and then we could always try to improve it later.

Co-authored-by: Quentin Barthélemy <[email protected]>

Add Friedman's H-squared of pairwise interaction statistics

2f06cbc

github-actions bot added the module:inspection label Feb 6, 2024

mayer79 and others added 8 commits February 6, 2024 22:10

run black

38f3d6c

np.unique() does not work well for non-standard values

f173c51

Reorganize imports

2e7b731

run ruff

64581e2

check weights and is fitted

8e50637

Replace compression logic by try except

efa4071

Switch to Bunch output

f53b838

Clip small numerators

032e661

lorentzenchr marked this pull request as draft February 8, 2024 10:23

lorentzenchr changed the title ~~ENH Add Friedman's H-squared (WIP - DO NOT MERGE)~~ ENH Add Friedman's H-squared Feb 8, 2024

lorentzenchr reviewed Feb 8, 2024

View reviewed changes

lorentzenchr linked an issue Feb 8, 2024 that may be closed by this pull request

Add friedman's H statistic #22383

Open

mayer79 and others added 2 commits February 9, 2024 11:35

Apply suggestions from code review

1335f05

Co-authored-by: Christian Lorentzen <[email protected]>

use sample_without_replacement, plus some docstring improvements

a28e8e0

lorentzenchr reviewed Feb 9, 2024

View reviewed changes

sklearn/inspection/_friedmans_h.py Outdated Show resolved Hide resolved

mayer79 added 2 commits February 9, 2024 12:53

fix existing problems

66f7bdd

Merge branch 'main' into friedmans-h

6db5ac7

lorentzenchr reviewed Feb 9, 2024

View reviewed changes

sklearn/inspection/_friedmans_h.py Outdated Show resolved Hide resolved

Rename things and fix imports

5beb941

mayer79 added 6 commits February 10, 2024 11:16

More compact output organization, faster example

9823fe5

Add formula to docstring

5aebb4c

Add preliminary unit tests

36ed7a8

Compare against two R packages

27e3540

Split calculate_pd_over_data into two plus some optimizations

e5f6e53

Fix typos in docstring

a2b659d

qbarthelemy reviewed Oct 17, 2024

View reviewed changes

doc/modules/h_statistic.rst Outdated Show resolved Hide resolved

glemaitre added 3 commits October 18, 2024 09:14

Merge remote-tracking branch 'origin/main' into pr/mayer79/28375

699c1de

add copyrigtht to new files

c9a40fc

solve indexing problem and NumPy 2.0 compatibility

38f5b33

glemaitre reviewed Oct 18, 2024

View reviewed changes

glemaitre added 4 commits October 18, 2024 21:04

simplify resampling

8506e5a

list comprehension

a51d469

black

dcc97ce

ruff

c919703

glemaitre self-requested a review October 19, 2024 10:44

glemaitre mentioned this pull request Oct 19, 2024

ENH expose max_memory_mb parameter to contol trade-off speed/memory during computation #30111

Open

2 tasks

Merge pull request #1 from glemaitre/review_1

ed08a25

Quick pass on some parameter naming

glemaitre modified the milestones: 1.6, 1.7 Oct 29, 2024

mayer79 mentioned this pull request Jan 20, 2025

Add Accumulated local effects (ALE) to inspection #30223

Open

lorentzenchr added the Waiting for Second Reviewer First reviewer is done, need a second one! label Feb 25, 2025

Update doc/modules/h_statistic.rst

93e0b9a

Co-authored-by: Quentin Barthélemy <[email protected]>

Uh oh!

ENH Add Friedman's H-squared #28375

Are you sure you want to change the base?

ENH Add Friedman's H-squared #28375

Uh oh!

Conversation

mayer79 commented Feb 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Feb 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ Linting issues

ruff format

Uh oh!

lorentzenchr commented Feb 8, 2024

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lorentzenchr commented Feb 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Sep 9, 2024

Uh oh!

Uh oh!

glemaitre commented Oct 18, 2024

Uh oh!

glemaitre commented Oct 18, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayer79 commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Oct 19, 2024

Uh oh!

adrinjalali commented Apr 30, 2025

Uh oh!

antoinebaker commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented Jun 4, 2025 • edited by lucyleeow Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Jun 7, 2025

Uh oh!

Uh oh!

mayer79 commented Feb 6, 2024 •

edited

Loading

github-actions bot commented Feb 6, 2024 •

edited

Loading

`ruff format`

lorentzenchr commented Feb 9, 2024 •

edited

Loading

mayer79 commented Oct 18, 2024 •

edited

Loading

antoinebaker commented Jun 4, 2025 •

edited

Loading

adrinjalali commented Jun 4, 2025 •

edited by lucyleeow

Loading