Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH Add Friedman's H-squared #28375

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 80 commits into
base: main
Choose a base branch
from
Open

Conversation

mayer79
Copy link
Contributor

@mayer79 mayer79 commented Feb 6, 2024

Reference Issues/PRs

Implements #22383

What does this implement/fix? Explain your changes.

@lorentzenchr

This PR implements a clean version of Friedman's H^2 statistic of pairwise interaction strength. It uses a couple of tricks to speed up the calculations. Still, one needs to be cautious when adding more than 6-8 features. The basic strategy is to select e.g. the top 5 predictors via permutation importance and then crunch the corresponding pairwise (absolute and relative) interaction strength statistics.

(My) reference implementation: https://github.com/mayer79/hstats

Any other comments?

  • The implementation also works for multi-output or multi-class classification.
  • Plots might follow in a later PR.
  • Univariate H-statistics also exist, but I have not added them (yet). They measure the proportion of prediction variability only explained by interactions involving feature j. We need to keep this in mind when thinking about the output API.

Copy link

github-actions bot commented Feb 6, 2024

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here


ruff format

ruff detected issues. Please run ruff format locally and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.5.1.


--- build_tools/circle/list_versions.py
+++ build_tools/circle/list_versions.py
@@ -71,9 +71,7 @@
     "Web-based documentation is available for versions listed below:\n",
 ]
 
-ROOT_URL = (
-    "https://api.github.com/repos/scikit-learn/scikit-learn.github.io/contents/"  # noqa
-)
+ROOT_URL = "https://api.github.com/repos/scikit-learn/scikit-learn.github.io/contents/"  # noqa
 RAW_FMT = "https://raw.githubusercontent.com/scikit-learn/scikit-learn.github.io/master/%s/index.html"  # noqa
 VERSION_RE = re.compile(r"scikit-learn ([\w\.\-]+) documentation</title>")
 NAMED_DIRS = ["dev", "stable"]

--- examples/applications/plot_species_distribution_modeling.py
+++ examples/applications/plot_species_distribution_modeling.py
@@ -109,7 +109,7 @@
 
 
 def plot_species_distribution(
-    species=("bradypus_variegatus_0", "microryzomys_minutus_0")
+    species=("bradypus_variegatus_0", "microryzomys_minutus_0"),
 ):
     """
     Plot the species distribution.

--- examples/ensemble/plot_bias_variance.py
+++ examples/ensemble/plot_bias_variance.py
@@ -177,8 +177,8 @@
 
     plt.subplot(2, n_estimators, n_estimators + n + 1)
     plt.plot(X_test, y_error, "r", label="$error(x)$")
-    plt.plot(X_test, y_bias, "b", label="$bias^2(x)$"),
-    plt.plot(X_test, y_var, "g", label="$variance(x)$"),
+    (plt.plot(X_test, y_bias, "b", label="$bias^2(x)$"),)
+    (plt.plot(X_test, y_var, "g", label="$variance(x)$"),)
     plt.plot(X_test, y_noise, "c", label="$noise(x)$")
 
     plt.xlim([-5, 5])

--- examples/linear_model/plot_tweedie_regression_insurance_claims.py
+++ examples/linear_model/plot_tweedie_regression_insurance_claims.py
@@ -604,8 +604,9 @@
             "predicted, frequency*severity model": np.sum(
                 exposure * glm_freq.predict(X) * glm_sev.predict(X)
             ),
-            "predicted, tweedie, power=%.2f"
-            % glm_pure_premium.power: np.sum(exposure * glm_pure_premium.predict(X)),
+            "predicted, tweedie, power=%.2f" % glm_pure_premium.power: np.sum(
+                exposure * glm_pure_premium.predict(X)
+            ),
         }
     )
 

--- examples/manifold/plot_lle_digits.py
+++ examples/manifold/plot_lle_digits.py
@@ -10,7 +10,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 # %%
 # Load digits dataset
 # -------------------

--- examples/manifold/plot_manifold_sphere.py
+++ examples/manifold/plot_manifold_sphere.py
@@ -50,7 +50,7 @@
 t = random_state.rand(n_samples) * np.pi
 
 # Sever the poles from the sphere.
-indices = (t < (np.pi - (np.pi / 8))) & (t > ((np.pi / 8)))
+indices = (t < (np.pi - (np.pi / 8))) & (t > (np.pi / 8))
 colors = p[indices]
 x, y, z = (
     np.sin(t[indices]) * np.cos(p[indices]),

--- sklearn/_loss/tests/test_loss.py
+++ sklearn/_loss/tests/test_loss.py
@@ -215,7 +215,8 @@
 
 
 @pytest.mark.parametrize(
-    "loss, y_pred_success, y_pred_fail", Y_COMMON_PARAMS + Y_PRED_PARAMS  # type: ignore
+    "loss, y_pred_success, y_pred_fail",
+    Y_COMMON_PARAMS + Y_PRED_PARAMS,  # type: ignore
 )
 def test_loss_boundary_y_pred(loss, y_pred_success, y_pred_fail):
     """Test boundaries of y_pred for loss functions."""
@@ -493,12 +494,14 @@
         sample_weight=sample_weight,
         loss_out=out_l1,
     )
-    loss.closs.loss(
-        y_true=y_true,
-        raw_prediction=raw_prediction,
-        sample_weight=sample_weight,
-        loss_out=out_l2,
-    ),
+    (
+        loss.closs.loss(
+            y_true=y_true,
+            raw_prediction=raw_prediction,
+            sample_weight=sample_weight,
+            loss_out=out_l2,
+        ),
+    )
     assert_allclose(out_l1, out_l2)
     loss.gradient(
         y_true=y_true,

--- sklearn/cluster/_feature_agglomeration.py
+++ sklearn/cluster/_feature_agglomeration.py
@@ -6,7 +6,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import numpy as np
 from scipy.sparse import issparse
 

--- sklearn/cross_decomposition/tests/test_pls.py
+++ sklearn/cross_decomposition/tests/test_pls.py
@@ -404,12 +404,12 @@
 
     X_orig = X.copy()
     with pytest.raises(AssertionError):
-        pls.transform(X, Y, copy=False),
+        (pls.transform(X, Y, copy=False),)
         assert_array_almost_equal(X, X_orig)
 
     X_orig = X.copy()
     with pytest.raises(AssertionError):
-        pls.predict(X, copy=False),
+        (pls.predict(X, copy=False),)
         assert_array_almost_equal(X, X_orig)
 
     # Make sure copy=True gives same transform and predictions as predict=False

--- sklearn/ensemble/_bagging.py
+++ sklearn/ensemble/_bagging.py
@@ -3,7 +3,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import itertools
 import numbers
 from abc import ABCMeta, abstractmethod

--- sklearn/ensemble/_forest.py
+++ sklearn/ensemble/_forest.py
@@ -35,7 +35,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import threading
 from abc import ABCMeta, abstractmethod
 from numbers import Integral, Real

--- sklearn/ensemble/tests/test_forest.py
+++ sklearn/ensemble/tests/test_forest.py
@@ -168,11 +168,12 @@
     reg = ForestRegressor(n_estimators=5, criterion=criterion, random_state=1)
     reg.fit(X_reg, y_reg)
     score = reg.score(X_reg, y_reg)
-    assert (
-        score > 0.93
-    ), "Failed with max_features=None, criterion %s and score = %f" % (
-        criterion,
-        score,
+    assert score > 0.93, (
+        "Failed with max_features=None, criterion %s and score = %f"
+        % (
+            criterion,
+            score,
+        )
     )
 
     reg = ForestRegressor(

--- sklearn/experimental/enable_hist_gradient_boosting.py
+++ sklearn/experimental/enable_hist_gradient_boosting.py
@@ -13,7 +13,6 @@
 # Don't remove this file, we don't want to break users code just because the
 # feature isn't experimental anymore.
 
-
 import warnings
 
 warnings.warn(

--- sklearn/feature_selection/_univariate_selection.py
+++ sklearn/feature_selection/_univariate_selection.py
@@ -3,7 +3,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import warnings
 from numbers import Integral, Real
 

--- sklearn/gaussian_process/tests/test_gpc.py
+++ sklearn/gaussian_process/tests/test_gpc.py
@@ -147,8 +147,9 @@
     # Define a dummy optimizer that simply tests 10 random hyperparameters
     def optimizer(obj_func, initial_theta, bounds):
         rng = np.random.RandomState(global_random_seed)
-        theta_opt, func_min = initial_theta, obj_func(
-            initial_theta, eval_gradient=False
+        theta_opt, func_min = (
+            initial_theta,
+            obj_func(initial_theta, eval_gradient=False),
         )
         for _ in range(10):
             theta = np.atleast_1d(

--- sklearn/gaussian_process/tests/test_gpr.py
+++ sklearn/gaussian_process/tests/test_gpr.py
@@ -394,8 +394,9 @@
     # Define a dummy optimizer that simply tests 50 random hyperparameters
     def optimizer(obj_func, initial_theta, bounds):
         rng = np.random.RandomState(0)
-        theta_opt, func_min = initial_theta, obj_func(
-            initial_theta, eval_gradient=False
+        theta_opt, func_min = (
+            initial_theta,
+            obj_func(initial_theta, eval_gradient=False),
         )
         for _ in range(50):
             theta = np.atleast_1d(

--- sklearn/linear_model/_linear_loss.py
+++ sklearn/linear_model/_linear_loss.py
@@ -509,9 +509,9 @@
             if l2_reg_strength > 0:
                 # The L2 penalty enters the Hessian on the diagonal only. To add those
                 # terms, we use a flattened view on the array.
-                hess.reshape(-1)[
-                    : (n_features * n_dof) : (n_dof + 1)
-                ] += l2_reg_strength
+                hess.reshape(-1)[: (n_features * n_dof) : (n_dof + 1)] += (
+                    l2_reg_strength
+                )
 
             if self.fit_intercept:
                 # With intercept included as added column to X, the hessian becomes

--- sklearn/linear_model/_ridge.py
+++ sklearn/linear_model/_ridge.py
@@ -5,7 +5,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import numbers
 import warnings
 from abc import ABCMeta, abstractmethod

--- sklearn/linear_model/_theil_sen.py
+++ sklearn/linear_model/_theil_sen.py
@@ -5,7 +5,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import warnings
 from itertools import combinations
 from numbers import Integral, Real

--- sklearn/manifold/_spectral_embedding.py
+++ sklearn/manifold/_spectral_embedding.py
@@ -3,7 +3,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import warnings
 from numbers import Integral, Real
 

--- sklearn/metrics/_classification.py
+++ sklearn/metrics/_classification.py
@@ -10,7 +10,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import warnings
 from numbers import Integral, Real
 

--- sklearn/metrics/_ranking.py
+++ sklearn/metrics/_ranking.py
@@ -10,7 +10,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import warnings
 from functools import partial
 from numbers import Integral, Real

--- sklearn/metrics/cluster/_supervised.py
+++ sklearn/metrics/cluster/_supervised.py
@@ -7,7 +7,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import warnings
 from math import log
 from numbers import Real

--- sklearn/metrics/cluster/_unsupervised.py
+++ sklearn/metrics/cluster/_unsupervised.py
@@ -3,7 +3,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import functools
 from numbers import Integral
 

--- sklearn/metrics/tests/test_common.py
+++ sklearn/metrics/tests/test_common.py
@@ -975,7 +975,8 @@
 @pytest.mark.parametrize("metric", CLASSIFICATION_METRICS.values())
 @pytest.mark.parametrize(
     "y_true, y_score",
-    invalids_nan_inf +
+    invalids_nan_inf
+    +
     # Add an additional case for classification only
     # non-regression test for:
     # https://github.com/scikit-learn/scikit-learn/issues/6809
@@ -2005,7 +2006,6 @@
 
 
 def check_array_api_metric_pairwise(metric, array_namespace, device, dtype_name):
-
     X_np = np.array([[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]], dtype=dtype_name)
     Y_np = np.array([[0.2, 0.3, 0.4], [0.5, 0.6, 0.7]], dtype=dtype_name)
 

--- sklearn/model_selection/_validation.py
+++ sklearn/model_selection/_validation.py
@@ -6,7 +6,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 import numbers
 import time
 import warnings

--- sklearn/multioutput.py
+++ sklearn/multioutput.py
@@ -8,7 +8,6 @@
 # Authors: The scikit-learn developers
 # SPDX-License-Identifier: BSD-3-Clause
 
-
 from abc import ABCMeta, abstractmethod
 from numbers import Integral
 

--- sklearn/neighbors/tests/test_neighbors.py
+++ sklearn/neighbors/tests/test_neighbors.py
@@ -650,10 +650,12 @@
             assert_allclose(np.concatenate(list(ind)), np.concatenate(list(ind1)))
 
         for i in range(len(results) - 1):
-            assert_allclose(
-                np.concatenate(list(results[i][0])),
-                np.concatenate(list(results[i + 1][0])),
-            ),
+            (
+                assert_allclose(
+                    np.concatenate(list(results[i][0])),
+                    np.concatenate(list(results[i + 1][0])),
+                ),
+            )
             assert_allclose(
                 np.concatenate(list(results[i][1])),
                 np.concatenate(list(results[i + 1][1])),

--- sklearn/tests/test_common.py
+++ sklearn/tests/test_common.py
@@ -318,7 +318,6 @@
     "transformer", GET_FEATURES_OUT_ESTIMATORS, ids=_get_check_estimator_ids
 )
 def test_transformers_get_feature_names_out(transformer):
-
     with ignore_warnings(category=(FutureWarning)):
         check_transformer_get_feature_names_out(
             transformer.__class__.__name__, transformer

--- sklearn/tests/test_metaestimators.py
+++ sklearn/tests/test_metaestimators.py
@@ -155,11 +155,12 @@
             if method in delegator_data.skip_methods:
                 continue
             assert hasattr(delegate, method)
-            assert hasattr(
-                delegator, method
-            ), "%s does not have method %r when its delegate does" % (
-                delegator_data.name,
-                method,
+            assert hasattr(delegator, method), (
+                "%s does not have method %r when its delegate does"
+                % (
+                    delegator_data.name,
+                    method,
+                )
             )
             # delegation before fit raises a NotFittedError
             if method == "score":
@@ -189,11 +190,12 @@
             delegate = SubEstimator(hidden_method=method)
             delegator = delegator_data.construct(delegate)
             assert not hasattr(delegate, method)
-            assert not hasattr(
-                delegator, method
-            ), "%s has method %r when its delegate does not" % (
-                delegator_data.name,
-                method,
+            assert not hasattr(delegator, method), (
+                "%s has method %r when its delegate does not"
+                % (
+                    delegator_data.name,
+                    method,
+                )
             )
 
 

--- sklearn/utils/_metadata_requests.py
+++ sklearn/utils/_metadata_requests.py
@@ -1098,8 +1098,9 @@
             method_mapping = MethodMapping()
             for method in METHODS:
                 method_mapping.add(caller=method, callee=method)
-            yield "$self_request", RouterMappingPair(
-                mapping=method_mapping, router=self._self_request
+            yield (
+                "$self_request",
+                RouterMappingPair(mapping=method_mapping, router=self._self_request),
             )
         for name, route_mapping in self._route_mappings.items():
             yield (name, route_mapping)

--- sklearn/utils/tests/test_multiclass.py
+++ sklearn/utils/tests/test_multiclass.py
@@ -416,12 +416,13 @@
 def test_type_of_target():
     for group, group_examples in EXAMPLES.items():
         for example in group_examples:
-            assert (
-                type_of_target(example) == group
-            ), "type_of_target(%r) should be %r, got %r" % (
-                example,
-                group,
-                type_of_target(example),
+            assert type_of_target(example) == group, (
+                "type_of_target(%r) should be %r, got %r"
+                % (
+                    example,
+                    group,
+                    type_of_target(example),
+                )
             )
 
     for example in NON_ARRAY_LIKE_EXAMPLES:

--- sklearn/utils/tests/test_seq_dataset.py
+++ sklearn/utils/tests/test_seq_dataset.py
@@ -154,30 +154,34 @@
 
 def test_buffer_dtype_mismatch_error():
     with pytest.raises(ValueError, match="Buffer dtype mismatch"):
-        ArrayDataset64(X32, y32, sample_weight32, seed=42),
+        (ArrayDataset64(X32, y32, sample_weight32, seed=42),)
 
     with pytest.raises(ValueError, match="Buffer dtype mismatch"):
-        ArrayDataset32(X64, y64, sample_weight64, seed=42),
+        (ArrayDataset32(X64, y64, sample_weight64, seed=42),)
 
     for csr_container in CSR_CONTAINERS:
         X_csr32 = csr_container(X32)
         X_csr64 = csr_container(X64)
         with pytest.raises(ValueError, match="Buffer dtype mismatch"):
-            CSRDataset64(
-                X_csr32.data,
-                X_csr32.indptr,
-                X_csr32.indices,
-                y32,
-                sample_weight32,
-                seed=42,
-            ),
+            (
+                CSRDataset64(
+                    X_csr32.data,
+                    X_csr32.indptr,
+                    X_csr32.indices,
+                    y32,
+                    sample_weight32,
+                    seed=42,
+                ),
+            )
 
         with pytest.raises(ValueError, match="Buffer dtype mismatch"):
-            CSRDataset32(
-                X_csr64.data,
-                X_csr64.indptr,
-                X_csr64.indices,
-                y64,
-                sample_weight64,
-                seed=42,
-            ),
+            (
+                CSRDataset32(
+                    X_csr64.data,
+                    X_csr64.indptr,
+                    X_csr64.indices,
+                    y64,
+                    sample_weight64,
+                    seed=42,
+                ),
+            )

33 files would be reformatted, 888 files already formatted

Generated for commit: 93e0b9a. Link to the linter CI: here

@lorentzenchr lorentzenchr marked this pull request as draft February 8, 2024 10:23
@lorentzenchr lorentzenchr changed the title ENH Add Friedman's H-squared (WIP - DO NOT MERGE) ENH Add Friedman's H-squared Feb 8, 2024
@lorentzenchr
Copy link
Member

@mayer79 Thanks for working on this important inspection tool. To get rid of the linter issues, you might use a pre-commit hook, see https://scikit-learn.org/dev/developers/contributing.html#how-to-contribute.

@amueller @glemaitre @adrinjalali ping as this might interest you.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first quick pass. Maybe, _partial_dependence_brute can help with the tests.

@lorentzenchr lorentzenchr linked an issue Feb 8, 2024 that may be closed by this pull request
@lorentzenchr
Copy link
Member

lorentzenchr commented Feb 9, 2024

I keep struggling over the fact that "Friedman's H-statistic" is actually an H-squared.

The naming will pop up during further review anyway. One possibility would be h2_statistics.

@glemaitre
Copy link
Member

@glemaitre: I will try to revive this every couple of month. How can we proceed?

Basically, the ball is more on my side: I need to get through the literature before to provide a meaningful review. I'll do my best to start after the release of 1.5.2 and push it for the 1.6 release

@glemaitre
Copy link
Member

OK this time promised, I really focus on reviewing this PR.

I'll first look at the core implementation. I have already some comment regarding naming but I don't think that this is important in the first pass. Again sorry @mayer79 for the delay.

I'll push a first commit to resolve the conflict.

@glemaitre
Copy link
Member

@mayer79 I have a couple of high level questions (with high variance regarding the topic):

  • do you recall the gain of reimplementing the partial dependence brute instead of reusing the function _partial_dependence_brute;
  • the API of the partial_dependence generate a grid from the data. Here, we sample instead. What is your thought on adopting the same API for consistency. Alternatively, we could think of extending the grid_resolution API such that it only take a subsample instead of creating new data points.
  • I'm wondering if we should extend features such that it takes an array of tuple to only compute the two-way h-statistics for a pair of feature? Basically, when passing the feature indices, you can compute for each possible pair and we could also store in the bunch the overall interaction strength.
  • I said I would yet speak about it but let's discuss it now as well :) About the naming, I'm wondering if we could come with a name that relates to the functionality of the feature instead of pure statistical definition. I would be more comfortable with something like feature_interactions or friedman_feature_interaction. Then, we need to make sure that the documentation is mentioning the H-statistic as it is now.

I'll probably make a PR on your own fork regarding some code styling that would be to annoying to request via a code review.

n = X.shape[0]
n_grid = grid.shape[0]

X_stacked = _safe_indexing(X, np.tile(np.arange(n), n_grid), axis=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I assume that the speed up observed between this function _calculate_pd_brute_fast and _partial_dependence_brute is only related to stacking all samples in a single matrix and call .predict_proba a single time.

So basically, we have a trade-off speed/memory consumption. Here, we might blow up the memory with large dataset and if we decide to not subsample.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ease maintenance, I'm really leaning towards using the _partial_dependence_brute implementation.

However, the pattern here shows that we can have a good speed-up by concatenating data but we probably need to think about a chunking strategy to not blow up memory.

"""

# Select grid columns and remove duplicates (will compensate below)
grid = _safe_indexing(X, feature_indices, axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where I'm thinking that we could reuse the _grid_from_X instead. I don't know if taking quantile will actually have a statistical impact?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using quantiles is a strategy. However, the hard part of the calculations are the 2D partial dependence calculations. If you work with grid size 50. The resulting grid (using only existing combinations) will be almost as large as the selected n = 500 rows. There is, additionally, the complication of distinguishing discrete from continuous features. This does not mean we should not go for the quantile strategy.

sample_weight : array-like of shape (n_samples,), default=None
Sample weights used in calculating partial dependencies.

n_max : int, default=500
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we call this subsample in other places like KBinsDiscretizer. I would probably keep the same naming.

Comment on lines +198 to +205
numerator_pairwise : ndarray of shape (n_pairs, output_dim)
Numerator of pairwise H-squared statistic.
Useful to see which feature pair has strongest absolute interaction.
Take square-root to get values on the scale of the predictions.

denominator_pairwise : ndarray of shape (n_pairs, output_dim)
Denominator of pairwise H-squared statistic. Used for appropriate
normalization of H.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to store individually those? Would storing the H**2 enough for the majority of use case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to store individually those? Would storing the H**2 enough for the majority of use case

In practice, both the relative H2 as well as the numerator (absolute measure) are useful.

@mayer79
Copy link
Contributor Author

mayer79 commented Oct 18, 2024

Thanks a lot for the excellent high-level view!

@mayer79 I have a couple of high level questions (with high variance regarding the topic):

  • do you recall the gain of reimplementing the partial dependence brute instead of reusing the function _partial_dependence_brute;

Speed matters. _partial_dependence_brute() is relatively slow since it calls predict() once per grid value. The fast implementation makes a single call on the stacked data and thus trades speed for memory. Ideally, we could replace the slow version by the fast one altogether (with some API changes). But this clashes with users that pass a large training data as background data (the partial dependence function exposed to the user unfortunately does not do any subsampling of the background data).

This example shows a speed-up of factor 10, but it might be exaggerated.

from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection._partial_dependence import _partial_dependence_brute

X, y = make_regression(n_samples=1000, n_features=10, random_state=0)
model = RandomForestRegressor(n_jobs=8).fit(X, y)
grid = np.linspace(0, 1, 100)

# 0.1 seconds
pd = _calculate_pd_brute_fast(model.predict, X, 0, grid=grid)
pd[0:4].flatten()  # array([-10.58987152, -10.56758906, -10.45100341, -10.4694109 ])

# 1.3 seconds
_partial_dependence_brute(
    model, grid.reshape(-1, 1), [0], X, response_method="predict"
)[0].flatten()[0:4]  # array([-10.58987152, -10.56758906, -10.45100341, -10.4694109 ])
  • the API of the partial_dependence generate a grid from the data. Here, we sample instead. What is your thought on adopting the same API for consistency. Alternatively, we could think of extending the grid_resolution API such that it only take a subsample instead of creating new data points.

We can think about this. The reason for the current approach is that aggregation of the results is easy. We simply calculate $mean_i(pd_ijk - pd_ij - pd_ik)^2$ over all i sampled lines. We could, alternatively, work with a bivariate grid and do the aggregation over the grid, using the number of observations falling into the bivariate bin as weights. This would be an approximation of the exact statistic, but it might be faster, and we would not need to do the slightly unsafe deduplication of observed values.

In (ugly) pseudo code:

pd_j, univariate_grids = partial_dep(model, [feat_1, feat_2, ...], data)
for feature pair, pair of univariate grids:
  bivariate_grid = cartesian product of univariate_grids
  counts_jk = number of observations in bivariate grid  bins
  relevant_bivariate_grid = bivariate_grid[counts_jk > 0]
  pd_jk = partial_dep(model, [(feature pair)], relevant_bivariate_grid, data)
  h_stat = (pd_jk - pd_j[of corresponding grid value of first feature] - pd_k[of second feature grid value])^2 * counts_jk

Handling of categorical features is probably easier with the exact approach. Conceptually, the two approaches are quite close: if you replace the original data balues by their corresponding grid values, you get an approximation of the data. Then you apply the exact H statistic algorithm. This gives the same as the approach via grids.

  • I'm wondering if we should extend features such that it takes an array of tuple to only compute the two-way h-statistics for a pair of feature? Basically, when passing the feature indices, you can compute for each possible pair and we could also store in the bunch the overall interaction strength.

That would be very neat! Of course, if you want to calculate the statistic only for a single pair, you can simply call the function with these two features.

  • I said I would yet speak about it but let's discuss it now as well :) About the naming, I'm wondering if we could come with a name that relates to the functionality of the feature instead of pure statistical definition. I would be more comfortable with something like feature_interactions or friedman_feature_interaction. Then, we need to make sure that the documentation is mentioning the H-statistic as it is now.

Great suggestions! We can change the name of the function, as well as the output API (which is also not ideal).

I'll probably make a PR on your own fork regarding some code styling that would be to annoying to request via a code review.

That would be fantastic, thanks a lot.

`output_dim` equals the number of values predicted per observation.
For single-output regression and binary classification, `output_dim` is 1.

feature_pairs : list of length n_feature_pairs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would be better to store the pair of original keys meaning if a user is passing some strings, we should store those instead of always indices.

@glemaitre
Copy link
Member

I opened #30111 to discuss a parameter to deal with the trade-off memory/speed. I think it could be a good addition.

Quick pass on some parameter naming
@glemaitre glemaitre modified the milestones: 1.6, 1.7 Oct 29, 2024
@lorentzenchr lorentzenchr added the Waiting for Second Reviewer First reviewer is done, need a second one! label Feb 25, 2025
@adrinjalali
Copy link
Member

@antoinebaker would you be kind to give a review here?

@antoinebaker
Copy link
Contributor

antoinebaker commented Jun 4, 2025

Hi @mayer79 and @glemaitre, what is the current status of this PR ?

If I understood correctly, this PR implements its own partial dependence computation through _calculate_pd_brute_fast, which is faster but could use a lot of memory. Following #28375 (comment) it seems that the way to go is to first improve the partial_dependence API, allowing and choosing among different strategies (subsampling, trimming duplicates, compressed grid using quantiles, controlling the memory/speed tradeoff).

Should this PR wait for #30111 to be finalized before moving on, and refactor to use the new _partial_dependence_brute ?

@adrinjalali
Copy link
Member

adrinjalali commented Jun 4, 2025

The issue with #30111 is that it doesn't really seem to be doing what it intends to do, so we shouldn't wait for that.

@glemaitre
Copy link
Member

Hi @mayer79 and @glemaitre, what is the current status of this PR ?

In terms of scope, it would be nice to merge this feature. In terms of code, I recall that I wanted to avoid to repeat some common code with partial dependence for maintainability.

When it comes to #30111, the idea was to get a way to limit the impact of memory consumption at the cost of computation. But reading back the experiments, it was not conclusive. I assume that we might go forward with a first version and then we could always try to improve it later.

Co-authored-by: Quentin Barthélemy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:inspection Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Add friedman's H statistic
8 participants