Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Meta] consistently break ties randomly in scikit-learn estimators (with random_state) in an unbiased way #23728

@ogrisel

Description

@ogrisel

Some estimators have arbitrary ways to break ties:

  • k-nearest neighbors with data points lying on a uniform 1d grid (see BallTree.query returns inconsistent indices between scikit-learn versions 0.24.1 and 1.1.1 #23667 for the BallTree for instance);
  • tied splits on different features in histogram gradient boosted trees with redundant features.
  • decision tree splits on the same feature with equivalent threshods: X = [[0], [1], [2], [3]] and y = [0, 1, 1, 0]: X > 0.5 and X > 2.5 are tied splits but only X > 0.5 is considered;
  • possibly other models (please feel free to edit this list or suggest missing estimators in the comments).

If the tie breaking logic is deterministic, then it might introduce a non-controllable bias in a datascience pipeline. For instance when analyzing the feature importance of an histogram gradient boosting model (via permutations or SHAP values), the first feature of a group of redundant features would always deterministically be picked up by the model and could lead a naive datascientist to believe that the other features of the group are not as predictive.

Note that this is not the case for our traditional DecisionTreeClassifier/Regressor / RandomForestClassifier/Regressor and extra trees because they all do feature shuffling (controllable by random_state) by default even when max_features == 1.0. This makes it easy to conduct the same study many times with different seeds to see if the results are an artifact of an arbitrary tie breaking or not.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions