[Meta] consistently break ties randomly in scikit-learn estimators (with random_state) in an unbiased way

Some estimators have arbitrary ways to break ties:

- k-nearest neighbors with data points lying on a uniform 1d grid (see #23667 for the BallTree for instance);
- tied splits on different features in histogram gradient boosted trees with redundant features.
- decision tree splits on the same feature with equivalent threshods: `X = [[0], [1], [2], [3]]` and `y = [0, 1, 1, 0]`: `X > 0.5` and `X > 2.5` are tied splits but only `X > 0.5` is considered;
- possibly other models (please feel free to edit this list or suggest missing estimators in the comments).

If the tie breaking logic is deterministic, then it might introduce a non-controllable bias in a datascience pipeline. For instance when analyzing the feature importance of an histogram gradient boosting model (via permutations or SHAP values), the first feature of a group of redundant features would always deterministically be picked up by the model and could lead a naive datascientist to believe that the other features of the group are not as predictive.

Note that this is not the case for our traditional `DecisionTreeClassifier/Regressor` / `RandomForestClassifier/Regressor` and extra trees because they all do [feature shuffling](https://github.com/scikit-learn/scikit-learn/blob/c217527af5744b9d0db8761c1e3667552312e5e7/sklearn/tree/_splitter.pyx#L322) (controllable by `random_state`) by default even when `max_features == 1.0`. This makes it easy to conduct the same study many times with different seeds to see if the results are an artifact of an arbitrary tie breaking or not.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Meta] consistently break ties randomly in scikit-learn estimators (with random_state) in an unbiased way #23728

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Meta] consistently break ties randomly in scikit-learn estimators (with random_state) in an unbiased way #23728

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions