-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
Some estimators have arbitrary ways to break ties:
- k-nearest neighbors with data points lying on a uniform 1d grid (see BallTree.query returns inconsistent indices between scikit-learn versions 0.24.1 and 1.1.1 #23667 for the BallTree for instance);
- tied splits on different features in histogram gradient boosted trees with redundant features.
- decision tree splits on the same feature with equivalent threshods:
X = [[0], [1], [2], [3]]andy = [0, 1, 1, 0]:X > 0.5andX > 2.5are tied splits but onlyX > 0.5is considered; - possibly other models (please feel free to edit this list or suggest missing estimators in the comments).
If the tie breaking logic is deterministic, then it might introduce a non-controllable bias in a datascience pipeline. For instance when analyzing the feature importance of an histogram gradient boosting model (via permutations or SHAP values), the first feature of a group of redundant features would always deterministically be picked up by the model and could lead a naive datascientist to believe that the other features of the group are not as predictive.
Note that this is not the case for our traditional DecisionTreeClassifier/Regressor / RandomForestClassifier/Regressor and extra trees because they all do feature shuffling (controllable by random_state) by default even when max_features == 1.0. This makes it easy to conduct the same study many times with different seeds to see if the results are an artifact of an arbitrary tie breaking or not.