Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FEA Add missing-value support for ExtaTreeClassifier and ExtaTreeRegressor #27966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 151 commits into from
Jul 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
151 commits
Select commit Hold shift + click to select a range
a8607b2
Added necessary Cython changes
adam2392 Dec 15, 2023
6d69617
Adding random splitter
adam2392 Dec 16, 2023
3474db3
WIP unit tests
adam2392 Dec 17, 2023
7904dee
Fully functioning extratrees with missing-values
adam2392 Dec 18, 2023
a6ac0e1
Add changelog
adam2392 Dec 18, 2023
65d1a51
Fix unit-test
adam2392 Dec 18, 2023
e062b72
Merge branch 'main' into extratreenan
adam2392 Dec 19, 2023
da1b1a1
Try again
adam2392 Dec 19, 2023
352dc4d
Merge branch 'main' into extratreenan
adam2392 Dec 19, 2023
f285a29
Merge branch 'extratreenan' of https://github.com/adam2392/scikit-lea…
adam2392 Dec 19, 2023
5654a44
Merge branch 'main' into extratreenan
adam2392 Jan 2, 2024
a9a06e6
Merge branch 'main' into extratreenan
adam2392 Jan 18, 2024
92b21f7
Merge branch 'main' into extratreenan
adam2392 Jan 19, 2024
6b9e387
Merge branch 'main' into extratreenan
adam2392 Jan 19, 2024
1c2b807
Adding update
adam2392 Jan 19, 2024
79f24f4
Merge branch 'scikit-learn:main' into extratreenan
adam2392 Jan 25, 2024
2429774
Merge branch 'extratreenan' of https://github.com/adam2392/scikit-lea…
adam2392 Jan 25, 2024
69f2563
Fix splitting
adam2392 Jan 25, 2024
e2c6017
Move changelog
adam2392 Jan 25, 2024
078a255
Try to fix unit test
adam2392 Jan 25, 2024
b376f51
New file
adam2392 Jan 25, 2024
61723a5
Try again
adam2392 Jan 25, 2024
85c34bf
Remove extra file
adam2392 Jan 25, 2024
64ad1cd
Fix extratrees
adam2392 Jan 25, 2024
1d0e7f6
Try again
adam2392 Jan 26, 2024
a4a70ae
Merge branch 'main' into extratreenan
adam2392 Jan 26, 2024
f3f14e5
Try again
adam2392 Jan 26, 2024
3f82051
Merge branch 'extratreenan' of https://github.com/adam2392/scikit-lea…
adam2392 Jan 26, 2024
bf6cf27
Try again
adam2392 Jan 26, 2024
e01b6c5
Try again
adam2392 Jan 26, 2024
d336470
Tray again
adam2392 Jan 26, 2024
7726518
Not reproducible on local
adam2392 Jan 26, 2024
30af450
Try again
adam2392 Jan 26, 2024
dbdb0f8
Fix Ci
adam2392 Jan 26, 2024
f85ed1a
Try again
adam2392 Jan 29, 2024
4124ac4
Merge branch 'main' into extratreenan
adam2392 Jan 29, 2024
ff7f5d8
Try to fix test
adam2392 Jan 29, 2024
5f6a728
Try again
adam2392 Jan 29, 2024
b152b84
Try again
adam2392 Jan 29, 2024
e9ee8b4
again
adam2392 Jan 29, 2024
83324b7
Merge main
adam2392 Jan 30, 2024
01cb1ad
Try again on ci
adam2392 Jan 30, 2024
8a28e68
Fix bug and add unit test
adam2392 Jan 31, 2024
4cf7bef
Add changelog entry
adam2392 Jan 31, 2024
02fd866
Fix lint
adam2392 Jan 31, 2024
2ecdffe
Changelog and fix build
adam2392 Jan 31, 2024
0fc8f58
Fix unit test
adam2392 Jan 31, 2024
f2a7364
Merge branch 'main' into regtree
adam2392 Jan 31, 2024
60baa80
Merge branch 'regtree' into extratreenan
adam2392 Jan 31, 2024
0dd8cae
Merge branch 'main' into extratreenan
adam2392 Jan 31, 2024
260ad04
Almost working
adam2392 Jan 31, 2024
ffb2c68
Cleanup
adam2392 Jan 31, 2024
2b3de39
Merge branch 'extratreenan' of https://github.com/adam2392/scikit-lea…
adam2392 Jan 31, 2024
a4b2f43
Apply suggestions from code review
adam2392 Feb 1, 2024
442968b
Merge
adam2392 Feb 1, 2024
6f070c2
Merge branch 'regtree' of https://github.com/adam2392/scikit-learn in…
adam2392 Feb 1, 2024
4782b8a
Change unit test according to Guillame
adam2392 Feb 1, 2024
cfb3ad7
Merge branch 'main' into regtree
adam2392 Feb 1, 2024
284a450
Merge branch 'regtree' of https://github.com/adam2392/scikit-learn in…
adam2392 Feb 1, 2024
13ddf83
Fix unit test docstiring
adam2392 Feb 1, 2024
e6a28f6
Merging
adam2392 Feb 1, 2024
7384b22
Add fixes to unit-test
adam2392 Feb 1, 2024
40ad130
Fix lint
adam2392 Feb 1, 2024
b5c8ddc
Merge branch 'main' into extratreenan
adam2392 Feb 1, 2024
4253cc7
Fix lint
adam2392 Feb 1, 2024
c72e462
Try again
adam2392 Feb 1, 2024
e1fe9be
TST improve regression test
glemaitre Feb 2, 2024
c3e01b8
Merge branch 'regtree' into extratreenan
adam2392 Feb 2, 2024
a4adf70
Try new dataset
adam2392 Feb 2, 2024
80047a6
Merge branch 'main' into extratreenan
adam2392 Feb 2, 2024
4ad56d7
Merge
adam2392 Feb 2, 2024
b7a50fc
Merge branch 'main' into extratreenan
adam2392 Feb 5, 2024
13ca9ee
Add new expected score
adam2392 Feb 6, 2024
e5bff94
Merge branch 'extratreenan' of https://github.com/adam2392/scikit-lea…
adam2392 Feb 6, 2024
e8ba177
Merge branch 'main' into extratreenan
adam2392 Feb 13, 2024
19521f1
Clean up
adam2392 Feb 13, 2024
8418b6b
Merge branch 'main' into extratreenan
adam2392 Feb 13, 2024
e72bb62
Cleanup merge
adam2392 Feb 13, 2024
f0b03ab
Merge branch 'extratreenan' of https://github.com/adam2392/scikit-lea…
adam2392 Feb 13, 2024
5398a39
Correct without limiting depth
adam2392 Feb 14, 2024
7b07f1a
Try with noise
adam2392 Feb 15, 2024
17fbce8
Merge branch 'main' into extratreenan
adam2392 Feb 15, 2024
04ceef0
Fix unit test for global seed
adam2392 Feb 15, 2024
065f60b
Merge branch 'extratreenan' of https://github.com/adam2392/scikit-lea…
adam2392 Feb 15, 2024
ef1dda5
Merge branch 'main' into extratreenan
adam2392 Feb 16, 2024
60b9e43
Merge branch 'main' into extratreenan
adam2392 Feb 19, 2024
d85ca3d
Merge branch 'main' into extratreenan
adam2392 Feb 19, 2024
97acf36
Merge branch 'main' into extratreenan
adam2392 Feb 21, 2024
116de12
Merge branch 'main' into extratreenan
adam2392 Feb 25, 2024
f00aa61
Merge branch 'main' into extratreenan
adam2392 Feb 26, 2024
f2de8e4
Merge branch 'main' into extratreenan
adam2392 Mar 2, 2024
c849418
Merge branch 'main' into extratreenan
adam2392 Mar 14, 2024
02882ae
Merge branch 'main' into extratreenan
adam2392 Mar 15, 2024
401c8d2
Merge branch 'main' into extratreenan
adam2392 Mar 19, 2024
e504b22
Merge branch 'main' into extratreenan
adam2392 Mar 26, 2024
3bcadd1
Merge branch 'main' into extratreenan
adam2392 Apr 15, 2024
0b572a5
Merge branch 'main' into extratreenan
adam2392 May 23, 2024
8096b50
Merge branch 'extratreenan' of https://github.com/adam2392/scikit-lea…
adam2392 May 23, 2024
13572cd
Fix isolation forest that relies on extratree
adam2392 May 23, 2024
42c1a7f
Merge branch 'main' into extratreenan
adam2392 Jun 10, 2024
6345933
Merge branch 'main' into extratreenan
adam2392 Jun 12, 2024
c67895a
Merge branch 'main' into extratreenan
adam2392 Jun 14, 2024
56f04b5
DOC update changelog
glemaitre Jun 20, 2024
0cceaf2
Merge remote-tracking branch 'origin/main' into pr/adam2392/27966
glemaitre Jun 20, 2024
73e4fd8
Merge branch 'main' into extratreenan
adam2392 Jun 20, 2024
d74ba65
Address guillame comments
adam2392 Jun 20, 2024
3526bcb
Do not force all finnite
adam2392 Jun 20, 2024
000363a
Merge branch 'main' into extratreenan
adam2392 Jun 26, 2024
97f24a4
Apply suggestions from code review
adam2392 Jul 1, 2024
994508b
Merging
adam2392 Jul 1, 2024
b53e881
Merge branch 'extratreenan' of https://github.com/adam2392/scikit-lea…
adam2392 Jul 1, 2024
e2dcca3
Merge branch 'main' into extratreenan
adam2392 Jul 1, 2024
ff39dba
Add extra unit test
adam2392 Jul 1, 2024
618cf53
Merge branch 'main' into extratreenan
adam2392 Jul 1, 2024
e8aadd2
Merge branch 'extratreenan' of https://github.com/adam2392/scikit-lea…
adam2392 Jul 1, 2024
818c1e5
Fix codecoverage
adam2392 Jul 1, 2024
38154ae
Merge branch 'main' into extratreenan
adam2392 Jul 2, 2024
8226eee
Apply suggestions from code review
adam2392 Jul 2, 2024
9d984a8
Merge branch 'extratreenan' of https://github.com/adam2392/scikit-lea…
adam2392 Jul 2, 2024
a2f9322
Merge branch 'main' into extratreenan
adam2392 Jul 2, 2024
b817cda
Merge branch 'extratreenan' of https://github.com/adam2392/scikit-lea…
adam2392 Jul 2, 2024
198769f
Fix lint
adam2392 Jul 2, 2024
c508046
Update _splitter.pyx
adam2392 Jul 3, 2024
5a5e0c3
Merge branch 'main' into extratreenan
adam2392 Jul 3, 2024
36e7c10
Fix unit test
adam2392 Jul 3, 2024
542019f
Merge branch 'main' into extratreenan
adam2392 Jul 3, 2024
ac57082
Apply suggestions from code review
adam2392 Jul 4, 2024
0753aa2
Address omar's comments
adam2392 Jul 4, 2024
4266679
Merge branch 'main' into extratreenan
adam2392 Jul 4, 2024
d80b60f
Remove if/else branch
adam2392 Jul 5, 2024
ac6b25a
Add extra section documenting missing-value treatment in extratrees
adam2392 Jul 5, 2024
eec51df
Revert the change which sets max_depth
OmarManzoor Jul 5, 2024
b7afac9
Revert the change
OmarManzoor Jul 5, 2024
90e8cec
Make extratrees documented
adam2392 Jul 6, 2024
9e40707
Merge branch 'extratreenan' of https://github.com/adam2392/scikit-lea…
adam2392 Jul 6, 2024
058671a
Merge branch 'main' into extratreenan
adam2392 Jul 6, 2024
708b614
Merge branch 'main' into extratreenan
adam2392 Jul 6, 2024
cafbde1
Merge branch 'extratreenan' of https://github.com/adam2392/scikit-lea…
adam2392 Jul 6, 2024
acd5a19
Fix cdef
adam2392 Jul 6, 2024
6b4906a
Fix circle
adam2392 Jul 7, 2024
699c97a
Try again
adam2392 Jul 7, 2024
9bc2db4
Increase tolerance
adam2392 Jul 7, 2024
02fcafe
Fixed
adam2392 Jul 7, 2024
1dc6e3d
Fix linters
adam2392 Jul 7, 2024
96972c0
Revert if/else branch
adam2392 Jul 8, 2024
fea6cbf
Address Omar comments
adam2392 Jul 8, 2024
83a5b67
Merge branch 'main' into extratreenan
adam2392 Jul 8, 2024
306245b
Update doc/modules/tree.rst
OmarManzoor Jul 9, 2024
969d69a
Merge branch 'main' into extratreenan
OmarManzoor Jul 9, 2024
2adf4da
Merge branch 'main' into extratreenan
OmarManzoor Jul 9, 2024
20e9d9f
Merge branch 'main' into extratreenan
OmarManzoor Jul 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 28 additions & 2 deletions doc/modules/tree.rst
Original file line number Diff line number Diff line change
Expand Up @@ -579,11 +579,21 @@ Note that it fits much slower than the MSE criterion.
Missing Values Support
======================

:class:`DecisionTreeClassifier` and :class:`DecisionTreeRegressor`
have built-in support for missing values when `splitter='best'` and criterion is
:class:`DecisionTreeClassifier`, :class:`DecisionTreeRegressor`
have built-in support for missing values using `splitter='best'`, where
the splits are determined in a greedy fashion.
:class:`ExtraTreeClassifier`, and :class:`ExtraTreeRegressor` have built-in
support for missing values for `splitter='random'`, where the splits
are determined randomly. For more details on how the splitter differs on
non-missing values, see the :ref:`Forest section <forest>`.

The criterion supported when there are missing-values are
`'gini'`, `'entropy`', or `'log_loss'`, for classification or
`'squared_error'`, `'friedman_mse'`, or `'poisson'` for regression.

First we will describe how :class:`DecisionTreeClassifier`, :class:`DecisionTreeRegressor`
handle missing-values in the data.

For each potential threshold on the non-missing data, the splitter will evaluate
the split with all the missing values going to the left node or the right node.

Expand Down Expand Up @@ -634,6 +644,22 @@ Decisions are made as follows:
>>> tree.predict(X_test)
array([1])

:class:`ExtraTreeClassifier`, and :class:`ExtraTreeRegressor` handle missing values
in a slightly different way. When splitting a node, a random threshold will be chosen
to split the non-missing values on. Then the non-missing values will be sent to the
left and right child based on the randomly selected threshold, while the missing
values will also be randomly sent to the left or right child. This is repeated for
every feature considered at each split. The best split among these is chosen.

During prediction, the treatment of missing-values is the same as that of the
decision tree:

- By default when predicting, the samples with missing values are classified
with the class used in the split found during training.

- If no missing values are seen during training for a given feature, then during
prediction missing values are mapped to the child with the most samples.

.. _minimal_cost_complexity_pruning:

Minimal Cost-Complexity Pruning
Expand Down
9 changes: 9 additions & 0 deletions doc/whats_new/v1.6.rst
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,15 @@ Changelog
deprecated the `base_estimator` parameter in favor of `estimator`.
:pr:`28494` by :user:`Adam Li <adam2392>`.

:mod:`sklearn.tree`
...................

- |Feature| :class:`tree.ExtraTreeClassifier` and :class:`tree.ExtraTreeRegressor` now
support missing-values in the data matrix ``X``. Missing-values are handled by
randomly moving all of the samples to the left, or right child node as the tree is
traversed.
:pr:`27966` by :user:`Adam Li <adam2392>`.

Thanks to everyone who has contributed to the maintenance and improvement of
the project since version 1.5, including:

Expand Down
15 changes: 12 additions & 3 deletions sklearn/ensemble/_iforest.py
Original file line number Diff line number Diff line change
Expand Up @@ -315,7 +315,9 @@ def fit(self, X, y=None, sample_weight=None):
self : object
Fitted estimator.
"""
X = self._validate_data(X, accept_sparse=["csc"], dtype=tree_dtype)
X = self._validate_data(
X, accept_sparse=["csc"], dtype=tree_dtype, force_all_finite=False
)
if issparse(X):
# Pre-sort indices to avoid that each individual tree of the
# ensemble sorts the indices.
Expand Down Expand Up @@ -515,7 +517,13 @@ def score_samples(self, X):
model.score(X)
"""
# Check data
X = self._validate_data(X, accept_sparse="csr", dtype=tree_dtype, reset=False)
X = self._validate_data(
X,
accept_sparse="csr",
dtype=tree_dtype,
reset=False,
force_all_finite=False,
)

return self._score_samples(X)

Expand Down Expand Up @@ -627,7 +635,8 @@ def _more_tags(self):
"check_sample_weights_invariance": (
"zero sample_weight is not equivalent to removing samples"
),
}
},
"allow_nan": True,
}


Expand Down
4 changes: 2 additions & 2 deletions sklearn/tree/_classes.py
Original file line number Diff line number Diff line change
Expand Up @@ -1074,7 +1074,7 @@ def predict_log_proba(self, X):
def _more_tags(self):
# XXX: nan is only support for dense arrays, but we set this for common test to
# pass, specifically: check_estimators_nan_inf
allow_nan = self.splitter == "best" and self.criterion in {
allow_nan = self.splitter in ("best", "random") and self.criterion in {
"gini",
"log_loss",
"entropy",
Expand Down Expand Up @@ -1405,7 +1405,7 @@ def _compute_partial_dependence_recursion(self, grid, target_features):
def _more_tags(self):
# XXX: nan is only support for dense arrays, but we set this for common test to
# pass, specifically: check_estimators_nan_inf
allow_nan = self.splitter == "best" and self.criterion in {
allow_nan = self.splitter in ("best", "random") and self.criterion in {
"squared_error",
"friedman_mse",
"poisson",
Expand Down
155 changes: 133 additions & 22 deletions sklearn/tree/_splitter.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@ from scipy.sparse import issparse

cdef float64_t INFINITY = np.inf

# Allow for 32 bit float comparisons
cdef float32_t INFINITY_32t = np.inf

# Mitigate precision differences between 32 bit and 64 bit
cdef float32_t FEATURE_THRESHOLD = 1e-7

Expand Down Expand Up @@ -479,6 +482,10 @@ cdef inline int node_split_best(
current_split.threshold = feature_values[p_prev]

current_split.n_missing = n_missing

# if there are no missing values in the training data, during
# test time, we send missing values to the branch that contains
# the most samples during training time.
if n_missing == 0:
current_split.missing_go_to_left = n_left > n_right
else:
Expand Down Expand Up @@ -680,7 +687,13 @@ cdef inline int node_split_random(
# Draw random splits and pick the best
cdef intp_t start = splitter.start
cdef intp_t end = splitter.end
cdef intp_t end_non_missing
cdef intp_t n_missing = 0
cdef bint has_missing = 0
cdef intp_t n_left, n_right
cdef bint missing_go_to_left

cdef intp_t[::1] samples = splitter.samples
cdef intp_t[::1] features = splitter.features
cdef intp_t[::1] constant_features = splitter.constant_features
cdef intp_t n_features = splitter.n_features
Expand Down Expand Up @@ -758,12 +771,22 @@ cdef inline int node_split_random(

current_split.feature = features[f_j]

# Find min, max
# Find min, max as we will randomly select a threshold between them
partitioner.find_min_max(
current_split.feature, &min_feature_value, &max_feature_value
)
n_missing = partitioner.n_missing
end_non_missing = end - n_missing

if max_feature_value <= min_feature_value + FEATURE_THRESHOLD:
if (
# All values for this feature are missing, or
end_non_missing == start or
# This feature is considered constant (max - min <= FEATURE_THRESHOLD)
max_feature_value <= min_feature_value + FEATURE_THRESHOLD
):
# We consider this feature constant in this case.
# Since finding a split with a constant feature is not valuable,
# we do not consider this feature for splitting.
features[f_j], features[n_total_constants] = features[n_total_constants], current_split.feature

n_found_constants += 1
Expand All @@ -772,6 +795,8 @@ cdef inline int node_split_random(

f_i -= 1
features[f_i], features[f_j] = features[f_j], features[f_i]
has_missing = n_missing != 0
criterion.init_missing(n_missing)

# Draw a random threshold
current_split.threshold = rand_uniform(
Expand All @@ -780,15 +805,38 @@ cdef inline int node_split_random(
random_state,
)

if has_missing:
# If there are missing values, then we randomly make all missing
# values go to the right or left.
#
# Note: compared to the BestSplitter, we do not evaluate the
# edge case where all the missing values go to the right node
# and the non-missing values go to the left node. This is because
# this would indicate a threshold outside of the observed range
# of the feature. However, it is not clear how much probability weight should
# be given to this edge case.
missing_go_to_left = rand_int(0, 2, random_state)
else:
missing_go_to_left = 0
criterion.missing_go_to_left = missing_go_to_left

if current_split.threshold == max_feature_value:
current_split.threshold = min_feature_value

# Partition
current_split.pos = partitioner.partition_samples(current_split.threshold)
current_split.pos = partitioner.partition_samples(
current_split.threshold
)

if missing_go_to_left:
n_left = current_split.pos - start + n_missing
n_right = end_non_missing - current_split.pos
else:
n_left = current_split.pos - start
n_right = end_non_missing - current_split.pos + n_missing

# Reject if min_samples_leaf is not guaranteed
if (((current_split.pos - start) < min_samples_leaf) or
((end - current_split.pos) < min_samples_leaf)):
if n_left < min_samples_leaf or n_right < min_samples_leaf:
continue

# Evaluate split
Expand Down Expand Up @@ -817,26 +865,44 @@ cdef inline int node_split_random(
current_proxy_improvement = criterion.proxy_impurity_improvement()

if current_proxy_improvement > best_proxy_improvement:
current_split.n_missing = n_missing

# if there are no missing values in the training data, during
# test time, we send missing values to the branch that contains
# the most samples during training time.
if has_missing:
current_split.missing_go_to_left = missing_go_to_left
else:
current_split.missing_go_to_left = n_left > n_right

best_proxy_improvement = current_proxy_improvement
best_split = current_split # copy

# Reorganize into samples[start:best.pos] + samples[best.pos:end]
if best_split.pos < end:
if current_split.feature != best_split.feature:
# TODO: Pass in best.n_missing when random splitter supports missing values.
partitioner.partition_samples_final(
best_split.pos, best_split.threshold, best_split.feature, 0
best_split.pos,
best_split.threshold,
best_split.feature,
best_split.n_missing
)
criterion.init_missing(best_split.n_missing)
criterion.missing_go_to_left = best_split.missing_go_to_left

criterion.reset()
criterion.update(best_split.pos)
criterion.children_impurity(
&best_split.impurity_left, &best_split.impurity_right
)
best_split.improvement = criterion.impurity_improvement(
impurity, best_split.impurity_left, best_split.impurity_right
impurity,
best_split.impurity_left,
best_split.impurity_right
)

shift_missing_values_to_left_if_required(&best_split, samples, end)

# Respect invariant for constant features: the original order of
# element in features[:n_known_constants] must be preserved for sibling
# and child nodes
Expand Down Expand Up @@ -941,29 +1007,68 @@ cdef class DensePartitioner:
float32_t* min_feature_value_out,
float32_t* max_feature_value_out,
) noexcept nogil:
"""Find the minimum and maximum value for current_feature."""
"""Find the minimum and maximum value for current_feature.

Missing values are stored at the end of feature_values.
The number of missing values observed in feature_values is stored
in self.n_missing.
"""
cdef:
intp_t p
intp_t p, current_end
float32_t current_feature_value
const float32_t[:, :] X = self.X
intp_t[::1] samples = self.samples
float32_t min_feature_value = X[samples[self.start], current_feature]
float32_t max_feature_value = min_feature_value
float32_t min_feature_value = INFINITY_32t
float32_t max_feature_value = -INFINITY_32t
float32_t[::1] feature_values = self.feature_values
intp_t n_missing = 0
const unsigned char[::1] missing_values_in_feature_mask = self.missing_values_in_feature_mask

feature_values[self.start] = min_feature_value
# We are copying the values into an array and
# finding min/max of the array in a manner which utilizes the cache more
# effectively. We need to also count the number of missing-values there are
if missing_values_in_feature_mask is not None and missing_values_in_feature_mask[current_feature]:
p, current_end = self.start, self.end - 1
# Missing values are placed at the end and do not participate in the
# min/max calculation.
while p <= current_end:
# Finds the right-most value that is not missing so that
# it can be swapped with missing values towards its left.
if isnan(X[samples[current_end], current_feature]):
n_missing += 1
current_end -= 1
continue

for p in range(self.start + 1, self.end):
current_feature_value = X[samples[p], current_feature]
feature_values[p] = current_feature_value
# X[samples[current_end], current_feature] is a non-missing value
if isnan(X[samples[p], current_feature]):
samples[p], samples[current_end] = samples[current_end], samples[p]
n_missing += 1
current_end -= 1

if current_feature_value < min_feature_value:
min_feature_value = current_feature_value
elif current_feature_value > max_feature_value:
max_feature_value = current_feature_value
current_feature_value = X[samples[p], current_feature]
feature_values[p] = current_feature_value
if current_feature_value < min_feature_value:
min_feature_value = current_feature_value
elif current_feature_value > max_feature_value:
max_feature_value = current_feature_value
p += 1
else:
min_feature_value = X[samples[self.start], current_feature]
max_feature_value = min_feature_value

feature_values[self.start] = min_feature_value
for p in range(self.start + 1, self.end):
current_feature_value = X[samples[p], current_feature]
feature_values[p] = current_feature_value

if current_feature_value < min_feature_value:
min_feature_value = current_feature_value
elif current_feature_value > max_feature_value:
max_feature_value = current_feature_value

min_feature_value_out[0] = min_feature_value
max_feature_value_out[0] = max_feature_value
self.n_missing = n_missing

cdef inline void next_p(self, intp_t* p_prev, intp_t* p) noexcept nogil:
"""Compute the next p_prev and p for iteratiing over feature values.
Expand All @@ -986,7 +1091,10 @@ cdef class DensePartitioner:
# (feature_values[p] >= end) or (feature_values[p] > feature_values[p - 1])
p[0] += 1

cdef inline intp_t partition_samples(self, float64_t current_threshold) noexcept nogil:
cdef inline intp_t partition_samples(
self,
float64_t current_threshold
) noexcept nogil:
"""Partition samples for feature_values at the current_threshold."""
cdef:
intp_t p = self.start
Expand Down Expand Up @@ -1233,7 +1341,10 @@ cdef class SparsePartitioner:
p_prev[0] = p[0]
p[0] = p_next

cdef inline intp_t partition_samples(self, float64_t current_threshold) noexcept nogil:
cdef inline intp_t partition_samples(
self,
float64_t current_threshold
) noexcept nogil:
"""Partition samples for feature_values at the current_threshold."""
return self._partition(current_threshold, self.start_positive)

Expand Down
Loading