Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions doc/whats_new/v1.4.rst
Original file line number Diff line number Diff line change
Expand Up @@ -368,6 +368,10 @@ Changelog
is a predefined metric listed in :func:`metrics.get_scorer_names` and
early stopping is enabled. :pr:`26163` by `Thomas Fan`_.

- |Fix| Fixes :class:`ensemble.IsolationForest` when the input is a sparse matrix and
`contamination` is set to a float value.
:pr:`27645` by :user:`Guillaume Lemaitre <glemaitre>`.

- |API| In :class:`ensemble.AdaBoostClassifier`, the `algorithm` argument `SAMME.R` was
deprecated and will be removed in 1.6. :pr:`26830` by :user:`Stefanie Senger
<StefanieSenger>`.
Expand Down
7 changes: 5 additions & 2 deletions sklearn/ensemble/_iforest.py
Original file line number Diff line number Diff line change
Expand Up @@ -340,7 +340,10 @@ def fit(self, X, y=None, sample_weight=None):

# Else, define offset_ wrt contamination parameter
# To avoid performing input validation a second time we call
# _score_samples rather than score_samples
# _score_samples rather than score_samples.
# _score_samples expects a CSR matrix, so we convert if necessary.
if issparse(X):
X = X.tocsr()
self.offset_ = np.percentile(self._score_samples(X), 100.0 * self.contamination)

return self
Expand Down Expand Up @@ -425,7 +428,7 @@ def score_samples(self, X):
The lower, the more abnormal.
"""
# Check data
X = self._validate_data(X, accept_sparse="csr", dtype=np.float32, reset=False)
X = self._validate_data(X, accept_sparse="csr", dtype=tree_dtype, reset=False)

return self._score_samples(X)

Expand Down
20 changes: 20 additions & 0 deletions sklearn/ensemble/tests/test_iforest.py
Original file line number Diff line number Diff line change
Expand Up @@ -341,3 +341,23 @@ def test_iforest_preserve_feature_names():
with warnings.catch_warnings():
warnings.simplefilter("error", UserWarning)
model.fit(X)


@pytest.mark.parametrize("sparse_container", CSC_CONTAINERS + CSR_CONTAINERS)
def test_iforest_sparse_input_float_contamination(sparse_container):
"""Check that `IsolationForest` accepts sparse matrix input and float value for
contamination.

Non-regression test for:
https://github.com/scikit-learn/scikit-learn/issues/27626
"""
X, _ = make_classification(n_samples=50, n_features=4, random_state=0)
X = sparse_container(X)
X.sort_indices()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why do we need to sort the indices?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea. I took back the minimum reproducer but it was probably failing without sorting indices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting indices allows getting into a canonical representation (e.g. with no duplicates' ambiguity).

contamination = 0.1
iforest = IsolationForest(
n_estimators=5, contamination=contamination, random_state=0
).fit(X)

X_decision = iforest.decision_function(X)
assert (X_decision < 0).sum() / X.shape[0] == pytest.approx(contamination)