Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MRG FIX: order of values of self.quantiles_ in QuantileTransformer #15751

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Dec 21, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions doc/whats_new/v0.22.rst
Original file line number Diff line number Diff line change
Expand Up @@ -817,6 +817,10 @@ Changelog
:class:`preprocessing.KernelCenterer`
:pr:`14336` by :user:`Gregory Dexter <gdex1>`.

- |Fix| :class:`preprocessing.QuantileTransformer` now guarantees the
`quantiles_` attribute to be completely sorted in non-decreasing manner.
:pr:`15751` by :user:`Tirth Patel <tirthasheshpatel>`.

:mod:`sklearn.model_selection`
..............................

Expand Down
10 changes: 10 additions & 0 deletions sklearn/preprocessing/_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -2262,6 +2262,11 @@ def _dense_fit(self, X, random_state):
col = col.take(subsample_idx, mode='clip')
self.quantiles_.append(np.nanpercentile(col, references))
self.quantiles_ = np.transpose(self.quantiles_)
# Due to floating-point precision error in `np.nanpercentile`,
# make sure that quantiles are monotonically increasing.
# Upstream issue in numpy:
# https://github.com/numpy/numpy/issues/14685
self.quantiles_ = np.maximum.accumulate(self.quantiles_)

def _sparse_fit(self, X, random_state):
"""Compute percentiles for sparse matrices.
Expand Down Expand Up @@ -2305,6 +2310,11 @@ def _sparse_fit(self, X, random_state):
self.quantiles_.append(
np.nanpercentile(column_data, references))
self.quantiles_ = np.transpose(self.quantiles_)
# due to floating-point precision error in `np.nanpercentile`,
# make sure the quantiles are monotonically increasing
# Upstream issue in numpy:
# https://github.com/numpy/numpy/issues/14685
self.quantiles_ = np.maximum.accumulate(self.quantiles_)

def fit(self, X, y=None):
"""Compute the quantiles used for transforming.
Expand Down
21 changes: 21 additions & 0 deletions sklearn/preprocessing/tests/test_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
from sklearn.utils._testing import assert_allclose
from sklearn.utils._testing import assert_allclose_dense_sparse
from sklearn.utils._testing import skip_if_32bit
from sklearn.utils._testing import _convert_container

from sklearn.utils.sparsefuncs import mean_variance_axis
from sklearn.preprocessing._data import _handle_zeros_in_scale
Expand Down Expand Up @@ -1532,6 +1533,26 @@ def test_quantile_transform_nan():
assert not np.isnan(transformer.quantiles_[:, 1:]).any()


@pytest.mark.parametrize("array_type", ['array', 'sparse'])
def test_quantile_transformer_sorted_quantiles(array_type):
# Non-regression test for:
# https://github.com/scikit-learn/scikit-learn/issues/15733
# Taken from upstream bug report:
# https://github.com/numpy/numpy/issues/14685
X = np.array([0, 1, 1, 2, 2, 3, 3, 4, 5, 5, 1, 1, 9, 9, 9, 8, 8, 7] * 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoot, I try to produce the failure running differnent size and way of generating unsuccessfully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice that you got one

Copy link
Member

@ogrisel ogrisel Dec 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trick was to make that dataset larger than 100 samples, otherwise quantiles_ was limited by X.shape[0].

So I just duplicated the samples 10 times and the monotonicity issue was fortunately still present :)

Copy link
Member

@ogrisel ogrisel Dec 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I am not sure if quantile_transformer.quantiles_.shape[0] being smaller than quantile_transformer.n_quantiles (when the training set size is too small) is a bug or not.

But that's unrelated to the topic of this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We raise warning in this case so I think that this is fine (at least we expected it)

X = 0.1 * X.reshape(-1, 1)
X = _convert_container(X, array_type)

n_quantiles = 100
qt = QuantileTransformer(n_quantiles=n_quantiles).fit(X)

# Check that the estimated quantile threasholds are monotically
# increasing:
quantiles = qt.quantiles_[:, 0]
assert len(quantiles) == 100
assert all(np.diff(quantiles) >= 0)


def test_robust_scaler_invalid_range():
for range_ in [
(-1, 90),
Expand Down