Thanks to visit codestin.com
Credit goes to github.com

Skip to content

SimpleImputer fails in "most_frequent" if incomparable types only if ties #31717

Open
@AlexandreAbraham

Description

@AlexandreAbraham

Describe the bug

Observed behavior

When using the "most_frequent" strategy from SimpleImputer and there is a tie, the code takes the minimum values among all ties. This crashes if the values are not comparable such as str and NoneType.

Steps/Code to Reproduce

import numpy as np
from sklearn.impute import SimpleImputer


X1 = np.asarray(['a', None])[:, None]
X2 = np.asarray(['a', None, None])[:, None]

imputer = SimpleImputer(add_indicator=True, strategy="most_frequent")

try:
    imputer.fit_transform(X1)
    print('X1 processed successfully')
except Exception as e:
    print('Error while processing X1:', e)


try:
    imputer.fit_transform(X2)
    print('X2 processed successfully')
except Exception as e:
    print('Error while processing X2:', e)

Expected Results

I would expect the Imputer to have a consistant behavior not depending on whether or not a tie is presente. Namely:

  • Run whether or not values are comparable
  • Crashes if values are not comparable, wheter there is a tie or not.

Note that the code claims to process data like scipy.stats.mode but mode only processes numeric values since scipy 1.9.0, it therefore crashed on this example and redirect the user toward np.unique:

Traceback (most recent call last):
  File "/Users/aabraham/NeuralkFoundry/tutorials/repro.py", line 11, in <module>
    print(scipy.stats.mode(X1))
          ~~~~~~~~~~~~~~~~^^^^
  File "/Users/aabraham/.local/share/mamba/envs/skle/lib/python3.13/site-packages/scipy/stats/_axis_nan_policy.py", line 611, in axis_nan_policy_wrapper
    res = hypotest_fun_out(*samples, axis=axis, **kwds)
  File "/Users/aabraham/.local/share/mamba/envs/skle/lib/python3.13/site-packages/scipy/stats/_stats_py.py", line 567, in mode
    raise TypeError(message)
TypeError: Argument `a` is not recognized as numeric. Support for input that cannot be coerced to a numeric array was deprecated in SciPy 1.9.0 and removed in SciPy 1.11.0. Please consider `np.unique`.

Let me know the correct behavior you expect and I can contribute a PR. A quick way to solve it would be to use hash(value) in case values are not comparable.

Actual Results

Error while processing X1: '<' not supported between instances of 'NoneType' and 'str'
X2 processed successfully

If the error is not catched, here is the stack trace:

Traceback (most recent call last):
  File "/Users/aabraham/NeuralkFoundry/tutorials/repro.py", line 10, in <module>
    imputer.fit_transform(X1)
    ~~~~~~~~~~~~~~~~~~~~~^^^^
  File "/Users/aabraham/scikit-learn/sklearn/utils/_set_output.py", line 316, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/Users/aabraham/scikit-learn/sklearn/base.py", line 894, in fit_transform
    return self.fit(X, **fit_params).transform(X)
           ~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/Users/aabraham/scikit-learn/sklearn/base.py", line 1365, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/Users/aabraham/scikit-learn/sklearn/impute/_base.py", line 453, in fit
    self.statistics_ = self._dense_fit(
                       ~~~~~~~~~~~~~~~^
        X, self.strategy, self.missing_values, fill_value
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/aabraham/scikit-learn/sklearn/impute/_base.py", line 565, in _dense_fit
    most_frequent[i] = _most_frequent(row, np.nan, 0)
                       ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/Users/aabraham/scikit-learn/sklearn/impute/_base.py", line 53, in _most_frequent
    most_frequent_value = min(
        value
        for value, count in counter.items()
        if count == most_frequent_count
    )
TypeError: '<' not supported between instances of 'NoneType' and 'str'

Versions

System:
    python: 3.13.5 | packaged by conda-forge | (main, Jun 16 2025, 08:24:05) [Clang 18.1.8 ]
executable: /Users/aabraham/.local/share/mamba/envs/skle/bin/python
   machine: macOS-15.4.1-arm64-arm-64bit-Mach-O

Python dependencies:
      sklearn: 1.8.dev0
          pip: 25.1.1
   setuptools: 80.9.0
        numpy: 2.3.1
        scipy: 1.16.0
       Cython: 3.1.2
       pandas: 2.3.0
   matplotlib: 3.10.3
       joblib: 1.5.1
threadpoolctl: 3.6.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libopenblas
       filepath: /Users/aabraham/.local/share/mamba/envs/skle/lib/libopenblas.0.dylib
        version: 0.3.30
threading_layer: openmp
   architecture: VORTEX

       user_api: openmp
   internal_api: openmp
    num_threads: 12
         prefix: libomp
       filepath: /Users/aabraham/.local/share/mamba/envs/skle/lib/libomp.dylib
        version: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions