Thanks to visit codestin.com
Credit goes to github.com

Skip to content

TypeError when fitting GridSearchCV or RandomizedSearchCV with OrdinalEncoder and OneHotEncoder in parameters grid #29157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
BriceChivu opened this issue Jun 2, 2024 · 13 comments · Fixed by #29179

Comments

@BriceChivu
Copy link

BriceChivu commented Jun 2, 2024

Describe the bug

Having both OrdinalEncoder and OneHotEncoder inside the parameters grid to be used by the GridSearchCV or RandomizedSearchCV results in the following error: TypeError: float() argument must be a string or a real number, not 'OneHotEncoder'.

Steps/Code to Reproduce

import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

set_config(transform_output="pandas")

# Setting seed for reproducibility
np.random.seed(42)

# Create a DataFrame with 1000 rows and 5 columns
num_rows = 1000
data = {
    "numeric_1": np.random.randn(num_rows),  # Normally distributed random numbers
    "numeric_3": np.random.randint(
        1, 100, size=num_rows
    ),  # Random integers between 1 and 100
    "object_1": np.random.choice(
        ["A", "B", "C", "D"], size=num_rows
    ),  # Random choice among 'A', 'B', 'C', 'D'
    "object_2": np.random.choice(
        ["X", "Y", "Z"], size=num_rows
    ),  # Random choice among 'X', 'Y', 'Z'
    "target": np.random.rand(num_rows)
    * 100,  # Uniformly distributed random numbers [0, 100)
}

df = pd.DataFrame(data)

X = df.drop("target", axis=1)
y = df["target"]

enc = ColumnTransformer(
    [("enc", OneHotEncoder(sparse_output=False), ["object_1", "object_2"])],
    remainder="passthrough",
    verbose_feature_names_out=False,
)

pipe = Pipeline(
    [
        ("enc", enc),
        ("regressor", HistGradientBoostingRegressor()),
    ]
)

grid_params = {
    "enc__enc": [
        OneHotEncoder(sparse_output=False),
        OrdinalEncoder(),
    ]
}

grid_search = GridSearchCV(pipe, grid_params, cv=5)
grid_search.fit(X, y)
# RandomizedSearchCV produces the same error
# rand_search = RandomizedSearchCV(pipe, grid_params, cv=5)
# rand_search.fit(X, y)

Expected Results

I would have expected the pipeline to run without errors, like that:
image

Actual Results

{
	"name": "TypeError",
	"message": "float() argument must be a string or a real number, not 'OneHotEncoder'",
	"stack": "---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[108], line 1
----> 1 grid_search.fit(X, y)

File ~/Documents/Coding/ML_exercises/.venv/lib/python3.12/site-packages/sklearn/base.py:1473, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1466     estimator._validate_params()
   1468 with config_context(
   1469     skip_parameter_validation=(
   1470         prefer_skip_nested_validation or global_skip_validation
   1471     )
   1472 ):
-> 1473     return fit_method(estimator, *args, **kwargs)

File ~/Documents/Coding/ML_exercises/.venv/lib/python3.12/site-packages/sklearn/model_selection/_search.py:968, in BaseSearchCV.fit(self, X, y, **params)
    962     results = self._format_results(
    963         all_candidate_params, n_splits, all_out, all_more_results
    964     )
    966     return results
--> 968 self._run_search(evaluate_candidates)
    970 # multimetric is determined here because in the case of a callable
    971 # self.scoring the return type is only known after calling
    972 first_test_score = all_out[0][\"test_scores\"]

File ~/Documents/Coding/ML_exercises/.venv/lib/python3.12/site-packages/sklearn/model_selection/_search.py:1543, in GridSearchCV._run_search(self, evaluate_candidates)
   1541 def _run_search(self, evaluate_candidates):
   1542     \"\"\"Search all candidates in param_grid\"\"\"
-> 1543     evaluate_candidates(ParameterGrid(self.param_grid))

File ~/Documents/Coding/ML_exercises/.venv/lib/python3.12/site-packages/sklearn/model_selection/_search.py:962, in BaseSearchCV.fit.<locals>.evaluate_candidates(candidate_params, cv, more_results)
    959         all_more_results[key].extend(value)
    961 nonlocal results
--> 962 results = self._format_results(
    963     all_candidate_params, n_splits, all_out, all_more_results
    964 )
    966 return results

File ~/Documents/Coding/ML_exercises/.venv/lib/python3.12/site-packages/sklearn/model_selection/_search.py:1098, in BaseSearchCV._format_results(self, candidate_params, n_splits, out, more_results)
   1094     arr_dtype = object
   1095 if len(param_list) == n_candidates and arr_dtype != object:
   1096     # Exclude `object` else the numpy constructor might infer a list of
   1097     # tuples to be a 2d array.
-> 1098     results[key] = MaskedArray(param_list, mask=False, dtype=arr_dtype)
   1099 else:
   1100     # Use one MaskedArray and mask all the places where the param is not
   1101     # applicable for that candidate (which may not contain all the params).
   1102     ma = MaskedArray(np.empty(n_candidates), mask=True, dtype=arr_dtype)

File ~/Documents/Coding/ML_exercises/.venv/lib/python3.12/site-packages/numpy/ma/core.py:2820, in MaskedArray.__new__(cls, data, mask, dtype, copy, subok, ndmin, fill_value, keep_mask, hard_mask, shrink, order)
   2811 \"\"\"
   2812 Create a new masked array from scratch.
   2813 
   (...)
   2817 
   2818 \"\"\"
   2819 # Process data.
-> 2820 _data = np.array(data, dtype=dtype, copy=copy,
   2821                  order=order, subok=True, ndmin=ndmin)
   2822 _baseclass = getattr(data, '_baseclass', type(_data))
   2823 # Check that we're not erasing the mask.

TypeError: float() argument must be a string or a real number, not 'OneHotEncoder'"
}

Versions

System:
    python: 3.12.3 (main, Apr  9 2024, 08:09:14) [Clang 15.0.0 (clang-1500.3.9.4)]
executable: /Users/brice/Documents/Coding/ML_exercises/.venv/bin/python
   machine: macOS-14.5-arm64-arm-64bit

Python dependencies:
      sklearn: 1.5.0
          pip: 24.0
   setuptools: 70.0.0
        numpy: 1.26.4
        scipy: 1.13.1
       Cython: None
       pandas: 2.2.2
   matplotlib: 3.9.0
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/brice/Documents/Coding/ML_exercises/.venv/lib/python3.12/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: armv8

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/brice/Documents/Coding/ML_exercises/.venv/lib/python3.12/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.27
threading_layer: pthreads
   architecture: neoversen1

       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/brice/Documents/Coding/ML_exercises/.venv/lib/python3.12/site-packages/sklearn/.dylibs/libomp.dylib
        version: None
@BriceChivu BriceChivu added Bug Needs Triage Issue requires triage labels Jun 2, 2024
@adrinjalali
Copy link
Member

Could you please provide a minimal reproducer?

  • remove the extra bits from the code which do not contribute to the error
  • use a dataset from sklearn.datasets
  • the code should run without requiring extra datasets by simply copy pasting the code.

@adrinjalali adrinjalali added Needs Reproducible Code Issue requires reproducible code and removed Needs Triage Issue requires triage labels Jun 3, 2024
@BriceChivu
Copy link
Author

Could you please provide a minimal reproducer?

  • remove the extra bits from the code which do not contribute to the error
  • use a dataset from sklearn.datasets
  • the code should run without requiring extra datasets by simply copy pasting the code.

Thanks for your comment. I modified the issue's description accordingly.

@adrinjalali
Copy link
Member

This seems to be another one related to dtypes of the result in grid search. @lesteve @MarcoGorelli WDYT?

@adrinjalali adrinjalali added module:model_selection and removed Needs Reproducible Code Issue requires reproducible code labels Jun 4, 2024
@lesteve
Copy link
Member

lesteve commented Jun 4, 2024

I can confirm this still happens in main. I have modified the snippet to not use force_int_remainder_cols (new ColumnTransformer parameter in 1.5) and the snippet runs on 1.4 so this seems like a regression indeed.

This is possible that this is the dtype tweak in grid-search .cv_results_ #28352. I did the previous bug fix so I am happy to let @MarcoGorelli take this one 😉.

@MarcoGorelli
Copy link
Contributor

thanks for the ping - this seems to be the issue:

(Pdb) p param_list
[OneHotEncoder(sparse_output=False), OrdinalEncoder()]
(Pdb) p np.result_type(*param_list)
dtype('float64')
(Pdb) p np.array(param_list).dtype
dtype('O')

I find it a bit surprising that np.result_type gives 'float64' here

@lesteve lesteve added this to the 1.5.1 milestone Jun 4, 2024
@MarcoGorelli
Copy link
Contributor

wait wut

In [6]: OrdinalEncoder().dtype
Out[6]: numpy.float64

@lesteve
Copy link
Member

lesteve commented Jun 4, 2024

Oh dear, OrdinalEncoder has a dtype parameter and hence a .dtype attribute. np.result_type probably relies on the .dtype attribute? Edit: same thing for OneHotEncoder.

@adrinjalali
Copy link
Member

In a sense, it does make sense that result_type is float64, since result_type implies result of an operation on those values. But we just want to create an array here, so maybe we should get the dtype of a created array instead?

@MarcoGorelli
Copy link
Contributor

I think that creates other issues #28352 (comment) which @thomasjpfan wanted to avoid

It might be simplest to just check if any object in param_list is an instance of BaseEstimator, and if so, set arr_dtype to object?

Got a call coming up but I can submit a pr later

@adrinjalali
Copy link
Member

Not everything is a BaseEstimator. A third party estimator might not be inheriting from BaseEstimator and that breaks this then.

We could check if anything is not a scaler of a simple object maybe? Not sure.

@MarcoGorelli
Copy link
Contributor

Ah thanks

A third-party estimator should still implement fit and predict/transform though? Maybe just check for those attributes?


As an aside, I expect that the dtype property might create other problems going forwards? It looks like it's not documented https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#onehotencoder , so that may make the case for renaming it?

@adrinjalali
Copy link
Member

dtype is documented. As a constructor argument, which becomes an attribute with the same name. So we can't easily rename it.

Checking for fit and predict (or any other Protocol) would also not be okay. I think we might end up in odd situations where some odd attribute / constructor argument is a random object.

@lesteve
Copy link
Member

lesteve commented Jun 4, 2024

It's good to have a fix in scikit-learn, but I think the numpy behaviour is unexpected so I opened numpy/numpy#26612.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants