Thanks to visit codestin.com
Credit goes to github.com

Skip to content

GridSearchCV fails when parameters are arrays with different sizes #29277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Gabriel-Kissin opened this issue Jun 17, 2024 · 7 comments Β· Fixed by #29314
Closed

GridSearchCV fails when parameters are arrays with different sizes #29277

Gabriel-Kissin opened this issue Jun 17, 2024 · 7 comments Β· Fixed by #29314

Comments

@Gabriel-Kissin
Copy link

Gabriel-Kissin commented Jun 17, 2024

Describe the bug

SplineTransformer accepts arrays for the knots argument to specify the positions of the knots.

Using GridSearchCV to find the best positions fails if the knots array has a different size (i.e. if there is a different n_knots). This appears to be because the code attempts to coerce the parameters into one array, and therefore fails due to the inhomogeneous shape.

Note: sklearn versions - this error only occurs in recent versions of sklearn (1.5.0). Earlier versions (1.4.2) did not suffer from this issue.

Note 2: the issue would be avoided if the n_knots parameter were to be searched over (instead of the knots parameter). However, it is often important to specify the knots positions directly - for example, with periodic data, as in the provided example, as the periodicity is defined by the first and last knots. In any case there are presumably other places in sklearn where arrays of different shapes can be provided as parameters and where the same issue will occur.

Steps/Code to Reproduce

import numpy as np

import sklearn.pipeline
import sklearn.preprocessing
import sklearn.model_selection
import sklearn.linear_model

import matplotlib.pyplot as plt

x = np.linspace(-np.pi*2,np.pi*5,1000)
y_true = np.sin(x)
y_train = y_true[(0<x) & (x<np.pi*2)]

x_train = x[(0<x) & (x<np.pi*2)]
y_train_noise = y_train + np.random.normal(size=y_train.shape, scale=0.5)

x = x.reshape((-1,1))
x_train = x_train.reshape((-1,1))

spline_reg_pipe = sklearn.pipeline.make_pipeline(
            sklearn.preprocessing.SplineTransformer(extrapolation="periodic"), 
            sklearn.linear_model.LinearRegression(fit_intercept=False)
            )

spline_reg_pipe_cv = sklearn.model_selection.GridSearchCV(
    estimator=spline_reg_pipe,
    param_grid={
        # 'splinetransformer__degree' : [3,4,5],
        'splinetransformer__knots'  : [np.linspace(0,np.pi*2,n_knots).reshape((-1,1)) 
                                       for n_knots in range(10,21,5)],
    },
    verbose=1
)

spline_reg_pipe_cv.fit(X=x_train, y=y_train_noise)

plt.scatter(x_train, y_train_noise, s=1, label='noisy data')
plt.plot(x, y_true, label='truth')
plt.plot(x, spline_reg_pipe_cv.predict(x), label='predictions')
plt.legend()
plt.show()

Expected Results

This is sample output from earlier versions of sklearn:
image

Actual Results

Error:

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (9,) + inhomogeneous part.

Full traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[46], line 26
     11 spline_reg_pipe = sklearn.pipeline.make_pipeline(
     12             sklearn.preprocessing.SplineTransformer(extrapolation="periodic"), 
     13             sklearn.linear_model.LinearRegression(fit_intercept=False)
     14             )
     16 spline_reg_pipe_cv = sklearn.model_selection.GridSearchCV(
     17     estimator=spline_reg_pipe,
     18     param_grid={
   (...)
     23     verbose=1
     24 )
---> 26 spline_reg_pipe_cv.fit(X=x_train, y=y_train_noise)
     28 plt.scatter(x_train, y_train_noise, s=1, label='noisy data')
     29 plt.plot(x, y_true, label='truth')

File ~/Library/Python/3.12/lib/python/site-packages/sklearn/base.py:1473, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1466     estimator._validate_params()
   1468 with config_context(
   1469     skip_parameter_validation=(
   1470         prefer_skip_nested_validation or global_skip_validation
   1471     )
   1472 ):
-> 1473     return fit_method(estimator, *args, **kwargs)

File ~/Library/Python/3.12/lib/python/site-packages/sklearn/model_selection/_search.py:968, in BaseSearchCV.fit(self, X, y, **params)
    962     results = self._format_results(
    963         all_candidate_params, n_splits, all_out, all_more_results
    964     )
    966     return results
--> 968 self._run_search(evaluate_candidates)
    970 # multimetric is determined here because in the case of a callable
    971 # self.scoring the return type is only known after calling
    972 first_test_score = all_out[0]["test_scores"]

File ~/Library/Python/3.12/lib/python/site-packages/sklearn/model_selection/_search.py:1543, in GridSearchCV._run_search(self, evaluate_candidates)
   1541 def _run_search(self, evaluate_candidates):
   1542     """Search all candidates in param_grid"""
-> 1543     evaluate_candidates(ParameterGrid(self.param_grid))

File ~/Library/Python/3.12/lib/python/site-packages/sklearn/model_selection/_search.py:962, in BaseSearchCV.fit.<locals>.evaluate_candidates(candidate_params, cv, more_results)
    959         all_more_results[key].extend(value)
    961 nonlocal results
--> 962 results = self._format_results(
    963     all_candidate_params, n_splits, all_out, all_more_results
    964 )
    966 return results

File ~/Library/Python/3.12/lib/python/site-packages/sklearn/model_selection/_search.py:1098, in BaseSearchCV._format_results(self, candidate_params, n_splits, out, more_results)
   1094     arr_dtype = object
   1095 if len(param_list) == n_candidates and arr_dtype != object:
   1096     # Exclude `object` else the numpy constructor might infer a list of
   1097     # tuples to be a 2d array.
-> 1098     results[key] = MaskedArray(param_list, mask=False, dtype=arr_dtype)
   1099 else:
   1100     # Use one MaskedArray and mask all the places where the param is not
   1101     # applicable for that candidate (which may not contain all the params).
   1102     ma = MaskedArray(np.empty(n_candidates), mask=True, dtype=arr_dtype)

File ~/Library/Python/3.12/lib/python/site-packages/numpy/ma/core.py:2820, in MaskedArray.__new__(cls, data, mask, dtype, copy, subok, ndmin, fill_value, keep_mask, hard_mask, shrink, order)
   2811 """
   2812 Create a new masked array from scratch.
   2813 
   (...)
   2817 
   2818 """
   2819 # Process data.
-> 2820 _data = np.array(data, dtype=dtype, copy=copy,
   2821                  order=order, subok=True, ndmin=ndmin)
   2822 _baseclass = getattr(data, '_baseclass', type(_data))
   2823 # Check that we're not erasing the mask.

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (9,) + inhomogeneous part.

Versions

System:
    python: 3.12.3 (v3.12.3:f6650f9ad7, Apr  9 2024, 08:18:47) [Clang 13.0.0 (clang-1300.0.29.30)]
executable: /usr/local/bin/python3
   machine: macOS-14.5-arm64-arm-64bit

Python dependencies:
      sklearn: 1.5.0
          pip: 24.0
   setuptools: 70.0.0
        numpy: 1.26.4
        scipy: 1.13.0
       Cython: 3.0.10
       pandas: 2.2.2
   matplotlib: 3.8.4
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 11
         prefix: libopenblas
       filepath: /Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: armv8

       user_api: blas
   internal_api: openblas
    num_threads: 11
         prefix: libopenblas
       filepath: /Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.26.dev
threading_layer: pthreads
   architecture: neoversen1

       user_api: openmp
   internal_api: openmp
    num_threads: 11
         prefix: libomp
       filepath: /Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

       user_api: openmp
   internal_api: openmp
    num_threads: 11
         prefix: libomp
       filepath: /Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/xgboost/.dylibs/libomp.dylib
        version: None

       user_api: openmp
   internal_api: openmp
    num_threads: 11
         prefix: libomp
       filepath: /opt/homebrew/Cellar/libomp/18.1.7/lib/libomp.dylib
        version: None
@Gabriel-Kissin Gabriel-Kissin added Bug Needs Triage Issue requires triage labels Jun 17, 2024
@Gabriel-Kissin
Copy link
Author

Gabriel-Kissin commented Jun 17, 2024

Workaround: looking at the source code reveals a temporary fix - to turn the arrays into nested lists. Doing this signals to sklearn that these parameters should be treated as objects and not forced into one array. In the code example provided, changing lines 29-30 from

        'splinetransformer__knots'  : [np.linspace(0,np.pi*2,n_knots).reshape((-1,1))
                                       for n_knots in range(10,21,5)],

to

        'splinetransformer__knots'  : [np.linspace(0,np.pi*2,n_knots).reshape((-1,1)).tolist() 
                                       for n_knots in range(10,21,5)],

solves the issue.

Perhaps this treatment should be extended to cover not only when parameters are clearly objects (when they are lists/tuples), but also whenever putting them into an array fails for whatever reason - as that failure can be interpreted as indication that they should be treated as objects.

@jeremiedbb
Copy link
Member

Thanks for the report @Gabriel-Kissin. It looks like another side effect of #28352. Ping @MarcoGorelli for confirmation, and maybe propose a quick fix ?

@jeremiedbb jeremiedbb added Regression and removed Needs Triage Issue requires triage labels Jun 20, 2024
@jeremiedbb jeremiedbb added this to the 1.5.1 milestone Jun 20, 2024
@MarcoGorelli
Copy link
Contributor

oh no, not another one

i'll take a look, thanks for the ping

@MarcoGorelli
Copy link
Contributor

I just tried this out on main, commit 65b2571 , and it doesn't error for me

Looks like it's already fixed, and all that's needed is another release

@jeremiedbb
Copy link
Member

Hum that's weird cause I'm able to reproduce the error on main, same commit :/

@MarcoGorelli
Copy link
Contributor

that's probably cause I don't practice building sklearn often enough, just went through the setup instructions again and can reproduce 😳 taking a look now πŸ‘€

@MarcoGorelli
Copy link
Contributor

fix incoming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants