GridSearchCV fails when parameters are arrays with different sizes #29277

Gabriel-Kissin · 2024-06-17T09:43:15Z

Describe the bug

SplineTransformer accepts arrays for the knots argument to specify the positions of the knots.

Using GridSearchCV to find the best positions fails if the knots array has a different size (i.e. if there is a different n_knots). This appears to be because the code attempts to coerce the parameters into one array, and therefore fails due to the inhomogeneous shape.

Note: sklearn versions - this error only occurs in recent versions of sklearn (1.5.0). Earlier versions (1.4.2) did not suffer from this issue.

Note 2: the issue would be avoided if the n_knots parameter were to be searched over (instead of the knots parameter). However, it is often important to specify the knots positions directly - for example, with periodic data, as in the provided example, as the periodicity is defined by the first and last knots. In any case there are presumably other places in sklearn where arrays of different shapes can be provided as parameters and where the same issue will occur.

Steps/Code to Reproduce

import numpy as np

import sklearn.pipeline
import sklearn.preprocessing
import sklearn.model_selection
import sklearn.linear_model

import matplotlib.pyplot as plt

x = np.linspace(-np.pi*2,np.pi*5,1000)
y_true = np.sin(x)
y_train = y_true[(0<x) & (x<np.pi*2)]

x_train = x[(0<x) & (x<np.pi*2)]
y_train_noise = y_train + np.random.normal(size=y_train.shape, scale=0.5)

x = x.reshape((-1,1))
x_train = x_train.reshape((-1,1))

spline_reg_pipe = sklearn.pipeline.make_pipeline(
            sklearn.preprocessing.SplineTransformer(extrapolation="periodic"), 
            sklearn.linear_model.LinearRegression(fit_intercept=False)
            )

spline_reg_pipe_cv = sklearn.model_selection.GridSearchCV(
    estimator=spline_reg_pipe,
    param_grid={
        # 'splinetransformer__degree' : [3,4,5],
        'splinetransformer__knots'  : [np.linspace(0,np.pi*2,n_knots).reshape((-1,1)) 
                                       for n_knots in range(10,21,5)],
    },
    verbose=1
)

spline_reg_pipe_cv.fit(X=x_train, y=y_train_noise)

plt.scatter(x_train, y_train_noise, s=1, label='noisy data')
plt.plot(x, y_true, label='truth')
plt.plot(x, spline_reg_pipe_cv.predict(x), label='predictions')
plt.legend()
plt.show()

Expected Results

This is sample output from earlier versions of sklearn:

Actual Results

Error:

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (9,) + inhomogeneous part.

Full traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[46], line 26
     11 spline_reg_pipe = sklearn.pipeline.make_pipeline(
     12             sklearn.preprocessing.SplineTransformer(extrapolation="periodic"), 
     13             sklearn.linear_model.LinearRegression(fit_intercept=False)
     14             )
     16 spline_reg_pipe_cv = sklearn.model_selection.GridSearchCV(
     17     estimator=spline_reg_pipe,
     18     param_grid={
   (...)
     23     verbose=1
     24 )
---> 26 spline_reg_pipe_cv.fit(X=x_train, y=y_train_noise)
     28 plt.scatter(x_train, y_train_noise, s=1, label='noisy data')
     29 plt.plot(x, y_true, label='truth')

File ~/Library/Python/3.12/lib/python/site-packages/sklearn/base.py:1473, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1466     estimator._validate_params()
   1468 with config_context(
   1469     skip_parameter_validation=(
   1470         prefer_skip_nested_validation or global_skip_validation
   1471     )
   1472 ):
-> 1473     return fit_method(estimator, *args, **kwargs)

File ~/Library/Python/3.12/lib/python/site-packages/sklearn/model_selection/_search.py:968, in BaseSearchCV.fit(self, X, y, **params)
    962     results = self._format_results(
    963         all_candidate_params, n_splits, all_out, all_more_results
    964     )
    966     return results
--> 968 self._run_search(evaluate_candidates)
    970 # multimetric is determined here because in the case of a callable
    971 # self.scoring the return type is only known after calling
    972 first_test_score = all_out[0]["test_scores"]

File ~/Library/Python/3.12/lib/python/site-packages/sklearn/model_selection/_search.py:1543, in GridSearchCV._run_search(self, evaluate_candidates)
   1541 def _run_search(self, evaluate_candidates):
   1542     """Search all candidates in param_grid"""
-> 1543     evaluate_candidates(ParameterGrid(self.param_grid))

File ~/Library/Python/3.12/lib/python/site-packages/sklearn/model_selection/_search.py:962, in BaseSearchCV.fit.<locals>.evaluate_candidates(candidate_params, cv, more_results)
    959         all_more_results[key].extend(value)
    961 nonlocal results
--> 962 results = self._format_results(
    963     all_candidate_params, n_splits, all_out, all_more_results
    964 )
    966 return results

File ~/Library/Python/3.12/lib/python/site-packages/sklearn/model_selection/_search.py:1098, in BaseSearchCV._format_results(self, candidate_params, n_splits, out, more_results)
   1094     arr_dtype = object
   1095 if len(param_list) == n_candidates and arr_dtype != object:
   1096     # Exclude `object` else the numpy constructor might infer a list of
   1097     # tuples to be a 2d array.
-> 1098     results[key] = MaskedArray(param_list, mask=False, dtype=arr_dtype)
   1099 else:
   1100     # Use one MaskedArray and mask all the places where the param is not
   1101     # applicable for that candidate (which may not contain all the params).
   1102     ma = MaskedArray(np.empty(n_candidates), mask=True, dtype=arr_dtype)

File ~/Library/Python/3.12/lib/python/site-packages/numpy/ma/core.py:2820, in MaskedArray.__new__(cls, data, mask, dtype, copy, subok, ndmin, fill_value, keep_mask, hard_mask, shrink, order)
   2811 """
   2812 Create a new masked array from scratch.
   2813 
   (...)
   2817 
   2818 """
   2819 # Process data.
-> 2820 _data = np.array(data, dtype=dtype, copy=copy,
   2821                  order=order, subok=True, ndmin=ndmin)
   2822 _baseclass = getattr(data, '_baseclass', type(_data))
   2823 # Check that we're not erasing the mask.

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (9,) + inhomogeneous part.

Versions

System:
    python: 3.12.3 (v3.12.3:f6650f9ad7, Apr  9 2024, 08:18:47) [Clang 13.0.0 (clang-1300.0.29.30)]
executable: /usr/local/bin/python3
   machine: macOS-14.5-arm64-arm-64bit

Python dependencies:
      sklearn: 1.5.0
          pip: 24.0
   setuptools: 70.0.0
        numpy: 1.26.4
        scipy: 1.13.0
       Cython: 3.0.10
       pandas: 2.2.2
   matplotlib: 3.8.4
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 11
         prefix: libopenblas
       filepath: /Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: armv8

       user_api: blas
   internal_api: openblas
    num_threads: 11
         prefix: libopenblas
       filepath: /Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.26.dev
threading_layer: pthreads
   architecture: neoversen1

       user_api: openmp
   internal_api: openmp
    num_threads: 11
         prefix: libomp
       filepath: /Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

       user_api: openmp
   internal_api: openmp
    num_threads: 11
         prefix: libomp
       filepath: /Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/xgboost/.dylibs/libomp.dylib
        version: None

       user_api: openmp
   internal_api: openmp
    num_threads: 11
         prefix: libomp
       filepath: /opt/homebrew/Cellar/libomp/18.1.7/lib/libomp.dylib
        version: None

The text was updated successfully, but these errors were encountered:

Gabriel-Kissin · 2024-06-17T09:50:05Z

Workaround: looking at the source code reveals a temporary fix - to turn the arrays into nested lists. Doing this signals to sklearn that these parameters should be treated as objects and not forced into one array. In the code example provided, changing lines 29-30 from

        'splinetransformer__knots'  : [np.linspace(0,np.pi*2,n_knots).reshape((-1,1))
                                       for n_knots in range(10,21,5)],

to

        'splinetransformer__knots'  : [np.linspace(0,np.pi*2,n_knots).reshape((-1,1)).tolist() 
                                       for n_knots in range(10,21,5)],

solves the issue.

Perhaps this treatment should be extended to cover not only when parameters are clearly objects (when they are lists/tuples), but also whenever putting them into an array fails for whatever reason - as that failure can be interpreted as indication that they should be treated as objects.

jeremiedbb · 2024-06-20T10:44:10Z

Thanks for the report @Gabriel-Kissin. It looks like another side effect of #28352. Ping @MarcoGorelli for confirmation, and maybe propose a quick fix ?

MarcoGorelli · 2024-06-20T11:01:22Z

oh no, not another one

i'll take a look, thanks for the ping

MarcoGorelli · 2024-06-20T12:17:36Z

I just tried this out on main, commit 65b2571 , and it doesn't error for me

Looks like it's already fixed, and all that's needed is another release

jeremiedbb · 2024-06-20T12:21:27Z

Hum that's weird cause I'm able to reproduce the error on main, same commit :/

MarcoGorelli · 2024-06-20T12:31:24Z

that's probably cause I don't practice building sklearn often enough, just went through the setup instructions again and can reproduce 😳 taking a look now 👀

MarcoGorelli · 2024-06-20T13:02:15Z

fix incoming

Gabriel-Kissin added Bug Needs Triage Issue requires triage labels Jun 17, 2024

jeremiedbb added Regression and removed Needs Triage Issue requires triage labels Jun 20, 2024

jeremiedbb added this to the 1.5.1 milestone Jun 20, 2024

MarcoGorelli mentioned this issue Jun 20, 2024

Fix a regression in GridSearchCV for parameter grids that have arrays of different sizes as parameter values #29314

Merged

jeremiedbb closed this as completed in #29314 Jul 1, 2024

lesteve mentioned this issue Jul 23, 2024

Follow-up after mean_poisson_deviance array API PR #29549

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GridSearchCV fails when parameters are arrays with different sizes #29277

GridSearchCV fails when parameters are arrays with different sizes #29277

Gabriel-Kissin commented Jun 17, 2024 •

edited by jeremiedbb

Loading

Gabriel-Kissin commented Jun 17, 2024 •

edited

Loading

jeremiedbb commented Jun 20, 2024

MarcoGorelli commented Jun 20, 2024

MarcoGorelli commented Jun 20, 2024

jeremiedbb commented Jun 20, 2024

MarcoGorelli commented Jun 20, 2024

MarcoGorelli commented Jun 20, 2024

GridSearchCV fails when parameters are arrays with different sizes #29277

GridSearchCV fails when parameters are arrays with different sizes #29277

Comments

Gabriel-Kissin commented Jun 17, 2024 • edited by jeremiedbb Loading

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Gabriel-Kissin commented Jun 17, 2024 • edited Loading

jeremiedbb commented Jun 20, 2024

MarcoGorelli commented Jun 20, 2024

MarcoGorelli commented Jun 20, 2024

jeremiedbb commented Jun 20, 2024

MarcoGorelli commented Jun 20, 2024

MarcoGorelli commented Jun 20, 2024

Gabriel-Kissin commented Jun 17, 2024 •

edited by jeremiedbb

Loading

Gabriel-Kissin commented Jun 17, 2024 •

edited

Loading