Description
Describe the bug
SplineTransformer
accepts arrays for the knots
argument to specify the positions of the knots.
Using GridSearchCV
to find the best positions fails if the knots
array has a different size (i.e. if there is a different n_knots
). This appears to be because the code attempts to coerce the parameters into one array, and therefore fails due to the inhomogeneous shape.
Note: sklearn versions - this error only occurs in recent versions of sklearn (1.5.0). Earlier versions (1.4.2) did not suffer from this issue.
Note 2: the issue would be avoided if the n_knots
parameter were to be searched over (instead of the knots
parameter). However, it is often important to specify the knots positions directly - for example, with periodic data, as in the provided example, as the periodicity is defined by the first and last knots. In any case there are presumably other places in sklearn where arrays of different shapes can be provided as parameters and where the same issue will occur.
Steps/Code to Reproduce
import numpy as np
import sklearn.pipeline
import sklearn.preprocessing
import sklearn.model_selection
import sklearn.linear_model
import matplotlib.pyplot as plt
x = np.linspace(-np.pi*2,np.pi*5,1000)
y_true = np.sin(x)
y_train = y_true[(0<x) & (x<np.pi*2)]
x_train = x[(0<x) & (x<np.pi*2)]
y_train_noise = y_train + np.random.normal(size=y_train.shape, scale=0.5)
x = x.reshape((-1,1))
x_train = x_train.reshape((-1,1))
spline_reg_pipe = sklearn.pipeline.make_pipeline(
sklearn.preprocessing.SplineTransformer(extrapolation="periodic"),
sklearn.linear_model.LinearRegression(fit_intercept=False)
)
spline_reg_pipe_cv = sklearn.model_selection.GridSearchCV(
estimator=spline_reg_pipe,
param_grid={
# 'splinetransformer__degree' : [3,4,5],
'splinetransformer__knots' : [np.linspace(0,np.pi*2,n_knots).reshape((-1,1))
for n_knots in range(10,21,5)],
},
verbose=1
)
spline_reg_pipe_cv.fit(X=x_train, y=y_train_noise)
plt.scatter(x_train, y_train_noise, s=1, label='noisy data')
plt.plot(x, y_true, label='truth')
plt.plot(x, spline_reg_pipe_cv.predict(x), label='predictions')
plt.legend()
plt.show()
Expected Results
This is sample output from earlier versions of sklearn:
Actual Results
Error:
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (9,) + inhomogeneous part.
Full traceback:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[46], line 26
11 spline_reg_pipe = sklearn.pipeline.make_pipeline(
12 sklearn.preprocessing.SplineTransformer(extrapolation="periodic"),
13 sklearn.linear_model.LinearRegression(fit_intercept=False)
14 )
16 spline_reg_pipe_cv = sklearn.model_selection.GridSearchCV(
17 estimator=spline_reg_pipe,
18 param_grid={
(...)
23 verbose=1
24 )
---> 26 spline_reg_pipe_cv.fit(X=x_train, y=y_train_noise)
28 plt.scatter(x_train, y_train_noise, s=1, label='noisy data')
29 plt.plot(x, y_true, label='truth')
File ~/Library/Python/3.12/lib/python/site-packages/sklearn/base.py:1473, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
1466 estimator._validate_params()
1468 with config_context(
1469 skip_parameter_validation=(
1470 prefer_skip_nested_validation or global_skip_validation
1471 )
1472 ):
-> 1473 return fit_method(estimator, *args, **kwargs)
File ~/Library/Python/3.12/lib/python/site-packages/sklearn/model_selection/_search.py:968, in BaseSearchCV.fit(self, X, y, **params)
962 results = self._format_results(
963 all_candidate_params, n_splits, all_out, all_more_results
964 )
966 return results
--> 968 self._run_search(evaluate_candidates)
970 # multimetric is determined here because in the case of a callable
971 # self.scoring the return type is only known after calling
972 first_test_score = all_out[0]["test_scores"]
File ~/Library/Python/3.12/lib/python/site-packages/sklearn/model_selection/_search.py:1543, in GridSearchCV._run_search(self, evaluate_candidates)
1541 def _run_search(self, evaluate_candidates):
1542 """Search all candidates in param_grid"""
-> 1543 evaluate_candidates(ParameterGrid(self.param_grid))
File ~/Library/Python/3.12/lib/python/site-packages/sklearn/model_selection/_search.py:962, in BaseSearchCV.fit.<locals>.evaluate_candidates(candidate_params, cv, more_results)
959 all_more_results[key].extend(value)
961 nonlocal results
--> 962 results = self._format_results(
963 all_candidate_params, n_splits, all_out, all_more_results
964 )
966 return results
File ~/Library/Python/3.12/lib/python/site-packages/sklearn/model_selection/_search.py:1098, in BaseSearchCV._format_results(self, candidate_params, n_splits, out, more_results)
1094 arr_dtype = object
1095 if len(param_list) == n_candidates and arr_dtype != object:
1096 # Exclude `object` else the numpy constructor might infer a list of
1097 # tuples to be a 2d array.
-> 1098 results[key] = MaskedArray(param_list, mask=False, dtype=arr_dtype)
1099 else:
1100 # Use one MaskedArray and mask all the places where the param is not
1101 # applicable for that candidate (which may not contain all the params).
1102 ma = MaskedArray(np.empty(n_candidates), mask=True, dtype=arr_dtype)
File ~/Library/Python/3.12/lib/python/site-packages/numpy/ma/core.py:2820, in MaskedArray.__new__(cls, data, mask, dtype, copy, subok, ndmin, fill_value, keep_mask, hard_mask, shrink, order)
2811 """
2812 Create a new masked array from scratch.
2813
(...)
2817
2818 """
2819 # Process data.
-> 2820 _data = np.array(data, dtype=dtype, copy=copy,
2821 order=order, subok=True, ndmin=ndmin)
2822 _baseclass = getattr(data, '_baseclass', type(_data))
2823 # Check that we're not erasing the mask.
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (9,) + inhomogeneous part.
Versions
System:
python: 3.12.3 (v3.12.3:f6650f9ad7, Apr 9 2024, 08:18:47) [Clang 13.0.0 (clang-1300.0.29.30)]
executable: /usr/local/bin/python3
machine: macOS-14.5-arm64-arm-64bit
Python dependencies:
sklearn: 1.5.0
pip: 24.0
setuptools: 70.0.0
numpy: 1.26.4
scipy: 1.13.0
Cython: 3.0.10
pandas: 2.2.2
matplotlib: 3.8.4
joblib: 1.4.2
threadpoolctl: 3.5.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
num_threads: 11
prefix: libopenblas
filepath: /Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
version: 0.3.23.dev
threading_layer: pthreads
architecture: armv8
user_api: blas
internal_api: openblas
num_threads: 11
prefix: libopenblas
filepath: /Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/scipy/.dylibs/libopenblas.0.dylib
version: 0.3.26.dev
threading_layer: pthreads
architecture: neoversen1
user_api: openmp
internal_api: openmp
num_threads: 11
prefix: libomp
filepath: /Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/sklearn/.dylibs/libomp.dylib
version: None
user_api: openmp
internal_api: openmp
num_threads: 11
prefix: libomp
filepath: /Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/xgboost/.dylibs/libomp.dylib
version: None
user_api: openmp
internal_api: openmp
num_threads: 11
prefix: libomp
filepath: /opt/homebrew/Cellar/libomp/18.1.7/lib/libomp.dylib
version: None