Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@LeoGrin
Copy link
Collaborator

@LeoGrin LeoGrin commented Jan 14, 2025

TabPFN fails when used as part of a sklearn pipeline for sklearn's version older than 1.4.2.
This PR adds tests which fails for these older versions and bump the minimum sklearn version to 1.4.2 in the pytoml.

@LennartPurucker
Copy link
Collaborator

Why does it fail in a pipeline for sklearn<1.4.2?

This is a pretty huge version bump (limiting to April 2024+), so I would rather fix the problem than bump the version this much at this point.

Otherwise the tests look good to me.

@LennartPurucker
Copy link
Collaborator

Looking more into it, could it be something related to numpy 2? Does any other requirement force us to numpy 2 🤔

@LennartPurucker
Copy link
Collaborator

LennartPurucker commented Jan 14, 2025

#116 seems very related now

@LeoGrin
Copy link
Collaborator Author

LeoGrin commented Jan 14, 2025

Thanks Lennart! Actually the minimum working version is 1.4.1.post1 (February 2024) not 1.4.2, so a bit better but still pretty recent. Doesn't seem related to numpy version as it works both for numpy 1.26 and numpy 2.

@noahho
Copy link
Collaborator

noahho commented Jan 14, 2025

@LennartPurucker #116 says numpy should be below 2, not numpy 2 required

@noahho
Copy link
Collaborator

noahho commented Jan 14, 2025

It's great to have this test in here @LeoGrin ! I agree, that if possible we fix whatever causes the issue rn, maybe it's just one small change

@LeoGrin
Copy link
Collaborator Author

LeoGrin commented Jan 14, 2025

Here's the test error for sklearn=1.4.0, I'm looking into this right now.

    def test_classifier_in_pipeline(X_y: tuple[np.ndarray, np.ndarray]) -> None:
        """Test that TabPFNClassifier works correctly within a sklearn pipeline."""
        X, y = X_y
    
        # Create a simple preprocessing pipeline
        pipeline = Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', TabPFNClassifier(
                n_estimators=2  # Fewer estimators for faster testing
            ))
        ])
    
>       pipeline.fit(X, y)

tests/test_classifier_interface.py:147: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/base.py:1351: in wrapper
    return fit_method(estimator, *args, **kwargs)
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/pipeline.py:475: in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
src/tabpfn/classifier.py:455: in fit
    X = ord_encoder.fit_transform(X)  # type: ignore
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/utils/_set_output.py:273: in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/base.py:1351: in wrapper
    return fit_method(estimator, *args, **kwargs)
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/compose/_column_transformer.py:914: in fit_transform
    result = self._call_func_on_transformers(
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/compose/_column_transformer.py:823: in _call_func_on_transformers
    return Parallel(n_jobs=self.n_jobs)(jobs)
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/utils/parallel.py:67: in __call__
    return super().__call__(iterable_with_config)
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/joblib/parallel.py:1918: in __call__
    return output if self.return_generator else list(output)
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/joblib/parallel.py:1847: in _get_sequential_output
    res = func(*args, **kwargs)
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/utils/parallel.py:129: in __call__
    return self.function(*args, **kwargs)
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/pipeline.py:1303: in _fit_transform_one
    res = transformer.fit_transform(X, y, **params.get("fit_transform", {}))
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/utils/_set_output.py:273: in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/base.py:1061: in fit_transform
    return self.fit(X, **fit_params).transform(X)
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/utils/_set_output.py:273: in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = FunctionTransformer(accept_sparse=True, check_inverse=False,
                    feature_names_out='one-to-one')
X =             0         1         2         3
0   -0.900681  1.019004 -1.340227 -1.315444
1   -1.143017 -0.131979 -1.340....053935
148  0.432165  0.788808  0.933271  1.448832
149  0.068662 -0.131979  0.762758  0.790671

[150 rows x 4 columns]

    def transform(self, X):
        """Transform X using the forward function.
    
        Parameters
        ----------
        X : {array-like, sparse-matrix} of shape (n_samples, n_features) \
                if `validate=True` else any object that `func` can handle
            Input array.
    
        Returns
        -------
        X_out : array-like, shape (n_samples, n_features)
            Transformed input.
        """
        X = self._check_input(X, reset=False)
        out = self._transform(X, func=self.func, kw_args=self.kw_args)
    
        if hasattr(out, "columns") and self.feature_names_out is not None:
            # check the consistency between the column names of the output and the
            # one generated by `get_feature_names_out`
            if list(out.columns) != list(self.get_feature_names_out()):
>               raise ValueError(
                    "The output generated by `func` have different column names than "
                    "the one generated by the method `get_feature_names_out`. "
                    f"Got output with columns names: {list(out.columns)} and "
                    "`get_feature_names_out` returned: "
                    f"{list(self.get_feature_names_out())}. "
                    "This can be fixed in different manners depending on your use case:"
                    "\n(i) If `func` returns a container with column names, make sure "
                    "they are consistent with the output of `get_feature_names_out`.\n"
                    "(ii) If `func` is a NumPy `ufunc`, then forcing `validate=True` "
                    "could be considered to internally convert the input container to "
                    "a NumPy array before calling the `ufunc`.\n"
                    "(iii) The column names can be overriden by setting "
                    "`set_output(transform='pandas')` such that the column names are "
                    "set to the names provided by `get_feature_names_out`."
                )
E               ValueError: The output generated by `func` have different column names than the one generated by the method `get_feature_names_out`. Got output with columns names: [0, 1, 2, 3] and `get_feature_names_out` returned: ['x0', 'x1', 'x2', 'x3']. This can be fixed in different manners depending on your use case:
E               (i) If `func` returns a container with column names, make sure they are consistent with the output of `get_feature_names_out`.
E               (ii) If `func` is a NumPy `ufunc`, then forcing `validate=True` could be considered to internally convert the input container to a NumPy array before calling the `ufunc`.
E               (iii) The column names can be overriden by setting `set_output(transform='pandas')` such that the column names are set to the names provided by `get_feature_names_out`.

@LennartPurucker
Copy link
Collaborator

I think it might be this scikit-learn/scikit-learn#28262

@LeoGrin
Copy link
Collaborator Author

LeoGrin commented Jan 14, 2025

Found a fix (replacing passtrough for the remainder of the ColumnTransformer by FunctionTransformer, which by default is identity). Don't understand as much as I would like to why it fixed the issue but it seems to work 🤷‍♂️

I pushed back the minimum version of sklearn to 1.2.0 (December 2022). Previous versions fail (at least) because SimpleImputer'skeep_empty_features was introduced in 1.2.0. I would argue we can use 1.2.0 as the minimum version and open a new issue to support older versions, but tell me if you think 1.2.0 is still too recent.

@LeoGrin LeoGrin requested review from LennartPurucker and removed request for LennartPurucker January 14, 2025 13:54
@LennartPurucker
Copy link
Collaborator

Perfect, very nice, thank you @LeoGrin!

Copy link
Collaborator

@LennartPurucker LennartPurucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@LennartPurucker
Copy link
Collaborator

Feel free to merge at your convenience.

@LeoGrin LeoGrin merged commit 6f84506 into main Jan 14, 2025
@LeoGrin LeoGrin changed the title Increase minimum scikit-learn versiont to fix error when using TabPFN as part of a pipeline Fix error when using TabPFN as part of a pipeline Jan 14, 2025
@LennartPurucker LennartPurucker deleted the fix_pipeline_error branch January 14, 2025 16:10
oscarkey pushed a commit that referenced this pull request Nov 12, 2025
…l. (#135)

* Record copied public PR 492

* Forward cache_trainset_representation to load_model. (#492)

(cherry picked from commit 54fd7e6)

---------

Co-authored-by: mirror-bot <[email protected]>
Co-authored-by: Phil <[email protected]>
liu-qingyuan pushed a commit to liu-qingyuan/TabPFN that referenced this pull request Nov 24, 2025
Increase minimum scikit-learn versiont to fix error when using TabPFN as part of a pipeline
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants