Fix error when using TabPFN as part of a pipeline #135

LeoGrin · 2025-01-14T08:46:59Z

TabPFN fails when used as part of a sklearn pipeline for sklearn's version older than 1.4.2.
This PR adds tests which fails for these older versions and bump the minimum sklearn version to 1.4.2 in the pytoml.

LennartPurucker · 2025-01-14T10:53:44Z

Why does it fail in a pipeline for sklearn<1.4.2?

This is a pretty huge version bump (limiting to April 2024+), so I would rather fix the problem than bump the version this much at this point.

Otherwise the tests look good to me.

LennartPurucker · 2025-01-14T10:55:27Z

Looking more into it, could it be something related to numpy 2? Does any other requirement force us to numpy 2 🤔

LennartPurucker · 2025-01-14T10:55:56Z

#116 seems very related now

LeoGrin · 2025-01-14T12:16:45Z

Thanks Lennart! Actually the minimum working version is 1.4.1.post1 (February 2024) not 1.4.2, so a bit better but still pretty recent. Doesn't seem related to numpy version as it works both for numpy 1.26 and numpy 2.

noahho · 2025-01-14T12:18:33Z

@LennartPurucker #116 says numpy should be below 2, not numpy 2 required

noahho · 2025-01-14T12:19:52Z

It's great to have this test in here @LeoGrin ! I agree, that if possible we fix whatever causes the issue rn, maybe it's just one small change

LeoGrin · 2025-01-14T12:19:57Z

Here's the test error for sklearn=1.4.0, I'm looking into this right now.

    def test_classifier_in_pipeline(X_y: tuple[np.ndarray, np.ndarray]) -> None:
        """Test that TabPFNClassifier works correctly within a sklearn pipeline."""
        X, y = X_y
    
        # Create a simple preprocessing pipeline
        pipeline = Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', TabPFNClassifier(
                n_estimators=2  # Fewer estimators for faster testing
            ))
        ])
    
>       pipeline.fit(X, y)

tests/test_classifier_interface.py:147: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/base.py:1351: in wrapper
    return fit_method(estimator, *args, **kwargs)
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/pipeline.py:475: in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
src/tabpfn/classifier.py:455: in fit
    X = ord_encoder.fit_transform(X)  # type: ignore
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/utils/_set_output.py:273: in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/base.py:1351: in wrapper
    return fit_method(estimator, *args, **kwargs)
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/compose/_column_transformer.py:914: in fit_transform
    result = self._call_func_on_transformers(
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/compose/_column_transformer.py:823: in _call_func_on_transformers
    return Parallel(n_jobs=self.n_jobs)(jobs)
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/utils/parallel.py:67: in __call__
    return super().__call__(iterable_with_config)
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/joblib/parallel.py:1918: in __call__
    return output if self.return_generator else list(output)
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/joblib/parallel.py:1847: in _get_sequential_output
    res = func(*args, **kwargs)
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/utils/parallel.py:129: in __call__
    return self.function(*args, **kwargs)
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/pipeline.py:1303: in _fit_transform_one
    res = transformer.fit_transform(X, y, **params.get("fit_transform", {}))
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/utils/_set_output.py:273: in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/base.py:1061: in fit_transform
    return self.fit(X, **fit_params).transform(X)
../../../mambaforge/envs/tabpfn_package_env/lib/python3.10/site-packages/sklearn/utils/_set_output.py:273: in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = FunctionTransformer(accept_sparse=True, check_inverse=False,
                    feature_names_out='one-to-one')
X =             0         1         2         3
0   -0.900681  1.019004 -1.340227 -1.315444
1   -1.143017 -0.131979 -1.340....053935
148  0.432165  0.788808  0.933271  1.448832
149  0.068662 -0.131979  0.762758  0.790671

[150 rows x 4 columns]

    def transform(self, X):
        """Transform X using the forward function.
    
        Parameters
        ----------
        X : {array-like, sparse-matrix} of shape (n_samples, n_features) \
                if `validate=True` else any object that `func` can handle
            Input array.
    
        Returns
        -------
        X_out : array-like, shape (n_samples, n_features)
            Transformed input.
        """
        X = self._check_input(X, reset=False)
        out = self._transform(X, func=self.func, kw_args=self.kw_args)
    
        if hasattr(out, "columns") and self.feature_names_out is not None:
            # check the consistency between the column names of the output and the
            # one generated by `get_feature_names_out`
            if list(out.columns) != list(self.get_feature_names_out()):
>               raise ValueError(
                    "The output generated by `func` have different column names than "
                    "the one generated by the method `get_feature_names_out`. "
                    f"Got output with columns names: {list(out.columns)} and "
                    "`get_feature_names_out` returned: "
                    f"{list(self.get_feature_names_out())}. "
                    "This can be fixed in different manners depending on your use case:"
                    "\n(i) If `func` returns a container with column names, make sure "
                    "they are consistent with the output of `get_feature_names_out`.\n"
                    "(ii) If `func` is a NumPy `ufunc`, then forcing `validate=True` "
                    "could be considered to internally convert the input container to "
                    "a NumPy array before calling the `ufunc`.\n"
                    "(iii) The column names can be overriden by setting "
                    "`set_output(transform='pandas')` such that the column names are "
                    "set to the names provided by `get_feature_names_out`."
                )
E               ValueError: The output generated by `func` have different column names than the one generated by the method `get_feature_names_out`. Got output with columns names: [0, 1, 2, 3] and `get_feature_names_out` returned: ['x0', 'x1', 'x2', 'x3']. This can be fixed in different manners depending on your use case:
E               (i) If `func` returns a container with column names, make sure they are consistent with the output of `get_feature_names_out`.
E               (ii) If `func` is a NumPy `ufunc`, then forcing `validate=True` could be considered to internally convert the input container to a NumPy array before calling the `ufunc`.
E               (iii) The column names can be overriden by setting `set_output(transform='pandas')` such that the column names are set to the names provided by `get_feature_names_out`.

LennartPurucker · 2025-01-14T12:34:20Z

I think it might be this scikit-learn/scikit-learn#28262

…sh back minimum sklearn to 1.2.0

LeoGrin · 2025-01-14T13:53:28Z

Found a fix (replacing passtrough for the remainder of the ColumnTransformer by FunctionTransformer, which by default is identity). Don't understand as much as I would like to why it fixed the issue but it seems to work 🤷‍♂️

I pushed back the minimum version of sklearn to 1.2.0 (December 2022). Previous versions fail (at least) because SimpleImputer'skeep_empty_features was introduced in 1.2.0. I would argue we can use 1.2.0 as the minimum version and open a new issue to support older versions, but tell me if you think 1.2.0 is still too recent.

LennartPurucker · 2025-01-14T14:00:37Z

Perfect, very nice, thank you @LeoGrin!

LennartPurucker

LGTM!

LennartPurucker · 2025-01-14T14:01:23Z

Feel free to merge at your convenience.

…l. (#135) * Record copied public PR 492 * Forward cache_trainset_representation to load_model. (#492) (cherry picked from commit 54fd7e6) --------- Co-authored-by: mirror-bot <[email protected]> Co-authored-by: Phil <[email protected]>

Increase minimum scikit-learn versiont to fix error when using TabPFN as part of a pipeline

add tests and increase scikit-learn version

397f352

LeoGrin requested a review from LennartPurucker January 14, 2025 10:49

minium sklearn version to 1.4.1

fb9d13a

fix pipeline error by changing passtrough to identity transformer, pu…

cde6185

…sh back minimum sklearn to 1.2.0

LeoGrin requested review from LennartPurucker and removed request for LennartPurucker January 14, 2025 13:54

LennartPurucker approved these changes Jan 14, 2025

View reviewed changes

LeoGrin merged commit 6f84506 into main Jan 14, 2025

LeoGrin changed the title ~~Increase minimum scikit-learn versiont to fix error when using TabPFN as part of a pipeline~~ Fix error when using TabPFN as part of a pipeline Jan 14, 2025

LeoGrin mentioned this pull request Jan 14, 2025

Support scikit-learn version < 1.2.0 #136

Closed

LennartPurucker deleted the fix_pipeline_error branch January 14, 2025 16:10

LeoGrin mentioned this pull request Jan 24, 2025

Issue with AutoTabPFNRegressor #155

Closed

liu-qingyuan pushed a commit to liu-qingyuan/TabPFN that referenced this pull request Nov 24, 2025

Merge pull request PriorLabs#135 from PriorLabs/fix_pipeline_error

edbce93

Increase minimum scikit-learn versiont to fix error when using TabPFN as part of a pipeline

Fix error when using TabPFN as part of a pipeline #135

Fix error when using TabPFN as part of a pipeline #135

Uh oh!

Conversation

LeoGrin commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LennartPurucker commented Jan 14, 2025

Uh oh!

LennartPurucker commented Jan 14, 2025

Uh oh!

LennartPurucker commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LeoGrin commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noahho commented Jan 14, 2025

Uh oh!

noahho commented Jan 14, 2025

Uh oh!

LeoGrin commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LennartPurucker commented Jan 14, 2025

Uh oh!

LeoGrin commented Jan 14, 2025

Uh oh!

LennartPurucker commented Jan 14, 2025

Uh oh!

LennartPurucker left a comment

Choose a reason for hiding this comment

Uh oh!

LennartPurucker commented Jan 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LeoGrin commented Jan 14, 2025 •

edited

Loading

LennartPurucker commented Jan 14, 2025 •

edited

Loading

LeoGrin commented Jan 14, 2025 •

edited

Loading

LeoGrin commented Jan 14, 2025 •

edited

Loading