Thanks to visit codestin.com
Credit goes to github.com

Skip to content

CI Use conda-forge for min-dependencies build and add polars and pandas #29502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jul 30, 2024

Conversation

lesteve
Copy link
Member

@lesteve lesteve commented Jul 16, 2024

Reference Issues/PRs

As noticed in #29490 (comment) we currently don't have any CI build with numpy 1.19 or numpy 1.20. The issue was caught in doc-min-depencies because it is actually using our real numpy minimum supported version.

What does this implement/fix? Explain your changes.

On top on using conda-forge to be able to use our min dependencies, this is adding polars and pandas to our min-dependencies build. This would allow to notice more easily issues like #29490.

Copy link

github-actions bot commented Jul 16, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: e6e9357. Link to the linter CI: here

OmarManzoor
OmarManzoor previously approved these changes Jul 16, 2024
Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @lesteve. The CI seems to be failing though. Won't doing this require fixing other errors just because of an older version?

@OmarManzoor OmarManzoor dismissed their stale review July 16, 2024 14:57

CI seems to be failing

@lesteve
Copy link
Member Author

lesteve commented Jul 16, 2024

The CI seems to be failing though.

Yep, I guess this is the old saying "if it's not tested then it's probably broken" striking again 😉

There are 3 failures see build log

FAILED compose/tests/test_column_transformer.py::test_column_transformer_column_renaming[polars] - ValueError: expected 3 values when selecting columns by boolean mask, got 0
FAILED compose/tests/test_column_transformer.py::test_column_transformer_error_with_duplicated_columns[polars] - AssertionError: Regex pattern did not match.
FAILED preprocessing/tests/test_function_transformer.py::test_function_transformer_overwrite_column_names[pandas-polars] - ValueError: Length mismatch: Expected axis has 3 elements, new values have ...

It is probably worth having a closer look and to estimate the work needed. It seems to be mostly related to polars and I would say numpy<=1.21 + polars is rather unlikely to happen in reality.

A reasonable compromise may be to bump our minimum Numpy version now since you know we are not really testing them very thoroughly.

Another reasonable compromise (but maybe more controversial?) would be to do nothing, pretend we did not notice this, and rely only on doc-min-dependencies to catch the worst bugs.

@lesteve
Copy link
Member Author

lesteve commented Jul 17, 2024

So after a bit of investigation I settled on:

  • adding pandas and polars to the min-dependencies CI build.
  • fixing issue with pandas < 1.4. For some reason when you create a pandas DataFrame from a polars DataFrame in pandas < 1.4 it adds an additional column ... Edit: actually what happens actually is the data is transposed.
  • moving minimum supported polars to 0.20.30 which fixed the other issues. polars 0.20.30 has been released May 26 2024, originally the minimum supported version was 0.20.23 which was released one month before roughly (28 April 2024). My feeling is that polars is moving so fast that scikit-learn users that care about polars support will be on recent versions anyway. cc @lorentzenchr who may have an informed opinion about this.

@lesteve
Copy link
Member Author

lesteve commented Jul 17, 2024

The codecov red status is because there is no CI without pandas and with coverage enabled. Actually adding pandas to the min dependencies remove the last build without pandas and with coverage enabled. I think this is still OKish.

The CI failure on the previous commit (where I did a mistake on the "no pandas installed" code) shows that we still have CI builds without pandas mainly the Windows pymin_conda_forge_mkl and Ubuntu Atlas see build log


def _create_pandas_dataframe_from_non_pandas_container(X, *, index, copy):
X_output = pd.DataFrame(X, index=index, copy=copy)
if "polars" not in str(X.__class__):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it good enough to test "polars" in class or should we check the more specific

Suggested change
if "polars" not in str(X.__class__):
if str(X.__class__).startswith("polars."):

Copy link
Member Author

@lesteve lesteve Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair question: originally I wanted to avoid importing polars to check isinstance(X, pl.DataFrame) I don't really remember why if to be fully honest 🤔. There is also sklearn.util.validation._is_polar_df although this will likely cause a circular import, sklearn.utils.validation imports stuff from sklearn.utils.fixes so it's not OK to use import sklearn.utils.validation in sklearn.utils.fixes.

Let me try to think a bit more about this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed some better code in fdf23ed

@betatim
Copy link
Member

betatim commented Jul 17, 2024

I think bumping the polars version is Ok. It is not an official dependency of scikit-learn, so we don't have to be as conservative.

With this PR we remove the only CI setup where we use the defaults channel of conda. I personally haven't used the defaults channel in a very very long time, but I suspect if we ask more "normal" data science users we would find some/many who do? Should we keep at least one defaults channel CI setup, but maybe not also with the most minimal version we require but the lowest one currently available?

@lesteve
Copy link
Member Author

lesteve commented Jul 17, 2024

With this PR we remove the only CI setup where we use the defaults channel of conda

What makes you think this? There are plenty (OK actually 3 after the changes in this PR) of CI builds still using "defaults" channel. Edit: thinking about it 2 are pip-based so they don't really use the conda packages from the defaults channel but there is still the no-OpenMP one which is on macOS (do we strongly want to keep a build with conda package from defaults for Linux?):

The command I use:

❯ rg defaults build_tools/**/*environment.yml
build_tools/azure/pylatest_conda_mkl_no_openmp_environment.yml
5:  - defaults

build_tools/azure/pylatest_pip_openblas_pandas_environment.yml
5:  - defaults

build_tools/azure/pylatest_pip_scipy_dev_environment.yml
5:  - defaults

(IMO there should be one CI build using defaults and all the other using conda-forge, but that's a different discussion 😉)

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I am fine with switching this build to conda-forge as long as we still have one CI build using the defaults channel.

@lesteve
Copy link
Member Author

lesteve commented Jul 30, 2024

Merging my own PR with two approvals

@lesteve lesteve merged commit cb35bd4 into scikit-learn:main Jul 30, 2024
30 of 32 checks passed
@lesteve lesteve deleted the conda-forge-min-deps-ci branch July 30, 2024 10:09
MarcBresson pushed a commit to MarcBresson/scikit-learn that referenced this pull request Sep 2, 2024
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Sep 9, 2024
glemaitre pushed a commit that referenced this pull request Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants