-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
CI Use conda-forge for min-dependencies build and add polars and pandas #29502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @lesteve. The CI seems to be failing though. Won't doing this require fixing other errors just because of an older version?
Yep, I guess this is the old saying "if it's not tested then it's probably broken" striking again 😉 There are 3 failures see build log
It is probably worth having a closer look and to estimate the work needed. It seems to be mostly related to polars and I would say numpy<=1.21 + polars is rather unlikely to happen in reality. A reasonable compromise may be to bump our minimum Numpy version now since you know we are not really testing them very thoroughly. Another reasonable compromise (but maybe more controversial?) would be to do nothing, pretend we did not notice this, and rely only on doc-min-dependencies to catch the worst bugs. |
So after a bit of investigation I settled on:
|
The codecov red status is because there is no CI without pandas and with coverage enabled. Actually adding pandas to the min dependencies remove the last build without pandas and with coverage enabled. I think this is still OKish. The CI failure on the previous commit (where I did a mistake on the "no pandas installed" code) shows that we still have CI builds without pandas mainly the Windows pymin_conda_forge_mkl and Ubuntu Atlas see build log |
sklearn/utils/fixes.py
Outdated
|
||
def _create_pandas_dataframe_from_non_pandas_container(X, *, index, copy): | ||
X_output = pd.DataFrame(X, index=index, copy=copy) | ||
if "polars" not in str(X.__class__): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it good enough to test "polars" in class
or should we check the more specific
if "polars" not in str(X.__class__): | |
if str(X.__class__).startswith("polars."): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair question: originally I wanted to avoid importing polars
to check isinstance(X, pl.DataFrame)
I don't really remember why if to be fully honest 🤔. There is also sklearn.util.validation._is_polar_df
although this will likely cause a circular import, sklearn.utils.validation
imports stuff from sklearn.utils.fixes
so it's not OK to use import sklearn.utils.validation
in sklearn.utils.fixes
.
Let me try to think a bit more about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed some better code in fdf23ed
I think bumping the polars version is Ok. It is not an official dependency of scikit-learn, so we don't have to be as conservative. With this PR we remove the only CI setup where we use the defaults channel of conda. I personally haven't used the defaults channel in a very very long time, but I suspect if we ask more "normal" data science users we would find some/many who do? Should we keep at least one defaults channel CI setup, but maybe not also with the most minimal version we require but the lowest one currently available? |
Co-authored-by: Tim Head <[email protected]>
What makes you think this? There are plenty (OK actually 3 after the changes in this PR) of CI builds still using "defaults" channel. Edit: thinking about it 2 are pip-based so they don't really use the conda packages from the The command I use:
(IMO there should be one CI build using defaults and all the other using conda-forge, but that's a different discussion 😉) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I am fine with switching this build to conda-forge as long as we still have one CI build using the defaults channel.
…nto conda-forge-min-deps-ci
Merging my own PR with two approvals |
…as (scikit-learn#29502) Co-authored-by: Tim Head <[email protected]>
…as (scikit-learn#29502) Co-authored-by: Tim Head <[email protected]>
…as (#29502) Co-authored-by: Tim Head <[email protected]>
Reference Issues/PRs
As noticed in #29490 (comment) we currently don't have any CI build with numpy 1.19 or numpy 1.20. The issue was caught in doc-min-depencies because it is actually using our real numpy minimum supported version.
What does this implement/fix? Explain your changes.
On top on using conda-forge to be able to use our min dependencies, this is adding polars and pandas to our min-dependencies build. This would allow to notice more easily issues like #29490.