RFC should the scikit-learn metrics return a Python scalar or a NumPy scalar? #27339

glemaitre · 2023-09-11T13:14:17Z

While working on the representation imposed by NEP51, I found out that we recently made the accuracy_score to return a Python scalar while, up-to-now, other metric are returning NumPy scalar.

This change was made due to the array API work:

scikit-learn/sklearn/utils/_array_api.py

Lines 448 to 454 in b0da1b7

    
           def _weighted_sum(sample_score, sample_weight, normalize=False, xp=None): 
        
               # XXX: this function accepts Array API input but returns a Python scalar 
        
               # float. The call to float() is convenient because it removes the need to 
        
               # move back results from device to host memory (e.g. calling `.cpu()` on a 
        
               # torch tensor). However, this might interact in unexpected ways (break?) 
        
               # with lazy Array API implementations. See: 
        
               # https://github.com/data-apis/array-api/issues/642

I assume that we are getting to an intersection where we should make the output of our metrics consistent but also foresee potential requirements: as the comment indicate, calling float() will be a sync point but it might not be the best strategy for lazy computation.

This RFC is a placeholder to discuss what strategy we should be implementing.

The text was updated successfully, but these errors were encountered:

ogrisel · 2023-09-11T13:44:34Z

As hinted by @glemaitre, the accepted NEP51 proposes to change the representation __repr__ of numpy scalar arrays:

https://numpy.org/neps/nep-0051-scalar-representation.html

It is being implemented for numpy 2.0 scheduled for next year and is causing many of our doctests to fail.

To address NEP51, we could go either way.

[solution-a] Do nothing to our scoring metrics (continue returning scalar numpy arrays)

accept that this is a more informative __repr__ and still decide to use numpy scalar arrays in the output of our metric functions in scikit-learn. The users would get more verbose output when using scikit-learn in interactive model (e.g. in a jupyter notebook sesssion) but at least it would be more explicit about the precision of floating point computation for instance,
we could either update our docstest so that they would pass on numpy 2.0.0.dev0+ and stop running the doctests on or numpy < 2.0 CIs or alternatively keep the doctests unchanged for now but stop running the doctests on the [scipy-dev] build and reexplore this decision once numpy 2.0 is out (or at least an official beta).
to add array API support, we might need to change our test to call float() manually on the result of accuracy_score and the like explicitly when needed (e.g. to trigger a lazy computation, move the results to CPU and be able to compare to another Python scalar value).

[solution-b] Change our scoring metrics to always return a Python scalar

So we could decide to make our scalar metric function always call float() (and maybe int() in some cases) internally.

This means that the precision level of the computation would be lost.

This also means that eager/blocking execution semantics are forced when calling a metric function with lazy Array API compatible inputs. It would also always move back the resulting scalar value to the CPU without the user having to do a library specific operation.

But this will make our doctests suite pass both for numpy 1.x and numpy 2.x+ unchanged.

This is always less verbose for scikit-learn users calling accuracy_score and similar in their notebooks.

[solution-c] Add a new flag `return_as_python_scalar=True` to all metrics

This way we do the conversion to Python scalar by default but we give control to the user, if they need to access the dtype of the result or keep the computation as lazy as possible when using the Array API.

More context about the possibility of lazy computation via Array API in scikit-learn

Note that the fit method of estimators that use an iterative solver with a tol-based stopping criterion will by construction need to trigger eager computation internally by design anyway, so we cannot really hope to express 100% lazy machine learning pipelines. More precisely:

est = IterativeEstimator(tol=1e-4).fit(X_lazy_array_train, y_lazy_array_train)
test_score = est.score(X_lazy_array_test, y_lazy_array_test)

even if test_score is kept as a lazy scalar array until calling float(test_score), a significant computation would have already been triggered (implicit eager evaluation) when calling fit due to the tol-based checks at the end of each iteration of the internal solver.

Also note:

I opened an issue to discuss lazy Array API in scikit-learn in Array API support with lazy evaluation #26724
Right now, dask does not really implement the Array API but it could quite quickly via array-api-compat: Dask support data-apis/array-api-compat#17
JAX's own Array API support might be enough for some scikit-learn estimators (e.g. PCA) but I haven't tried and we have not included JAX in our array-api compliance test framework yet.
PyTorch uses eager computation semantics by default, unless it is wrapped in a function decorated by torch.compile. By default the compiler attempts to discover blocks of code that can be executed natively (with lazy evaluation semantics) and automatically issue synchronized blocking calls to the Python interpreter whenever its needed.

betatim · 2023-09-12T11:49:20Z

@ogrisel it seems some of your comment is misformatted or missing words? Could you take a look?

ogrisel · 2023-09-13T15:02:20Z

@betatim I edited my comment.

betatim · 2023-09-18T13:43:29Z

Thanks a lot!

I think my least favourite option is (c). It feels like we are delegating the complexity to our users, who probably have even less clue about "the right thing to do" than we have.

I like (b) and think that it isn't a big deal that something like Estimator.score can't be lazy. Trying to make it lazy seems to add a lot of complexity (you need a LazyScalar and explain to users) and I can't think of a use-case where making the computation lazy would allow you to make the computation more efficient. Compared to for example reading data from a file, where you can make the reading more efficient by knowing that certain columns are never used.

I also wonder if we need to specify that the type of the return value of Estimator.score is a Python scalar and not rely on duck typing (so it can be a Python scalar or a Numpy scalar).

adrinjalali · 2023-09-18T14:10:01Z

I also like (b) for the same reasons as @betatim mentions

betatim · 2023-10-13T12:22:05Z

There seems to be some amount of consensus and no new comments for a while. Should we try to wrap this up?

@glemaitre what do you think of the options Oliver listed? Do you also like option (b)?

glemaitre · 2023-10-13T12:25:04Z

Thanks @betatim to keep track of this issue. Option (b) looks the right trade-off right now.
So we can therefore settle on converting output to Python scalar.

ckosten · 2023-11-07T18:25:24Z

/take

adrinjalali · 2023-12-04T15:28:43Z

@ckosten this is not a good issue to begin with. You can look for good first issues and help wanted tags to find one.

ckosten · 2023-12-15T20:08:55Z

It was assigned as a beginner problem in a scikit learn workshop recently...

@ckosten this is not a good issue to begin with. You can look for good first issues and help wanted tags to find one.

ogrisel · 2024-06-07T09:39:16Z

Shall we close this issue and open matching meta-issue to track the remaining work to do?

It might be redundant with the array API meta issue at #26024 which is already well under way and since most scalar-returning metric function will likely such a treatment to have their tests pass on PyTorch with CUDA.

Maybe we can leave it open for now and close it once all the metric function referenced in #26024 have been addressed.

adrinjalali · 2024-06-10T10:19:39Z

will likely such a treatment to have their tests pass on PyTorch with CUDA.

To be clear, you mean for them to return a scalar? Then yeah I'm happy to have this closed.

github-actions bot added the Needs Triage Issue requires triage label Sep 11, 2023

glemaitre added RFC and removed Needs Triage Issue requires triage labels Sep 11, 2023

ogrisel changed the title ~~RFC does the scikit-learn metric should return a Python scalar or a NumPy scalar~~ RFC should the scikit-learn metrics return a Python scalar or a NumPy scalar Sep 11, 2023

ogrisel changed the title ~~RFC should the scikit-learn metrics return a Python scalar or a NumPy scalar~~ RFC should the scikit-learn metrics return a Python scalar or a NumPy scalar? Sep 11, 2023

betatim mentioned this issue Sep 12, 2023

CI Do not run doctests for numpy 2 #27345

Merged

github-actions bot assigned ckosten Nov 7, 2023

ckosten mentioned this issue Nov 9, 2023

Issue #27339 ckosten/scikit-learn#1

Open

adrinjalali unassigned ckosten Dec 4, 2023

github-actions bot added the help wanted label Dec 4, 2023

lorentzenchr mentioned this issue Apr 6, 2024

FEA Add d2_log_loss_score #28351

Merged

glemaitre added this to Array API May 17, 2024

glemaitre moved this to Discussion in Array API May 17, 2024

glemaitre mentioned this issue Jun 10, 2024

ENH Add zero_division param to cohen_kappa_score #29210

Merged

adrinjalali mentioned this issue Oct 21, 2024

ENH Array API support for f1_score and multilabel_confusion_matrix #27369

Merged

lesteve mentioned this issue Dec 18, 2024

DOC Update rst doctests to be compatible with numpy >= 2 #30495

Merged

jeremiedbb mentioned this issue Jan 3, 2025

Make scorers return python floats #30575

Merged

lesteve closed this as completed in #30575 Feb 3, 2025

github-project-automation bot moved this from Discussion to Done in Array API Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC should the scikit-learn metrics return a Python scalar or a NumPy scalar? #27339

RFC should the scikit-learn metrics return a Python scalar or a NumPy scalar? #27339

glemaitre commented Sep 11, 2023

ogrisel commented Sep 11, 2023 •

edited

Loading

betatim commented Sep 12, 2023

ogrisel commented Sep 13, 2023

betatim commented Sep 18, 2023

adrinjalali commented Sep 18, 2023

betatim commented Oct 13, 2023

glemaitre commented Oct 13, 2023

ckosten commented Nov 7, 2023

adrinjalali commented Dec 4, 2023

ckosten commented Dec 15, 2023

ogrisel commented Jun 7, 2024

adrinjalali commented Jun 10, 2024

RFC should the scikit-learn metrics return a Python scalar or a NumPy scalar? #27339

RFC should the scikit-learn metrics return a Python scalar or a NumPy scalar? #27339

Comments

glemaitre commented Sep 11, 2023

ogrisel commented Sep 11, 2023 • edited Loading

[solution-a] Do nothing to our scoring metrics (continue returning scalar numpy arrays)

[solution-b] Change our scoring metrics to always return a Python scalar

[solution-c] Add a new flag return_as_python_scalar=True to all metrics

More context about the possibility of lazy computation via Array API in scikit-learn

betatim commented Sep 12, 2023

ogrisel commented Sep 13, 2023

betatim commented Sep 18, 2023

adrinjalali commented Sep 18, 2023

betatim commented Oct 13, 2023

glemaitre commented Oct 13, 2023

ckosten commented Nov 7, 2023

adrinjalali commented Dec 4, 2023

ckosten commented Dec 15, 2023

ogrisel commented Jun 7, 2024

adrinjalali commented Jun 10, 2024

ogrisel commented Sep 11, 2023 •

edited

Loading

[solution-c] Add a new flag `return_as_python_scalar=True` to all metrics