-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ARD Regressor accuracy degrades when upgrading Scipy 1.2.1 -> 1.3.0 #14055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report. After a quick check |
Thanks for the suggestion, I'll see if I can pin that down. |
that's not the first time this change is causing issues in our code #13903 |
Yep, a quick-and-dirty patch confirms this is due to the aforementioned pinvh cond changes I'll try and clean that up into something more readable and maintainable, making it clear what's a 'choose_pinvh_cutoff' subroutine and that the current option is to just match default scipy behaviour pre 1.3.0. |
@NicolasHug The fix in #14067, copies over the old 'pinvh`, maybe we should do the same for #13903 and copy over pinv2? |
the regression test breaks when updating to python3.8: #15637 System: Python dependencies: Built with OpenMP: True |
Hi,
bit of a tricky one, I'm hoping someone will have some time and/or suggestions for further investigation!
There seems to be an often-occurring worsening of performance (i.e. accuracy, although run-time increases too!) from the ARD regressor when upgrading from Scipy 1.2.1 -> 1.3.0.
Description
On a very simple dataset (see code snippets below) where a near-perfect fit should be achievable, typical error seems to degrade from order 1E-5 to 1E-2. Notably, convergence iterations seem to increase also from ~a few (~5) to around 50-200 iterations.
Here's the headline plot, plotting absolute co-efficient error when fit across 1000 datasets generated with different random seeds:

Note how with Scipy==1.2.1, errors are largely constrained to <0.01, while with Scipy==1.3.0 they range up to 0.05 (and in a few rare cases the algorithm produces garbage results, see later).
I guess this could be (probably is?) a Scipy rather than Sklearn issue, but probably the only way to confirm / isolate that would be to start here.
It's also possible that this worsening of behaviour is a weirdness of my particular toy example, but the difference in behaviour seems large and unexpected enough to warrant further investigation, I'd hope!
Steps/Code to Reproduce
Single Seed:
OK, so here's a short snippet on just a single seed if you're curious to try this yourself. I'm generating three vectors of normally distributed values, 250 samples. Then the target is just a perfect copy of one of those vectors (index=1). We measure the accuracy of the fit by simply checking how close that coefficient is to 1.0 (the other coefficients always shrink to 0., as you'd hope):
Results
Scipy 1.2.1:
Scipy 1.3.0
Datasets from 1000 different seeds
It could be that there's some oddity of the random data from a single seed, so I set up some short scripts to first generate a static collection of 1000 of the datasets as seen above, then collate the results from both versions of scipy. The snippets are as follows:
Make data:
Test sklearn:
Plot results:
A little investigating summary statistics of those datasets in notebook gives the following points of comparison:
The text was updated successfully, but these errors were encountered: