Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ARD Regressor accuracy degrades when upgrading Scipy 1.2.1 -> 1.3.0 #14055

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
timstaley opened this issue Jun 10, 2019 · 6 comments · Fixed by #14067
Closed

ARD Regressor accuracy degrades when upgrading Scipy 1.2.1 -> 1.3.0 #14055

timstaley opened this issue Jun 10, 2019 · 6 comments · Fixed by #14067
Labels
Bug Moderate Anything that requires some knowledge of conventions and best practices Sprint

Comments

@timstaley
Copy link
Contributor

timstaley commented Jun 10, 2019

Hi,
bit of a tricky one, I'm hoping someone will have some time and/or suggestions for further investigation!

There seems to be an often-occurring worsening of performance (i.e. accuracy, although run-time increases too!) from the ARD regressor when upgrading from Scipy 1.2.1 -> 1.3.0.

Description

On a very simple dataset (see code snippets below) where a near-perfect fit should be achievable, typical error seems to degrade from order 1E-5 to 1E-2. Notably, convergence iterations seem to increase also from ~a few (~5) to around 50-200 iterations.

Here's the headline plot, plotting absolute co-efficient error when fit across 1000 datasets generated with different random seeds:
coeff_abs_error_histograms

Note how with Scipy==1.2.1, errors are largely constrained to <0.01, while with Scipy==1.3.0 they range up to 0.05 (and in a few rare cases the algorithm produces garbage results, see later).

I guess this could be (probably is?) a Scipy rather than Sklearn issue, but probably the only way to confirm / isolate that would be to start here.

It's also possible that this worsening of behaviour is a weirdness of my particular toy example, but the difference in behaviour seems large and unexpected enough to warrant further investigation, I'd hope!

Steps/Code to Reproduce

Single Seed:

OK, so here's a short snippet on just a single seed if you're curious to try this yourself. I'm generating three vectors of normally distributed values, 250 samples. Then the target is just a perfect copy of one of those vectors (index=1). We measure the accuracy of the fit by simply checking how close that coefficient is to 1.0 (the other coefficients always shrink to 0., as you'd hope):

import scipy
import sklearn
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import ARDRegression

sklearn.show_versions()

def test_ard_regressor(dataset: np.ndarray) -> float:
    X = dataset
    y = X[:,1]
    regressor = ARDRegression(verbose=True)
    regressor.fit(X, y)
    abs_coef_error = np.abs(1 - regressor.coef_[1])
    print(abs_coef_error)
    return abs_coef_error

size=250
X = np.random.RandomState(seed=45).normal(size=(size,3))

test_ard_regressor(X)

Results

Scipy 1.2.1:

python single_seed.py 

System:
    python: 3.6.7 (default, Oct 22 2018, 11:32:17)  [GCC 8.2.0]
executable: /home/staley/.virtualenvs/sklearn-bug-scipy-1.2.1/bin/python
   machine: Linux-4.15.0-47-generic-x86_64-with-Ubuntu-18.04-bionic

BLAS:
    macros: HAVE_CBLAS=None
  lib_dirs: /usr/lib/x86_64-linux-gnu
cblas_libs: openblas, openblas

Python deps:
       pip: 19.1.1
setuptools: 41.0.1
   sklearn: 0.21.2
     numpy: 1.16.4
     scipy: 1.2.1
    Cython: None
    pandas: None
Converged after 4 iterations
9.647701516568574e-07

Scipy 1.3.0

python single_seed.py 

System:
    python: 3.6.7 (default, Oct 22 2018, 11:32:17)  [GCC 8.2.0]
executable: /home/staley/.virtualenvs/sklearn-bug-scipy-1.3/bin/python
   machine: Linux-4.15.0-47-generic-x86_64-with-Ubuntu-18.04-bionic

BLAS:
    macros: HAVE_CBLAS=None
  lib_dirs: /usr/lib/x86_64-linux-gnu
cblas_libs: openblas, openblas

Python deps:
       pip: 19.1.1
setuptools: 41.0.1
   sklearn: 0.21.2
     numpy: 1.16.4
     scipy: 1.3.0
    Cython: None
    pandas: None
Converged after 18 iterations
0.16538104739325354

Datasets from 1000 different seeds

It could be that there's some oddity of the random data from a single seed, so I set up some short scripts to first generate a static collection of 1000 of the datasets as seen above, then collate the results from both versions of scipy. The snippets are as follows:

Make data:

import numpy as np
size=250
random_datasets = {seed: np.random.RandomState(seed).normal(size=(size,3)) 
                   for seed in range(1000)}
np.savez('random_datasets.npz', data=list(random_datasets.values()), seeds=list(random_datasets.keys()))

Test sklearn:

import scipy
import sklearn
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import ARDRegression

random_datasets = np.load('random_datasets.npz')
random_datasets=dict(zip(random_datasets['seeds'], random_datasets['data']))

def test_ard_regressor(dataset: np.ndarray) -> float:
    X = dataset
    y = X[:,1]
    regressor = ARDRegression(verbose=True)
    regressor.fit(X, y)
    abs_coef_error = np.abs(1 - regressor.coef_[1])
    print(abs_coef_error)
    return abs_coef_error

results = []
for seed, data in random_datasets.items():
    print("Seed:",seed)
    results.append(test_ard_regressor(data))

np.save(f'scipy_{scipy.__version__}_results', np.array(results))

Plot results:

import numpy as np
import matplotlib.pyplot as plt

results_1_2_1 = np.load("./scipy_1.2.1_results.npy")
results_1_3_0 = np.load("./scipy_1.3.0_results.npy")

counts, bin_edges = np.histogram(results_1_2_1)

ax = plt.gca()
ax.hist(results_1_2_1, label="scipy==1.2.1", alpha=0.5, bins=bin_edges)
ax.hist(results_1_3_0, label="scipy==1.3.0", alpha=0.5, bins=bin_edges)
# ax.set_xlim(0, 1.0)
ax.legend()
plt.show()

A little investigating summary statistics of those datasets in notebook gives the following points of comparison:

> np.median(results_1_2_1)
1.1909624002770514e-05
> np.median(results_1_3_0)
0.008368892887510193

>np.percentile(results_1_2_1, 99)
0.03166983391537859
>np.percentile(results_1_3_0, 99)
0.16551247976283737


> results_1_2_1.max()
0.08478086928684647
>results_1_3_0.max()
46606.5545533851 

@timstaley timstaley changed the title ARD Regressor accuracy regression when upgrading Scipy 1.2.1 -> 1.3.0 ARD Regressor accuracy degrades when upgrading Scipy 1.2.1 -> 1.3.0 Jun 10, 2019
@rth
Copy link
Member

rth commented Jun 10, 2019

Thanks for the report. After a quick check ARDRegression uses pinvh from scipy. The cutoff factor for small singular values was recently changed in scipy/scipy#10067 it might be worth setting the previous value in scikit-learn code and see if that allows you to reproduce previous results.

@timstaley
Copy link
Contributor Author

Thanks for the suggestion, I'll see if I can pin that down.

@NicolasHug
Copy link
Member

that's not the first time this change is causing issues in our code #13903

@timstaley
Copy link
Contributor Author

timstaley commented Jun 10, 2019

Yep, a quick-and-dirty patch confirms this is due to the aforementioned pinvh cond changes
(scipy/scipy#10067)

I'll try and clean that up into something more readable and maintainable, making it clear what's a 'choose_pinvh_cutoff' subroutine and that the current option is to just match default scipy behaviour pre 1.3.0.
That should revert the accuracy regression, while leaving room if anyone wants to have a think about a more rigorous approach to computing a sensible pinvh cutoff value (not something I'm up to speed on personally).

@thomasjpfan
Copy link
Member

@NicolasHug The fix in #14067, copies over the old 'pinvh`, maybe we should do the same for #13903 and copy over pinv2?

@thomasjpfan thomasjpfan added Bug Sprint Moderate Anything that requires some knowledge of conventions and best practices labels Jul 13, 2019
@amueller
Copy link
Member

amueller commented Nov 15, 2019

the regression test breaks when updating to python3.8: #15637

System:
python: 3.8.0 (default, Nov 6 2019, 21:49:08) [GCC 7.3.0]
executable: /home/andy/miniconda3/envs/py38/bin/python
machine: Linux-4.15.0-65-generic-x86_64-with-glibc2.10

Python dependencies:
pip: 19.3.1
setuptools: 41.6.0.post20191030
sklearn: 0.22.dev0
numpy: 1.17.4
scipy: 1.3.2
Cython: 0.29.14
pandas: None
matplotlib: None
joblib: 0.14.0

Built with OpenMP: True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Moderate Anything that requires some knowledge of conventions and best practices Sprint
Projects
None yet
5 participants