Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@ngoix
Copy link
Contributor

@ngoix ngoix commented Sep 11, 2017

fix #9730

@ngoix ngoix changed the title fix kdd_kddcup99 [MRG] fix kdd_kddcup99 Sep 11, 2017


def test_shuffle():
dataset = fetch_kddcup99(subset='SA', shuffle=True, percent10=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix a random_state

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks!


def test_shuffle():
dataset = fetch_kddcup99(subset='SA', shuffle=True, percent10=True)
assert(any(dataset.target[-100:] == b'normal.'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fails on master?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does.

@agramfort
Copy link
Member

@ngoix you'll need an entry in what's new bug section

@amueller amueller changed the title [MRG] fix kdd_kddcup99 [MRG + 1] fix kdd_kddcup99 Sep 11, 2017
@amueller
Copy link
Member

lgtm

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also remove the shuffle param from _fetch_brute_kddcup99?

@jnothman
Copy link
Member

random_state=0 should pass the test, by the way

@ngoix
Copy link
Contributor Author

ngoix commented Sep 12, 2017

also added a SkipTest in case kdd data are not downloaded, as it's not small data.

@ngoix
Copy link
Contributor Author

ngoix commented Sep 12, 2017

@jnothman I removed the shuffle param from _fetch_brute_kddcup99

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that's reasonable although ideally we'd then have a CI that fetches all the datasets and runs the datasets tests...

LGTM

@jnothman
Copy link
Member

CI failing?

dataset = fetch_kddcup99(random_state=0, subset='SA', shuffle=True,
percent10=True, download_if_missing=False)
except IOError:
raise SkipTest("kddcup99 dataset can not be loaded.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is not tested in CIs?

@ngoix
Copy link
Contributor Author

ngoix commented Sep 13, 2017

I removed the SkipTest, and now the coverage tests are passing.

@lesteve
Copy link
Member

lesteve commented Sep 13, 2017

I removed the SkipTest, and now the coverage tests are passing.

The CIs were failing because of coverage issues. My understanding is that we do not want to download datasets on Travis during the tests.

Edit: just to be clear I was suggesting reverting your last commit.

@jnothman
Copy link
Member

jnothman commented Sep 13, 2017 via email

@lesteve
Copy link
Member

lesteve commented Sep 13, 2017

I'd agree that we don't usually want to download them in Travis, and poor
coverage here is acceptable, but it would also be nice if we had a Travis
instance that did test this sort of thing.

I am not sure how this would work on Travis. When I was working on testing datasets on figshare, it was taking quite a while to download all the datasets from scratch (maybe 30-40 minutes off the top of my head from a good university network) and maybe we do not have the time to do it within a Travis build.

A possible work-around is to run the datasets tests on CircleCI where some of the datasets are downloaded/cached already.

We chatted about something a bit related with @ogrisel. For some tests, it would be nice to run them once in a while but not on each PR. The idea was to set-up a separate repo in the scikit-learn organization and use daily cron jobs in Travis. Amongst the things we thought of:

  • tests on OSX (takes too much time to spin up a OSX VM on Travis apparently)
  • tests on numpy-dev and scipy-dev (failures if any are very likely linked to a change in numpy than to the PR changes)
  • maybe datasets download if we have enough time to do it in a Travis build.

Note that neither the CircleCI nor the Travis cron job play nice with the coverage ...

@ngoix
Copy link
Contributor Author

ngoix commented Sep 13, 2017

commit reverted!

@lesteve
Copy link
Member

lesteve commented Sep 13, 2017

Merging, thanks a lot @ngoix.

@lesteve lesteve merged commit 7e3ad6d into scikit-learn:master Sep 13, 2017
massich pushed a commit to massich/scikit-learn that referenced this pull request Sep 15, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fetch_kddcup99 does not shuffle data

5 participants