[MRG] Enforce n_folds >= 2 for k-fold cross-validation #2054

ogrisel · 2013-06-11T09:14:35Z

Users might be confused if they set n_folds=1 in KFold or StratifiedKFold as they would get empty training sets that can often results in model fit weird error messages.

This can happen if they naively use cv=1 in GridSearchCV for instance. See: #2048 .

jnothman · 2013-06-11T12:23:31Z

LGTM

ogrisel · 2013-06-11T15:27:40Z

Thanks @jnothman.

Any second reviewer? @GaelVaroquaux @larsmans @mblondel ? If nobody has an objection I will merge tonight.

larsmans · 2013-06-11T15:31:37Z

sklearn/cross_validation.py

@@ -366,6 +372,13 @@ def __init__(self, y, n_folds=3, indices=True, k=None):
        _validate_kfold(n_folds, n)
        _, y_sorted = unique(y, return_inverse=True)
        min_labels = np.min(np.bincount(y_sorted))
+        n_folds = int(n_folds)
+        if n_folds < 2:


The error message can be factored out to a private method or function for maintainability. Maybe the check too.

Otherwise, 👍 for merge.

Ha! We didn't even notice there is already a helper function _validate_kfold that tests for k <= 0! And then KFold goes on to do more validation that equally applies to StratifiedKFold.

This is the problem of looking at diffs. No longer LGTM for merge :s

(And I would vote for making an ABC to cover both cases, but KFold extracts cintiguous slices while stratiied uses i::n_folds. The latter would work for both, but I assume we need to ensure backwards-compatibility.)

jnothman · 2013-06-11T23:09:20Z

I'm sure this PR isn't the right place for it, but, is the following not bad behaviour of KFold?

>>> for n in range(20, 30):
...   print('%d samples in 10 folds' % n, [len(test) for train, test in KFold(n,10)])
... 
20 samples in 10 folds [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
21 samples in 10 folds [2, 2, 2, 2, 2, 2, 2, 2, 2, 3]
22 samples in 10 folds [2, 2, 2, 2, 2, 2, 2, 2, 2, 4]
23 samples in 10 folds [2, 2, 2, 2, 2, 2, 2, 2, 2, 5]
24 samples in 10 folds [2, 2, 2, 2, 2, 2, 2, 2, 2, 6]
25 samples in 10 folds [2, 2, 2, 2, 2, 2, 2, 2, 2, 7]
26 samples in 10 folds [2, 2, 2, 2, 2, 2, 2, 2, 2, 8]
27 samples in 10 folds [2, 2, 2, 2, 2, 2, 2, 2, 2, 9]
28 samples in 10 folds [2, 2, 2, 2, 2, 2, 2, 2, 2, 10]
29 samples in 10 folds [2, 2, 2, 2, 2, 2, 2, 2, 2, 11]

The folds here are very unbalanced! Of course one should ideally have n >> k...

erg · 2013-06-11T23:16:13Z

I'm sure this PR isn't the right place for it, but, is the following not bad behaviour of KFold?

Ooh! I had something for this...

def _fair_array_counts(n_samples, n_classes, random_state=None):
    """Tries to fairly partition n_samples between n_classes.

    If this cannot be done fairly, +1 is added `remainder` times
    to the counts for random arrays until a total of `n_samples` is
    reached.

    >>> _fair_array_counts(5, 3, random_state=43)
    array([2, 1, 2])
    """
    if n_classes > n_samples:
        raise ValueError("The number of classes is greater"
                         " than the number of samples requested")
    sample_size = n_samples // n_classes
    sample_size_rem = n_samples % n_classes
    counts = np.repeat(sample_size, n_classes)
    if sample_size_rem > 0:
        counts[:sample_size_rem] += 1
        # Shuffle so the class inbalance varies between runs
        random_state = check_random_state(random_state)
        random_state.shuffle(counts)
    return counts


for n in range(20, 31):
    print ml._fair_array_counts(n,10,random_state=3)
[2 2 2 2 2 2 2 2 2 2]
[2 2 2 2 2 2 2 3 2 2]
[2 2 3 2 2 2 2 3 2 2]
[2 2 3 3 2 2 2 3 2 2]
[2 2 3 3 2 2 2 3 3 2]
[2 3 3 3 2 2 2 3 3 2]
[3 3 3 3 2 2 2 3 3 2]
[3 3 3 3 2 3 2 3 3 2]
[3 3 3 3 2 3 3 3 3 2]
[3 3 3 3 2 3 3 3 3 3]
[3 3 3 3 3 3 3 3 3 3]

jnothman · 2013-06-12T00:52:19Z

What is the benefit of distributing the remainder randomly?

erg · 2013-06-12T00:55:48Z

It supports randomness because I wrote it for a resample function PR.

#1454

ogrisel · 2013-06-12T18:25:03Z

I addressed all the comments about the initial scope of this PR (I think). The unfair distribution of the folds should probably be addressed in another PR I guess.

larsmans · 2013-06-12T19:22:41Z

Test failure...

jnothman · 2013-06-12T22:13:19Z

Test failure...

Yes, I don't think renaming n to n_samples is within scope of this PR. It would have to be a full deprecation and applied consistently across the module where n is repeatedly used with this meaning.

ogrisel · 2013-06-13T08:22:24Z

I fixed the broken doctest. Unfortunately it's not possible to provide backward compat for positional arguments that are called as kwargs.

If you really want I can revert the n to n_samples renaming.

jnothman · 2013-06-13T09:10:43Z

If you really want I can revert the n to n_samples renaming.

If not, change it for the rest of the module!

ogrisel · 2013-06-13T11:10:36Z

Ok I reverted the n_samples renaming. I think this PR is in a fine state now.

robertlayton · 2013-06-14T05:56:00Z

sklearn/cross_validation.py

+            "k-folds cross validation requires at least one"
+            " train / test split by setting n_folds=2 or more,"
+            " got n_folds=%d."
+            % n_folds)


This should probably use .format() instead.

Yes, it seems there are a lot of % formats in the codebase. And last time @ogrisel tried otherwise, I had to change {} to {0}, {1} to make Jenkins (i.e. Py2.6) happy...

Yeah, many (I'm probably culpable of it myself as well in past PRs).

Annoying about the python 2.6 issue. Maybe this is an issue to bring up when we move to python 2.7 as the minimum version?

ogrisel · 2013-06-14T08:49:16Z

Ok I switched to new style format and rebased everything. We want to keep support for 2.6 till 2015 (end of the Ubuntu 10.04 LTS support) if it's not too cumbersome.

larsmans · 2013-06-14T10:37:00Z

Are we switching to .format now? I still use % even in new code...

ogrisel · 2013-06-14T11:04:48Z

Are we switching to .format now? I still use % even in new code...

I would not say it's mandatory. It tend to prefer it a bit over the old syntax as I find it a bit more explicit for python newcomers but I don't think the python community is likely to deprecate the % notation any time soon.

larsmans · 2013-06-14T15:16:26Z

Apparently they don't dare actually deprecate it. But fair enough, I'll start learning str.format :)

jnothman · 2013-06-15T09:29:14Z

It tend to prefer it a bit over the old syntax as I find it a bit more explicit for python newcomers

It's also problematic to use % where you might unwittingly have a tuple on the rhs...

ogrisel · 2013-06-16T20:44:57Z

Shall we merge this PR?

ogrisel · 2013-06-16T20:45:17Z

It's also problematic to use % where you might unwittingly have a tuple on the rhs...

+1

ENH Enforce n_folds >= 2 for k-fold cross-validation

ogrisel · 2013-06-21T07:14:26Z

Thanks!

larsmans reviewed Jun 11, 2013
View reviewed changes

robertlayton reviewed Jun 14, 2013
View reviewed changes

Enforce n_folds >= 2 for k-fold cross-validation

80e1d27

jnothman added a commit that referenced this pull request Jun 21, 2013

Merge pull request #2054 from ogrisel/invalid-n-folds

aa66b62

ENH Enforce n_folds >= 2 for k-fold cross-validation

jnothman merged commit aa66b62 into scikit-learn:master Jun 21, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Enforce n_folds >= 2 for k-fold cross-validation #2054

[MRG] Enforce n_folds >= 2 for k-fold cross-validation #2054

ogrisel commented Jun 11, 2013

jnothman commented Jun 11, 2013

ogrisel commented Jun 11, 2013

larsmans Jun 11, 2013

jnothman Jun 11, 2013

jnothman Jun 11, 2013

jnothman commented Jun 11, 2013

erg commented Jun 11, 2013

jnothman commented Jun 12, 2013

erg commented Jun 12, 2013

ogrisel commented Jun 12, 2013

larsmans commented Jun 12, 2013

jnothman commented Jun 12, 2013

ogrisel commented Jun 13, 2013

jnothman commented Jun 13, 2013

ogrisel commented Jun 13, 2013

robertlayton Jun 14, 2013

jnothman Jun 14, 2013

robertlayton Jun 14, 2013

ogrisel commented Jun 14, 2013

larsmans commented Jun 14, 2013

ogrisel commented Jun 14, 2013

larsmans commented Jun 14, 2013

jnothman commented Jun 15, 2013

ogrisel commented Jun 16, 2013

ogrisel commented Jun 16, 2013

ogrisel commented Jun 21, 2013

[MRG] Enforce n_folds >= 2 for k-fold cross-validation #2054

[MRG] Enforce n_folds >= 2 for k-fold cross-validation #2054

Conversation

ogrisel commented Jun 11, 2013

jnothman commented Jun 11, 2013

ogrisel commented Jun 11, 2013

larsmans Jun 11, 2013

Choose a reason for hiding this comment

jnothman Jun 11, 2013

Choose a reason for hiding this comment

jnothman Jun 11, 2013

Choose a reason for hiding this comment

jnothman commented Jun 11, 2013

erg commented Jun 11, 2013

jnothman commented Jun 12, 2013

erg commented Jun 12, 2013

ogrisel commented Jun 12, 2013

larsmans commented Jun 12, 2013

jnothman commented Jun 12, 2013

ogrisel commented Jun 13, 2013

jnothman commented Jun 13, 2013

ogrisel commented Jun 13, 2013

robertlayton Jun 14, 2013

Choose a reason for hiding this comment

jnothman Jun 14, 2013

Choose a reason for hiding this comment

robertlayton Jun 14, 2013

Choose a reason for hiding this comment

ogrisel commented Jun 14, 2013

larsmans commented Jun 14, 2013

ogrisel commented Jun 14, 2013

larsmans commented Jun 14, 2013

jnothman commented Jun 15, 2013

ogrisel commented Jun 16, 2013

ogrisel commented Jun 16, 2013

ogrisel commented Jun 21, 2013