[MRG+1] Sparse One vs. Rest #3276

hamsal · 2014-06-13T15:06:24Z

This is a PR for the other parts in PR #2458 concerning sparse output support for the ovr classier. This branch was made off of the sprs-lbl-bin branch from PR #3203 and will be rebased to remove all the sparse label binarizer additions from the diff once the pull request is merged.

Benchmark Experiments using PR #2828 for data generation:

fit_ovr: sparse_output=True VS. sparse_output=sp.issparse(y)
fit_ovr With input as CSR matrix: Cast to CSC before column extraction VS. Leave as CSR
Memory profiling for sparse vs. dense target data
Test the changes to label.py
Test probability and desicion functions after fitting on sparse target data
Memory benchmark with njobs != 1

coveralls · 2014-06-13T22:01:31Z

Coverage increased (+0.02%) when pulling 0979dd0 on hamsal:sprs-ovr into 662fe00 on scikit-learn:master.

arjoly · 2014-06-14T08:39:44Z

Ping myself @arjoly

hamsal · 2014-06-19T02:42:16Z

This is in WIP mode proper csc_matrix construction is necessary in predict_ovr

coveralls · 2014-06-19T02:45:36Z

Coverage increased (+0.02%) when pulling 897a3ba on hamsal:sprs-ovr into 662fe00 on scikit-learn:master.

coveralls · 2014-06-19T15:46:23Z

Coverage increased (+0.02%) when pulling 10f23b1 on hamsal:sprs-ovr into 662fe00 on scikit-learn:master.

hamsal · 2014-06-19T23:31:23Z

I think this is getting to the point where it would benefit from a review so I will mark it for merge. Please note that the changes specific to this pull request are located in sklearn/multiclass.py and sklearn/tests/testmulticlass.py.

jnothman · 2014-06-20T00:09:03Z

For others' help, this comparison may be useful https://github.com/hamsal/scikit-learn/compare/scikit-learn:sparse_labelbinarizer...sprs-ovr (or you could open a new PR that bases off scikit-learn's sparse_labelbinarizer branch that I just created)

jnothman · 2014-06-20T00:10:01Z

sklearn/tests/test_multiclass.py

@@ -361,3 +390,7 @@ def test_ecoc_gridsearch():
    cv.fit(iris.data, iris.target)
    best_C = cv.best_estimator_.estimators_[0].C
    assert_true(best_C in Cs)
+


please insert another blank line

This comment has been addressed

arjoly · 2014-06-24T15:55:29Z

@hamsal Once you have taken into account the comment of @jnothman, can you rebase on top of master?

arjoly · 2014-06-25T20:01:35Z

sklearn/multiclass.py

+            pred = _predict_binary(e, X)
+            np.maximum(maxima, pred, out=maxima)
+            argmaxima[maxima == pred] = i
+        return np.array(argmaxima.T, dtype=label_binarizer.classes_.dtype)


Am I right to expect that argmaxima to be an integer and not a label from classes?

I have revised this line to sample its labels from the label_binarizer.classes_ list and I wrote a test that uses string labels for the multiclass case.

arjoly · 2014-06-25T20:06:53Z

Can you also add minimal support for decision_function, predict_proba and predict_log_proba?

While this might lead to big dense matrix, it's still interesting for a user to see the probability/decision_score/log_proba associated to a few samples.

hamsal · 2014-06-26T01:18:00Z

I am marking this a WIP while I work on the above revisions

coveralls · 2014-06-28T00:52:09Z

Coverage increased (+0.0%) when pulling 824933a on hamsal:sprs-ovr into 7a70e0f on scikit-learn:master.

hamsal · 2014-06-28T01:21:28Z

@arjoly Is it not the case that sparse support for these functions only take up more space for the matrix representation? If a user wants to see the from these functions after training with sparse target data they would not need to make any changes to how they use these functions. Is there any benefit to converting the output to a sparse matrix?

jnothman · 2014-06-29T03:01:19Z

sklearn/multiclass.py

    return estimators, lb


+def get_col_(Y, i):


This does not follow naming conventions... perhaps you meant _get_col

arjoly · 2014-06-29T09:26:24Z

@arjoly Is it not the case that sparse support for these functions only take up more space for the matrix representation?
True, in most case this will lead to a dense representation.

If a user wants to see the from these functions after training with sparse target data they would not need to make any changes to how they use these functions. Is there any benefit to converting the output to a sparse matrix?

Apparently, I wasn't clear. You might want to fit with sparse output y and get the probability. So something like this should be possible

# We have X_train, y_sparse, X_test
estimator = OneVsRestClassifier(...)
estimator.fit(X_train, y_sparse)
y_proba = estimator.predict_proba(X_test)

Here, it means to add some appropriate tests.

coveralls · 2014-07-01T23:03:03Z

Coverage decreased (-0.05%) when pulling 1af7ccf on hamsal:sprs-ovr into 7a70e0f on scikit-learn:master.

jnothman · 2014-07-01T23:19:52Z

sklearn/multiclass.py

-        for i in range(Y.shape[1]))
+    if sp.issparse(Y):
+        Y = Y.tocsc()
+        columns = [_get_col(Y, i) for i in range(Y.shape[1])]


This will densify the entire matrix in memory, which is the opposite of what's wanted.

Use (_get_col(Y, i) for i in range(Y.shape[1])) (no surrounding [])

Thank you! I have updated this to use the generator expression.

arjoly · 2014-07-15T16:08:27Z

Hmm, sorry, should I have set pre_dispatch to something in particular as
well? I can just avoid the blame by saying that the script wasn't
self-contained enough :P

This is an option of the Parallel function of joblib.

vene · 2014-07-15T16:12:05Z

@arjoly I'm familiar with pre_dispatch, however I didn't follow the discussion in this PR. Is it expected to use more memory with the default value?

coveralls · 2014-07-15T16:17:13Z

Coverage increased (+0.01%) when pulling a3b909a on hamsal:sprs-ovr into f7e9527 on scikit-learn:master.

arjoly · 2014-07-15T16:21:47Z

Oups, have I missunderstood something? In my current understanding, this parameter unroll the generator columns = (col.toarray().ravel() for col in Y.T) with its default value.

coveralls · 2014-07-15T18:22:28Z

Coverage increased (+0.01%) when pulling 911bff2 on hamsal:sprs-ovr into f7e9527 on scikit-learn:master.

hamsal · 2014-07-15T20:06:39Z

I believe I have addressed all of the above comments. The results of Vlads benchmark are a little confusing but I think along with my benchmark it validates that n_jobs=-1 could not hurt the memory consumption.

vene · 2014-07-16T12:01:14Z

sklearn/multiclass.py

+    Y = lb.fit_transform(y)
+    Y = Y.tocsc()
+    columns = (col.toarray().ravel() for col in Y.T)
+    estimators = Parallel(n_jobs=n_jobs)(delayed(_fit_binary)


Please add a comment (or a sentence in the docs) stating that n_jobs > 1 can be slower than n_jobs == 1 when the individual binary classifiers are very fast to fit (as in the case in @arjoly's benchmark.)

You can add a comment in the source referencing this joblib issue.

Raise a ValueError in the label binarizer when dealing with multiputput target data, test a ValueError is raised in ovr for multioutput target data. Test a binary classification task with ovr

arjoly · 2014-07-17T09:46:20Z

sklearn/multiclass.py

-            Multi-class targets. An indicator matrix turns on multilabel
-            classification.
+        y : {array-like, sparse matrix}, shape = [n_samples] or
+            [n_samples, n_classes] Multi-class targets. An indicator matrix


y : {array-like, sparse matrix}, shape = [n_samples] or [n_samples, n_classes]

Could it be one line?

This extends over the line limit

Does it render well in the doc?

I am working on building the doc, I am getting errors importing ImportError: No module named sklearn.externals.six after running make html. I am building scikit learn python setup.py build_ext --inplace then adding the directory to my PYTHONPATH. I am still trying to figure out the problem.

What happens with make doc?

This rule is in the scikit-learn folder.

Ok I have gotten it to start building. I apprently have not run python setup.py install in the scikit-learn folder before.

It looks bad in the documentation [n_samples, n_classes] is in the body. Maybe it is better to move the entire statement shape = [n_samples] or [n_samples, n_classes] into the body?

shape = [n_samples] or [n_samples, n_classes] into the body?

Usually it is put in the header.

Althoug it is not entirely precise another solution could be to shorten shape = [n_samples] or [n_samples, n_classes] to shape = [n_samples, n_classes]

GaelVaroquaux · 2014-07-17T12:30:38Z

sklearn/multiclass.py

+    Y = lb.fit_transform(y)
+    Y = Y.tocsc()
+    columns = (col.toarray().ravel() for col in Y.T)
+    # In cases where indivdual estimators are very fast to train setting


Typo: individual

coveralls · 2014-07-17T13:13:14Z

Coverage increased (+0.05%) when pulling 855fdd1 on hamsal:sprs-ovr into f7e9527 on scikit-learn:master.

GaelVaroquaux · 2014-07-17T13:30:45Z

OK, I am 👍 for merge. Bravo to @hamsal and all the reviewers!!! This is good job.

If we don't get a -1, next core dev that reviews this PR should merge it.

arjoly · 2014-07-18T11:48:53Z

Merge by rebase !!

Congratulation :-)

vene · 2014-07-18T11:52:38Z

Thank you @hamsal!

GaelVaroquaux · 2014-07-18T11:58:46Z

Thank you @hamsal!

Yes, that's an important contribution!

hamsal · 2014-07-18T12:30:51Z

Thank you for the reviews!!!

jnothman · 2014-08-14T03:25:40Z

@hamsal, I am writing a what's new entry for sparse multilabel support. I realise that no attribution has been made in the code, or in the commit log to @rsivapr. Did you build upon his code? If so, it should at least be acknowledged in the changelog, and in the .py files if you used his code directly. Could you please clarify the situation?

ua-chjb · 2024-11-23T21:26:12Z

Hello all, thanks for all your work on this! What was the final change? The documentation still seems to suggest a dense matrix as a legitimate way to read in y_train. But the memory error persits.

https://scikit-learn.org/1.5/modules/multiclass.html

I didn't quite follow everything above, should y_train be read into a CSR matrix instead?

hamsal changed the title ~~Sparse One vs. Rest~~ [WIP] Sparse One vs. Rest Jun 13, 2014

hamsal changed the title ~~[WIP] Sparse One vs. Rest~~ [MRG] Sparse One vs. Rest Jun 19, 2014

jnothman reviewed Jun 20, 2014
View reviewed changes

arjoly reviewed Jun 25, 2014
View reviewed changes

hamsal changed the title ~~[MRG] Sparse One vs. Rest~~ [WIP] Sparse One vs. Rest Jun 26, 2014

jnothman reviewed Jun 29, 2014
View reviewed changes

jnothman reviewed Jul 1, 2014
View reviewed changes

Use lb.classes_ in fit_ovr to maintain class dtype

911bff2

vene reviewed Jul 16, 2014
View reviewed changes

hamsal added 2 commits July 16, 2014 10:09

Comment j_jobs > 1 in fit_orv

80f57f2

LabelBinarizer and ovr fail multioutput, test binary ovr

dbc67af

Raise a ValueError in the label binarizer when dealing with multiputput target data, test a ValueError is raised in ovr for multioutput target data. Test a binary classification task with ovr

vene mentioned this pull request Jul 17, 2014

[MRG+1] Sparse multilabel target support in metrics #3395

Merged

arjoly reviewed Jul 17, 2014
View reviewed changes

arjoly mentioned this pull request Jul 17, 2014

Multi-label Label Binarizer Memory Error #2441

Closed

GaelVaroquaux reviewed Jul 17, 2014
View reviewed changes

hamsal added 2 commits July 17, 2014 08:38

Fit binary target data on one line

26d63c3

Fix typo individual

855fdd1

GaelVaroquaux changed the title ~~[MRG] Sparse One vs. Rest~~ [MRG+1] Sparse One vs. Rest Jul 17, 2014

Untab overindented line in predict docstring

e1dc470

arjoly closed this Jul 18, 2014

vene mentioned this pull request Jul 20, 2014

[MRG] Automatically group short tasks in batches joblib/joblib#157

Merged

3 tasks

Uh oh!

[MRG+1] Sparse One vs. Rest #3276

[MRG+1] Sparse One vs. Rest #3276

Uh oh!

Conversation

hamsal commented Jun 13, 2014

Uh oh!

coveralls commented Jun 13, 2014

Uh oh!

arjoly commented Jun 14, 2014

Uh oh!

hamsal commented Jun 19, 2014

Uh oh!

coveralls commented Jun 19, 2014

Uh oh!

coveralls commented Jun 19, 2014

Uh oh!

hamsal commented Jun 19, 2014

Uh oh!

jnothman commented Jun 20, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arjoly commented Jun 24, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arjoly commented Jun 25, 2014

Uh oh!

hamsal commented Jun 26, 2014

Uh oh!

coveralls commented Jun 28, 2014

Uh oh!

hamsal commented Jun 28, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arjoly commented Jun 29, 2014

Uh oh!

coveralls commented Jul 1, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arjoly commented Jul 15, 2014

Uh oh!

vene commented Jul 15, 2014

Uh oh!

coveralls commented Jul 15, 2014

Uh oh!

arjoly commented Jul 15, 2014

Uh oh!

coveralls commented Jul 15, 2014

Uh oh!

hamsal commented Jul 15, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment