Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] Sparse One vs. Rest #3276

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 54 commits into from
Closed

[MRG+1] Sparse One vs. Rest #3276

wants to merge 54 commits into from

Conversation

hamsal
Copy link
Contributor

@hamsal hamsal commented Jun 13, 2014

This is a PR for the other parts in PR #2458 concerning sparse output support for the ovr classier. This branch was made off of the sprs-lbl-bin branch from PR #3203 and will be rebased to remove all the sparse label binarizer additions from the diff once the pull request is merged.

Benchmark Experiments using PR #2828 for data generation:

  • fit_ovr: sparse_output=True VS. sparse_output=sp.issparse(y)
  • fit_ovr With input as CSR matrix: Cast to CSC before column extraction VS. Leave as CSR
  • Memory profiling for sparse vs. dense target data
  • Test the changes to label.py
  • Test probability and desicion functions after fitting on sparse target data
  • Memory benchmark with njobs != 1

@hamsal hamsal changed the title Sparse One vs. Rest [WIP] Sparse One vs. Rest Jun 13, 2014
@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) when pulling 0979dd0 on hamsal:sprs-ovr into 662fe00 on scikit-learn:master.

@arjoly
Copy link
Member

arjoly commented Jun 14, 2014

Ping myself @arjoly

@hamsal
Copy link
Contributor Author

hamsal commented Jun 19, 2014

This is in WIP mode proper csc_matrix construction is necessary in predict_ovr

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) when pulling 897a3ba on hamsal:sprs-ovr into 662fe00 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) when pulling 10f23b1 on hamsal:sprs-ovr into 662fe00 on scikit-learn:master.

@hamsal
Copy link
Contributor Author

hamsal commented Jun 19, 2014

I think this is getting to the point where it would benefit from a review so I will mark it for merge. Please note that the changes specific to this pull request are located in sklearn/multiclass.py and sklearn/tests/testmulticlass.py.

@hamsal hamsal changed the title [WIP] Sparse One vs. Rest [MRG] Sparse One vs. Rest Jun 19, 2014
@jnothman
Copy link
Member

For others' help, this comparison may be useful https://github.com/hamsal/scikit-learn/compare/scikit-learn:sparse_labelbinarizer...sprs-ovr (or you could open a new PR that bases off scikit-learn's sparse_labelbinarizer branch that I just created)

@@ -361,3 +390,7 @@ def test_ecoc_gridsearch():
cv.fit(iris.data, iris.target)
best_C = cv.best_estimator_.estimators_[0].C
assert_true(best_C in Cs)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please insert another blank line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment has been addressed

@arjoly
Copy link
Member

arjoly commented Jun 24, 2014

@hamsal Once you have taken into account the comment of @jnothman, can you rebase on top of master?

pred = _predict_binary(e, X)
np.maximum(maxima, pred, out=maxima)
argmaxima[maxima == pred] = i
return np.array(argmaxima.T, dtype=label_binarizer.classes_.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right to expect that argmaxima to be an integer and not a label from classes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have revised this line to sample its labels from the label_binarizer.classes_ list and I wrote a test that uses string labels for the multiclass case.

@arjoly
Copy link
Member

arjoly commented Jun 25, 2014

Can you also add minimal support for decision_function, predict_proba and predict_log_proba?

While this might lead to big dense matrix, it's still interesting for a user to see the probability/decision_score/log_proba associated to a few samples.

@hamsal hamsal changed the title [MRG] Sparse One vs. Rest [WIP] Sparse One vs. Rest Jun 26, 2014
@hamsal
Copy link
Contributor Author

hamsal commented Jun 26, 2014

I am marking this a WIP while I work on the above revisions

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 824933a on hamsal:sprs-ovr into 7a70e0f on scikit-learn:master.

@hamsal
Copy link
Contributor Author

hamsal commented Jun 28, 2014

@arjoly Is it not the case that sparse support for these functions only take up more space for the matrix representation? If a user wants to see the from these functions after training with sparse target data they would not need to make any changes to how they use these functions. Is there any benefit to converting the output to a sparse matrix?

return estimators, lb


def get_col_(Y, i):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not follow naming conventions... perhaps you meant _get_col

@arjoly
Copy link
Member

arjoly commented Jun 29, 2014

@arjoly Is it not the case that sparse support for these functions only take up more space for the matrix representation?
True, in most case this will lead to a dense representation.

If a user wants to see the from these functions after training with sparse target data they would not need to make any changes to how they use these functions. Is there any benefit to converting the output to a sparse matrix?

Apparently, I wasn't clear. You might want to fit with sparse output y and get the probability. So something like this should be possible

# We have X_train, y_sparse, X_test
estimator = OneVsRestClassifier(...)
estimator.fit(X_train, y_sparse)
y_proba = estimator.predict_proba(X_test)

Here, it means to add some appropriate tests.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.05%) when pulling 1af7ccf on hamsal:sprs-ovr into 7a70e0f on scikit-learn:master.

for i in range(Y.shape[1]))
if sp.issparse(Y):
Y = Y.tocsc()
columns = [_get_col(Y, i) for i in range(Y.shape[1])]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will densify the entire matrix in memory, which is the opposite of what's wanted.

Use (_get_col(Y, i) for i in range(Y.shape[1])) (no surrounding [])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I have updated this to use the generator expression.

@arjoly
Copy link
Member

arjoly commented Jul 15, 2014

Hmm, sorry, should I have set pre_dispatch to something in particular as
well? I can just avoid the blame by saying that the script wasn't
self-contained enough :P

This is an option of the Parallel function of joblib.

@vene
Copy link
Member

vene commented Jul 15, 2014

@arjoly I'm familiar with pre_dispatch, however I didn't follow the discussion in this PR. Is it expected to use more memory with the default value?

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) when pulling a3b909a on hamsal:sprs-ovr into f7e9527 on scikit-learn:master.

@arjoly
Copy link
Member

arjoly commented Jul 15, 2014

Oups, have I missunderstood something? In my current understanding, this parameter unroll the generator columns = (col.toarray().ravel() for col in Y.T) with its default value.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) when pulling 911bff2 on hamsal:sprs-ovr into f7e9527 on scikit-learn:master.

@hamsal
Copy link
Contributor Author

hamsal commented Jul 15, 2014

I believe I have addressed all of the above comments. The results of Vlads benchmark are a little confusing but I think along with my benchmark it validates that n_jobs=-1 could not hurt the memory consumption.

Y = lb.fit_transform(y)
Y = Y.tocsc()
columns = (col.toarray().ravel() for col in Y.T)
estimators = Parallel(n_jobs=n_jobs)(delayed(_fit_binary)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment (or a sentence in the docs) stating that n_jobs > 1 can be slower than n_jobs == 1 when the individual binary classifiers are very fast to fit (as in the case in @arjoly's benchmark.)

You can add a comment in the source referencing this joblib issue.

hamsal added 2 commits July 16, 2014 10:09
Raise a ValueError in the label binarizer when dealing with multiputput target data,
test a ValueError is raised in ovr for multioutput target data.

Test a binary classification task with ovr
Multi-class targets. An indicator matrix turns on multilabel
classification.
y : {array-like, sparse matrix}, shape = [n_samples] or
[n_samples, n_classes] Multi-class targets. An indicator matrix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y : {array-like, sparse matrix}, shape = [n_samples] or [n_samples, n_classes]

Could it be one line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extends over the line limit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it render well in the doc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on building the doc, I am getting errors importing ImportError: No module named sklearn.externals.six after running make html. I am building scikit learn python setup.py build_ext --inplace then adding the directory to my PYTHONPATH. I am still trying to figure out the problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with make doc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule is in the scikit-learn folder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I have gotten it to start building. I apprently have not run python setup.py install in the scikit-learn folder before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks bad in the documentation [n_samples, n_classes] is in the body. Maybe it is better to move the entire statement shape = [n_samples] or [n_samples, n_classes] into the body?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shape = [n_samples] or [n_samples, n_classes] into the body?

Usually it is put in the header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Althoug it is not entirely precise another solution could be to shorten shape = [n_samples] or [n_samples, n_classes] to shape = [n_samples, n_classes]

Y = lb.fit_transform(y)
Y = Y.tocsc()
columns = (col.toarray().ravel() for col in Y.T)
# In cases where indivdual estimators are very fast to train setting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: individual

@coveralls
Copy link

Coverage Status

Coverage increased (+0.05%) when pulling 855fdd1 on hamsal:sprs-ovr into f7e9527 on scikit-learn:master.

@GaelVaroquaux GaelVaroquaux changed the title [MRG] Sparse One vs. Rest [MRG+1] Sparse One vs. Rest Jul 17, 2014
@GaelVaroquaux
Copy link
Member

OK, I am 👍 for merge. Bravo to @hamsal and all the reviewers!!! This is good job.

If we don't get a -1, next core dev that reviews this PR should merge it.

@arjoly
Copy link
Member

arjoly commented Jul 18, 2014

Merge by rebase !!

Congratulation :-)

@arjoly arjoly closed this Jul 18, 2014
@vene
Copy link
Member

vene commented Jul 18, 2014

Thank you @hamsal!

@GaelVaroquaux
Copy link
Member

Thank you @hamsal!

Yes, that's an important contribution!

@hamsal
Copy link
Contributor Author

hamsal commented Jul 18, 2014

Thank you for the reviews!!!

@jnothman
Copy link
Member

@hamsal, I am writing a what's new entry for sparse multilabel support. I realise that no attribution has been made in the code, or in the commit log to @rsivapr. Did you build upon his code? If so, it should at least be acknowledged in the changelog, and in the .py files if you used his code directly. Could you please clarify the situation?

@ua-chjb
Copy link

ua-chjb commented Nov 23, 2024

Hello all, thanks for all your work on this! What was the final change? The documentation still seems to suggest a dense matrix as a legitimate way to read in y_train. But the memory error persits.

https://scikit-learn.org/1.5/modules/multiclass.html

I didn't quite follow everything above, should y_train be read into a CSR matrix instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants