-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1] Sparse One vs. Rest #3276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Ping myself @arjoly |
This is in WIP mode proper csc_matrix construction is necessary in |
I think this is getting to the point where it would benefit from a review so I will mark it for merge. Please note that the changes specific to this pull request are located in sklearn/multiclass.py and sklearn/tests/testmulticlass.py. |
For others' help, this comparison may be useful https://github.com/hamsal/scikit-learn/compare/scikit-learn:sparse_labelbinarizer...sprs-ovr (or you could open a new PR that bases off scikit-learn's |
@@ -361,3 +390,7 @@ def test_ecoc_gridsearch(): | |||
cv.fit(iris.data, iris.target) | |||
best_C = cv.best_estimator_.estimators_[0].C | |||
assert_true(best_C in Cs) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please insert another blank line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment has been addressed
pred = _predict_binary(e, X) | ||
np.maximum(maxima, pred, out=maxima) | ||
argmaxima[maxima == pred] = i | ||
return np.array(argmaxima.T, dtype=label_binarizer.classes_.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right to expect that argmaxima
to be an integer and not a label from classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have revised this line to sample its labels from the label_binarizer.classes_
list and I wrote a test that uses string labels for the multiclass case.
Can you also add minimal support for decision_function, predict_proba and predict_log_proba? While this might lead to big dense matrix, it's still interesting for a user to see the probability/decision_score/log_proba associated to a few samples. |
I am marking this a WIP while I work on the above revisions |
@arjoly Is it not the case that sparse support for these functions only take up more space for the matrix representation? If a user wants to see the from these functions after training with sparse target data they would not need to make any changes to how they use these functions. Is there any benefit to converting the output to a sparse matrix? |
return estimators, lb | ||
|
||
|
||
def get_col_(Y, i): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not follow naming conventions... perhaps you meant _get_col
Apparently, I wasn't clear. You might want to fit with sparse output
Here, it means to add some appropriate tests. |
for i in range(Y.shape[1])) | ||
if sp.issparse(Y): | ||
Y = Y.tocsc() | ||
columns = [_get_col(Y, i) for i in range(Y.shape[1])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will densify the entire matrix in memory, which is the opposite of what's wanted.
Use (_get_col(Y, i) for i in range(Y.shape[1]))
(no surrounding []
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I have updated this to use the generator expression.
This is an option of the Parallel function of joblib. |
@arjoly I'm familiar with |
Oups, have I missunderstood something? In my current understanding, this parameter unroll the generator |
I believe I have addressed all of the above comments. The results of Vlads benchmark are a little confusing but I think along with my benchmark it validates that |
Y = lb.fit_transform(y) | ||
Y = Y.tocsc() | ||
columns = (col.toarray().ravel() for col in Y.T) | ||
estimators = Parallel(n_jobs=n_jobs)(delayed(_fit_binary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment (or a sentence in the docs) stating that n_jobs > 1
can be slower than n_jobs == 1
when the individual binary classifiers are very fast to fit (as in the case in @arjoly's benchmark.)
You can add a comment in the source referencing this joblib issue.
Raise a ValueError in the label binarizer when dealing with multiputput target data, test a ValueError is raised in ovr for multioutput target data. Test a binary classification task with ovr
Multi-class targets. An indicator matrix turns on multilabel | ||
classification. | ||
y : {array-like, sparse matrix}, shape = [n_samples] or | ||
[n_samples, n_classes] Multi-class targets. An indicator matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y : {array-like, sparse matrix}, shape = [n_samples] or [n_samples, n_classes]
Could it be one line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This extends over the line limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it render well in the doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am working on building the doc, I am getting errors importing ImportError: No module named sklearn.externals.six
after running make html
. I am building scikit learn python setup.py build_ext --inplace
then adding the directory to my PYTHONPATH
. I am still trying to figure out the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens with make doc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rule is in the scikit-learn folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I have gotten it to start building. I apprently have not run python setup.py install
in the scikit-learn
folder before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks bad in the documentation [n_samples, n_classes]
is in the body. Maybe it is better to move the entire statement shape = [n_samples] or [n_samples, n_classes]
into the body?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shape = [n_samples] or [n_samples, n_classes] into the body?
Usually it is put in the header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Althoug it is not entirely precise another solution could be to shorten shape = [n_samples] or [n_samples, n_classes]
to shape = [n_samples, n_classes]
Y = lb.fit_transform(y) | ||
Y = Y.tocsc() | ||
columns = (col.toarray().ravel() for col in Y.T) | ||
# In cases where indivdual estimators are very fast to train setting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: individual
OK, I am 👍 for merge. Bravo to @hamsal and all the reviewers!!! This is good job. If we don't get a -1, next core dev that reviews this PR should merge it. |
Merge by rebase !! Congratulation :-) |
Thank you @hamsal! |
Yes, that's an important contribution! |
Thank you for the reviews!!! |
@hamsal, I am writing a what's new entry for sparse multilabel support. I realise that no attribution has been made in the code, or in the commit log to @rsivapr. Did you build upon his code? If so, it should at least be acknowledged in the changelog, and in the |
Hello all, thanks for all your work on this! What was the final change? The documentation still seems to suggest a dense matrix as a legitimate way to read in y_train. But the memory error persits. https://scikit-learn.org/1.5/modules/multiclass.html I didn't quite follow everything above, should y_train be read into a CSR matrix instead? |
This is a PR for the other parts in PR #2458 concerning sparse output support for the ovr classier. This branch was made off of the sprs-lbl-bin branch from PR #3203 and will be rebased to remove all the sparse label binarizer additions from the diff once the pull request is merged.
Benchmark Experiments using PR #2828 for data generation:
fit_ovr
:sparse_output=True
VS.sparse_output=sp.issparse(y)
fit_ovr
With input as CSR matrix: Cast to CSC before column extraction VS. Leave as CSRnjobs != 1