-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Cache class mapping in MultiLabelBinarizer() #12116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@TomDLT you mentioned previously that #11719 (comment) ( Could you please elaborate a bit? |
|
also @qinhanmin2014 mentioned that there is no need to set attributes in |
sklearn/preprocessing/label.py
Outdated
| if self.classes is None: | ||
| classes = sorted(set(itertools.chain.from_iterable(y))) | ||
| if self._classes_cache is None: | ||
| self._classes_cache = sorted(set(itertools.chain.from_iterable(y))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
sklearn/preprocessing/label.py
Outdated
| class_to_index = dict(zip(self.classes_, range(len(self.classes_)))) | ||
| if self._indexed_classes_cache is None: | ||
| self._indexed_classes_cache = dict(zip(self.classes_, range(len(self.classes_)))) | ||
| class_to_index = self._indexed_classes_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this local variable. Combine these two lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sklearn/preprocessing/label.py
Outdated
|
|
||
| class_to_index = dict(zip(self.classes_, range(len(self.classes_)))) | ||
| if self._indexed_classes_cache is None: | ||
| self._indexed_classes_cache = dict(zip(self.classes_, range(len(self.classes_)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache needs to be invalidated if fit is called again. Why not just do this caching in fit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah.. so basically we should take the cached value only in case if self.classes_ is the same, ie if we've already calculated that dict(zip( thing before?
sklearn/preprocessing/label.py
Outdated
| self._prev_classes = self.classes_ | ||
| self._cached_dict = None | ||
|
|
||
| if (self._prev_classes.__repr__() != self.classes_.__repr__()) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This __repr__ comparison would miss subtle changes in the classes, wouldn't it?
I would suggest to just reset self._cached_dict at each fit and fit_transform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean like that? 36dc1a4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test in sklearn/preprocessing/tests/test_label.py, to test that the reset is correct?
You would need to fit an estimator, call the transform, refit it with different classes, call a second time transform, and ensure that the result of the second transform is different from the first one.
Could you also time the performance gain introduced by this change? Just compare the computational time of transform on a few examples, before and after this change.
sklearn/preprocessing/label.py
Outdated
| return yt | ||
|
|
||
| def _cached_dict_for(self): | ||
| if (self._cached_dict is None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the parenthesis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
sklearn/preprocessing/label.py
Outdated
|
|
||
| def _cached_dict_for(self): | ||
| if (self._cached_dict is None): | ||
| self._cached_dict = dict(zip(self.classes_, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The \ is not necessary, since you have an open parenthesis.
You also need to align the arguments to comply with PEP8:
self._cached_dict = dict(zip(self.classes_,
range(len(self.classes_))))There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
sklearn/preprocessing/label.py
Outdated
|
|
||
| return yt | ||
|
|
||
| def _cached_dict_for(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the method should be named as a verb, e.g. _build_cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
|
Please add the requested test that fitting multiple times with different classes gets the right results |
|
something like this? f387b8a |
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's getting there :)
| # first call | ||
| mlb = MultiLabelBinarizer(classes=[1, 3, 2]) | ||
| assert_array_equal(mlb.fit_transform(inp), indicator_mat) | ||
| #second call change class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pep8: space after#
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| [1, 0, 0], | ||
| [1, 0, 1]]) | ||
|
|
||
| inp2 = [(1, 2), (1,), (2, 3)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the same class mapping, so it's not useful for the test.
| [0, 1, 1]]) | ||
|
|
||
| # first call | ||
| mlb = MultiLabelBinarizer(classes=[1, 3, 2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to test with and without specifying classes. You can reset classes just by setting mlb.classes =
|
not really sure already 😅 any closer? |
|
Please add an |
|
Thanks @kiote ! |
Fixes #11680
What does this implement/fix? Explain your changes.
I've implemented a simple caching strategy to save previously calculated values in private attributes.
Any other comments?
There was a suggestion in my previous PR to make benchmarks as well but I'd prefer to do it as the separate issue (not really familiar with benchmarking being used here)