Cache class mapping in MultiLabelBinarizer() #12116

kiote · 2018-09-20T11:05:36Z

What does this implement/fix? Explain your changes.

I've implemented a simple caching strategy to save previously calculated values in private attributes.

Any other comments?

There was a suggestion in my previous PR to make benchmarks as well but I'd prefer to do it as the separate issue (not really familiar with benchmarking being used here)

kiote · 2018-09-20T11:09:45Z

@TomDLT you mentioned previously that #11719 (comment) (self._indexed_classes needs to be reset when self.classes_ changes.) but honestly I'm not really sure in what situations it happens, so hard to follow that without knowing details.

Could you please elaborate a bit?

kiote · 2018-09-20T11:12:03Z

also @qinhanmin2014 mentioned that there is no need to set attributes in __init__ but then how should changing be implemented at all, if there is no reference to previous value? #11719 (comment)

jnothman · 2018-09-20T11:28:01Z

sklearn/preprocessing/label.py

        if self.classes is None:
-            classes = sorted(set(itertools.chain.from_iterable(y)))
+            if self._classes_cache is None:
+                self._classes_cache = sorted(set(itertools.chain.from_iterable(y)))


I don't think we need this

jnothman · 2018-09-20T11:32:02Z

sklearn/preprocessing/label.py

-        class_to_index = dict(zip(self.classes_, range(len(self.classes_))))
+        if self._indexed_classes_cache is None:
+            self._indexed_classes_cache = dict(zip(self.classes_, range(len(self.classes_))))
+        class_to_index = self._indexed_classes_cache


No need for this local variable. Combine these two lines

jnothman · 2018-09-20T11:38:11Z

sklearn/preprocessing/label.py


-        class_to_index = dict(zip(self.classes_, range(len(self.classes_))))
+        if self._indexed_classes_cache is None:
+            self._indexed_classes_cache = dict(zip(self.classes_, range(len(self.classes_))))


The cache needs to be invalidated if fit is called again. Why not just do this caching in fit?

Ah.. so basically we should take the cached value only in case if self.classes_ is the same, ie if we've already calculated that dict(zip( thing before?

TomDLT · 2018-09-20T17:08:50Z

sklearn/preprocessing/label.py

+            self._prev_classes = self.classes_
+            self._cached_dict = None
+
+        if (self._prev_classes.__repr__() != self.classes_.__repr__()) \


This __repr__ comparison would miss subtle changes in the classes, wouldn't it?
I would suggest to just reset self._cached_dict at each fit and fit_transform.

you mean like that? 36dc1a4

TomDLT

Can you add a test in sklearn/preprocessing/tests/test_label.py, to test that the reset is correct?
You would need to fit an estimator, call the transform, refit it with different classes, call a second time transform, and ensure that the result of the second transform is different from the first one.

Could you also time the performance gain introduced by this change? Just compare the computational time of transform on a few examples, before and after this change.

sklearn/preprocessing/label.py

TomDLT · 2018-09-21T08:31:49Z

sklearn/preprocessing/label.py

        return yt

+    def _cached_dict_for(self):
+        if (self._cached_dict is None):


You don't need the parenthesis.

TomDLT · 2018-09-21T08:35:01Z

sklearn/preprocessing/label.py


+    def _cached_dict_for(self):
+        if (self._cached_dict is None):
+            self._cached_dict = dict(zip(self.classes_, \


The \ is not necessary, since you have an open parenthesis.
You also need to align the arguments to comply with PEP8:

self._cached_dict = dict(zip(self.classes_, range(len(self.classes_))))

jnothman · 2018-09-22T21:29:59Z

sklearn/preprocessing/label.py


        return yt

+    def _cached_dict_for(self):


the method should be named as a verb, e.g. _build_cache

jnothman · 2018-09-26T00:08:20Z

Please add the requested test that fitting multiple times with different classes gets the right results

kiote · 2018-09-28T06:27:22Z

something like this? f387b8a

jnothman

It's getting there :)

jnothman · 2018-09-29T23:29:31Z

sklearn/preprocessing/tests/test_label.py

+    # first call
+    mlb = MultiLabelBinarizer(classes=[1, 3, 2])
+    assert_array_equal(mlb.fit_transform(inp), indicator_mat)
+    #second call change class


Pep8: space after#

jnothman · 2018-09-29T23:30:50Z

sklearn/preprocessing/tests/test_label.py

+                              [1, 0, 0],
+                              [1, 0, 1]])
+
+    inp2 = [(1, 2), (1,), (2, 3)]


This has the same class mapping, so it's not useful for the test.

jnothman · 2018-09-29T23:32:12Z

sklearn/preprocessing/tests/test_label.py

+                               [0, 1, 1]])
+
+    # first call
+    mlb = MultiLabelBinarizer(classes=[1, 3, 2])


It would be good to test with and without specifying classes. You can reset classes just by setting mlb.classes =

kiote · 2018-10-02T06:54:46Z

not really sure already 😅 any closer?

jnothman · 2018-10-03T07:14:53Z

Please add an |Efficiency| entry to the change log at doc/whats_new/v0.21.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

TomDLT · 2018-10-08T13:44:51Z

Thanks @kiote !

Add caching

83ca148

kiote changed the title ~~Add caching~~ Cache class mapping in MultiLabelBinarizer() Sep 20, 2018

jnothman reviewed Sep 20, 2018

View reviewed changes

kiote added 5 commits September 20, 2018 15:07

remove unnecessary caching

353eba3

get rid of local var

2314fb8

implement dofferent caching

fe73ba5

remove print

c0d073f

fix line length

dd8b48e

TomDLT reviewed Sep 20, 2018

View reviewed changes

reset chached_dict

36dc1a4

TomDLT reviewed Sep 21, 2018

View reviewed changes

kiote added 2 commits September 21, 2018 17:41

follow PR comments

c6e185c

flake8 fix

71089a1

jnothman reviewed Sep 22, 2018

View reviewed changes

method rename

874ea99

add test

f387b8a

jnothman reviewed Sep 29, 2018

View reviewed changes

kiote added 2 commits October 2, 2018 09:51

fix tests

be32d0a

Merge branch 'master' into cache

5dc5600

blank lines fix

17c8bb2

TomDLT approved these changes Oct 2, 2018

View reviewed changes

jnothman approved these changes Oct 3, 2018

View reviewed changes

kiote and others added 2 commits October 7, 2018 22:25

add release documentation

dee9d9c

FIX whatsnew entry

cdd72e2

TomDLT added 3 commits October 8, 2018 11:06

Move module

ea2f479

Merge branch 'master' into cache

a4ee75f

remove merge mistake

393448c

jnothman merged commit 4e2e1fa into scikit-learn:master Oct 8, 2018

Uh oh!

Cache class mapping in MultiLabelBinarizer() #12116

Cache class mapping in MultiLabelBinarizer() #12116

Uh oh!

Conversation

kiote commented Sep 20, 2018

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

kiote commented Sep 20, 2018

Uh oh!

kiote commented Sep 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomDLT left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Sep 26, 2018

Uh oh!

kiote commented Sep 28, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiote commented Oct 2, 2018

Uh oh!

jnothman commented Oct 3, 2018

Uh oh!

TomDLT commented Oct 8, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TomDLT left a comment •

edited

Loading