Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@kiote
Copy link
Contributor

@kiote kiote commented Sep 20, 2018

Fixes #11680

What does this implement/fix? Explain your changes.

I've implemented a simple caching strategy to save previously calculated values in private attributes.

Any other comments?

There was a suggestion in my previous PR to make benchmarks as well but I'd prefer to do it as the separate issue (not really familiar with benchmarking being used here)

@kiote kiote changed the title Add caching Cache class mapping in MultiLabelBinarizer() Sep 20, 2018
@kiote
Copy link
Contributor Author

kiote commented Sep 20, 2018

@TomDLT you mentioned previously that #11719 (comment) (self._indexed_classes needs to be reset when self.classes_ changes.) but honestly I'm not really sure in what situations it happens, so hard to follow that without knowing details.

Could you please elaborate a bit?

@kiote
Copy link
Contributor Author

kiote commented Sep 20, 2018

also @qinhanmin2014 mentioned that there is no need to set attributes in __init__ but then how should changing be implemented at all, if there is no reference to previous value? #11719 (comment)

if self.classes is None:
classes = sorted(set(itertools.chain.from_iterable(y)))
if self._classes_cache is None:
self._classes_cache = sorted(set(itertools.chain.from_iterable(y)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

class_to_index = dict(zip(self.classes_, range(len(self.classes_))))
if self._indexed_classes_cache is None:
self._indexed_classes_cache = dict(zip(self.classes_, range(len(self.classes_))))
class_to_index = self._indexed_classes_cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this local variable. Combine these two lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


class_to_index = dict(zip(self.classes_, range(len(self.classes_))))
if self._indexed_classes_cache is None:
self._indexed_classes_cache = dict(zip(self.classes_, range(len(self.classes_))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache needs to be invalidated if fit is called again. Why not just do this caching in fit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah.. so basically we should take the cached value only in case if self.classes_ is the same, ie if we've already calculated that dict(zip( thing before?

self._prev_classes = self.classes_
self._cached_dict = None

if (self._prev_classes.__repr__() != self.classes_.__repr__()) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This __repr__ comparison would miss subtle changes in the classes, wouldn't it?
I would suggest to just reset self._cached_dict at each fit and fit_transform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean like that? 36dc1a4

Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test in sklearn/preprocessing/tests/test_label.py, to test that the reset is correct?
You would need to fit an estimator, call the transform, refit it with different classes, call a second time transform, and ensure that the result of the second transform is different from the first one.

Could you also time the performance gain introduced by this change? Just compare the computational time of transform on a few examples, before and after this change.

return yt

def _cached_dict_for(self):
if (self._cached_dict is None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the parenthesis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed


def _cached_dict_for(self):
if (self._cached_dict is None):
self._cached_dict = dict(zip(self.classes_, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The \ is not necessary, since you have an open parenthesis.
You also need to align the arguments to comply with PEP8:

            self._cached_dict = dict(zip(self.classes_,
                                         range(len(self.classes_))))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed


return yt

def _cached_dict_for(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the method should be named as a verb, e.g. _build_cache

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

@jnothman
Copy link
Member

Please add the requested test that fitting multiple times with different classes gets the right results

@kiote
Copy link
Contributor Author

kiote commented Sep 28, 2018

something like this? f387b8a

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's getting there :)

# first call
mlb = MultiLabelBinarizer(classes=[1, 3, 2])
assert_array_equal(mlb.fit_transform(inp), indicator_mat)
#second call change class
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pep8: space after#

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

[1, 0, 0],
[1, 0, 1]])

inp2 = [(1, 2), (1,), (2, 3)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the same class mapping, so it's not useful for the test.

[0, 1, 1]])

# first call
mlb = MultiLabelBinarizer(classes=[1, 3, 2])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to test with and without specifying classes. You can reset classes just by setting mlb.classes =

@kiote
Copy link
Contributor Author

kiote commented Oct 2, 2018

not really sure already 😅 any closer?

@jnothman
Copy link
Member

jnothman commented Oct 3, 2018

Please add an |Efficiency| entry to the change log at doc/whats_new/v0.21.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

@jnothman jnothman merged commit 4e2e1fa into scikit-learn:master Oct 8, 2018
@TomDLT
Copy link
Member

TomDLT commented Oct 8, 2018

Thanks @kiote !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cache class mapping in MultiLabelBinarizer()

3 participants