Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@deniederhut
Copy link
Contributor

Lambda functions are non-serializable under the stdlib pickle
module. This commit replaces the lambdas found in three text
preprocessing functions with hidden functions for chaining
a sequence of preprocessing steps that can be partialed where
appropriate.

Reference Issues/PRs

Closes #12833

What does this implement/fix? Explain your changes.

Instead of composing functions with lambdas, create chains
of preprocessing steps inside single functions that can
be decomposed with partialing.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into it!

Could you please run benchmarks/bench_text_vectorizers.py before and after this PR and report results?

return lambda doc: self._char_wb_ngrams(
preprocess(self.decode(doc)))
return partial(_analyze, ngrams=self._char_wb_ngrams,
preprocessor=preprocess, decoder=self.decode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: this is really a use-case for toolz.functoolz.compose a shame that we can't use it.

@deniederhut
Copy link
Contributor Author

Sure thing!

Before

================================================================================
#    Text vectorizers benchmark
================================================================================

Using a subset of the 20 newsrgoups dataset (1000 documents).
This benchmarks runs in ~1 min ...

========== Run time performance (sec) ===========

Computing the mean and the standard deviation of the run time over 3 runs...

vectorizer            CountVectorizer HashingVectorizer  TfidfVectorizer
analyzer ngram_range                                                    
char     (4, 4)       2.063 (+-0.031)   1.115 (+-0.010)  2.147 (+-0.023)
char_wb  (4, 4)       1.585 (+-0.022)   1.011 (+-0.006)  1.648 (+-0.012)
word     (1, 1)       0.299 (+-0.006)   0.224 (+-0.004)  0.304 (+-0.001)
         (1, 2)       1.172 (+-0.021)   0.444 (+-0.016)  1.228 (+-0.027)

=============== Memory usage (MB) ===============

vectorizer           CountVectorizer HashingVectorizer TfidfVectorizer
analyzer ngram_range                                                  
char     (4, 4)                284.1             195.3           283.7
char_wb  (4, 4)                232.6             198.2           235.1
word     (1, 1)                155.1             159.5           160.3
         (1, 2)                264.9             167.8           265.8

After

========== Run time performance (sec) ===========

Computing the mean and the standard deviation of the run time over 3 runs...

vectorizer            CountVectorizer HashingVectorizer  TfidfVectorizer
analyzer ngram_range                                                    
char     (4, 4)       2.110 (+-0.024)   1.091 (+-0.026)  2.155 (+-0.016)
char_wb  (4, 4)       1.627 (+-0.037)   1.023 (+-0.039)  1.643 (+-0.003)
word     (1, 1)       0.354 (+-0.075)   0.221 (+-0.001)  0.317 (+-0.009)
         (1, 2)       1.206 (+-0.021)   0.427 (+-0.009)  1.214 (+-0.005)

=============== Memory usage (MB) ===============

vectorizer           CountVectorizer HashingVectorizer TfidfVectorizer
analyzer ngram_range                                                  
char     (4, 4)                304.7             210.8           303.6
char_wb  (4, 4)                255.1             213.5           253.1
word     (1, 1)                176.2             184.1           181.9
         (1, 2)                289.4             190.9           288.9

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd have a non-regression test that checks that all build_* methods result in objects that can be pickled and restored.

@rth rth changed the title BUG: 12833 remove lambdas from text preprocessing FIX remove lambdas from text preprocessing Jul 25, 2019
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @deniederhut !

I'm not overly enthusiastic about the addition of _preprocess and _analyze function, but I don't see another way of fixing pickling.

Unless we ask people to use cloudpickle?

@thomasjpfan
Copy link
Member

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

@deniederhut
Copy link
Contributor Author

Hm... Circle is showing

agent key RSA SHA256:oEdX/44m2Y1klY5Uq287qOWuWfJrhusv4e5ImKQ5Uhk returned incorrect signature type

Does this need to be rebased?

@rth
Copy link
Member

rth commented Jul 30, 2019

Please resolve conflicts (hopefully that would also fix CI by merging master in).

Lambda functions are non-serializable under the stdlib pickle
module. This commit replaces the lambdas found in three text
preprocessing functions with hidden functions for chaining
a sequence of preprocessing steps that can be partialed where
appropriate.

Closes scikit-learn#12833
The function has been modified, so testing for identity is
no longer appropriate.
@deniederhut deniederhut force-pushed the bug/12833-remove-lambdas branch from b6dc6fe to 07d7cf4 Compare July 31, 2019 02:02
@deniederhut
Copy link
Contributor Author

Yup! That did the trick for the CI

@thomasjpfan thomasjpfan merged commit 53f76d1 into scikit-learn:master Aug 1, 2019
@thomasjpfan
Copy link
Member

Thank you @deniederhut!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pickling Tokenizers fails due to use of lambdas

4 participants