FIX remove lambdas from text preprocessing #14430

deniederhut · 2019-07-21T02:47:05Z

Lambda functions are non-serializable under the stdlib pickle
module. This commit replaces the lambdas found in three text
preprocessing functions with hidden functions for chaining
a sequence of preprocessing steps that can be partialed where
appropriate.

Reference Issues/PRs

Closes #12833

What does this implement/fix? Explain your changes.

Instead of composing functions with lambdas, create chains
of preprocessing steps inside single functions that can
be decomposed with partialing.

rth

Thanks for looking into it!

Could you please run benchmarks/bench_text_vectorizers.py before and after this PR and report results?

rth · 2019-07-22T09:32:08Z

sklearn/feature_extraction/text.py

-            return lambda doc: self._char_wb_ngrams(
-                preprocess(self.decode(doc)))
+            return partial(_analyze, ngrams=self._char_wb_ngrams,
+                           preprocessor=preprocess, decoder=self.decode)


Side note: this is really a use-case for toolz.functoolz.compose a shame that we can't use it.

sklearn/feature_extraction/text.py

deniederhut · 2019-07-23T01:34:46Z

Sure thing!

Before

================================================================================
#    Text vectorizers benchmark
================================================================================

Using a subset of the 20 newsrgoups dataset (1000 documents).
This benchmarks runs in ~1 min ...

========== Run time performance (sec) ===========

Computing the mean and the standard deviation of the run time over 3 runs...

vectorizer            CountVectorizer HashingVectorizer  TfidfVectorizer
analyzer ngram_range                                                    
char     (4, 4)       2.063 (+-0.031)   1.115 (+-0.010)  2.147 (+-0.023)
char_wb  (4, 4)       1.585 (+-0.022)   1.011 (+-0.006)  1.648 (+-0.012)
word     (1, 1)       0.299 (+-0.006)   0.224 (+-0.004)  0.304 (+-0.001)
         (1, 2)       1.172 (+-0.021)   0.444 (+-0.016)  1.228 (+-0.027)

=============== Memory usage (MB) ===============

vectorizer           CountVectorizer HashingVectorizer TfidfVectorizer
analyzer ngram_range                                                  
char     (4, 4)                284.1             195.3           283.7
char_wb  (4, 4)                232.6             198.2           235.1
word     (1, 1)                155.1             159.5           160.3
         (1, 2)                264.9             167.8           265.8

After

========== Run time performance (sec) ===========

Computing the mean and the standard deviation of the run time over 3 runs...

vectorizer            CountVectorizer HashingVectorizer  TfidfVectorizer
analyzer ngram_range                                                    
char     (4, 4)       2.110 (+-0.024)   1.091 (+-0.026)  2.155 (+-0.016)
char_wb  (4, 4)       1.627 (+-0.037)   1.023 (+-0.039)  1.643 (+-0.003)
word     (1, 1)       0.354 (+-0.075)   0.221 (+-0.001)  0.317 (+-0.009)
         (1, 2)       1.206 (+-0.021)   0.427 (+-0.009)  1.214 (+-0.005)

=============== Memory usage (MB) ===============

vectorizer           CountVectorizer HashingVectorizer TfidfVectorizer
analyzer ngram_range                                                  
char     (4, 4)                304.7             210.8           303.6
char_wb  (4, 4)                255.1             213.5           253.1
word     (1, 1)                176.2             184.1           181.9
         (1, 2)                289.4             190.9           288.9

jnothman

Ideally we'd have a non-regression test that checks that all build_* methods result in objects that can be pickled and restored.

rth

Thanks @deniederhut !

I'm not overly enthusiastic about the addition of _preprocess and _analyze function, but I don't see another way of fixing pickling.

Unless we ask people to use cloudpickle?

sklearn/feature_extraction/text.py

thomasjpfan · 2019-07-27T03:22:24Z

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

deniederhut · 2019-07-30T02:15:58Z

Hm... Circle is showing

agent key RSA SHA256:oEdX/44m2Y1klY5Uq287qOWuWfJrhusv4e5ImKQ5Uhk returned incorrect signature type

Does this need to be rebased?

rth · 2019-07-30T13:43:10Z

Please resolve conflicts (hopefully that would also fix CI by merging master in).

Lambda functions are non-serializable under the stdlib pickle module. This commit replaces the lambdas found in three text preprocessing functions with hidden functions for chaining a sequence of preprocessing steps that can be partialed where appropriate. Closes scikit-learn#12833

The function has been modified, so testing for identity is no longer appropriate.

deniederhut · 2019-08-01T00:14:50Z

Yup! That did the trick for the CI

thomasjpfan · 2019-08-01T14:23:38Z

Thank you @deniederhut!

rth reviewed Jul 22, 2019

View reviewed changes

jnothman reviewed Jul 23, 2019

View reviewed changes

rth changed the title ~~BUG: 12833 remove lambdas from text preprocessing~~ FIX remove lambdas from text preprocessing Jul 25, 2019

rth approved these changes Jul 25, 2019

View reviewed changes

thomasjpfan reviewed Jul 25, 2019

View reviewed changes

sklearn/feature_extraction/text.py Outdated Show resolved Hide resolved

deniederhut added 6 commits July 30, 2019 21:00

TST: adds test for roundtripping processing funcs

f0f1ba6

TST: replaces identity test with output test

5030124

The function has been modified, so testing for identity is no longer appropriate.

BUG: if/else analyzer with other steps

1624ea5

CLN: replaces if/else with partial input

3dc1a13

DOC: updates changelog

07d7cf4

deniederhut force-pushed the bug/12833-remove-lambdas branch from b6dc6fe to 07d7cf4 Compare July 31, 2019 02:02

rth approved these changes Aug 1, 2019

View reviewed changes

thomasjpfan approved these changes Aug 1, 2019

View reviewed changes

thomasjpfan merged commit 53f76d1 into scikit-learn:master Aug 1, 2019

jnothman mentioned this pull request Aug 4, 2019

Replace lambda expressions with locally def'd function calls #12835

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FIX remove lambdas from text preprocessing #14430

FIX remove lambdas from text preprocessing #14430

Uh oh!

deniederhut commented Jul 21, 2019

Uh oh!

rth left a comment

Uh oh!

rth Jul 22, 2019

Uh oh!

Uh oh!

deniederhut commented Jul 23, 2019

Uh oh!

jnothman left a comment •

edited

Loading

Uh oh!

rth left a comment

Uh oh!

Uh oh!

thomasjpfan commented Jul 27, 2019

Uh oh!

deniederhut commented Jul 30, 2019

Uh oh!

rth commented Jul 30, 2019

Uh oh!

deniederhut commented Aug 1, 2019

Uh oh!

thomasjpfan commented Aug 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

FIX remove lambdas from text preprocessing #14430

FIX remove lambdas from text preprocessing #14430

Uh oh!

Conversation

deniederhut commented Jul 21, 2019

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

rth Jul 22, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

deniederhut commented Jul 23, 2019

Before

After

Uh oh!

jnothman left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thomasjpfan commented Jul 27, 2019

Uh oh!

deniederhut commented Jul 30, 2019

Uh oh!

rth commented Jul 30, 2019

Uh oh!

deniederhut commented Aug 1, 2019

Uh oh!

thomasjpfan commented Aug 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jnothman left a comment •

edited

Loading