Use succinct tries for vocabulary storage

Hey,

Python dictionary is quite wasteful for string data, so `CountVectorizer`, `TfidfVectorizer` and `DictVectorizer` all could take much less memory with an another data structure for `vocabulary_`. 

Here is a quick prototype that persists `vocabulary_` of `CountVectorizer` and `TfidfVectorizer` in [MARISA-Trie](https://code.google.com/p/marisa-trie/) via Python [wrapper](https://github.com/kmike/marisa-trie): https://gist.github.com/kmike/7814472.

I used [this](https://gist.github.com/kmike/7813450) script to measure memory and speed of `fit` and `dump`, and [this](https://gist.github.com/kmike/7815149) script to measure parameters of `load` and `transform`.

The results are quite cool: for '20 newsgroups' data (fit on training subset, transform on test subset) memory consumption of different vectorizers after loading is the following:
-  `CountVectorizer()`: 94MB;
-  `CountVectorizer(ngram_range=(1,2)`: 666MB;
-  `MarisaCountVectorizer()`: 1.2MB;
-  `MarisaCountVectorizer(ngram_range=(1,2))`: 13.3MB;

So using a good succinct Trie implementation gives us 50x-80x reduction of memory usage; it also makes serializing and deserializing almost instant. And such vectorizers don't have HashingVectorizer limitations.

The downside is that `fit` method is 2x-3x slower and requires slightly more memory (but I think it could be changed to require less memory than CountVectorizer.fit), and `transform` is about 2x slower. Full output: https://gist.github.com/kmike/7815156

What do you think about adding something similar to scikit-learn? 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Use succinct tries for vocabulary storage #2639

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Use succinct tries for vocabulary storage #2639

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions