Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Use succinct tries for vocabulary storage #2639

Closed as not planned
Closed as not planned
@kmike

Description

@kmike

Hey,

Python dictionary is quite wasteful for string data, so CountVectorizer, TfidfVectorizer and DictVectorizer all could take much less memory with an another data structure for vocabulary_.

Here is a quick prototype that persists vocabulary_ of CountVectorizer and TfidfVectorizer in MARISA-Trie via Python wrapper: https://gist.github.com/kmike/7814472.

I used this script to measure memory and speed of fit and dump, and this script to measure parameters of load and transform.

The results are quite cool: for '20 newsgroups' data (fit on training subset, transform on test subset) memory consumption of different vectorizers after loading is the following:

  • CountVectorizer(): 94MB;
  • CountVectorizer(ngram_range=(1,2): 666MB;
  • MarisaCountVectorizer(): 1.2MB;
  • MarisaCountVectorizer(ngram_range=(1,2)): 13.3MB;

So using a good succinct Trie implementation gives us 50x-80x reduction of memory usage; it also makes serializing and deserializing almost instant. And such vectorizers don't have HashingVectorizer limitations.

The downside is that fit method is 2x-3x slower and requires slightly more memory (but I think it could be changed to require less memory than CountVectorizer.fit), and transform is about 2x slower. Full output: https://gist.github.com/kmike/7815156

What do you think about adding something similar to scikit-learn?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions