Description
Hey,
Python dictionary is quite wasteful for string data, so CountVectorizer
, TfidfVectorizer
and DictVectorizer
all could take much less memory with an another data structure for vocabulary_
.
Here is a quick prototype that persists vocabulary_
of CountVectorizer
and TfidfVectorizer
in MARISA-Trie via Python wrapper: https://gist.github.com/kmike/7814472.
I used this script to measure memory and speed of fit
and dump
, and this script to measure parameters of load
and transform
.
The results are quite cool: for '20 newsgroups' data (fit on training subset, transform on test subset) memory consumption of different vectorizers after loading is the following:
CountVectorizer()
: 94MB;CountVectorizer(ngram_range=(1,2)
: 666MB;MarisaCountVectorizer()
: 1.2MB;MarisaCountVectorizer(ngram_range=(1,2))
: 13.3MB;
So using a good succinct Trie implementation gives us 50x-80x reduction of memory usage; it also makes serializing and deserializing almost instant. And such vectorizers don't have HashingVectorizer limitations.
The downside is that fit
method is 2x-3x slower and requires slightly more memory (but I think it could be changed to require less memory than CountVectorizer.fit), and transform
is about 2x slower. Full output: https://gist.github.com/kmike/7815156
What do you think about adding something similar to scikit-learn?