Open
Description
In #7272 (that was a mix of different PRs aiming to optimize text vectorizers) among other things, the intermediary storage of X.indices
(where X
is the document term matrix) was moved from array.array('i')
to List[int]
.
There were benchmarks suggesting that overall after that PR, less peak memory was used to vectorize documents. However, now I don't understand how that is possible since,
- one element of
array.array('i')
should be around 8x smaller than that of a Python int - storage of indices of the output sparse arrays should be in large part responsible for the overall memory footprint
I can't see any obvious mistakes in my benchmarks back than, but something doesn't make sense here.
Re-measuring peak memory usage of CountVectorizer.fit_transform
, with and without the above change would be a good place to start.
Related discussion in #13045 (comment)