Re-evaluate memory usage in CountVectorizer

In https://github.com/scikit-learn/scikit-learn/pull/7272 (that was a mix of different PRs aiming to optimize text vectorizers) among other things, the intermediary storage of `X.indices` (where `X` is the document term matrix) was moved from `array.array('i')` to `List[int]`.

There were benchmarks suggesting that overall after that PR, less peak memory was used to vectorize documents. However, now I don't understand how that is possible since,
 - one element of `array.array('i')` should be around 8x smaller than that of a Python int
 - storage of indices of the output sparse arrays should be in large part responsible for the overall memory footprint

I can't see any obvious mistakes in my benchmarks back than, but something doesn't make sense here.

Re-measuring peak memory usage of `CountVectorizer.fit_transform`, with and without the above change would be a good place to start.

Related discussion in https://github.com/scikit-learn/scikit-learn/issues/13045#issuecomment-458300128

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Re-evaluate memory usage in CountVectorizer #13062

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Re-evaluate memory usage in CountVectorizer #13062

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions