Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Re-evaluate memory usage in CountVectorizer #13062

Open
@rth

Description

@rth

In #7272 (that was a mix of different PRs aiming to optimize text vectorizers) among other things, the intermediary storage of X.indices (where X is the document term matrix) was moved from array.array('i') to List[int].

There were benchmarks suggesting that overall after that PR, less peak memory was used to vectorize documents. However, now I don't understand how that is possible since,

  • one element of array.array('i') should be around 8x smaller than that of a Python int
  • storage of indices of the output sparse arrays should be in large part responsible for the overall memory footprint

I can't see any obvious mistakes in my benchmarks back than, but something doesn't make sense here.

Re-measuring peak memory usage of CountVectorizer.fit_transform, with and without the above change would be a good place to start.

Related discussion in #13045 (comment)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions