Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Memory mapping causes disk space usage bloat in big models with many search parameters #19608

Open
@asansal-quantico

Description

@asansal-quantico

Hello,

We use GridSearchCV in a project where our training data is quite large (approx, 20,000 rows x 100 columns/features). We search a large space of hyperparameters, and with 5-10 cross-validation runs total number of model sometimes up to 1000.

When running default parallelization using joblib, the tool generates a lot of memory-mapped disk data and uses a lot of disk space on the scale of hundreds of gigabytes.

Solution:
Once we disable memory mapping, it runs successfully without using the disk. The solution is to call joblib's Parallel(..., max_nbytes=None) within GridSearchCV, see the link to code below. The default for joblib is '1M', which is 1 megabyte. When we pass None, it disables memory mapping.

Suggested improvement is to add a kwarg to GridSearchCV that gets passed to Parallel's max_nbytes.

parallel = Parallel(n_jobs=self.n_jobs,

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions