Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Validation step fails when using shared memory with multiprocessing.managers.BaseManager #28899

Closed
@ElenaKhaustova

Description

@ElenaKhaustova

Describe the bug

Original issue: kedro-org/kedro#3674

Relates to #28781

We use multiprocessing managers to work with shared memory for pipeline parallelisation. After this validation step was added we are experiencing ValueError: cannot set WRITEABLE flag to True of this array error when objects are retrieved from shared memory and passed to scikit-learn functions, for example fit, including this validation step.

The only solution that works for us so far is making a deep copy of objects before passing them to those methods which is not the desired solution.

Steps/Code to Reproduce

Some findings:

from concurrent.futures import ProcessPoolExecutor
from multiprocessing.managers import BaseManager
import traceback

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression


class MemoryDataset:
    def __init__(self):
        self._ds = None

    def save(self, ds):
        self._ds = ds

    def load(self):
        return self._ds


def train_model(dataset: MemoryDataset) -> LinearRegression:
    regressor = LinearRegression()
    X_train, y_train = dataset.load()
    try:
        regressor.fit(X_train, y_train)
    except Exception as _:
        print(traceback.format_exc())
    return regressor


class MyManager(BaseManager):
    pass


MyManager.register("MemoryDataset", MemoryDataset, exposed=("save", "load"))


def main():
    rng = np.random.default_rng()
    n_samples = 1000
    X_train = pd.DataFrame(rng.random((n_samples, 4)), columns=list('ABCD'))
    y_train = pd.Series(rng.random(n_samples))
    # Replacing pd.Series with pd.DataFrame solves the issue
    # y_train = pd.DataFrame(rng.random((n_samples, 1)), columns=list('E'))

    futures = set()

    manager = MyManager()
    manager.start()
    dataset = manager.MemoryDataset()
    dataset.save((X_train, y_train))

    with ProcessPoolExecutor(max_workers=1) as pool:
        futures.add(pool.submit(train_model, dataset))

Expected Results

No error is thrown.

Actual Results

Traceback (most recent call last):
  File "/pr-scikit-learn/main.py", line 48, in train_model
    regressor.fit(X_train, y_train)
  File "/lib/python3.11/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/sklearn/linear_model/_base.py", line 609, in fit
    X, y = self._validate_data(
           ^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/sklearn/base.py", line 650, in _validate_data
    X, y = check_X_y(X, y, **check_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1282, in check_X_y
    y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1292, in _check_y
    y = check_array(
        ^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1100, in check_array
    array.flags.writeable = True
    ^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot set WRITEABLE flag to True of this array

Versions

System:
    python: 3.11.9 (main, Apr 19 2024, 11:44:45) [Clang 14.0.6 ]
executable: /opt/miniconda3/envs/paraller-runner-scikit-learn-env/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.5.dev0
          pip: 23.3.1
   setuptools: 68.2.2
        numpy: 1.26.4
        scipy: 1.13.0
       Cython: None
       pandas: 2.2.2
   matplotlib: None
       joblib: 1.4.0
threadpoolctl: 3.4.0

Built with OpenMP: False

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 10
         prefix: libopenblas
       filepath: /opt/miniconda3/envs/paraller-runner-scikit-learn-env/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Nehalem

       user_api: blas
   internal_api: openblas
    num_threads: 10
         prefix: libopenblas
       filepath: /opt/miniconda3/envs/paraller-runner-scikit-learn-env/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.26.dev
threading_layer: pthreads
   architecture: Nehalem

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions