Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

austintlee
Copy link

@austintlee austintlee commented Nov 5, 2024

We want to be able to run the rerank transform on Ray on GPU, but we found a couple of issues while testing this using test_rerank.py. The tests in test_rerank.py currently run in ExecMode.LOCAL and I suspect most of the time they are run on CPU. If you change the execution mode to RAY and run the tests on a GPU machine, we find the following issues.

Issue 1 - similarity does not properly run on GPU.

Here's the stack trace:

ray.exceptions.RayTaskError(UserCodeException): ray::MapBatches(HuggingFaceTransformersSimilarityScorer)->Map(ray_callable)() (pid=626521, ip=192.168.68.124)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "aryn/sycamore/lib/sycamore/sycamore/transforms/base.py", line 203, in ray_callable
    return BaseMapTransform._process_ray(ray_input, name, lambda d: f(d, *args, **kwargs), enable_auto_metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "aryn/sycamore/lib/sycamore/sycamore/transforms/base.py", line 257, in _process_ray
    outputs = f(docs)
              ^^^^^^^
  File "aryn/sycamore/lib/sycamore/sycamore/transforms/base.py", line 203, in <lambda>
    return BaseMapTransform._process_ray(ray_input, name, lambda d: f(d, *args, **kwargs), enable_auto_metadata)
                                                                    ^^^^^^^^^^^^^^^^^^^^^
  File "aryn/sycamore/lib/sycamore/sycamore/utils/import_utils.py", line 46, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "aryn/sycamore/lib/sycamore/sycamore/transforms/similarity.py", line 153, in __call__
    return self.generate_similarity_scores(doc_batch, query, score_property_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "aryn/sycamore/lib/sycamore/sycamore/transforms/similarity.py", line 88, in generate_similarity_scores
    scores = self.score(input_pairs)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "aryn/sycamore/lib/sycamore/sycamore/utils/time_trace.py", line 141, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "aryn/sycamore/lib/sycamore/sycamore/transforms/similarity.py", line 164, in score
    self._model = AutoModelForSequenceClassification.from_pretrained(self.model_name).to(self.device)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/transformers/modeling_utils.py", line 2958, in to
    return super().to(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1173, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 779, in _apply
    module._apply(fn)
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 779, in _apply
    module._apply(fn)
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 779, in _apply
    module._apply(fn)
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 804, in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1159, in convert
    return t.to(
           ^^^^^
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available

The above exception was the direct cause of the following exception:

ray::MapBatches(HuggingFaceTransformersSimilarityScorer)->Map(ray_callable)() (pid=626520, ip=192.168.68.124)
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 461, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 392, in __call__
    for data in iter:
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 134, in _udf_timed_iter
    output = next(input)
             ^^^^^^^^^^^
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 216, in __call__
    yield from self._row_fn(input, ctx)
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 379, in transform_fn
    for row in rows:
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 269, in __call__
    for block in blocks:
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 392, in __call__
    for data in iter:
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 134, in _udf_timed_iter
    output = next(input)
             ^^^^^^^^^^^
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 236, in __call__
    yield from self._batch_fn(input, ctx)
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 282, in transform_fn
    res = fn(batch)
          ^^^^^^^^^
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 194, in fn
    _handle_debugger_exception(e)
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 210, in _handle_debugger_exception
    raise UserCodeException() from e
ray.exceptions.UserCodeException

../../.cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/exceptions.py:87: RayTaskError(UserCodeException)

I took a suggestion from @HenryL27 and mirrored the GPU setup in embed.py and the above problem went away. This is the diff in similarity.py.

@austintlee austintlee marked this pull request as ready for review December 13, 2024 17:01
@austintlee
Copy link
Author

Issue 2 - pickle deserialization gets confused.

Once I got past the above issue, I started getting a different stack trace:

ray.exceptions.RayTaskError(UserCodeException): ray::Map(ray_callable)() (pid=545630, ip=192.168.68.124)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "aryn/sycamore/lib/sycamore/sycamore/transforms/sort.py", line 50, in ray_callable
    doc = Document.from_row(input_dict)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "aryn/sycamore/lib/sycamore/sycamore/data/document.py", line 237, in from_row
    return Document.deserialize(row["doc"])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "aryn/sycamore/lib/sycamore/sycamore/data/document.py", line 224, in deserialize
    data = pickle.loads(raw)  # mapped_loads(raw)
           ^^^^^^^^^^^^^^^^^
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/torch/storage.py", line 381, in _load_from_bytes
    return torch.load(io.BytesIO(b))
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/torch/serialization.py", line 1040, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/torch/serialization.py", line 1272, in _legacy_load
    result = unpickler.load()
             ^^^^^^^^^^^^^^^^
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/torch/serialization.py", line 1205, in persistent_load
    obj = restore_location(obj, location)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/torch/serialization.py", line 390, in default_restore_location
    result = fn(storage, location)
             ^^^^^^^^^^^^^^^^^^^^^
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/torch/serialization.py", line 265, in _cuda_deserialize
    device = validate_cuda_device(location)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/torch/serialization.py", line 249, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

The above exception was the direct cause of the following exception:

ray::Map(ray_callable)() (pid=545630, ip=192.168.68.124)
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 461, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 392, in __call__
    for data in iter:
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 134, in _udf_timed_iter
    output = next(input)
             ^^^^^^^^^^^
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 216, in __call__
    yield from self._row_fn(input, ctx)
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 380, in transform_fn
    out_row = fn(row)
              ^^^^^^^
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 194, in fn
    _handle_debugger_exception(e)
  File ".cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 210, in _handle_debugger_exception
    raise UserCodeException() from e
ray.exceptions.UserCodeException

../../.cache/pypoetry/virtualenvs/sycamore-monorepo-QRJsh08E-py3.12/lib/python3.12/site-packages/ray/data/exceptions.py:87: RayTaskError(UserCodeException)

Upon further investigation, I found what I believe to be the same issue reported here - pytorch/pytorch#16797. The problem went away once I applied the workaround suggested in that Github issue.

Copy link
Contributor

@bsowell bsowell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks!

def score(self, inputs: list[tuple[str, str]]) -> list[float]:
import torch

print(f"GPU: {torch.cuda.is_available()}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Might want to remove this (or change to log).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

@austintlee austintlee merged commit 26b14f4 into main Dec 17, 2024
12 of 14 checks passed
@austintlee austintlee deleted the similarity-gpu-ray branch December 17, 2024 01:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants