Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Task]: Tracking cloudpickle issues #34903

@claudevdm

Description

@claudevdm

Cloudpickle is set as the default pickle_library in 2.65.0, where the previous default was dill. See https://s.apache.org/beam-cloudpickle-next-steps for background.

This can cause breakages in cases where the behavior of dill and cloudpickle diverge.

cloudpickle_pickler_test tests demonstrates the behavior of cloudpickle in various cases. Notable behavior includes:

  1. Globals defined in __main__ module are pickled by value
  2. Globals defined in importable modules are pickled by reference
  3. Module aliased globals are pickled by value
  4. All functions and classes defined in __main__ module are pickled by value
  5. All closures and dynamic types are pickled by value.

Known issues include:

  • Unittests that rely on globals will fail. Cloudpickle assumes the __main__ module is not available in the unpickling environment and therefore redefines globals. To fix tests that rely on globals use the apache_beam.utils.shared module as shown in
    def test_globals_shared_are_pickled_by_reference(self):
  • Closures and dynamic classes that reference unpicklable objects fail. This can be fixed by defining functions at the top level and binding arguments with functools.partial when necessary
  • When encountering types not picklable by cloudpickle, rather define these types in an importable module in which case they will be pickled by reference.

Please report any new issues on this tracking bug. For any breakages that require reverting back to dill specify the --pickle_library=dill pipeline option.

Issue Priority

Priority: 2 (default / most normal work should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Sub-issues

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions