Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

claudevdm
Copy link
Collaborator

@claudevdm claudevdm commented Jul 29, 2025

This PR refactors how "special types" are deterministically encoded in Beam Python, moving from dill to cloudpickle while preserving update compatibility.

Background:

Previously, FastPrimitivesCoder used dill to deterministically encode certain "special types" (e.g., NamedTuple, Enum, dataclasses). However, PipelineOptions, and specifically update_compatibility_version, were not available during the creation of these deterministic coders. This posed a challenge for maintaining update compatibility across Beam SDK versions.

Key Changes and Motivation:

  1. Makeupdate_compatibility_version available to FastPrimitivesCoder.as_deterministic_coder:

    • To correctly choose the deterministic coder for update compatibility, the update_compatibility_version from PipelineOptions needs to be accessible when a deterministic coder is constructed.
    • This PR adds update_compatibility_version as a variable to the coder registry during pipeline construction
    • FastPrimitivesCoder.as_deterministic_coder looks up registry.update_compatibility_version when deciding what detministic coder to use
  2. Transitioning to cloudpickle for Special Type Encoding:

    • The core change involves switching the encoding mechanism for "special types" in FastPrimitivesCoderImpl from dill to cloudpickle.
    • A new force_use_dill parameter is introduced to FastPrimitivesCoderImpl and its Cythonized counterpart, allowing control over which pickler is used.
    • _verify_dill_compat is added to enforce dill==0.3.1.1 when update_compatibility_version=2.67.0 is specified, ensuring backward compatibility with older pipelines.
    • New encode_type_2_67_0 and _unpickle_type_2_67_0 methods are introduced to handle the dill-based encoding/decoding for compatibility with versions <=2.67.0.
    • The encode_type and decode_type methods in FastPrimitivesCoderImpl now use cloudpickle_pickler.dumps and cloudpickle_pickler.loads by default, falling back to dill if force_use_dill is true.
  3. Introducing DeterministicFastPrimitivesCoderV2 for Compatibility Checks:

    • To ensure that update compatibility checks correctly fail when upgrading from SDK versions <=2.67.0 to >2.67.0 (if update_compatibility_version is not explicitly set), a new coder DeterministicFastPrimitivesCoderV2 is introduced.
    • This new coder has a different underlying coder name, which will cause a compatibility error if an older pipeline tries to connect to a newer SDK version without the proper update_compatibility_version flag.
    • The _update_compatible_deterministic_fast_primitives_coder helper function is used by FastPrimitivesCoder.as_deterministic_coder to return either DeterministicFastPrimitivesCoder (for dill-compatible versions) or DeterministicFastPrimitivesCoderV2 (for cloudpickle-based versions).
  4. Helper Function for Version Comparison (is_v1_prior_to_v2):

    • A new utility function is_v1_prior_to_v2 is added to apache_beam.transforms.util to simplify version comparison logic, replacing a duplicated pattern.

Testing:

  • New parameterized tests (test_deterministic_coder, test_deterministic_map_coder_is_update_compatible, etc.) have been added to coders_test_common.py to verify the behavior of deterministic coders with and without update_compatibility_version.
  • The test_cross_process_encoding_of_special_types_is_deterministic test has been updated to be parameterized by compat_version and now explicitly tests the dill fallback for 2.67.0 and earlier.
  • A new test_group_by_key_importable_special_types test confirms GroupByKey functionality with special types and update compatibility versions.
  • A new test_group_by_key_dynamic_special_types test demonstrates support for dynamic types by explicitly using CloudpickleCoder as a fallback.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@claudevdm claudevdm marked this pull request as draft July 29, 2025 13:09
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @claudevdm, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new update_compatibility_version parameter to the as_deterministic_coder method across the Apache Beam Python SDK's coder hierarchy. This change allows for the propagation of compatibility version information during the conversion of coders to their deterministic forms, which is crucial for maintaining serialization compatibility across different versions of Beam or user-defined types. The modification ensures that this compatibility context is available throughout the coder conversion process, particularly for complex and nested coder structures.

Highlights

  • API Change: The as_deterministic_coder method across various Coder implementations now accepts an optional update_compatibility_version parameter. This parameter is intended to provide context for compatibility during the conversion process.
  • Parameter Propagation: The newly introduced update_compatibility_version parameter is propagated through nested as_deterministic_coder calls within composite coders such as MapCoder, NullableCoder, TupleCoder, TupleSequenceCoder, and IterableCoder.
  • RowCoder Integration: The RowCoder's constructor (__init__) and its as_deterministic_coder method have been updated to accept and utilize the update_compatibility_version parameter, ensuring it's passed down to its component coders when forcing determinism.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to propagate an update_compatibility_version parameter through as_deterministic_coder methods. The changes are applied to many Coder subclasses.

While the parameter is correctly passed down in several composite coders, there are a few issues:

  • In row_coder.py, there's a syntax error in the definition of RowCoder.as_deterministic_coder that will prevent the code from running.
  • In coders.py, several as_deterministic_coder implementations have been updated to accept the new parameter, but they don't use it when creating the new deterministic coder instance. This makes the change incomplete for those coders.

@claudevdm claudevdm force-pushed the coders1-pass-update-comat branch 3 times, most recently from 8112f51 to a6d8607 Compare July 30, 2025 19:05
@claudevdm claudevdm force-pushed the coders1-pass-update-comat branch 5 times, most recently from 6e9be45 to 4597045 Compare August 2, 2025 11:34
@claudevdm claudevdm force-pushed the coders1-pass-update-comat branch 2 times, most recently from a19eeff to 51e1a59 Compare August 26, 2025 18:39
@claudevdm claudevdm marked this pull request as ready for review August 26, 2025 20:02
@claudevdm
Copy link
Collaborator Author

R: @tvalentyn
R: @damccorm

Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

@claudevdm claudevdm changed the title Pass update compat through as_deterministic_coder. Pass update compat through as_deterministic_coder and use cloudpickle for deterministic special types. Aug 27, 2025
Copy link
Contributor

@damccorm damccorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally LGTM, I'll defer to @tvalentyn here though

@github-actions github-actions bot removed the runners label Aug 29, 2025
@claudevdm claudevdm force-pushed the coders1-pass-update-comat branch from 2728c60 to 73b97c4 Compare August 29, 2025 21:25
@claudevdm claudevdm force-pushed the coders1-pass-update-comat branch from 73b97c4 to 13ab3c0 Compare August 29, 2025 21:36
@tvalentyn tvalentyn merged commit e8fab26 into apache:master Aug 30, 2025
85 of 89 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants