Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

nathaniel-d-ef
Copy link
Contributor

@nathaniel-d-ef nathaniel-d-ef commented Aug 28, 2025

Which issue does this PR close?

Rationale for this change

This introduces support for Confluent schema registry ID handling in the arrow-avro crate, adding compatibility with Confluent's wire format. These improvements enable streaming Apache Kafka, Redpanda, and Pulsar messages with Avro schemas directly into arrow-rs.

What changes are included in this PR?

  • Adds Confluent support
  • Adds initial support for SHA256 and MD5 algorithm types. Rabin remains the default.

Are these changes tested?

Yes, existing tests are all passing, and tests for ID handling have been added. Benchmark results show no appreciable changes.

Are there any user-facing changes?

  • Confluent users need to provide the ID fingerprint when using the set method, unlike the register method which generates it from the schema on the fly. Existing API behavior has been maintained.

  • SchemaStore TryFrom now accepts a &HashMap<Fingerprint, AvroSchema>, rather than a &[AvroSchema]

Huge shout out to @jecsand838 for his collaboration on this!

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Aug 28, 2025
Comment on lines +1920 to +1931
b"\x00" as &[u8],
b"\x01" as &[u8],
b"\x02" as &[u8],
b"\x03" as &[u8],
b"\x04" as &[u8],
b"\x05" as &[u8],
b"\x06" as &[u8],
b"\x07" as &[u8],
b"\x08" as &[u8],
b"\t" as &[u8],
b"\n" as &[u8],
b"\x0b" as &[u8],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about why these changes were needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was due to the addition of the sha dependency. It caused the compiler confusion over which AsRef to use

@jecsand838
Copy link
Contributor

@nathaniel-d-ef

Thank you so much for getting this up.

Overall this looks really good and your implementation is solid. I did leave a few comments related to ideas I had for improving this a bit more. The biggest one is feature flagging the md5 and sha2 fingerprint hashing.

Copy link
Contributor

@jecsand838 jecsand838 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks again @nathaniel-d-ef

@alamb @mbrobbel Would one of you be able to give this a look when you get a chance?

Copy link
Member

@mbrobbel mbrobbel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have some generic questions/suggestions.

/// let schema = avro.schema().unwrap();
/// let fp = generate_fingerprint(&schema, FingerprintAlgorithm::Rabin).unwrap();
/// ```
pub fn generate_fingerprint(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense if this was a method of Schema?

Copy link
Contributor

@jecsand838 jecsand838 Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbrobbel Just wanted to jump in on this one real fast. This makes 100% sense, however I'd argue for making generate_fingerprint a method of AvroSchema.

We have a general plan to make most of the enums in the schema.rs file pub(crate) again prior to public release of arrow-avro. Meanwhile we'd expose AvroSchema publicly. Curious what your thoughts are on this direction however.

nathaniel-d-ef and others added 4 commits August 30, 2025 00:19
…ng in arrow-avro; update Cargo.toml for sha256 feature flag
… and optimize message ID creation in arrow-avro reader.
… `clone()` calls in `try_from` implementation and associated tests.
@alamb alamb merged commit 07e0953 into apache:main Sep 4, 2025
23 of 24 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 4, 2025

Thanks @nathaniel-d-ef and @mbrobbel - I apologize for the delay in merging

@jecsand838 jecsand838 deleted the ENG-72 branch September 9, 2025 01:35
alamb pushed a commit that referenced this pull request Sep 22, 2025
# Which issue does this PR close?

- Part of #4886
- Extends work in #8242

# Rationale for this change

This introduces writer-side fingerprint prefix support, removing the
existing hard-coded Rabin approach with a configurable pattern extending
off of the work done on the reader side. In addition to supporting the
SHA256 and MD5 (feature flagged), we also cover compatibility with
Confluent's wire format IDs.

# What changes are included in this PR?

- Replaced fixed Rabin fingerprinting with support for configurable
`FingerprintAlgorithm` in schema and writer.
- Removed deprecated methods and unnecessary variable assignments for
single-object encoding.
- Simplified prefix generation logic and encoding workflows.
- Updated benchmarks and added unit tests to validate updated
fingerprinting strategies.

# Are these changes tested?

Yes, existing tests are all passing, and tests have been added to
validate the prefix outputs. Benchmark results show no appreciable
changes.

# Are there any user-facing changes?

-  Crate is not yet public
- Confluent users are expected to provide the schema store ID when
registering a WriterBuilder

---------

Co-authored-by: Connor Sanders <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate arrow-avro arrow-avro crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants