-
Notifications
You must be signed in to change notification settings - Fork 1k
Adds Confluent wire format handling to arrow-avro crate #8242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b"\x00" as &[u8], | ||
b"\x01" as &[u8], | ||
b"\x02" as &[u8], | ||
b"\x03" as &[u8], | ||
b"\x04" as &[u8], | ||
b"\x05" as &[u8], | ||
b"\x06" as &[u8], | ||
b"\x07" as &[u8], | ||
b"\x08" as &[u8], | ||
b"\t" as &[u8], | ||
b"\n" as &[u8], | ||
b"\x0b" as &[u8], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious about why these changes were needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was due to the addition of the sha dependency. It caused the compiler confusion over which AsRef to use
Thank you so much for getting this up. Overall this looks really good and your implementation is solid. I did leave a few comments related to ideas I had for improving this a bit more. The biggest one is feature flagging the |
…o, refactor benchmarks and prefix handling
Co-authored-by: Connor Sanders <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks again @nathaniel-d-ef
@alamb @mbrobbel Would one of you be able to give this a look when you get a chance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have some generic questions/suggestions.
arrow-avro/src/schema.rs
Outdated
/// let schema = avro.schema().unwrap(); | ||
/// let fp = generate_fingerprint(&schema, FingerprintAlgorithm::Rabin).unwrap(); | ||
/// ``` | ||
pub fn generate_fingerprint( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense if this was a method of Schema
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbrobbel Just wanted to jump in on this one real fast. This makes 100% sense, however I'd argue for making generate_fingerprint
a method of AvroSchema
.
We have a general plan to make most of the enums in the schema.rs
file pub(crate)
again prior to public release of arrow-avro
. Meanwhile we'd expose AvroSchema
publicly. Curious what your thoughts are on this direction however.
…ng in arrow-avro; update Cargo.toml for sha256 feature flag
… and optimize message ID creation in arrow-avro reader.
… `clone()` calls in `try_from` implementation and associated tests.
Thanks @nathaniel-d-ef and @mbrobbel - I apologize for the delay in merging |
# Which issue does this PR close? - Part of #4886 - Extends work in #8242 # Rationale for this change This introduces writer-side fingerprint prefix support, removing the existing hard-coded Rabin approach with a configurable pattern extending off of the work done on the reader side. In addition to supporting the SHA256 and MD5 (feature flagged), we also cover compatibility with Confluent's wire format IDs. # What changes are included in this PR? - Replaced fixed Rabin fingerprinting with support for configurable `FingerprintAlgorithm` in schema and writer. - Removed deprecated methods and unnecessary variable assignments for single-object encoding. - Simplified prefix generation logic and encoding workflows. - Updated benchmarks and added unit tests to validate updated fingerprinting strategies. # Are these changes tested? Yes, existing tests are all passing, and tests have been added to validate the prefix outputs. Benchmark results show no appreciable changes. # Are there any user-facing changes? - Crate is not yet public - Confluent users are expected to provide the schema store ID when registering a WriterBuilder --------- Co-authored-by: Connor Sanders <[email protected]>
Which issue does this PR close?
Rationale for this change
This introduces support for Confluent schema registry ID handling in the arrow-avro crate, adding compatibility with Confluent's wire format. These improvements enable streaming Apache Kafka, Redpanda, and Pulsar messages with Avro schemas directly into arrow-rs.
What changes are included in this PR?
Are these changes tested?
Yes, existing tests are all passing, and tests for ID handling have been added. Benchmark results show no appreciable changes.
Are there any user-facing changes?
Confluent users need to provide the ID fingerprint when using the
set
method, unlike theregister
method which generates it from the schema on the fly. Existing API behavior has been maintained.SchemaStore TryFrom now accepts a
&HashMap<Fingerprint, AvroSchema>
, rather than a&[AvroSchema]
Huge shout out to @jecsand838 for his collaboration on this!