-
Notifications
You must be signed in to change notification settings - Fork 83
ffi: Add SchemaTree implementation to support the next IR stream format. #411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Broadly speaking this looks good to me, I just have a few questions.
- If we want to mix structured and unstructured logs in the same stream how would that be represented in this schema tree?
In clp-s we're taking the approach that the root of the tree is a node '-1' that has no type, and each different type of log has an unnamed node of the correct type that is a child of that '-1' node (e.g. for JSON logs this would be an unnamed node of type object, and for unstructured logs this would be an unnamed node of type clp string).
- Are tree insertions/lookups all O(1) in the context of reading back a stream? It looks like this should be the case but just want to confirm.
| * the parent id, the key name, and the node type to locate a unique tree node. This class wraps | ||
| * the location information as a non-integer identifier to locate a unique node in the tree. | ||
| */ | ||
| class TreeNodeLocator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment could probably be rephrased to be more clear. Maybe something like
"When constructing the schema tree we uniquely identify the location of a node being appended to the try by the unique triple of parent id, key name, and node type. This class
stores that triple, and can act as a unique identifier for a node in the tree."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- How about "appended" -> "inserted"? "Appended" is more implementation-specific, and the doc string doesn't necessarily expose this detail.
- Shall we add a sentence to explain why the triple is unique? Essentially, it's because key name + node type should not have any ambiguity for a parent node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those both sound good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. PR title is also good for commit message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly docs + style changes with a few concerns about logic.
Co-authored-by: kirkrodrigues <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few touch-ups. For the PR title, how about:
ffi: Add SchemaTree implementation to support the next IR stream format.
Co-authored-by: kirkrodrigues <[email protected]>
The commit message lgtm |
References
Description
This PR introduces a streaming schema tree implementation designed for IR v2. It has the following components:
SchemaTreeNode: A class that specifies the node information for each node in the schema tree, including a unique ID, a parent ID, key name, type, and children IDs.SchemaTree: A tree built with SchemaTreeNode, with methods to insert/get tree nodes.Notice that we already have a schema tree implementation in
clp-s. The reasons to re-implement the schema tree are the following:absl::flat_hash_mapwhen building our FFI libraries. As a tradeoff, the worst-case time complexity of node finding takes O(n) instead of O(1). However, from existing profiling results, this change has negligible influence on the IR v2 stream serialization/deserialization, even when the tree max depth and max width are large.SchemaTreehas an efficient implementation for this scenario.Validation performed
clang-tidylinter check.unitTest.