Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@LinZhihao-723
Copy link
Member

@LinZhihao-723 LinZhihao-723 commented May 22, 2024

References

Description

This PR introduces a streaming schema tree implementation designed for IR v2. It has the following components:

  • SchemaTreeNode: A class that specifies the node information for each node in the schema tree, including a unique ID, a parent ID, key name, type, and children IDs.
  • SchemaTree: A tree built with SchemaTreeNode, with methods to insert/get tree nodes.
  • Unit test: unit test cases to cover basic functionality.

Notice that we already have a schema tree implementation in clp-s. The reasons to re-implement the schema tree are the following:

  • The schema tree node maintains different information to track a tree node:
    • The types of schema tree nodes in the IR format differ from those in the Archive format. The types we support in IR format are the following:
      • Integer
      • Float
      • String
      • Boolean
      • Unstructured Array
      • Object
    • There is no need to track the count of each node.
  • The schema tree is designed to be used in our new IR stream. Compared to the existing implementation, this PR makes it more lightweight:
    • The schema tree does not maintain a hash map for existing nodes. This reduces memory usage and doesn't require absl::flat_hash_map when building our FFI libraries. As a tradeoff, the worst-case time complexity of node finding takes O(n) instead of O(1). However, from existing profiling results, this change has negligible influence on the IR v2 stream serialization/deserialization, even when the tree max depth and max width are large.
    • By making these changes, the memory used by the schema tree can be approximated more simply. This can be helpful if we need to build a heuristic to determine when to rotate an IR stream.
  • When IR serialization fails, the tree needs to be recovered back to the state before the serialization starts, meaning that all nodes inserted during a failed serialization must be removed. SchemaTree has an efficient implementation for this scenario.

Validation performed

  • Passed clang-tidy linter check.
  • Ensured the code can be successfully built with unitTest.
  • Ensured new unit tests passed.

Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Broadly speaking this looks good to me, I just have a few questions.

  1. If we want to mix structured and unstructured logs in the same stream how would that be represented in this schema tree?

In clp-s we're taking the approach that the root of the tree is a node '-1' that has no type, and each different type of log has an unnamed node of the correct type that is a child of that '-1' node (e.g. for JSON logs this would be an unnamed node of type object, and for unstructured logs this would be an unnamed node of type clp string).

  1. Are tree insertions/lookups all O(1) in the context of reading back a stream? It looks like this should be the case but just want to confirm.

* the parent id, the key name, and the node type to locate a unique tree node. This class wraps
* the location information as a non-integer identifier to locate a unique node in the tree.
*/
class TreeNodeLocator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment could probably be rephrased to be more clear. Maybe something like

"When constructing the schema tree we uniquely identify the location of a node being appended to the try by the unique triple of parent id, key name, and node type. This class
stores that triple, and can act as a unique identifier for a node in the tree."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. How about "appended" -> "inserted"? "Appended" is more implementation-specific, and the doc string doesn't necessarily expose this detail.
  2. Shall we add a sentence to explain why the triple is unique? Essentially, it's because key name + node type should not have any ambiguity for a parent node

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those both sound good to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed

@LinZhihao-723
Copy link
Member Author

Nice work! Broadly speaking this looks good to me, I just have a few questions.

  1. If we want to mix structured and unstructured logs in the same stream how would that be represented in this schema tree?

In clp-s we're taking the approach that the root of the tree is a node '-1' that has no type, and each different type of log has an unnamed node of the correct type that is a child of that '-1' node (e.g. for JSON logs this would be an unnamed node of type object, and for unstructured logs this would be an unnamed node of type clp string).

  1. Are tree insertions/lookups all O(1) in the context of reading back a stream? It looks like this should be the case but just want to confirm.
  1. For unstructured logs, our higher-level APIs (FFI) should structure the log event. For example, a normal unstructured log event should be serialized to sth like: {"timestamp": 100000, "log_level": "INFO", "log_message": "xxxxx"}. In the IR level, we don't differentiate whether the input src is a structured or unstructured log. The only thing that we might have special handling is the timestamp in the future.
  2. What does "reading back a stream" mean? Did you mean deserializing the stream? In general, node insertion and lookup are not O(1). For lookup, we are traversing all children of a parent as we don't have a hashmap storing location to node id mapping. For insertion, we add a sanity check to ensure the node of the given location doesn't exist (which requires a lookup). From my previous benchmark, this shouldn't be the bottleneck for both serialization and deserialization. I don't think this check can be skipped during deserialization as the stream might be corrupted: users could abuse our format to generate an illegal stream. In both cases, the bottleneck is to traverse the children to find if {key_name, type} pair already exists. If this becomes the bottleneck in the future, we can optimize the implementation by introducing a hashmap when the number of children exceeds some threshold.

@LinZhihao-723 LinZhihao-723 requested a review from gibber9809 June 1, 2024 21:11
gibber9809
gibber9809 previously approved these changes Jun 3, 2024
Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. PR title is also good for commit message.

Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly docs + style changes with a few concerns about logic.

kirkrodrigues
kirkrodrigues previously approved these changes Jun 3, 2024
Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few touch-ups. For the PR title, how about:

ffi: Add SchemaTree implementation to support the next IR stream format.

@LinZhihao-723
Copy link
Member Author

A few touch-ups. For the PR title, how about:

ffi: Add SchemaTree implementation to support the next IR stream format.

The commit message lgtm

@LinZhihao-723 LinZhihao-723 changed the title FFI: Add support for schema tree. ffi: Add SchemaTree implementation to support the next IR stream format. Jun 4, 2024
@LinZhihao-723 LinZhihao-723 merged commit 3e00d50 into y-scope:main Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants