-
Notifications
You must be signed in to change notification settings - Fork 818
optimize openmetrics text parsing (~4x perf) #402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e6d7944
to
7832fed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct, on several points.
} | ||
|
||
|
||
def replace_escape_sequence(match): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal functions and constants should begin with _
labelvalue = [] | ||
def _is_character_escaped(s, charpos): | ||
num_bslashes = 0 | ||
while (charpos > num_bslashes and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this comparison be against 0?
This is also going to be n^2 overall if there's many backslashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's n^2 in the worst case yes when we're escaping too many characters, I've mad a change by calling the function with smaller arguments _is_character_escaped(value_substr[:i], i)
instead of _is_character_escaped(value_substr, i)
Let me know if you're thinking of a possible better solution to optimize π
# The label name is before the equal | ||
value_start = sub_labels.index("=") | ||
label_name = sub_labels[:value_start] | ||
sub_labels = sub_labels[value_start + 1:].lstrip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you lstipping here?
value_substr = sub_labels[quote_start:] | ||
|
||
# Check for extra commas | ||
if label_name[0] == ',' or value_substr[len(value_substr)-1] == ',': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if label_name is zero length?
i = 0 | ||
while i < len(value_substr): | ||
i = value_substr.index('"', i) | ||
if not _is_character_escaped(value_substr, i): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is n^2. Work from the start in one loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still n^2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's n^2 in the worst case yes when we're escaping too many characters, I've mad a change by calling the function with smaller arguments _is_character_escaped(value_substr[:i], i)
instead of _is_character_escaped(value_substr, i)
Let me know if you're thinking of a possible better solution to optimize π
quote_end = i + 1 | ||
label_value = sub_labels[quote_start:quote_end] | ||
# Replace escaping if needed | ||
if escaping: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't you check each value, rather than the whole string?
# `index` and `rindex` methods raise a ValueError with | ||
# `substring not found` message if text doesn't contain label braces | ||
label_start = text.index("{") | ||
label_end = text.rindex("}", 0, text.find(" # ")) # ignore exemplar label braces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if " # " is part of a label value?
You should also add a testcase for this so it doesn't trip up someone else.
# Detect what separator is used | ||
separator = " " | ||
if separator not in text: | ||
separator = "\t" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tabs aren't supported.
return Sample(name, labels, value, timestamp, exemplar) | ||
|
||
except ValueError as e: | ||
if str(e).startswith("substring not found"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a clean way to do this. Use find instead.
label_start = text.index("{") | ||
label_end = text.rindex("}", 0, text.find(" # ")) # ignore exemplar label braces | ||
# The name is before the labels | ||
name = text[:label_start].strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the strip?
Thank you for reviewing the PR, I've made the requested changes and added a test case, the parsing logic is more solid now. |
raise ValueError | ||
|
||
# Check for extra commas | ||
if label_name[0] == ',' or value_substr[len(value_substr) - 1] == ',': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value_substr[-1]
is more succinct
i = 0 | ||
while i < len(value_substr): | ||
i = value_substr.index('"', i) | ||
if not _is_character_escaped(value_substr, i): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still n^2
# Remove the processed label from the sub-slice for next iteration | ||
sub_labels = sub_labels[quote_end + 1:] | ||
next_comma = sub_labels.find(",") + 1 | ||
sub_labels = sub_labels[next_comma:].lstrip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the lstrip?
return Sample(name, labels, value, timestamp, exemplar) | ||
|
||
except ValueError as e: | ||
if str(e).find("substring not found") > -1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still here, don't look inside error strings
sub_labels = sub_labels[value_start + 1:] | ||
|
||
# Find the first quote after the equal | ||
quote_start = sub_labels.index('"') + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is guaranteed to be right after the equals.
value_substr = sub_labels[quote_start:] | ||
|
||
# Check for empty label name | ||
if len(label_name) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the MetricFamily code catch this already?
|
||
# Remove the processed label from the sub-slice for next iteration | ||
sub_labels = sub_labels[quote_end + 1:] | ||
next_comma = sub_labels.find(",") + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is guaranteed to be after the ", if present.
value = [] | ||
# Detect the labels in the text | ||
try: | ||
# `index` method raises a ValueError with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use find instead of index
name = text[:label_start] | ||
seperator = " # " | ||
if not name.endswith("_bucket") or text.count(seperator) == 0: | ||
# Line doesn't contain an exemplar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if it (incorrectly) does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it to if text.count(seperator) == 0:
it should garantee that π
Here is the benchmark after the changes we made in this PR, the perfs got even better now ~3.9x
|
That looks about right. Could you expand the unittests to ensure we're covering everything for both the old and new way of parsing labels? Also, it'd be great if you could add tests for any things this PR had incorrect at any point as if you've made this mistake others likely will too and the tests are going to be used as the openmetrics test suite. |
Added test cases, I guess the PR is ready for a final review :) |
label = text[label_start + 1:label_end] | ||
labels = _parse_labels(label) | ||
else: | ||
# Line contains an exemplar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potentially contains
def _parse_labels(text): | ||
labels = {} | ||
# Return if we don't have valid labels | ||
if "=" not in text: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would this handle something like {a}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made the required changes and added some test cases like that
|
||
@unittest.skipIf(sys.version_info < (3, 3), "Test requires Python 3.3+.") | ||
def test_fallback_to_state_machine_label_parsing(self): | ||
from unittest.mock import patch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As these tests will become the official regression suite for OpenMetrics, it'd be best to test a full line rather than just a function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added multiple test cases to assert what function are called π
Signed-off-by: Ahmed Mezghani <[email protected]>
Signed-off-by: Ahmed Mezghani <[email protected]>
Signed-off-by: Ahmed Mezghani <[email protected]>
Signed-off-by: Ahmed Mezghani <[email protected]>
Signed-off-by: Ahmed Mezghani <[email protected]>
Signed-off-by: Ahmed Mezghani <[email protected]>
Signed-off-by: Ahmed Mezghani <[email protected]>
Signed-off-by: Ahmed Mezghani <[email protected]>
Signed-off-by: Ahmed Mezghani <[email protected]>
Signed-off-by: Ahmed Mezghani <[email protected]>
Thanks! |
Great! @brian-brazil :) any plans to release soon? |
I'll add it to my todo list. |
This PR optimizes the openmetrics parser using the logic introduced in #282 to optimize the prometheus parser.
Here are some benchmark using
timeit
: