Thanks to visit codestin.com
Credit goes to github.com

Skip to content

optimize openmetrics text parsing (~4x perf) #402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 17, 2019

Conversation

ahmed-mez
Copy link
Contributor

@ahmed-mez ahmed-mez commented May 6, 2019

This PR optimizes the openmetrics parser using the logic introduced in #282 to optimize the prometheus parser.

Here are some benchmark using timeit:

call (x100000): _parse_sample('simple_metric 1.513767429e+09')

Simple example with prometheus parser: 0.2489180564880371
Simple example with openmetrics parser: 1.1144659519195557
Simple example with the optimized openmetrics parser: 0.5948491096496582

call (x100000): _parse_sample('kube_service_labels{label_app="kube-state-metrics",label_chart="kube-state-metrics-0.5.0",label_heritage="Tiller",label_release="ungaged-panther",namespace="default",service="ungaged-panther-kube-state-metrics"} 1')

KSM metric example with prometheus parser: 1.6796550750732422
KSM metric example openmetrics parser: 6.6183180809021
KSM metric example optimized openmetrics parser: 2.0289480686187744

@ahmed-mez ahmed-mez force-pushed the master branch 2 times, most recently from e6d7944 to 7832fed Compare May 6, 2019 10:37
Copy link
Contributor

@brian-brazil brian-brazil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct, on several points.

}


def replace_escape_sequence(match):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal functions and constants should begin with _

labelvalue = []
def _is_character_escaped(s, charpos):
num_bslashes = 0
while (charpos > num_bslashes and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this comparison be against 0?

This is also going to be n^2 overall if there's many backslashes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's n^2 in the worst case yes when we're escaping too many characters, I've mad a change by calling the function with smaller arguments _is_character_escaped(value_substr[:i], i) instead of _is_character_escaped(value_substr, i)
Let me know if you're thinking of a possible better solution to optimize πŸ‘

# The label name is before the equal
value_start = sub_labels.index("=")
label_name = sub_labels[:value_start]
sub_labels = sub_labels[value_start + 1:].lstrip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you lstipping here?

value_substr = sub_labels[quote_start:]

# Check for extra commas
if label_name[0] == ',' or value_substr[len(value_substr)-1] == ',':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if label_name is zero length?

i = 0
while i < len(value_substr):
i = value_substr.index('"', i)
if not _is_character_escaped(value_substr, i):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is n^2. Work from the start in one loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still n^2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's n^2 in the worst case yes when we're escaping too many characters, I've mad a change by calling the function with smaller arguments _is_character_escaped(value_substr[:i], i) instead of _is_character_escaped(value_substr, i)
Let me know if you're thinking of a possible better solution to optimize πŸ‘

quote_end = i + 1
label_value = sub_labels[quote_start:quote_end]
# Replace escaping if needed
if escaping:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you check each value, rather than the whole string?

# `index` and `rindex` methods raise a ValueError with
# `substring not found` message if text doesn't contain label braces
label_start = text.index("{")
label_end = text.rindex("}", 0, text.find(" # ")) # ignore exemplar label braces
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if " # " is part of a label value?

You should also add a testcase for this so it doesn't trip up someone else.

# Detect what separator is used
separator = " "
if separator not in text:
separator = "\t"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tabs aren't supported.

return Sample(name, labels, value, timestamp, exemplar)

except ValueError as e:
if str(e).startswith("substring not found"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a clean way to do this. Use find instead.

label_start = text.index("{")
label_end = text.rindex("}", 0, text.find(" # ")) # ignore exemplar label braces
# The name is before the labels
name = text[:label_start].strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the strip?

@ahmed-mez
Copy link
Contributor Author

Thank you for reviewing the PR, I've made the requested changes and added a test case, the parsing logic is more solid now.
I'll be looking forward to get a second review. Thanks!

raise ValueError

# Check for extra commas
if label_name[0] == ',' or value_substr[len(value_substr) - 1] == ',':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value_substr[-1] is more succinct

i = 0
while i < len(value_substr):
i = value_substr.index('"', i)
if not _is_character_escaped(value_substr, i):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still n^2

# Remove the processed label from the sub-slice for next iteration
sub_labels = sub_labels[quote_end + 1:]
next_comma = sub_labels.find(",") + 1
sub_labels = sub_labels[next_comma:].lstrip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the lstrip?

return Sample(name, labels, value, timestamp, exemplar)

except ValueError as e:
if str(e).find("substring not found") > -1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still here, don't look inside error strings

sub_labels = sub_labels[value_start + 1:]

# Find the first quote after the equal
quote_start = sub_labels.index('"') + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is guaranteed to be right after the equals.

value_substr = sub_labels[quote_start:]

# Check for empty label name
if len(label_name) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the MetricFamily code catch this already?


# Remove the processed label from the sub-slice for next iteration
sub_labels = sub_labels[quote_end + 1:]
next_comma = sub_labels.find(",") + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is guaranteed to be after the ", if present.

value = []
# Detect the labels in the text
try:
# `index` method raises a ValueError with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use find instead of index

name = text[:label_start]
seperator = " # "
if not name.endswith("_bucket") or text.count(seperator) == 0:
# Line doesn't contain an exemplar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it (incorrectly) does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to if text.count(seperator) == 0:
it should garantee that πŸ‘

@ahmed-mez
Copy link
Contributor Author

Here is the benchmark after the changes we made in this PR, the perfs got even better now ~3.9x

call (x100000): _parse_sample('simple_metric 1.513767429e+09')

Simple example with prometheus parser: 0.24088597297668457
Simple example with openmetrics parser: 1.116285800933838
Simple example with the optimized openmetrics parser: 0.48735499382019043

call (x100000): _parse_sample('kube_service_labels{label_app="kube-state-metrics",label_chart="kube-state-metrics-0.5.0",label_heritage="Tiller",label_release="ungaged-panther",namespace="default",service="ungaged-panther-kube-state-metrics"} 1')

KSM metric example with prometheus parser: 1.608799934387207
KSM metric example openmetrics parser: 6.636054039001465
KSM metric example optimized openmetrics parser: 1.7176191806793213

@brian-brazil
Copy link
Contributor

That looks about right. Could you expand the unittests to ensure we're covering everything for both the old and new way of parsing labels? Also, it'd be great if you could add tests for any things this PR had incorrect at any point as if you've made this mistake others likely will too and the tests are going to be used as the openmetrics test suite.

@ahmed-mez ahmed-mez changed the title optimize openmetrics text parsing (~3.3x perf) optimize openmetrics text parsing (~4x perf) May 15, 2019
@ahmed-mez
Copy link
Contributor Author

Added test cases, I guess the PR is ready for a final review :)

label = text[label_start + 1:label_end]
labels = _parse_labels(label)
else:
# Line contains an exemplar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially contains

def _parse_labels(text):
labels = {}
# Return if we don't have valid labels
if "=" not in text:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this handle something like {a} ?

Copy link
Contributor Author

@ahmed-mez ahmed-mez May 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the required changes and added some test cases like that


@unittest.skipIf(sys.version_info < (3, 3), "Test requires Python 3.3+.")
def test_fallback_to_state_machine_label_parsing(self):
from unittest.mock import patch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As these tests will become the official regression suite for OpenMetrics, it'd be best to test a full line rather than just a function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added multiple test cases to assert what function are called πŸ‘

ahmed-mez added 10 commits May 16, 2019 15:47
Signed-off-by: Ahmed Mezghani <[email protected]>
Signed-off-by: Ahmed Mezghani <[email protected]>
Signed-off-by: Ahmed Mezghani <[email protected]>
Signed-off-by: Ahmed Mezghani <[email protected]>
Signed-off-by: Ahmed Mezghani <[email protected]>
Signed-off-by: Ahmed Mezghani <[email protected]>
Signed-off-by: Ahmed Mezghani <[email protected]>
@brian-brazil brian-brazil merged commit 6740213 into prometheus:master May 17, 2019
@brian-brazil
Copy link
Contributor

Thanks!

@ahmed-mez
Copy link
Contributor Author

Great! @brian-brazil :) any plans to release soon?

@brian-brazil
Copy link
Contributor

I'll add it to my todo list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants