-
Notifications
You must be signed in to change notification settings - Fork 868
more ssl stats #1837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more ssl stats #1837
Conversation
31c4c85 to
d41ecab
Compare
d41ecab to
2e5cccb
Compare
|
kazuho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR.
The design looks fine and I have only one question to ask (see below).
include/h2o.h
Outdated
| * summations of handshake latency in microsecond | ||
| */ | ||
| uint64_t handshake_full_latency; | ||
| uint64_t handshake_resume_latency; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the intent of collecting these values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I can understand that you might want to track how the average latency changes over time. To do that, I can see that you would periodically collect the stats, and calculate (cur_latency - prev_latency) / (cur_handshake_count - prev_handshake_count) to obtain the average latency between the previous stats collection and the current stats collection.
What I wonder is how you would use the data. Even though it is a nice-have, I am not sure if it can be used for analyzing issues, etc., because the time spent to finish a handshake not only depends on the mode of the handshake (i.e. full vs. resume) but also on the version of TLS, use of false start, use of HRR (in case TLS 1.3 is used).
I think I might be fine with merging the PR as-is (we might want to change the name of to "accumulated_latency" or something, though), but I am wondering if the information being exposed here is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kazuho even though I agree that the latency measurements may be of limited usefulness, they could be useful in determining if say, lack of resources (e.g. busy CPU) may be impairing handshakes and increasing latency. So, if a spike in latency is detected externally, these stats may help in determining why (providing finer grained detail). Granted, CPU monitoring would accomplish the same, but that may not be avail at all times. Also, if something like neverbleed is in the picture (vs not), its effect on latency, if any, would be more easily measured.
Regarding the impl, it does lgtm as well 👍
However, I have a more generic suggestion around stats, and not entirely about this PR: couldn't perhaps the whole implementation of stats/counters (or at least the majority of them) be made lock free? Considering these are cumulative stats, which already are per worker and with per-thread data, reckon stats could be using thread-local variables/arrays (with an aggregator to sum the values when needed). (have a sample impl doing just that that I could share)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robguima The reason I asked is because lack of resources is not the only factor that affects the latency. In TLS before 1.2, I assume that the numbers will reflect 1 RTT + time spent to process handshake messages. In TLS 1.3, in most cases, the time will just reflect the time spent to processing handshake messages (if HRR is not used), or 1 RTT + time spent to process the handshake messages.
Considering the fact that the measured value includes RTT, and that we are going to start seeing TLS 1.3 connections for which the measured value does not include RTT, I would be cautious about using this value for detecting issues, though I am not generally against measuring and using this value to support the verification of other metrics.
However, I have a more generic suggestion around stats, and not entirely about this PR: couldn't perhaps the whole implementation of stats/counters (or at least the majority of them) be made lock free?
Actually, it's lock free. The stats are tied to h2o_context_t which is a per-event-loop (also means pre-thread) structure. When the stats handler is called, it gathers the information from all the threads (this is when a mutex is used) and reports them back to the client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. Hmm wondering what would be the best solution then... Perhaps splitting it further into the different categories (esp once 1.3 is in use and esp 0RTT)?
On the stats, good to know thanks for the explanation. 👍 On the other hand, the aggregator could also be made lock free (w/ preallocated/registered thread-local data). Anyway, does not seem that important now considering that the contention around locking the aggregator is probably very low (and infrequent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am inclined to have this property as-is, and see how it goes, considering the fact that nobody has argued against having it, and maintaining it is easy.
We will have the chance to improve the metrics (by adding new ones).
|
I think this is enough for now and when we want to know more fine-grained metrics, doing like the following would be nice. ssl.handshake.full-tls13-hrr the last component is key for aggregation and we can add details as mush as we want. And even in such case, we can provide |
|
after some tweaks the final stat is like the following: {
"ssl.errors": 0, /* counter for ssl handshake errors */
"ssl.alpn.h1": 0, /* the number of times h2o selected h1 in ALPN */
"ssl.alpn.h2": 0, /* the number of times h2o selected h2 in ALPN */
"ssl.handshake.full": 1, /* the number of times full handshake happened */
"ssl.handshake.resume": 1, /* the number of times resume handshake happened */
"ssl.handshake.accumlated-time.full": 15748, /* sum of time for full handshakes in microseconds */
"ssl.handshake.accumlated-time.resume": 2434 /* sum of time for resume handshakes in microseconds */
} |
|
Thank you for the changes. I like the change of the name to |
|
@i110 I think that the PR is ready for merge. |
This PR adds following properties in status json response.
ssl.errorsbelongs toeventsmodule, and others to newly addedsslmodule.One small concern: uint64_t is enough to store summation value of microsecond latencies? Assuming that, in worst case, average handshake latency is 100ms (100,000ns) and h2o serves 1,000,000 req/sec, it'll take
to overflow. Is it enough? If not, we should change the summation properties to averages.