more ssl stats #1837

i110 · 2018-08-20T17:05:02Z

This PR adds following properties in status json response.

{
    "ssl.errors": 0,  /* counter for ssl handshake errors */
    "ssl.alpn.h1": 0, /* the number of times h2o selected h1 in ALPN */
    "ssl.alpn.h2": 0, /* the number of times h2o selected h2 in ALPN */ 
    "ssl.handshake.full": 1, /* the number of times full handshake happened */
    "ssl.handshake.resume": 1, /* the number of times resume handshake happened */
    "ssl.handshake.full.latency": 15748, /* sum of latencies for full handshakes in microseconds */
    "ssl.handshake.resume.latency": 2434 /* sum of latencies for resume handshakes in microseconds */
}

ssl.errors belongs to events module, and others to newly added ssl module.

One small concern: uint64_t is enough to store summation value of microsecond latencies? Assuming that, in worst case, average handshake latency is 100ms (100,000ns) and h2o serves 1,000,000 req/sec, it'll take

(2 ^ 64) / 100,000 ns / 1,000,000 rps / 86400sec ≈  2,135 days ≈ 5.85 years

to overflow. Is it enough? If not, we should change the summation properties to averages.

deweerdt · 2018-08-20T18:13:13Z

uint64_t is enough to store summation value of microsecond latencies?
I believe that's plenty, yes.

kazuho

Thank you for the PR.

The design looks fine and I have only one question to ask (see below).

kazuho · 2018-08-23T03:20:42Z

include/h2o.h

+         * summations of handshake latency in microsecond
+         */
+        uint64_t handshake_full_latency;
+        uint64_t handshake_resume_latency;


What is the intent of collecting these values?

@deweerdt @robguima What do you think?

FWIW, I can understand that you might want to track how the average latency changes over time. To do that, I can see that you would periodically collect the stats, and calculate (cur_latency - prev_latency) / (cur_handshake_count - prev_handshake_count) to obtain the average latency between the previous stats collection and the current stats collection.

What I wonder is how you would use the data. Even though it is a nice-have, I am not sure if it can be used for analyzing issues, etc., because the time spent to finish a handshake not only depends on the mode of the handshake (i.e. full vs. resume) but also on the version of TLS, use of false start, use of HRR (in case TLS 1.3 is used).

I think I might be fine with merging the PR as-is (we might want to change the name of to "accumulated_latency" or something, though), but I am wondering if the information being exposed here is sufficient.

@kazuho even though I agree that the latency measurements may be of limited usefulness, they could be useful in determining if say, lack of resources (e.g. busy CPU) may be impairing handshakes and increasing latency. So, if a spike in latency is detected externally, these stats may help in determining why (providing finer grained detail). Granted, CPU monitoring would accomplish the same, but that may not be avail at all times. Also, if something like neverbleed is in the picture (vs not), its effect on latency, if any, would be more easily measured.

Regarding the impl, it does lgtm as well 👍
However, I have a more generic suggestion around stats, and not entirely about this PR: couldn't perhaps the whole implementation of stats/counters (or at least the majority of them) be made lock free? Considering these are cumulative stats, which already are per worker and with per-thread data, reckon stats could be using thread-local variables/arrays (with an aggregator to sum the values when needed). (have a sample impl doing just that that I could share)

@robguima The reason I asked is because lack of resources is not the only factor that affects the latency. In TLS before 1.2, I assume that the numbers will reflect 1 RTT + time spent to process handshake messages. In TLS 1.3, in most cases, the time will just reflect the time spent to processing handshake messages (if HRR is not used), or 1 RTT + time spent to process the handshake messages.

Considering the fact that the measured value includes RTT, and that we are going to start seeing TLS 1.3 connections for which the measured value does not include RTT, I would be cautious about using this value for detecting issues, though I am not generally against measuring and using this value to support the verification of other metrics.

However, I have a more generic suggestion around stats, and not entirely about this PR: couldn't perhaps the whole implementation of stats/counters (or at least the majority of them) be made lock free?

Actually, it's lock free. The stats are tied to h2o_context_t which is a per-event-loop (also means pre-thread) structure. When the stats handler is called, it gathers the information from all the threads (this is when a mutex is used) and reports them back to the client.

Ah I see. Hmm wondering what would be the best solution then... Perhaps splitting it further into the different categories (esp once 1.3 is in use and esp 0RTT)?

On the stats, good to know thanks for the explanation. 👍 On the other hand, the aggregator could also be made lock free (w/ preallocated/registered thread-local data). Anyway, does not seem that important now considering that the contention around locking the aggregator is probably very low (and infrequent).

I am inclined to have this property as-is, and see how it goes, considering the fact that nobody has argued against having it, and maintaining it is easy.

We will have the chance to improve the metrics (by adding new ones).

i110 · 2018-08-24T03:16:11Z

I think this is enough for now and when we want to know more fine-grained metrics, doing like the following would be nice.

ssl.handshake.full-tls13-hrr
ssl.handshake.latency.full-tls13-hrr

the last component is key for aggregation and we can add details as mush as we want. And even in such case, we can provide ssl.handshake.full and ssl.handshake.full.latency as an aggregated value of all full handshake.

i110 · 2018-08-24T03:56:57Z

after some tweaks the final stat is like the following:

{
    "ssl.errors": 0,  /* counter for ssl handshake errors */
    "ssl.alpn.h1": 0, /* the number of times h2o selected h1 in ALPN */
    "ssl.alpn.h2": 0, /* the number of times h2o selected h2 in ALPN */ 
    "ssl.handshake.full": 1, /* the number of times full handshake happened */
    "ssl.handshake.resume": 1, /* the number of times resume handshake happened */
    "ssl.handshake.accumlated-time.full": 15748, /* sum of time for full handshakes in microseconds */
    "ssl.handshake.accumlated-time.resume": 2434 /* sum of time for resume handshakes in microseconds */
}

@kazuho @robguima @deweerdt 🙂

kazuho · 2018-08-24T04:01:42Z

Thank you for the changes. I like the change of the name to accumulated-time. It makes it clear that they do not represent "latency", which is a term that refers the time spent per each connection.

kazuho · 2018-08-24T04:02:22Z

@i110 I think that the PR is ready for merge.

i110 added 3 commits August 21, 2018 02:06

add ssl.alpn.h[12] event property

7a14390

add ssl.errors event property

ca1b608

add handshake stats

ad04237

i110 force-pushed the i110/more-ssl-stats branch from 31c4c85 to d41ecab Compare August 20, 2018 17:06

i110 added 3 commits August 21, 2018 02:09

add dedicated ssl status handler

a8d0f7b

reformat

4cfa3cb

add ssl status test

2e5cccb

i110 force-pushed the i110/more-ssl-stats branch from d41ecab to 2e5cccb Compare August 20, 2018 17:09

kazuho reviewed Aug 23, 2018

View reviewed changes

i110 added 2 commits August 24, 2018 12:48

change word order

972654c

rename ssl status using accumulated-time instead of latency

d48f1b3

fix test

db41275

i110 merged commit c3b9517 into master Aug 29, 2018

more ssl stats #1837

more ssl stats #1837

Uh oh!

Conversation

i110 commented Aug 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deweerdt commented Aug 20, 2018

Uh oh!

kazuho left a comment

Choose a reason for hiding this comment

Uh oh!

kazuho Aug 23, 2018

Choose a reason for hiding this comment

Uh oh!

i110 Aug 23, 2018

Choose a reason for hiding this comment

Uh oh!

kazuho Aug 23, 2018

Choose a reason for hiding this comment

Uh oh!

robguima Aug 23, 2018

Choose a reason for hiding this comment

Uh oh!

kazuho Aug 24, 2018

Choose a reason for hiding this comment

Uh oh!

robguima Aug 24, 2018

Choose a reason for hiding this comment

Uh oh!

kazuho Aug 24, 2018

Choose a reason for hiding this comment

Uh oh!

i110 commented Aug 24, 2018

Uh oh!

i110 commented Aug 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kazuho commented Aug 24, 2018

Uh oh!

kazuho commented Aug 24, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

i110 commented Aug 20, 2018 •

edited

Loading

i110 commented Aug 24, 2018 •

edited

Loading