Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@i110
Copy link
Contributor

@i110 i110 commented Aug 20, 2018

This PR adds following properties in status json response.

{
    "ssl.errors": 0,  /* counter for ssl handshake errors */
    "ssl.alpn.h1": 0, /* the number of times h2o selected h1 in ALPN */
    "ssl.alpn.h2": 0, /* the number of times h2o selected h2 in ALPN */ 
    "ssl.handshake.full": 1, /* the number of times full handshake happened */
    "ssl.handshake.resume": 1, /* the number of times resume handshake happened */
    "ssl.handshake.full.latency": 15748, /* sum of latencies for full handshakes in microseconds */
    "ssl.handshake.resume.latency": 2434 /* sum of latencies for resume handshakes in microseconds */
}

ssl.errors belongs to events module, and others to newly added ssl module.

One small concern: uint64_t is enough to store summation value of microsecond latencies? Assuming that, in worst case, average handshake latency is 100ms (100,000ns) and h2o serves 1,000,000 req/sec, it'll take

(2 ^ 64) / 100,000 ns / 1,000,000 rps / 86400sec ≈  2,135 days ≈ 5.85 years

to overflow. Is it enough? If not, we should change the summation properties to averages.

@i110 i110 force-pushed the i110/more-ssl-stats branch from 31c4c85 to d41ecab Compare August 20, 2018 17:06
@i110 i110 force-pushed the i110/more-ssl-stats branch from d41ecab to 2e5cccb Compare August 20, 2018 17:09
@deweerdt
Copy link
Member

uint64_t is enough to store summation value of microsecond latencies?
I believe that's plenty, yes.

Copy link
Member

@kazuho kazuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR.

The design looks fine and I have only one question to ask (see below).

include/h2o.h Outdated
* summations of handshake latency in microsecond
*/
uint64_t handshake_full_latency;
uint64_t handshake_resume_latency;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the intent of collecting these values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deweerdt @robguima What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I can understand that you might want to track how the average latency changes over time. To do that, I can see that you would periodically collect the stats, and calculate (cur_latency - prev_latency) / (cur_handshake_count - prev_handshake_count) to obtain the average latency between the previous stats collection and the current stats collection.

What I wonder is how you would use the data. Even though it is a nice-have, I am not sure if it can be used for analyzing issues, etc., because the time spent to finish a handshake not only depends on the mode of the handshake (i.e. full vs. resume) but also on the version of TLS, use of false start, use of HRR (in case TLS 1.3 is used).

I think I might be fine with merging the PR as-is (we might want to change the name of to "accumulated_latency" or something, though), but I am wondering if the information being exposed here is sufficient.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kazuho even though I agree that the latency measurements may be of limited usefulness, they could be useful in determining if say, lack of resources (e.g. busy CPU) may be impairing handshakes and increasing latency. So, if a spike in latency is detected externally, these stats may help in determining why (providing finer grained detail). Granted, CPU monitoring would accomplish the same, but that may not be avail at all times. Also, if something like neverbleed is in the picture (vs not), its effect on latency, if any, would be more easily measured.

Regarding the impl, it does lgtm as well 👍
However, I have a more generic suggestion around stats, and not entirely about this PR: couldn't perhaps the whole implementation of stats/counters (or at least the majority of them) be made lock free? Considering these are cumulative stats, which already are per worker and with per-thread data, reckon stats could be using thread-local variables/arrays (with an aggregator to sum the values when needed). (have a sample impl doing just that that I could share)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robguima The reason I asked is because lack of resources is not the only factor that affects the latency. In TLS before 1.2, I assume that the numbers will reflect 1 RTT + time spent to process handshake messages. In TLS 1.3, in most cases, the time will just reflect the time spent to processing handshake messages (if HRR is not used), or 1 RTT + time spent to process the handshake messages.

Considering the fact that the measured value includes RTT, and that we are going to start seeing TLS 1.3 connections for which the measured value does not include RTT, I would be cautious about using this value for detecting issues, though I am not generally against measuring and using this value to support the verification of other metrics.

However, I have a more generic suggestion around stats, and not entirely about this PR: couldn't perhaps the whole implementation of stats/counters (or at least the majority of them) be made lock free?

Actually, it's lock free. The stats are tied to h2o_context_t which is a per-event-loop (also means pre-thread) structure. When the stats handler is called, it gathers the information from all the threads (this is when a mutex is used) and reports them back to the client.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Hmm wondering what would be the best solution then... Perhaps splitting it further into the different categories (esp once 1.3 is in use and esp 0RTT)?

On the stats, good to know thanks for the explanation. 👍 On the other hand, the aggregator could also be made lock free (w/ preallocated/registered thread-local data). Anyway, does not seem that important now considering that the contention around locking the aggregator is probably very low (and infrequent).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am inclined to have this property as-is, and see how it goes, considering the fact that nobody has argued against having it, and maintaining it is easy.

We will have the chance to improve the metrics (by adding new ones).

@i110
Copy link
Contributor Author

i110 commented Aug 24, 2018

I think this is enough for now and when we want to know more fine-grained metrics, doing like the following would be nice.

ssl.handshake.full-tls13-hrr
ssl.handshake.latency.full-tls13-hrr

the last component is key for aggregation and we can add details as mush as we want. And even in such case, we can provide ssl.handshake.full and ssl.handshake.full.latency as an aggregated value of all full handshake.

@i110
Copy link
Contributor Author

i110 commented Aug 24, 2018

after some tweaks the final stat is like the following:

{
    "ssl.errors": 0,  /* counter for ssl handshake errors */
    "ssl.alpn.h1": 0, /* the number of times h2o selected h1 in ALPN */
    "ssl.alpn.h2": 0, /* the number of times h2o selected h2 in ALPN */ 
    "ssl.handshake.full": 1, /* the number of times full handshake happened */
    "ssl.handshake.resume": 1, /* the number of times resume handshake happened */
    "ssl.handshake.accumlated-time.full": 15748, /* sum of time for full handshakes in microseconds */
    "ssl.handshake.accumlated-time.resume": 2434 /* sum of time for resume handshakes in microseconds */
}

@kazuho @robguima @deweerdt 🙂

@kazuho
Copy link
Member

kazuho commented Aug 24, 2018

Thank you for the changes. I like the change of the name to accumulated-time. It makes it clear that they do not represent "latency", which is a term that refers the time spent per each connection.

@kazuho
Copy link
Member

kazuho commented Aug 24, 2018

@i110 I think that the PR is ready for merge.

@i110 i110 merged commit c3b9517 into master Aug 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants