-
Notifications
You must be signed in to change notification settings - Fork 7.3k
CUBIC congestion control in QUIC #443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Regarding the 2nd patch: good catch! With what While I agree that application-level closure shouldn't affect transport layer, FTR, I simulated such behavior locally with large response that fits http3_stream_buffer_size / output_buffers to close request ASAP and set keepalive timer, and 30% packet loss / 32k limited sliding stream window to cause extra round trips. |
It's not related to idle timeout. QUIC packets will be slowly sent to client and ACKs will come back so that idle timeout will not expire.
I don't remember the exact setting for this, but a small MTU (1200-1500) and a large response (50-100M) triggered the issue quickly. |
For this particular issue I think used |
Following a report here #442 (comment), I updated the last patch to ignore the case when |
For the record, as discussed elsewhere. I support the idea to fallback to the non-"fast convergence" case if Wmax happens to be lower, i.e. to avoid further reducing Wmax. This was probably overlooked in RFC 9438, sections 4.6 / 4.7, that suggests "further reducing Wmax" (uncapped) before the window reduction (capped to 2MSS). |
While benchmarking CUBIC on a local Linux machine with tc-netem, I consistently saw congestion window collapsing without any obvious reason. After debugging this deeper, I realized that nginx considers certain packets lost due to packet reordering, while the window is much below the BDP. RFC 9002 allows 3 later packets to be acknowledged, while in my tests I saw 10-20 packet reordering. I increased |
On which side have you observed reordering? Given that nginx is a sender, were that the client ACK packets reordered? In that regard, see also draft-ietf-quic-ack-frequency for ACK_FREQUENCY frame extension used to batch ack-eliciting packets acknowledgment, which may also improve performance by eliminating ACK-only packet reordering. IIRC, there were reports on IETF mailing lists that reducing ACK frequency 10-100x times improved performance considerably. Otherwise see below, if the reordering happened on the nginx (sender) side, with transient gaps as reported in ACK frames from the receiver side. Note that the purpose of the RFC 9002
Also, Points to consider changing
|
First, I observed client ACKing packets normally in the ascending order. Then at some point a multi-range ACK comes from the client with a hole in it. There's usually enough packet after the hole (>3) that nginx marks the entire hole as lost (and decreases the congestion window). After that another ACK comes from client which covers the hole, which means that the hole packets weren't actually lost, but it's too late now. |
Increasing the value may slow down the retransmission, which seems like a lesser evil compared to collapsing the congestion window. If we start with a smaller packet threshold, we'll have to increase it based on spurious loss detection. Seems like something rather compliex. |
Pushed one more patch on top of the series. It implements packet threshold auto-detection since according to my tests Update: I keep testing this with larger bandwidths and larger files and it looks like removing packet threshold altogether gives a huge speed boost. Update2: the worst delay we may experience after removing packet threshold is RTT/8, since a later packet has already been ACKed, which took 1 RTT. Time threshold is roughly equal to 9/8 RTT. What remains is RTT/8. |
Could you clarify in the commit log why it is not a reliable? s/ngx_currnt_msec/ngx_current_msec/ |
ngx_quic_revert_send(c, ctx, preserved_pnum[i]); | ||
} | ||
|
||
ngx_quic_revert_send(c, preserved_pnum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that preserved_pnum is not fully initialized here if pad < 3 and so may corrupt certain ctx->pnum
with zero value.
For example, this will overwrite ctx->pnum for application level in ngx_quic_revert_send(), which is possible if send_ctx[3]->sending
is not empty.
That said, preserved_pnum could be precalculated to avoid corruption:
diff --git a/src/event/quic/ngx_event_quic_output.c b/src/event/quic/ngx_event_quic_output.c
index c2282a5f7..7462e0633 100644
--- a/src/event/quic/ngx_event_quic_output.c
+++ b/src/event/quic/ngx_event_quic_output.c
@@ -127,9 +127,9 @@ ngx_quic_create_datagrams(ngx_connection_t *c)
cg = &qc->congestion;
path = qc->path;
-#if (NGX_SUPPRESS_WARN)
- ngx_memzero(preserved_pnum, sizeof(preserved_pnum));
-#endif
+ for (i = 0; i < NGX_QUIC_SEND_CTX_LAST; i++) {
+ preserved_pnum[i] = qc->send_ctx[i].pnum;
+ }
while (cg->in_flight < cg->window) {
@@ -143,8 +143,6 @@ ngx_quic_create_datagrams(ngx_connection_t *c)
ctx = &qc->send_ctx[i];
- preserved_pnum[i] = ctx->pnum;
-
if (ngx_quic_generate_ack(c, ctx) != NGX_OK) {
return NGX_ERROR;
}
OTOH, this specific condition to call ngx_quic_revert_send()
appears to be never true after 4f3707c,
which replaced send queue based PTO probes with a direct send, so it can be simply removed instead.
Also, while looking into this, I noticed that PTO probes for Initial packets aren't expanded anymore after 4f3707c.
A quick'n'dirty fix, for the sake of clarity:
diff --git a/src/event/quic/ngx_event_quic_ack.c b/src/event/quic/ngx_event_quic_ack.c
index a6f34348b..6ed411e9f 100644
--- a/src/event/quic/ngx_event_quic_ack.c
+++ b/src/event/quic/ngx_event_quic_ack.c
@@ -928,7 +928,7 @@ ngx_quic_pto_handler(ngx_event_t *ev)
f->type = NGX_QUIC_FT_PING;
f->ignore_congestion = 1;
- if (ngx_quic_frame_sendto(c, f, 0, qc->path) == NGX_ERROR) {
+ if (ngx_quic_frame_sendto(c, f, 1200 * !i, qc->path) == NGX_ERROR) {
goto failed;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record:
after the second review, preserved_pnum[] appears to be used correctly here.
Two other items stand valid, to be addressed separately.
Improved logging for simpler data extraction for plotting congestion window graphs. In particular, added current milliseconds number from ngx_current_msec. While here, simplified logging text and removed irrelevant data.
Previously, the expiration caused QUIC connection finalization even if there are application-terminated streams finishing sending data. Such finalization terminated these streams. An easy way to trigger this is to request a large file from HTTP/3 over a small MTU. In this case keepalive timeout expiration may abruptly terminate the request stream.
As per RFC 9002, Section B.2, max_datagram_size used in congestion window computations should be based on path MTU.
Since recovery_start field was initialized with ngx_current_msec, all congestion events that happened within the same millisecond or cycle iteration, were treated as in recovery mode. Also, when handling persistent congestion, initializing recovery_start with ngx_current_msec resulted in treating all sent packets as in recovery mode, which violates RFC 9002, see example in Appendix B.8. While here, also fixed recovery_start wrap protection. Previously it used 2 * max_idle_timeout time frame for all sent frames, which is not a reliable protection since max_idle_timeout is unrelated to congestion control. Now recovery_start <= now condition is enforced. Note that recovery_start wrap is highly unlikely and can only occur on a 32-bit system if there are no congestion events for 24 days.
On some systems the value of ngx_current_msec is derived from monotonic clock, for which the following is defined by POSIX: For this clock, the value returned by clock_gettime() represents the amount of time (in seconds and nanoseconds) since an unspecified point in the past. As as result, overflow protection is needed when comparing two ngx_msec_t. The change adds such protection to the ngx_quic_detect_lost() function.
1172354
to
85927bf
Compare
preserved_pnum = ctx->pnum; | ||
#if (NGX_SUPPRESS_WARN) | ||
ngx_memzero(preserved_pnum, sizeof(preserved_pnum)); | ||
#endif | ||
|
||
level = 2; /* application */ | ||
preserved_pnum[level] = ctx->pnum; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is always called under condition that only application level frames exist.
That said, we can initialize only certain parts of preserved_pnum
.
diff --git a/src/event/quic/ngx_event_quic_output.c b/src/event/quic/ngx_event_quic_output.c
index e024fd012..a92a539f3 100644
--- a/src/event/quic/ngx_event_quic_output.c
+++ b/src/event/quic/ngx_event_quic_output.c
@@ -331,13 +331,13 @@ ngx_quic_create_segments(ngx_connection_t *c)
size_t len, segsize;
ssize_t n;
u_char *p, *end;
- uint64_t preserved_pnum[NGX_QUIC_SEND_CTX_LAST];
ngx_uint_t nseg, level;
ngx_quic_path_t *path;
ngx_quic_send_ctx_t *ctx;
ngx_quic_congestion_t *cg;
ngx_quic_connection_t *qc;
static u_char dst[NGX_QUIC_MAX_UDP_SEGMENT_BUF];
+ static uint64_t preserved_pnum[NGX_QUIC_SEND_CTX_LAST];
qc = ngx_quic_get_connection(c);
cg = &qc->congestion;
@@ -355,11 +355,7 @@ ngx_quic_create_segments(ngx_connection_t *c)
nseg = 0;
-#if (NGX_SUPPRESS_WARN)
- ngx_memzero(preserved_pnum, sizeof(preserved_pnum));
-#endif
-
- level = 2; /* application */
+ level = ctx - qc->send_ctx;
preserved_pnum[level] = ctx->pnum;
for ( ;; ) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, applied.
Previously, these functions operated on a per-level basis. This however resulted in excessive logging of in_flight and will also led to extra work detecting underutilized congestion window in the followup patches.
As per RFC 9002, Section 7.8, congestion window should not be increased when it's underutilized.
As per RFC 9000, Section 14.4: Loss of a QUIC packet that is carried in a PMTU probe is therefore not a reliable indication of congestion and SHOULD NOT trigger a congestion control reaction.
If connection is network-limited, MTU probes have little chance of being sent since congestion window is almost always full. As a result, PMTUD may not be able to reach the real MTU and the connection may operate with a reduced MTU. The solution is to ignore the congestion window. This may lead to a temporary increase in in-flight count beyond congestion window.
Previosly the threshold was hardcoded at 10000. This value is too low for high BDP networks. For example, if all frames are STREAM frames, and MTU is 1500, the upper limit for congestion window would be roughly 15M (10000 * 1500). With 100ms RTT it's just a 1.2Gbps network (15M * 10 * 8). In reality, the limit is even lower because of other frame types. Also, the number of frames that could be used simultaneously depends on the total amount of data buffered in all server streams, and client flow control. The change sets frame threshold based on max concurrent streams and stream buffer size, the product of which is the maximum number of in-flight stream data in all server streams at any moment. The value is divided by 2000 to account for a typical MTU 1500 and the fact that not all frames are STREAM frames.
RFC 9002, Section 6.1.1 defines packet reordering threshold as 3. Testing shows that such low value leads to spurious packet losses followed by congestion window collapse. The change implements dynamic packet threshold detection based on in-flight packet range. Packet threshold is defined as half the number of in-flight packets, with mininum value of 3. Also, renamed ngx_quic_lost_threshold() to ngx_quic_time_threshold() for better compliance with RFC 9002 terms.
Description
CUBIC congestion control is described in RFC 9438. Currently nginx implements Reno congestion control as described in RFC 9002 for QUIC. CUBIC is an improvement over Reno especially for high-BDP and high-RTT paths.
This PR includes several preparatory patches plus QUIC Congestion control patch. The first patch changes nginx debug logging to allow plotting congestion control window. Below are some congestion control graphs.
Closes #442.
Testing environment
All testing is done on local interface with MTU 1500. Traffic is controlled by
tc-netem
, graphs are plotted bygnuplot
, andgtlsclient
is used as an HTTP/3 client. A 50M file is downloaded for testing.nginx.conf:
After running a test, the following script uses
gnuplot
to generate a png with the graph.Different parts of the graph are colored according to the congestion control state:
The graph plots congestion window size at any time from session start to end.
Graphs
Current congestion control (Reno).

Fixed Reno aggressiveness. This was due to using the maximum UDP packet size instead of current MTU (QUIC: use path MTU in congestion window computations).

CUBIC congestion control.
