CUBIC congestion control in QUIC #443

arut · 2025-01-11T13:43:47Z

Description

CUBIC congestion control is described in RFC 9438. Currently nginx implements Reno congestion control as described in RFC 9002 for QUIC. CUBIC is an improvement over Reno especially for high-BDP and high-RTT paths.

This PR includes several preparatory patches plus QUIC Congestion control patch. The first patch changes nginx debug logging to allow plotting congestion control window. Below are some congestion control graphs.

Closes #442.

Testing environment

All testing is done on local interface with MTU 1500. Traffic is controlled by tc-netem, graphs are plotted by gnuplot, and gtlsclient is used as an HTTP/3 client. A 50M file is downloaded for testing.

$ sudo ifconfig lo mtu 1500
$ sudo tc qdisc add dev lo root netem limit 300 delay 100ms
$ gtlsclient -q --exit-on-first-stream-close 127.0.0.1 8443 https://example.com/50m

nginx.conf:

master_process off;                                                              
daemon off;                                                                      
                                                                                 
error_log logs/error.log debug;                                                                                                                                   
                                                                                 
events {                                                                         
}                                                                                
                                                                                                                                                                  
http {                                                                           
    server {                                                                     
        listen 8443 quic;                                         
                                                                                                                                                                  
        http3_stream_buffer_size 10m;                                            
                                                                                 
        ssl_certificate certs/example.com.crt;                                   
        ssl_certificate_key certs/example.com.key;                               
                                                                                 
        location / {                                                             
            root html;                                                           
        }                                                                        
    }                                                                            
}

After running a test, the following script uses gnuplot to generate a png with the graph.

#!/bin/bash

ERROR_LOG=logs/error.log
CUBIC_SCRIPT=/tmp/cubic.gnuplot
CUBIC_DATA=/tmp/cubic.dat
CUBIC_PNG=cubic.png

rm $CUBIC_PNG || true

awk 'BEGIN{t=0} /congestion ack.*win/{match($0, /ack ([^ ]+) /, s); match($0, /win:([0-9]+)/, m); match($0, / t:([0-9]+)/, n); if (t==0) {t=n[1];} print n[1]-t, m[1], s[1]}' $ERROR_LOG >$CUBIC_DATA

cat >$CUBIC_SCRIPT <<END
set title "QUIC congestion control"
set xlabel ""
set ylabel ""
set format y '%.1s%cB'
set format x '%0.0ss'
set yrange [0:]
set term png
set output "$CUBIC_PNG"

map_color(string) = (              \
    string eq 'ss'    ? 0x00ff00 : \
    string eq 'cubic' ? 0x0000ff : \
    string eq 'reno'  ? 0xff0000 : \
    string eq 'idle'  ? 0x000000 : \
    string eq 'rec'   ? 0x444444 : \
    0x888888)

plot "$CUBIC_DATA" using 1:2:(map_color(stringcolumn(3))) title "cwnd" with linespoints lc rgbcolor variable
END

gnuplot $CUBIC_SCRIPT

Different parts of the graph are colored according to the congestion control state:

green: slow start
blue: CUBIC
red: Reno (including Reno-friendliness region in CUBIC)
black: application-limited region
grey: recovery

The graph plots congestion window size at any time from session start to end.

Graphs

Current congestion control (Reno).

Fixed Reno aggressiveness. This was due to using the maximum UDP packet size instead of current MTU (QUIC: use path MTU in congestion window computations).

CUBIC congestion control.

src/event/quic/ngx_event_quic_ack.c

pluknet · 2025-02-28T16:17:32Z

Regarding the 2nd patch: good catch!

With what keepalive_timeout, response size, and network setup do you observe such premature connection close?

While I agree that application-level closure shouldn't affect transport layer,
AFAIU, to step into this, (re-)transmitting STREAM frames should take longer than QUIC idle timeout (typically >> 0s), as borrowed from keepalive_timeout, which is quite unusual to observe.

FTR, I simulated such behavior locally with large response that fits http3_stream_buffer_size / output_buffers to close request ASAP and set keepalive timer, and 30% packet loss / 32k limited sliding stream window to cause extra round trips.
This makes near 30-45s extra time to deliver stream.

src/event/quic/ngx_event_quic.c

arut · 2025-03-03T15:05:08Z

Regarding the 2nd patch: good catch!

With what keepalive_timeout, response size, and network setup do you observe such premature connection close?

While I agree that application-level closure shouldn't affect transport layer, AFAIU, to step into this, (re-)transmitting STREAM frames should take longer than QUIC idle timeout (typically >> 0s), as borrowed from keepalive_timeout, which is quite unusual to observe.

It's not related to idle timeout. QUIC packets will be slowly sent to client and ACKs will come back so that idle timeout will not expire.

FTR, I simulated such behavior locally with large response that fits http3_stream_buffer_size / output_buffers to close request ASAP and set keepalive timer, and 30% packet loss / 32k limited sliding stream window to cause extra round trips. This makes near 30-45s extra time to deliver stream.

I don't remember the exact setting for this, but a small MTU (1200-1500) and a large response (50-100M) triggered the issue quickly.

src/event/quic/ngx_event_quic_ack.c

src/event/quic/ngx_event_quic_output.c

src/event/quic/ngx_event_quic_ack.c

src/event/quic/ngx_event_quic_migration.c

arut · 2025-03-06T10:15:25Z

I don't remember the exact setting for this, but a small MTU (1200-1500) and a large response (50-100M) triggered the issue quickly.

For this particular issue I think used qdisc netem with low bandwidth and no losses. After finishing the last request, buffering the entire response in QUIC layer and closing the stream at the server, it took nginx quite a long time to send the outstanding data. Hitting keepalive timeout stopped the sending process.

arut · 2025-03-06T10:23:25Z

Following a report here #442 (comment), I updated the last patch to ignore the case when cg->window >= cg->w_max. This can happen when a congestion event occurs repeatedly with a very small window. The cubic time function was not equipped to handle a negative argument, which eventually resulted in division by zero. The fix is to consider negative argument as zero and continue with only the concave part of the cubic function.

src/event/quic/ngx_event_quic_ack.c

pluknet · 2025-03-06T16:20:11Z

Following a report here #442 (comment), I updated the last patch to ignore the case when cg->window >= cg->w_max. This can happen when a congestion event occurs repeatedly with a very small window. The cubic time function was not equipped to handle a negative argument, which eventually resulted in division by zero. The fix is to consider negative argument as zero and continue with only the concave part of the cubic function.

For the record, as discussed elsewhere.

I support the idea to fallback to the non-"fast convergence" case if Wmax happens to be lower, i.e. to avoid further reducing Wmax.
Also, it barely has a physical interpretation.

This was probably overlooked in RFC 9438, sections 4.6 / 4.7, that suggests "further reducing Wmax" (uncapped) before the window reduction (capped to 2MSS).

arut · 2025-03-28T14:18:03Z

While benchmarking CUBIC on a local Linux machine with tc-netem, I consistently saw congestion window collapsing without any obvious reason. After debugging this deeper, I realized that nginx considers certain packets lost due to packet reordering, while the window is much below the BDP. RFC 9002 allows 3 later packets to be acknowledged, while in my tests I saw 10-20 packet reordering. I increased NGX_QUIC_PKT_THR from 3 to 100 in my code and CUBIC window became much smoother after that without those spurious packet losses and window collapses. Also download time improved significantly. This change needs to be tested in various environments and can potentially improve QUIC performance. The RFC is not strict about packet reordering conditions and references RACK (RFC8985), which has only time-based packet loss detection, which is probably the right way to go.

pluknet · 2025-04-02T10:34:20Z

While benchmarking CUBIC on a local Linux machine with tc-netem, I consistently saw congestion window collapsing without any obvious reason. After debugging this deeper, I realized that nginx considers certain packets lost due to packet reordering, while the window is much below the BDP. RFC 9002 allows 3 later packets to be acknowledged, while in my tests I saw 10-20 packet reordering. I increased NGX_QUIC_PKT_THR from 3 to 100 in my code and CUBIC window became much smoother after that without those spurious packet losses and window collapses. Also download time improved significantly. This change needs to be tested in various environments and can potentially improve QUIC performance. The RFC is not strict about packet reordering conditions and references RACK (RFC8985), which has only time-based packet loss detection, which is probably the right way to go.

On which side have you observed reordering? Given that nginx is a sender, were that the client ACK packets reordered?

In that regard, see also draft-ietf-quic-ack-frequency for ACK_FREQUENCY frame extension used to batch ack-eliciting packets acknowledgment, which may also improve performance by eliminating ACK-only packet reordering. IIRC, there were reports on IETF mailing lists that reducing ACK frequency 10-100x times improved performance considerably.

Otherwise see below, if the reordering happened on the nginx (sender) side, with transient gaps as reported in ACK frames from the receiver side.

Note that the purpose of the RFC 9002 kPacketThreshold is to guide the minimum default. Implementers are free to make it larger:

Implementations SHOULD NOT use a packet threshold
less than 3, to keep in line with TCP {{?RFC5681}}.

Also, kPacketThreshold aims to provide a good base balance of reordering resilience between degraded performance, caused by spurious retransmits, and recovery latency.

Points to consider changing NGX_QUIC_PKT_THR (from RFC 9002), or rather making it dynamic:

Keeping the value low may harm.
Spuriously declaring packets lost leads to unnecessary retransmissions and may
result in degraded performance due to the actions of the congestion controller
upon detecting loss.
Some networks may exhibit higher degrees of packet reordering, causing a sender
to detect spurious losses. Additionally, packet reordering could be more common
with QUIC than TCP because network elements that could observe and reorder TCP
packets cannot do that for QUIC and also because QUIC packet numbers are
encrypted.
Increasing the value may harm.
Implementations that detect spurious retransmissions and
increase the reordering threshold in packets or time MAY choose to start with
smaller initial reordering thresholds to minimize recovery latency.
Keeping recovery latency low is important for latency sensitive streams.
In early QUIC drafts, time-based loss detection to handle reordering
was considered as a replacement for a packet reordering threshold.
The RACK function in Linux TCP increases the reordering window up to one
RTT when packet reordering is detected and thus avoids fast retransmits.
There's also mentions on IETF lists to implement dynamic packet reordering thresholds.

arut · 2025-04-02T13:19:00Z

On which side have you observed reordering? Given that nginx is a sender, were that the client ACK packets reordered?

First, I observed client ACKing packets normally in the ascending order. Then at some point a multi-range ACK comes from the client with a hole in it. There's usually enough packet after the hole (>3) that nginx marks the entire hole as lost (and decreases the congestion window). After that another ACK comes from client which covers the hole, which means that the hole packets weren't actually lost, but it's too late now.

arut · 2025-04-02T13:28:03Z

Increasing the value may harm.
Implementations that detect spurious retransmissions and
increase the reordering threshold in packets or time MAY choose to start with
smaller initial reordering thresholds to minimize recovery latency.
Keeping recovery latency low is important for latency sensitive streams.

Increasing the value may slow down the retransmission, which seems like a lesser evil compared to collapsing the congestion window. If we start with a smaller packet threshold, we'll have to increase it based on spurious loss detection. Seems like something rather compliex.

arut · 2025-04-03T13:49:30Z

Pushed one more patch on top of the series. It implements packet threshold auto-detection since according to my tests 3 is not enough. The improvement is smaller than hardcoding 100 instead of 3, but it's still faster than the original version and contains no hardcode.

Update: I keep testing this with larger bandwidths and larger files and it looks like removing packet threshold altogether gives a huge speed boost.

Update2: the worst delay we may experience after removing packet threshold is RTT/8, since a later packet has already been ACKed, which took 1 RTT. Time threshold is roughly equal to 9/8 RTT. What remains is RTT/8.

pluknet · 2025-04-11T13:24:55Z

c6cafd1 looks good to me.

c6cafd1 (QUIC: cache MTU until packet loss.) could be merged into 2f67d92 (QUIC: use path MTU in congestion window computations.), to base 4a6b446 (QUIC: CUBIC congestion control.) on top of it.

This would make series more consistent, IMHO.

pluknet · 2025-04-11T14:15:15Z

faeed91:

While here, also fixed recovery_start wrap protection. Previously it used
2 * max_idle_timeout time frame for all sent frames, whcih is not a
reliable protection.

Could you clarify in the commit log why it is not a reliable?

s/ngx_currnt_msec/ngx_current_msec/
s/whcih/which/
s/ocur/occur/

pluknet · 2025-04-11T17:15:14Z

src/event/quic/ngx_event_quic_output.c

-                    ngx_quic_revert_send(c, ctx, preserved_pnum[i]);
-                }
-
+                ngx_quic_revert_send(c, preserved_pnum);


Note that preserved_pnum is not fully initialized here if pad < 3 and so may corrupt certain ctx->pnum with zero value.
For example, this will overwrite ctx->pnum for application level in ngx_quic_revert_send(), which is possible if send_ctx[3]->sending is not empty.

That said, preserved_pnum could be precalculated to avoid corruption:

diff --git a/src/event/quic/ngx_event_quic_output.c b/src/event/quic/ngx_event_quic_output.c index c2282a5f7..7462e0633 100644 --- a/src/event/quic/ngx_event_quic_output.c +++ b/src/event/quic/ngx_event_quic_output.c @@ -127,9 +127,9 @@ ngx_quic_create_datagrams(ngx_connection_t *c) cg = &qc->congestion; path = qc->path; -#if (NGX_SUPPRESS_WARN) - ngx_memzero(preserved_pnum, sizeof(preserved_pnum)); -#endif + for (i = 0; i < NGX_QUIC_SEND_CTX_LAST; i++) { + preserved_pnum[i] = qc->send_ctx[i].pnum; + } while (cg->in_flight < cg->window) { @@ -143,8 +143,6 @@ ngx_quic_create_datagrams(ngx_connection_t *c) ctx = &qc->send_ctx[i]; - preserved_pnum[i] = ctx->pnum; - if (ngx_quic_generate_ack(c, ctx) != NGX_OK) { return NGX_ERROR; }

OTOH, this specific condition to call ngx_quic_revert_send() appears to be never true after 4f3707c,
which replaced send queue based PTO probes with a direct send, so it can be simply removed instead.

Also, while looking into this, I noticed that PTO probes for Initial packets aren't expanded anymore after 4f3707c.
A quick'n'dirty fix, for the sake of clarity:

diff --git a/src/event/quic/ngx_event_quic_ack.c b/src/event/quic/ngx_event_quic_ack.c index a6f34348b..6ed411e9f 100644 --- a/src/event/quic/ngx_event_quic_ack.c +++ b/src/event/quic/ngx_event_quic_ack.c @@ -928,7 +928,7 @@ ngx_quic_pto_handler(ngx_event_t *ev) f->type = NGX_QUIC_FT_PING; f->ignore_congestion = 1; - if (ngx_quic_frame_sendto(c, f, 0, qc->path) == NGX_ERROR) { + if (ngx_quic_frame_sendto(c, f, 1200 * !i, qc->path) == NGX_ERROR) { goto failed; } }

For the record:
after the second review, preserved_pnum[] appears to be used correctly here.

Two other items stand valid, to be addressed separately.

Improved logging for simpler data extraction for plotting congestion window graphs. In particular, added current milliseconds number from ngx_current_msec. While here, simplified logging text and removed irrelevant data.

Previously, the expiration caused QUIC connection finalization even if there are application-terminated streams finishing sending data. Such finalization terminated these streams. An easy way to trigger this is to request a large file from HTTP/3 over a small MTU. In this case keepalive timeout expiration may abruptly terminate the request stream.

As per RFC 9002, Section B.2, max_datagram_size used in congestion window computations should be based on path MTU.

Since recovery_start field was initialized with ngx_current_msec, all congestion events that happened within the same millisecond or cycle iteration, were treated as in recovery mode. Also, when handling persistent congestion, initializing recovery_start with ngx_current_msec resulted in treating all sent packets as in recovery mode, which violates RFC 9002, see example in Appendix B.8. While here, also fixed recovery_start wrap protection. Previously it used 2 * max_idle_timeout time frame for all sent frames, which is not a reliable protection since max_idle_timeout is unrelated to congestion control. Now recovery_start <= now condition is enforced. Note that recovery_start wrap is highly unlikely and can only occur on a 32-bit system if there are no congestion events for 24 days.

On some systems the value of ngx_current_msec is derived from monotonic clock, for which the following is defined by POSIX: For this clock, the value returned by clock_gettime() represents the amount of time (in seconds and nanoseconds) since an unspecified point in the past. As as result, overflow protection is needed when comparing two ngx_msec_t. The change adds such protection to the ngx_quic_detect_lost() function.

pluknet · 2025-04-15T13:28:59Z

src/event/quic/ngx_event_quic_output.c

-    preserved_pnum = ctx->pnum;
+#if (NGX_SUPPRESS_WARN)
+    ngx_memzero(preserved_pnum, sizeof(preserved_pnum));
+#endif
+
+    level = 2; /* application */
+    preserved_pnum[level] = ctx->pnum;


The function is always called under condition that only application level frames exist.
That said, we can initialize only certain parts of preserved_pnum.

diff --git a/src/event/quic/ngx_event_quic_output.c b/src/event/quic/ngx_event_quic_output.c index e024fd012..a92a539f3 100644 --- a/src/event/quic/ngx_event_quic_output.c +++ b/src/event/quic/ngx_event_quic_output.c @@ -331,13 +331,13 @@ ngx_quic_create_segments(ngx_connection_t *c) size_t len, segsize; ssize_t n; u_char *p, *end; - uint64_t preserved_pnum[NGX_QUIC_SEND_CTX_LAST]; ngx_uint_t nseg, level; ngx_quic_path_t *path; ngx_quic_send_ctx_t *ctx; ngx_quic_congestion_t *cg; ngx_quic_connection_t *qc; static u_char dst[NGX_QUIC_MAX_UDP_SEGMENT_BUF]; + static uint64_t preserved_pnum[NGX_QUIC_SEND_CTX_LAST]; qc = ngx_quic_get_connection(c); cg = &qc->congestion; @@ -355,11 +355,7 @@ ngx_quic_create_segments(ngx_connection_t *c) nseg = 0; -#if (NGX_SUPPRESS_WARN) - ngx_memzero(preserved_pnum, sizeof(preserved_pnum)); -#endif - - level = 2; /* application */ + level = ctx - qc->send_ctx; preserved_pnum[level] = ctx->pnum; for ( ;; ) {

OK, applied.

Previously, these functions operated on a per-level basis. This however resulted in excessive logging of in_flight and will also led to extra work detecting underutilized congestion window in the followup patches.

As per RFC 9002, Section 7.8, congestion window should not be increased when it's underutilized.

As per RFC 9000, Section 14.4: Loss of a QUIC packet that is carried in a PMTU probe is therefore not a reliable indication of congestion and SHOULD NOT trigger a congestion control reaction.

If connection is network-limited, MTU probes have little chance of being sent since congestion window is almost always full. As a result, PMTUD may not be able to reach the real MTU and the connection may operate with a reduced MTU. The solution is to ignore the congestion window. This may lead to a temporary increase in in-flight count beyond congestion window.

Previosly the threshold was hardcoded at 10000. This value is too low for high BDP networks. For example, if all frames are STREAM frames, and MTU is 1500, the upper limit for congestion window would be roughly 15M (10000 * 1500). With 100ms RTT it's just a 1.2Gbps network (15M * 10 * 8). In reality, the limit is even lower because of other frame types. Also, the number of frames that could be used simultaneously depends on the total amount of data buffered in all server streams, and client flow control. The change sets frame threshold based on max concurrent streams and stream buffer size, the product of which is the maximum number of in-flight stream data in all server streams at any moment. The value is divided by 2000 to account for a typical MTU 1500 and the fact that not all frames are STREAM frames.

RFC 9002, Section 6.1.1 defines packet reordering threshold as 3. Testing shows that such low value leads to spurious packet losses followed by congestion window collapse. The change implements dynamic packet threshold detection based on in-flight packet range. Packet threshold is defined as half the number of in-flight packets, with mininum value of 3. Also, renamed ngx_quic_lost_threshold() to ngx_quic_time_threshold() for better compliance with RFC 9002 terms.

arut added the feature label Jan 11, 2025

arut requested a review from pluknet January 11, 2025 13:43

Maryna-f5 modified the milestones: nginx-1.27.4 , nginx-1.27.5 Jan 14, 2025

Maryna-f5 removed this from the nginx-1.27.5 milestone Feb 6, 2025

pluknet reviewed Feb 28, 2025

View reviewed changes

src/event/quic/ngx_event_quic_ack.c Outdated Show resolved Hide resolved

pluknet reviewed Feb 28, 2025

View reviewed changes

src/event/quic/ngx_event_quic.c Show resolved Hide resolved