Thanks to visit codestin.com
Credit goes to github.com

Skip to content

CUBIC congestion control in QUIC #443

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 15, 2025
Merged

CUBIC congestion control in QUIC #443

merged 12 commits into from
Apr 15, 2025

Conversation

arut
Copy link
Contributor

@arut arut commented Jan 11, 2025

Description

CUBIC congestion control is described in RFC 9438. Currently nginx implements Reno congestion control as described in RFC 9002 for QUIC. CUBIC is an improvement over Reno especially for high-BDP and high-RTT paths.

This PR includes several preparatory patches plus QUIC Congestion control patch. The first patch changes nginx debug logging to allow plotting congestion control window. Below are some congestion control graphs.

Closes #442.

Testing environment

All testing is done on local interface with MTU 1500. Traffic is controlled by tc-netem, graphs are plotted by gnuplot, and gtlsclient is used as an HTTP/3 client. A 50M file is downloaded for testing.

$ sudo ifconfig lo mtu 1500
$ sudo tc qdisc add dev lo root netem limit 300 delay 100ms
$ gtlsclient -q --exit-on-first-stream-close 127.0.0.1 8443 https://example.com/50m 

nginx.conf:

master_process off;                                                              
daemon off;                                                                      
                                                                                 
error_log logs/error.log debug;                                                                                                                                   
                                                                                 
events {                                                                         
}                                                                                
                                                                                                                                                                  
http {                                                                           
    server {                                                                     
        listen 8443 quic;                                         
                                                                                                                                                                  
        http3_stream_buffer_size 10m;                                            
                                                                                 
        ssl_certificate certs/example.com.crt;                                   
        ssl_certificate_key certs/example.com.key;                               
                                                                                 
        location / {                                                             
            root html;                                                           
        }                                                                        
    }                                                                            
} 

After running a test, the following script uses gnuplot to generate a png with the graph.

#!/bin/bash

ERROR_LOG=logs/error.log
CUBIC_SCRIPT=/tmp/cubic.gnuplot
CUBIC_DATA=/tmp/cubic.dat
CUBIC_PNG=cubic.png

rm $CUBIC_PNG || true

awk 'BEGIN{t=0} /congestion ack.*win/{match($0, /ack ([^ ]+) /, s); match($0, /win:([0-9]+)/, m); match($0, / t:([0-9]+)/, n); if (t==0) {t=n[1];} print n[1]-t, m[1], s[1]}' $ERROR_LOG >$CUBIC_DATA

cat >$CUBIC_SCRIPT <<END
set title "QUIC congestion control"
set xlabel ""
set ylabel ""
set format y '%.1s%cB'
set format x '%0.0ss'
set yrange [0:]
set term png
set output "$CUBIC_PNG"

map_color(string) = (              \
    string eq 'ss'    ? 0x00ff00 : \
    string eq 'cubic' ? 0x0000ff : \
    string eq 'reno'  ? 0xff0000 : \
    string eq 'idle'  ? 0x000000 : \
    string eq 'rec'   ? 0x444444 : \
    0x888888)

plot "$CUBIC_DATA" using 1:2:(map_color(stringcolumn(3))) title "cwnd" with linespoints lc rgbcolor variable
END

gnuplot $CUBIC_SCRIPT

Different parts of the graph are colored according to the congestion control state:

  • green: slow start
  • blue: CUBIC
  • red: Reno (including Reno-friendliness region in CUBIC)
  • black: application-limited region
  • grey: recovery

The graph plots congestion window size at any time from session start to end.

Graphs

Current congestion control (Reno).
original

Fixed Reno aggressiveness. This was due to using the maximum UDP packet size instead of current MTU (QUIC: use path MTU in congestion window computations).
reno

CUBIC congestion control.
cubic

@arut arut added the feature label Jan 11, 2025
@arut arut requested a review from pluknet January 11, 2025 13:43
@Maryna-f5 Maryna-f5 modified the milestones: nginx-1.27.4 , nginx-1.27.5 Jan 14, 2025
@Maryna-f5 Maryna-f5 removed this from the nginx-1.27.5 milestone Feb 6, 2025
@pluknet
Copy link
Contributor

pluknet commented Feb 28, 2025

Regarding the 2nd patch: good catch!

With what keepalive_timeout, response size, and network setup do you observe such premature connection close?

While I agree that application-level closure shouldn't affect transport layer,
AFAIU, to step into this, (re-)transmitting STREAM frames should take longer than QUIC idle timeout (typically >> 0s), as borrowed from keepalive_timeout, which is quite unusual to observe.

FTR, I simulated such behavior locally with large response that fits http3_stream_buffer_size / output_buffers to close request ASAP and set keepalive timer, and 30% packet loss / 32k limited sliding stream window to cause extra round trips.
This makes near 30-45s extra time to deliver stream.

@arut
Copy link
Contributor Author

arut commented Mar 3, 2025

Regarding the 2nd patch: good catch!

With what keepalive_timeout, response size, and network setup do you observe such premature connection close?

While I agree that application-level closure shouldn't affect transport layer, AFAIU, to step into this, (re-)transmitting STREAM frames should take longer than QUIC idle timeout (typically >> 0s), as borrowed from keepalive_timeout, which is quite unusual to observe.

It's not related to idle timeout. QUIC packets will be slowly sent to client and ACKs will come back so that idle timeout will not expire.

FTR, I simulated such behavior locally with large response that fits http3_stream_buffer_size / output_buffers to close request ASAP and set keepalive timer, and 30% packet loss / 32k limited sliding stream window to cause extra round trips. This makes near 30-45s extra time to deliver stream.

I don't remember the exact setting for this, but a small MTU (1200-1500) and a large response (50-100M) triggered the issue quickly.

@arut
Copy link
Contributor Author

arut commented Mar 6, 2025

I don't remember the exact setting for this, but a small MTU (1200-1500) and a large response (50-100M) triggered the issue quickly.

For this particular issue I think used qdisc netem with low bandwidth and no losses. After finishing the last request, buffering the entire response in QUIC layer and closing the stream at the server, it took nginx quite a long time to send the outstanding data. Hitting keepalive timeout stopped the sending process.

@arut
Copy link
Contributor Author

arut commented Mar 6, 2025

Following a report here #442 (comment), I updated the last patch to ignore the case when cg->window >= cg->w_max. This can happen when a congestion event occurs repeatedly with a very small window. The cubic time function was not equipped to handle a negative argument, which eventually resulted in division by zero. The fix is to consider negative argument as zero and continue with only the concave part of the cubic function.

@pluknet
Copy link
Contributor

pluknet commented Mar 6, 2025

Following a report here #442 (comment), I updated the last patch to ignore the case when cg->window >= cg->w_max. This can happen when a congestion event occurs repeatedly with a very small window. The cubic time function was not equipped to handle a negative argument, which eventually resulted in division by zero. The fix is to consider negative argument as zero and continue with only the concave part of the cubic function.

For the record, as discussed elsewhere.

I support the idea to fallback to the non-"fast convergence" case if Wmax happens to be lower, i.e. to avoid further reducing Wmax.
Also, it barely has a physical interpretation.

This was probably overlooked in RFC 9438, sections 4.6 / 4.7, that suggests "further reducing Wmax" (uncapped) before the window reduction (capped to 2MSS).

@arut
Copy link
Contributor Author

arut commented Mar 28, 2025

While benchmarking CUBIC on a local Linux machine with tc-netem, I consistently saw congestion window collapsing without any obvious reason. After debugging this deeper, I realized that nginx considers certain packets lost due to packet reordering, while the window is much below the BDP. RFC 9002 allows 3 later packets to be acknowledged, while in my tests I saw 10-20 packet reordering. I increased NGX_QUIC_PKT_THR from 3 to 100 in my code and CUBIC window became much smoother after that without those spurious packet losses and window collapses. Also download time improved significantly. This change needs to be tested in various environments and can potentially improve QUIC performance. The RFC is not strict about packet reordering conditions and references RACK (RFC8985), which has only time-based packet loss detection, which is probably the right way to go.

@pluknet
Copy link
Contributor

pluknet commented Apr 2, 2025

While benchmarking CUBIC on a local Linux machine with tc-netem, I consistently saw congestion window collapsing without any obvious reason. After debugging this deeper, I realized that nginx considers certain packets lost due to packet reordering, while the window is much below the BDP. RFC 9002 allows 3 later packets to be acknowledged, while in my tests I saw 10-20 packet reordering. I increased NGX_QUIC_PKT_THR from 3 to 100 in my code and CUBIC window became much smoother after that without those spurious packet losses and window collapses. Also download time improved significantly. This change needs to be tested in various environments and can potentially improve QUIC performance. The RFC is not strict about packet reordering conditions and references RACK (RFC8985), which has only time-based packet loss detection, which is probably the right way to go.

On which side have you observed reordering? Given that nginx is a sender, were that the client ACK packets reordered?

In that regard, see also draft-ietf-quic-ack-frequency for ACK_FREQUENCY frame extension used to batch ack-eliciting packets acknowledgment, which may also improve performance by eliminating ACK-only packet reordering. IIRC, there were reports on IETF mailing lists that reducing ACK frequency 10-100x times improved performance considerably.

Otherwise see below, if the reordering happened on the nginx (sender) side, with transient gaps as reported in ACK frames from the receiver side.

Note that the purpose of the RFC 9002 kPacketThreshold is to guide the minimum default. Implementers are free to make it larger:

Implementations SHOULD NOT use a packet threshold
less than 3, to keep in line with TCP {{?RFC5681}}.

Also, kPacketThreshold aims to provide a good base balance of reordering resilience between degraded performance, caused by spurious retransmits, and recovery latency.

Points to consider changing NGX_QUIC_PKT_THR (from RFC 9002), or rather making it dynamic:

  • Keeping the value low may harm.
    Spuriously declaring packets lost leads to unnecessary retransmissions and may
    result in degraded performance due to the actions of the congestion controller
    upon detecting loss.
    Some networks may exhibit higher degrees of packet reordering, causing a sender
    to detect spurious losses. Additionally, packet reordering could be more common
    with QUIC than TCP because network elements that could observe and reorder TCP
    packets cannot do that for QUIC and also because QUIC packet numbers are
    encrypted.
  • Increasing the value may harm.
    Implementations that detect spurious retransmissions and
    increase the reordering threshold in packets or time MAY choose to start with
    smaller initial reordering thresholds to minimize recovery latency.
    Keeping recovery latency low is important for latency sensitive streams.
  • In early QUIC drafts, time-based loss detection to handle reordering
    was considered as a replacement for a packet reordering threshold.
  • The RACK function in Linux TCP increases the reordering window up to one
    RTT when packet reordering is detected and thus avoids fast retransmits.
  • There's also mentions on IETF lists to implement dynamic packet reordering thresholds.

@arut
Copy link
Contributor Author

arut commented Apr 2, 2025

On which side have you observed reordering? Given that nginx is a sender, were that the client ACK packets reordered?

First, I observed client ACKing packets normally in the ascending order. Then at some point a multi-range ACK comes from the client with a hole in it. There's usually enough packet after the hole (>3) that nginx marks the entire hole as lost (and decreases the congestion window). After that another ACK comes from client which covers the hole, which means that the hole packets weren't actually lost, but it's too late now.

@arut
Copy link
Contributor Author

arut commented Apr 2, 2025

Increasing the value may harm.
Implementations that detect spurious retransmissions and
increase the reordering threshold in packets or time MAY choose to start with
smaller initial reordering thresholds to minimize recovery latency.
Keeping recovery latency low is important for latency sensitive streams.

Increasing the value may slow down the retransmission, which seems like a lesser evil compared to collapsing the congestion window. If we start with a smaller packet threshold, we'll have to increase it based on spurious loss detection. Seems like something rather compliex.

@arut
Copy link
Contributor Author

arut commented Apr 3, 2025

Pushed one more patch on top of the series. It implements packet threshold auto-detection since according to my tests 3 is not enough. The improvement is smaller than hardcoding 100 instead of 3, but it's still faster than the original version and contains no hardcode.

Update: I keep testing this with larger bandwidths and larger files and it looks like removing packet threshold altogether gives a huge speed boost.

Update2: the worst delay we may experience after removing packet threshold is RTT/8, since a later packet has already been ACKed, which took 1 RTT. Time threshold is roughly equal to 9/8 RTT. What remains is RTT/8.

@Maryna-f5 Maryna-f5 added this to the nginx-1.27.5 milestone Apr 10, 2025
@pluknet
Copy link
Contributor

pluknet commented Apr 11, 2025

c6cafd1 looks good to me.

c6cafd1 (QUIC: cache MTU until packet loss.) could be merged into 2f67d92 (QUIC: use path MTU in congestion window computations.), to base 4a6b446 (QUIC: CUBIC congestion control.) on top of it.

This would make series more consistent, IMHO.

@pluknet
Copy link
Contributor

pluknet commented Apr 11, 2025

faeed91:

While here, also fixed recovery_start wrap protection. Previously it used
2 * max_idle_timeout time frame for all sent frames, whcih is not a
reliable protection.

Could you clarify in the commit log why it is not a reliable?

s/ngx_currnt_msec/ngx_current_msec/
s/whcih/which/
s/ocur/occur/

ngx_quic_revert_send(c, ctx, preserved_pnum[i]);
}

ngx_quic_revert_send(c, preserved_pnum);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that preserved_pnum is not fully initialized here if pad < 3 and so may corrupt certain ctx->pnum with zero value.
For example, this will overwrite ctx->pnum for application level in ngx_quic_revert_send(), which is possible if send_ctx[3]->sending is not empty.

That said, preserved_pnum could be precalculated to avoid corruption:

diff --git a/src/event/quic/ngx_event_quic_output.c b/src/event/quic/ngx_event_quic_output.c
index c2282a5f7..7462e0633 100644
--- a/src/event/quic/ngx_event_quic_output.c
+++ b/src/event/quic/ngx_event_quic_output.c
@@ -127,9 +127,9 @@ ngx_quic_create_datagrams(ngx_connection_t *c)
     cg = &qc->congestion;
     path = qc->path;
 
-#if (NGX_SUPPRESS_WARN)
-    ngx_memzero(preserved_pnum, sizeof(preserved_pnum));
-#endif
+    for (i = 0; i < NGX_QUIC_SEND_CTX_LAST; i++) {
+        preserved_pnum[i] = qc->send_ctx[i].pnum;
+    }
 
     while (cg->in_flight < cg->window) {
 
@@ -143,8 +143,6 @@ ngx_quic_create_datagrams(ngx_connection_t *c)
 
             ctx = &qc->send_ctx[i];
 
-            preserved_pnum[i] = ctx->pnum;
-
             if (ngx_quic_generate_ack(c, ctx) != NGX_OK) {
                 return NGX_ERROR;
             }

OTOH, this specific condition to call ngx_quic_revert_send() appears to be never true after 4f3707c,
which replaced send queue based PTO probes with a direct send, so it can be simply removed instead.

Also, while looking into this, I noticed that PTO probes for Initial packets aren't expanded anymore after 4f3707c.
A quick'n'dirty fix, for the sake of clarity:

diff --git a/src/event/quic/ngx_event_quic_ack.c b/src/event/quic/ngx_event_quic_ack.c
index a6f34348b..6ed411e9f 100644
--- a/src/event/quic/ngx_event_quic_ack.c
+++ b/src/event/quic/ngx_event_quic_ack.c
@@ -928,7 +928,7 @@ ngx_quic_pto_handler(ngx_event_t *ev)
             f->type = NGX_QUIC_FT_PING;
             f->ignore_congestion = 1;
 
-            if (ngx_quic_frame_sendto(c, f, 0, qc->path) == NGX_ERROR) {
+            if (ngx_quic_frame_sendto(c, f, 1200 * !i, qc->path) == NGX_ERROR) {
                 goto failed;
             }
         }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record:
after the second review, preserved_pnum[] appears to be used correctly here.

Two other items stand valid, to be addressed separately.

arut added 5 commits April 14, 2025 17:25
Improved logging for simpler data extraction for plotting congestion
window graphs.  In particular, added current milliseconds number from
ngx_current_msec.

While here, simplified logging text and removed irrelevant data.
Previously, the expiration caused QUIC connection finalization even if
there are application-terminated streams finishing sending data.  Such
finalization terminated these streams.

An easy way to trigger this is to request a large file from HTTP/3 over
a small MTU.  In this case keepalive timeout expiration may abruptly
terminate the request stream.
As per RFC 9002, Section B.2, max_datagram_size used in congestion window
computations should be based on path MTU.
Since recovery_start field was initialized with ngx_current_msec, all
congestion events that happened within the same millisecond or cycle
iteration, were treated as in recovery mode.

Also, when handling persistent congestion, initializing recovery_start
with ngx_current_msec resulted in treating all sent packets as in recovery
mode, which violates RFC 9002, see example in Appendix B.8.

While here, also fixed recovery_start wrap protection.  Previously it used
2 * max_idle_timeout time frame for all sent frames, which is not a
reliable protection since max_idle_timeout is unrelated to congestion
control.  Now recovery_start <= now condition is enforced.  Note that
recovery_start wrap is highly unlikely and can only occur on a
32-bit system if there are no congestion events for 24 days.
On some systems the value of ngx_current_msec is derived from monotonic
clock, for which the following is defined by POSIX:

   For this clock, the value returned by clock_gettime() represents
   the amount of time (in seconds and nanoseconds) since an unspecified
   point in the past.

As as result, overflow protection is needed when comparing two ngx_msec_t.
The change adds such protection to the ngx_quic_detect_lost() function.
@arut arut force-pushed the quic-cubic branch 2 times, most recently from 1172354 to 85927bf Compare April 14, 2025 16:49
Comment on lines 338 to 359
preserved_pnum = ctx->pnum;
#if (NGX_SUPPRESS_WARN)
ngx_memzero(preserved_pnum, sizeof(preserved_pnum));
#endif

level = 2; /* application */
preserved_pnum[level] = ctx->pnum;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is always called under condition that only application level frames exist.
That said, we can initialize only certain parts of preserved_pnum.

diff --git a/src/event/quic/ngx_event_quic_output.c b/src/event/quic/ngx_event_quic_output.c
index e024fd012..a92a539f3 100644
--- a/src/event/quic/ngx_event_quic_output.c
+++ b/src/event/quic/ngx_event_quic_output.c
@@ -331,13 +331,13 @@ ngx_quic_create_segments(ngx_connection_t *c)
     size_t                  len, segsize;
     ssize_t                 n;
     u_char                 *p, *end;
-    uint64_t                preserved_pnum[NGX_QUIC_SEND_CTX_LAST];
     ngx_uint_t              nseg, level;
     ngx_quic_path_t        *path;
     ngx_quic_send_ctx_t    *ctx;
     ngx_quic_congestion_t  *cg;
     ngx_quic_connection_t  *qc;
     static u_char           dst[NGX_QUIC_MAX_UDP_SEGMENT_BUF];
+    static uint64_t         preserved_pnum[NGX_QUIC_SEND_CTX_LAST];
 
     qc = ngx_quic_get_connection(c);
     cg = &qc->congestion;
@@ -355,11 +355,7 @@ ngx_quic_create_segments(ngx_connection_t *c)
 
     nseg = 0;
 
-#if (NGX_SUPPRESS_WARN)
-    ngx_memzero(preserved_pnum, sizeof(preserved_pnum));
-#endif
-
-    level = 2; /* application */
+    level = ctx - qc->send_ctx;
     preserved_pnum[level] = ctx->pnum;
 
     for ( ;; ) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, applied.

arut added 7 commits April 15, 2025 18:13
Previously, these functions operated on a per-level basis.  This however
resulted in excessive logging of in_flight and will also led to extra
work detecting underutilized congestion window in the followup patches.
As per RFC 9002, Section 7.8, congestion window should not be increased
when it's underutilized.
As per RFC 9000, Section 14.4:

    Loss of a QUIC packet that is carried in a PMTU probe is therefore
    not a reliable indication of congestion and SHOULD NOT trigger a
    congestion control reaction.
If connection is network-limited, MTU probes have little chance of being
sent since congestion window is almost always full.  As a result, PMTUD
may not be able to reach the real MTU and the connection may operate with
a reduced MTU.  The solution is to ignore the congestion window.  This may
lead to a temporary increase in in-flight count beyond congestion window.
Previosly the threshold was hardcoded at 10000.  This value is too low for
high BDP networks.  For example, if all frames are STREAM frames, and MTU
is 1500, the upper limit for congestion window would be roughly 15M
(10000 * 1500).  With 100ms RTT it's just a 1.2Gbps network (15M * 10 * 8).
In reality, the limit is even lower because of other frame types.  Also,
the number of frames that could be used simultaneously depends on the total
amount of data buffered in all server streams, and client flow control.

The change sets frame threshold based on max concurrent streams and stream
buffer size, the product of which is the maximum number of in-flight stream
data in all server streams at any moment.  The value is divided by 2000 to
account for a typical MTU 1500 and the fact that not all frames are STREAM
frames.
RFC 9002, Section 6.1.1 defines packet reordering threshold as 3.  Testing
shows that such low value leads to spurious packet losses followed by
congestion window collapse.  The change implements dynamic packet threshold
detection based on in-flight packet range.  Packet threshold is defined
as half the number of in-flight packets, with mininum value of 3.

Also, renamed ngx_quic_lost_threshold() to ngx_quic_time_threshold()
for better compliance with RFC 9002 terms.
@arut arut merged commit aa49a41 into nginx:master Apr 15, 2025
1 check passed
@arut arut deleted the quic-cubic branch April 15, 2025 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CUBIC congestion control for QUIC
3 participants