[p2p/simulated, estimator] simulate bandwidth and message size constraints #1452

andresilva · 2025-08-21T21:59:36Z

This PR implements bandwidth-aware network simulation in two parts: p2p/simulated network now supports bandwidth constraints with proper message transmission delays and queueing, and estimator CLI/DSL extensions that expose this functionality to users.

Closes #1407

p2p/src/simulated/network.rs

patrick-ogrady · 2025-08-22T16:47:06Z

examples/estimator/simplex_with_sizes.lazy

This is SOOOOO cool

p2p/src/simulated/ingress.rs

p2p/src/simulated/network.rs

patrick-ogrady

Left 2 small notes, otherwise getting very close

patrick-ogrady · 2025-08-23T00:40:20Z

Looks like some failing linting (probably from a doc):

error[E0308]: mismatched types
   --> p2p/src/utils/mux.rs:287:18
    |
287 |         latency: 0.0,
    |                  ^^^ expected `Duration`, found floating-point number

error[E0308]: mismatched types
   --> p2p/src/utils/mux.rs:288:17
    |
288 |         jitter: 0.0,
    |                 ^^^ expected `Duration`, found floating-point number

    Checking commonware-consensus v0.0.59 (/home/runner/work/monorepo/monorepo/consensus)
    Checking commonware-sync v0.0.59 (/home/runner/work/monorepo/monorepo/examples/sync)
    Checking commonware-deployer v0.0.59 (/home/runner/work/monorepo/monorepo/deployer)
    Checking commonware-coding v0.0.59 (/home/runner/work/monorepo/monorepo/coding)
For more information about this error, try `rustc --explain E0308`.
error: could not compile `commonware-p2p` (lib test) due to 2 previous errors
warning: build failed, waiting for other jobs to finish...
Error: Process completed with exit code 101.

andresilva · 2025-08-26T15:00:43Z

I suspect it may be worthwhile to add a small unit test that ensures a number of sends happens within so much amount of virtual time (using the deterministic runtime call context.current())

There was already a test for this, but only between two peers which didn't trigger the issue above. I extended the tests now with one-to-many and many-to-one sends, and also to make sure pipelining works as expected (i.e. only transmission blocks the pipe not latency).

patrick-ogrady · 2025-08-26T15:31:27Z

I suspect it may be worthwhile to add a small unit test that ensures a number of sends happens within so much amount of virtual time (using the deterministic runtime call context.current())

There was already a test for this, but only between two peers which didn't trigger the issue above. I extended the tests now with one-to-many and many-to-one sends, and also to make sure pipelining works as expected (i.e. only transmission blocks the pipe not latency).

Will review your updates shortly 👍

…rvation_duration

patrick-ogrady · 2025-08-27T18:51:51Z

p2p/src/simulated/network.rs

+
+            // Always update sender's egress (sender uses bandwidth regardless of
+            // delivery), this reserves the "pipe" for the duration of the transmission
+            self.peers.get_mut(&origin).unwrap().egress_available_at = transmission_complete_at;


While this makes the approximation better, it blocks all egress from a peer based on the slowest peer they are connecting to (i.e. their next egress_available_at is set to transmission_complete_at).

What I think may simplify this fix (and take it over the finish line) is to loosen the pairwise effective_bps approximation and instead [1] consume the sender egress, [2] put the message into "no mans land", [3] deliver to receiver at transmission_complete_at + latency + ingress_available_at + payload/effective_eps (recipient).

I think there is an open question on ordering there but find the current approach (in my local testing) adds too large of a latency penalty because of this "uniform recipient-biased emission".

I suppose an alternative could be a pairwise reservation of bandwidth that can't exceed the global effective_bps but figured that might be a tad harder.

Lmk what you think (happy to take it from here if you are done with refactors 😅 )

I think this will only solve one side of the issue, we will still have the problem of a slow sender blocking a fast receiver. The fundamental problem here is that we are making static scheduling decisions at send time. I believe that works for accurately modeling the send side, you can just use your own egress bandwidth to reserve your side of the pipe, but it's not possible to model the receive side statically like this (because of other competing sends affecting the receiver). For the receive side we need to track the available capacity at any point in time, and dynamically adjust based on that. I think there's no way to get around this (despite my efforts to avoid it 😅), so I'll try to come up with something that doesn't blow up complexity too much.

Edit: actually reserving the send side of the pipe at full speed is also unrealistic because of TCP backpressure. But since we're going to have to dynamically reserve capacity on the receive side, we can use the same logic for the send side.

andresilva · 2025-08-28T17:08:16Z

I finally decided to bite the bullet and implement a proper transfer scheduler for reserving capacities on both sides appropriately. The scheduler uses a delta-based approach to track bandwidth usage changes over time, where positive deltas represent bandwidth allocation and negative deltas represent release. It handles scenarios like multiple concurrent transfers competing for bandwidth.

The scheduler enforces both sender egress and receiver ingress bandwidth limits simultaneously, taking the minimum available bandwidth at each point in time. It calculates optimal bandwidth reservations upfront when a transfer begins, creating a series of time-bounded allocations that adapt to the changing availability of bandwidth as other transfers start and complete. The algorithm merges sender and receiver schedules chronologically and iterates through time windows between bandwidth change events, calculating how much data can be transferred in each window.

There is special handling for zero-bandwidth scenarios: when a sender has 0 bandwidth, transfers block indefinitely (simulating network congestion), while a receiver with 0 bandwidth still allows transfers to complete using only sender bandwidth (simulating one-way communication failure).

At this point I think the PR might, hilariously, have too many tests. Some of them may be redundant, I'll clean that up next as needed.

patrick-ogrady

This is phenomenal. Left 1 real question/suggestion: why do we take such care to support the 0 bps case? It seems we could prevent that config in the public interface (and I don't see value to supporting a dead channel).

Maybe I'm missing something?

p2p/src/simulated/bandwidth.rs

patrick-ogrady · 2025-08-28T22:08:00Z

p2p/src/simulated/mod.rs

 //! `commonware-p2p::simulated` can be run deterministically when paired with `commonware-runtime::deterministic`.
 //! This makes it possible to reproduce an arbitrary order of delivered/dropped messages with a given seed.
 //!
+//! # Bandwidth Simulation


:chefs-kiss:

patrick-ogrady · 2025-08-28T22:11:32Z

p2p/src/simulated/network.rs

-                let receiver = self.peers.get(&recipient).unwrap();
-                (receiver.ingress_bps, receiver.ingress_available_at)
-            };
+                let sender_has_bandwidth = sender_peer.egress.bandwidth_bps > 0;


~~I wonder how much handling we could remove if we required any bandwidth specification to take a non-zero BPS? That seems like a very reasonable expectation?~~

For completeness, I guess it is worth supporting this to ensure a sender burns their egress (may be useful modeling a sybil attack)? If we don't add a link, the message will never be sent to begin with.

That being said, I suppose we could model a sybil as infinite bandwidth rather than 0. So, I guess my point still stands (why do we support 0 bps as a config)?

p2p/src/simulated/mod.rs

p2p/src/simulated/bandwidth.rs

andresilva · 2025-08-28T23:55:00Z

We decided to still allow the special case of 0 egress/ingress bandwidth, since it adds some functionality compared to removing the link altogether, for example modeling a scenario where a firewall prevents incoming traffic but outgoing still works fine.

All comments addressed.

patrick-ogrady

🚀

codecov · 2025-08-29T18:31:26Z

Codecov Report

❌ Patch coverage is 97.51511% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.97%. Comparing base (4e9f321) to head (3cc8686).
⚠️ Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
p2p/src/simulated/mod.rs	96.91%	27 Missing ⚠️
p2p/src/simulated/network.rs	93.49%	8 Missing ⚠️
p2p/src/simulated/bandwidth.rs	99.44%	2 Missing ⚠️

@@            Coverage Diff             @@
##             main    #1452      +/-   ##
==========================================
+ Coverage   91.80%   91.97%   +0.17%     
==========================================
  Files         280      281       +1     
  Lines       70653    72307    +1654     
==========================================
+ Hits        64865    66507    +1642     
- Misses       5788     5800      +12

Files with missing lines	Coverage Δ
broadcast/src/buffered/mod.rs	`99.52% <100.00%> (ø)`
collector/src/p2p/mod.rs	`98.79% <ø> (ø)`
consensus/src/aggregation/mod.rs	`97.91% <100.00%> (ø)`
consensus/src/marshal/mod.rs	`99.75% <100.00%> (-0.01%)`	⬇️
consensus/src/ordered_broadcast/mod.rs	`99.56% <100.00%> (ø)`
consensus/src/simplex/actors/voter/mod.rs	`98.30% <100.00%> (ø)`
consensus/src/simplex/mod.rs	`98.20% <100.00%> (ø)`
...onsensus/src/threshold_simplex/actors/voter/mod.rs	`98.22% <100.00%> (ø)`
consensus/src/threshold_simplex/mod.rs	`98.66% <100.00%> (ø)`
p2p/src/simulated/ingress.rs	`98.97% <100.00%> (+0.19%)`	⬆️
... and 5 more

... and 12 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4e9f321...3cc8686. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andresilva added 12 commits August 21, 2025 16:29

[p2p/simulated] simulate bandwidth limits

1e75687

[p2p/simulated] guarantee message ordering

1dad24d

[p2p/simulated] use Duration for latency and jitter

a38a325

[p2p/simulated] docs for bandwidth limits

c076873

[p2p/simulated] add bandwidth tests

cd59a83

[treewide] fix tests

ee47370

[examples/estimator] allow defining message size

d579f30

[examples/estimator] allow specifying bandwidth limits per region

205e937

[examples/estimator] update README

ec1e83b

[examples/estimator] test parsing of commands with size

3b4660e

[examples/estimator] add simplex lazy file with message sizes

aeae623

[examples/estimator] update readme

8730a19

patrick-ogrady reviewed Aug 21, 2025

View reviewed changes

p2p/src/simulated/network.rs Outdated Show resolved Hide resolved

[p2p/simulated] Oracle::set_bandwidth

60d3353