Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@andresilva
Copy link
Collaborator

@andresilva andresilva commented Aug 21, 2025

This PR implements bandwidth-aware network simulation in two parts: p2p/simulated network now supports bandwidth constraints with proper message transmission delays and queueing, and estimator CLI/DSL extensions that expose this functionality to users.

Closes #1407

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is SOOOOO cool

Copy link
Contributor

@patrick-ogrady patrick-ogrady left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left 2 small notes, otherwise getting very close

@patrick-ogrady
Copy link
Contributor

Looks like some failing linting (probably from a doc):

error[E0308]: mismatched types
   --> p2p/src/utils/mux.rs:287:18
    |
287 |         latency: 0.0,
    |                  ^^^ expected `Duration`, found floating-point number

error[E0308]: mismatched types
   --> p2p/src/utils/mux.rs:288:17
    |
288 |         jitter: 0.0,
    |                 ^^^ expected `Duration`, found floating-point number

    Checking commonware-consensus v0.0.59 (/home/runner/work/monorepo/monorepo/consensus)
    Checking commonware-sync v0.0.59 (/home/runner/work/monorepo/monorepo/examples/sync)
    Checking commonware-deployer v0.0.59 (/home/runner/work/monorepo/monorepo/deployer)
    Checking commonware-coding v0.0.59 (/home/runner/work/monorepo/monorepo/coding)
For more information about this error, try `rustc --explain E0308`.
error: could not compile `commonware-p2p` (lib test) due to 2 previous errors
warning: build failed, waiting for other jobs to finish...
Error: Process completed with exit code 101.

@andresilva
Copy link
Collaborator Author

I suspect it may be worthwhile to add a small unit test that ensures a number of sends happens within so much amount of virtual time (using the deterministic runtime call context.current())

There was already a test for this, but only between two peers which didn't trigger the issue above. I extended the tests now with one-to-many and many-to-one sends, and also to make sure pipelining works as expected (i.e. only transmission blocks the pipe not latency).

@patrick-ogrady
Copy link
Contributor

I suspect it may be worthwhile to add a small unit test that ensures a number of sends happens within so much amount of virtual time (using the deterministic runtime call context.current())

There was already a test for this, but only between two peers which didn't trigger the issue above. I extended the tests now with one-to-many and many-to-one sends, and also to make sure pipelining works as expected (i.e. only transmission blocks the pipe not latency).

Will review your updates shortly 👍


// Always update sender's egress (sender uses bandwidth regardless of
// delivery), this reserves the "pipe" for the duration of the transmission
self.peers.get_mut(&origin).unwrap().egress_available_at = transmission_complete_at;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this makes the approximation better, it blocks all egress from a peer based on the slowest peer they are connecting to (i.e. their next egress_available_at is set to transmission_complete_at).

What I think may simplify this fix (and take it over the finish line) is to loosen the pairwise effective_bps approximation and instead [1] consume the sender egress, [2] put the message into "no mans land", [3] deliver to receiver at transmission_complete_at + latency + ingress_available_at + payload/effective_eps (recipient).

I think there is an open question on ordering there but find the current approach (in my local testing) adds too large of a latency penalty because of this "uniform recipient-biased emission".

I suppose an alternative could be a pairwise reservation of bandwidth that can't exceed the global effective_bps but figured that might be a tad harder.

Lmk what you think (happy to take it from here if you are done with refactors 😅 )

Copy link
Collaborator Author

@andresilva andresilva Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will only solve one side of the issue, we will still have the problem of a slow sender blocking a fast receiver. The fundamental problem here is that we are making static scheduling decisions at send time. I believe that works for accurately modeling the send side, you can just use your own egress bandwidth to reserve your side of the pipe, but it's not possible to model the receive side statically like this (because of other competing sends affecting the receiver). For the receive side we need to track the available capacity at any point in time, and dynamically adjust based on that. I think there's no way to get around this (despite my efforts to avoid it 😅), so I'll try to come up with something that doesn't blow up complexity too much.

Edit: actually reserving the send side of the pipe at full speed is also unrealistic because of TCP backpressure. But since we're going to have to dynamically reserve capacity on the receive side, we can use the same logic for the send side.

@andresilva
Copy link
Collaborator Author

andresilva commented Aug 28, 2025

I finally decided to bite the bullet and implement a proper transfer scheduler for reserving capacities on both sides appropriately. The scheduler uses a delta-based approach to track bandwidth usage changes over time, where positive deltas represent bandwidth allocation and negative deltas represent release. It handles scenarios like multiple concurrent transfers competing for bandwidth.

The scheduler enforces both sender egress and receiver ingress bandwidth limits simultaneously, taking the minimum available bandwidth at each point in time. It calculates optimal bandwidth reservations upfront when a transfer begins, creating a series of time-bounded allocations that adapt to the changing availability of bandwidth as other transfers start and complete. The algorithm merges sender and receiver schedules chronologically and iterates through time windows between bandwidth change events, calculating how much data can be transferred in each window.

There is special handling for zero-bandwidth scenarios: when a sender has 0 bandwidth, transfers block indefinitely (simulating network congestion), while a receiver with 0 bandwidth still allows transfers to complete using only sender bandwidth (simulating one-way communication failure).

At this point I think the PR might, hilariously, have too many tests. Some of them may be redundant, I'll clean that up next as needed.

@andresilva andresilva force-pushed the andre/p2p-simulated-bandwidth branch from 446d709 to 1424bad Compare August 28, 2025 17:58
Copy link
Contributor

@patrick-ogrady patrick-ogrady left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is phenomenal. Left 1 real question/suggestion: why do we take such care to support the 0 bps case? It seems we could prevent that config in the public interface (and I don't see value to supporting a dead channel).

Maybe I'm missing something?

//! `commonware-p2p::simulated` can be run deterministically when paired with `commonware-runtime::deterministic`.
//! This makes it possible to reproduce an arbitrary order of delivered/dropped messages with a given seed.
//!
//! # Bandwidth Simulation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:chefs-kiss:

let receiver = self.peers.get(&recipient).unwrap();
(receiver.ingress_bps, receiver.ingress_available_at)
};
let sender_has_bandwidth = sender_peer.egress.bandwidth_bps > 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much handling we could remove if we required any bandwidth specification to take a non-zero BPS? That seems like a very reasonable expectation?

For completeness, I guess it is worth supporting this to ensure a sender burns their egress (may be useful modeling a sybil attack)? If we don't add a link, the message will never be sent to begin with.

That being said, I suppose we could model a sybil as infinite bandwidth rather than 0. So, I guess my point still stands (why do we support 0 bps as a config)?

@andresilva
Copy link
Collaborator Author

We decided to still allow the special case of 0 egress/ingress bandwidth, since it adds some functionality compared to removing the link altogether, for example modeling a scenario where a firewall prevents incoming traffic but outgoing still works fine.

All comments addressed.

patrick-ogrady
patrick-ogrady previously approved these changes Aug 29, 2025
Copy link
Contributor

@patrick-ogrady patrick-ogrady left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

patrick-ogrady
patrick-ogrady previously approved these changes Aug 29, 2025
@patrick-ogrady patrick-ogrady merged commit 4313dba into commonwarexyz:main Aug 29, 2025
37 checks passed
@codecov
Copy link

codecov bot commented Aug 29, 2025

Codecov Report

❌ Patch coverage is 97.51511% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.97%. Comparing base (4e9f321) to head (3cc8686).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
p2p/src/simulated/mod.rs 96.91% 27 Missing ⚠️
p2p/src/simulated/network.rs 93.49% 8 Missing ⚠️
p2p/src/simulated/bandwidth.rs 99.44% 2 Missing ⚠️
@@            Coverage Diff             @@
##             main    #1452      +/-   ##
==========================================
+ Coverage   91.80%   91.97%   +0.17%     
==========================================
  Files         280      281       +1     
  Lines       70653    72307    +1654     
==========================================
+ Hits        64865    66507    +1642     
- Misses       5788     5800      +12     
Files with missing lines Coverage Δ
broadcast/src/buffered/mod.rs 99.52% <100.00%> (ø)
collector/src/p2p/mod.rs 98.79% <ø> (ø)
consensus/src/aggregation/mod.rs 97.91% <100.00%> (ø)
consensus/src/marshal/mod.rs 99.75% <100.00%> (-0.01%) ⬇️
consensus/src/ordered_broadcast/mod.rs 99.56% <100.00%> (ø)
consensus/src/simplex/actors/voter/mod.rs 98.30% <100.00%> (ø)
consensus/src/simplex/mod.rs 98.20% <100.00%> (ø)
...onsensus/src/threshold_simplex/actors/voter/mod.rs 98.22% <100.00%> (ø)
consensus/src/threshold_simplex/mod.rs 98.66% <100.00%> (ø)
p2p/src/simulated/ingress.rs 98.97% <100.00%> (+0.19%) ⬆️
... and 5 more

... and 12 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4e9f321...3cc8686. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[estimator] Add Bandwidth Constraints

2 participants