Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@akerouanton
Copy link
Member

@akerouanton akerouanton commented Aug 4, 2025

- What I did

1st commit: libnet/d/bridge: driver: un-embed mutex

The bridge driver was embedding sync.Mutex which is unconventional and makes it harder to analyze locks ordering.

2nd commit: libnet/d/bridge: handleFirewalldReloadNw: fix deadlock

handleFirewalldReloadNw locks d.mu and then d.configNetworks. However, the rest of the driver locks d.configNetworks first and then d.mu.

This could result in deadlocks if handleFirewalldReloadNw is called while the bridge driver is already holding d.configNetworks lock.

Other code paths were checked to ensure that they all follow the same locking order.

Locks ordering analysis

d.configNetwork:

  • (*driver).configure() locks d.configNetworkd.mu not acquired
  • (*driver).CreateNetwork() locks d.configNetwork - d.mu not acquired
  • (*driver).DeleteNetwork() locks d.configNetwork - d.mu not acquired
  • (*driver).handleFirewalldReloadNw locks d.configNetwork - d.mu IS acquired
  • (*driver).ProgramExternalConnectivity locks d.configNetwork - d.mu not acquired

d.mu.Lock:

  • (*driver).configure() locks d.mu - d.configNetwork not acquired
  • (*driver).CreateEndpoint() locks d.mu - d.configNetwork not acquired
  • (*driver).createNetwork() locks d.mu - d.configNetwork IS acquired by callers:
    • (*driver).CreateNetwork() LOCKS d.configNetwork -> (*driver).createNetwork()
    • (*driver).configure() LOCKS d.configNetwork -> (*driver).initStore() -> (*driver).populateNetworks() -> (*driver).createNetwork()
  • (*driver).CreateNetwork() locks d.mu - d.configNetwork not acquired
  • (*driver).DeleteEndpoint() locks d.mu - d.configNetwork not acquired
  • (*driver).deleteNetwork() locks d.mu - d.configNetwork IS acquired by caller:
    • (*driver).DeleteNetwork() LOCKS d.configNetwork -> (*driver).deleteNetwork()
  • (*driver).EndpointOperInfo() locks d.mu - d.configNetwork not acquired
  • (*driver).getNetwork() locks d.mud.configNetwork IS acquired by caller:
    • (*driver).Join() doesn't lock d.configNetwork -> (*driver).getNetwork()
    • (*driver).Leave() doesn't lock d.configNetwork -> (*driver).getNetwork()
    • (*driver).ProgramExternalConnectivity() LOCKS d.configNetwork -> (*driver).getNetwork()
  • (*driver).getNetworks() locks d.mud.configNetwork IS acquired by caller:
    • (*driver).CreateNetwork() LOCKS d.configNetwork -> (*driver).checkConflict() -> (*driver).getNetworks()
  • (*driver).handleFirewalldReload() locks d.mud.configNetwork not acquired
  • (*driver).handleFirewalldReloadNw() locks d.mu - d.configNetwork locked after d.mu

- Human readable description for the release notes

- Fix a deadlock that could happen if a firewalld reload was processed while the bridge networking driver was starting up, or creating or deleting a network, or creating port-mappings

The bridge driver was embedding `sync.Mutex` which is unconventional and
makes it harder to analyze locks ordering.

Signed-off-by: Albin Kerouanton <[email protected]>
handleFirewalldReloadNw locks `d.mu` and then `d.configNetworks`.
However, the rest of the driver locks `d.configNetworks` first and then
`d.mu`.

This could result in deadlocks if `handleFirewalldReloadNw` is called
while the bridge driver is already holding `d.configNetworks` lock.

Other code paths were checked to ensure that they all follow the same
locking order.

This bug was introduced by commit a527e5a.

Signed-off-by: Albin Kerouanton <[email protected]>
@akerouanton akerouanton added this to the 29.0.0 milestone Aug 4, 2025
@akerouanton akerouanton self-assigned this Aug 4, 2025
@akerouanton akerouanton marked this pull request as ready for review August 4, 2025 10:43
Copy link
Contributor

@austinvazquez austinvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change for consistent lock ordering LGTM.

Comment on lines +165 to +166
// mu is used to protect accesses to config and networks. Do not hold this lock while locking configNetwork.
mu sync.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, it's common practice to put the mutex above the fields it protects; in this case (from the changes in this code), it's config and networks, so those could possibly be moved under this one (or the mutex moved to the top).

It's not 100% clear to me what configNetwork protects though. Perhaps worth (could be a follow-up) to add a comment on that as well.

Comment on lines 1700 to 1706
// Make sure the network isn't being deleted, and ProgramExternalConnectivity
// isn't modifying iptables rules, while restoring the rules.
d.configNetwork.Lock()
defer d.configNetwork.Unlock()

d.mu.Lock()
defer d.mu.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW there's other places where we acquite a lock for configNetwork, but don't hold a lock for networks (mu), or at least not for the whole duration;

// start the critical section, from this point onward we are dealing with the list of networks
// so to be consistent we cannot allow that the list changes
d.configNetwork.Lock()
defer d.configNetwork.Unlock()
// check network conflicts
if err = d.checkConflict(config); err != nil {
return err
}

d.Lock()
if _, ok := d.networks[id]; ok {
d.Unlock()
return types.ForbiddenErrorf("network %s exists", id)
}
d.Unlock()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I checked all of them and wrote down my analysis in the PR description. This is the only case where the lock ordering is reversed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I missed the collapsed block, nice!

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

but there's definitely improvements to be made; at least the locking strategy seems complex and brittle (and could use touch-ups in documentation)

@thaJeztah thaJeztah merged commit 559c3c7 into moby:master Aug 4, 2025
432 of 448 checks passed
@akerouanton akerouanton deleted the fix-firewalld-reload-deadlock branch August 4, 2025 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deadlock when firewalld is reloaded while the bridge driver is initializing

5 participants