-
Notifications
You must be signed in to change notification settings - Fork 18.9k
libnet/d/bridge: handleFirewalldReloadNw: fix deadlock #50620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libnet/d/bridge: handleFirewalldReloadNw: fix deadlock #50620
Conversation
The bridge driver was embedding `sync.Mutex` which is unconventional and makes it harder to analyze locks ordering. Signed-off-by: Albin Kerouanton <[email protected]>
handleFirewalldReloadNw locks `d.mu` and then `d.configNetworks`. However, the rest of the driver locks `d.configNetworks` first and then `d.mu`. This could result in deadlocks if `handleFirewalldReloadNw` is called while the bridge driver is already holding `d.configNetworks` lock. Other code paths were checked to ensure that they all follow the same locking order. This bug was introduced by commit a527e5a. Signed-off-by: Albin Kerouanton <[email protected]>
austinvazquez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change for consistent lock ordering LGTM.
| // mu is used to protect accesses to config and networks. Do not hold this lock while locking configNetwork. | ||
| mu sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, it's common practice to put the mutex above the fields it protects; in this case (from the changes in this code), it's config and networks, so those could possibly be moved under this one (or the mutex moved to the top).
It's not 100% clear to me what configNetwork protects though. Perhaps worth (could be a follow-up) to add a comment on that as well.
| // Make sure the network isn't being deleted, and ProgramExternalConnectivity | ||
| // isn't modifying iptables rules, while restoring the rules. | ||
| d.configNetwork.Lock() | ||
| defer d.configNetwork.Unlock() | ||
|
|
||
| d.mu.Lock() | ||
| defer d.mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW there's other places where we acquite a lock for configNetwork, but don't hold a lock for networks (mu), or at least not for the whole duration;
moby/daemon/libnetwork/drivers/bridge/bridge_linux.go
Lines 758 to 766 in cf15d5b
| // start the critical section, from this point onward we are dealing with the list of networks | |
| // so to be consistent we cannot allow that the list changes | |
| d.configNetwork.Lock() | |
| defer d.configNetwork.Unlock() | |
| // check network conflicts | |
| if err = d.checkConflict(config); err != nil { | |
| return err | |
| } |
moby/daemon/libnetwork/drivers/bridge/bridge_linux.go
Lines 725 to 730 in cf15d5b
| d.Lock() | |
| if _, ok := d.networks[id]; ok { | |
| d.Unlock() | |
| return types.ForbiddenErrorf("network %s exists", id) | |
| } | |
| d.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I checked all of them and wrote down my analysis in the PR description. This is the only case where the lock ordering is reversed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! I missed the collapsed block, nice!
thaJeztah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
but there's definitely improvements to be made; at least the locking strategy seems complex and brittle (and could use touch-ups in documentation)
- What I did
1st commit: libnet/d/bridge: driver: un-embed mutex
The bridge driver was embedding
sync.Mutexwhich is unconventional and makes it harder to analyze locks ordering.2nd commit: libnet/d/bridge: handleFirewalldReloadNw: fix deadlock
handleFirewalldReloadNw locks
d.muand thend.configNetworks. However, the rest of the driver locksd.configNetworksfirst and thend.mu.This could result in deadlocks if
handleFirewalldReloadNwis called while the bridge driver is already holdingd.configNetworkslock.Other code paths were checked to ensure that they all follow the same locking order.
Locks ordering analysis
d.configNetwork:
(*driver).configure()locksd.configNetwork—d.munot acquired(*driver).CreateNetwork()locksd.configNetwork-d.munot acquired(*driver).DeleteNetwork()locksd.configNetwork-d.munot acquired(*driver).handleFirewalldReloadNwlocksd.configNetwork-d.muIS acquired(*driver).ProgramExternalConnectivitylocksd.configNetwork-d.munot acquiredd.mu.Lock:
(*driver).configure()locksd.mu-d.configNetworknot acquired(*driver).CreateEndpoint()locksd.mu-d.configNetworknot acquired(*driver).createNetwork()locksd.mu-d.configNetworkIS acquired by callers:(*driver).CreateNetwork()LOCKSd.configNetwork->(*driver).createNetwork()(*driver).configure()LOCKSd.configNetwork->(*driver).initStore()->(*driver).populateNetworks()->(*driver).createNetwork()(*driver).CreateNetwork()locksd.mu-d.configNetworknot acquired(*driver).DeleteEndpoint()locksd.mu-d.configNetworknot acquired(*driver).deleteNetwork()locksd.mu-d.configNetworkIS acquired by caller:(*driver).DeleteNetwork()LOCKSd.configNetwork->(*driver).deleteNetwork()(*driver).EndpointOperInfo()locksd.mu-d.configNetworknot acquired(*driver).getNetwork()locksd.mu—d.configNetworkIS acquired by caller:(*driver).Join()doesn't lockd.configNetwork->(*driver).getNetwork()(*driver).Leave()doesn't lockd.configNetwork->(*driver).getNetwork()(*driver).ProgramExternalConnectivity()LOCKSd.configNetwork->(*driver).getNetwork()(*driver).getNetworks()locksd.mu—d.configNetworkIS acquired by caller:(*driver).CreateNetwork()LOCKSd.configNetwork->(*driver).checkConflict()->(*driver).getNetworks()(*driver).handleFirewalldReload()locksd.mu—d.configNetworknot acquired(*driver).handleFirewalldReloadNw()locksd.mu-d.configNetworklocked afterd.mu- Human readable description for the release notes
- Fix a deadlock that could happen if a firewalld reload was processed while the bridge networking driver was starting up, or creating or deleting a network, or creating port-mappings