Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

robmry
Copy link
Contributor

@robmry robmry commented Oct 7, 2024

- What I did

Containers on a network with option com.docker.network.bridge.gateway_mode_ipv[46]=routed do not have NAT set up for port-mappings from the host - but mapped ports are opened in the container's iptables/ip6tables rules, and they can be accessed from a remote host that has routing to the container network (via the host).

However, those ports were not accessible from containers on other networks on the same host.

- How I did it

Introduce the use of ipset, with sets containing the subnets of each of the externally-accessible docker networks.

  • This is a new depdendency, ipset needs to be available in the kernel.
  • We should document the new dependency.
  • The Docker-in-Docker image might need to modprobe it (so that a host with an old dockerd that hasn't loaded the module can run a new image). Edit: this isn't necessary, netlink calls from the inner-docker take care of it.

Use those sets in rules matching packets routed to those docker networks so that:

  • only one conntrack RELATED,ESTABLISHED rule is needed in the filter-FORWARD chain
    • previously there was a rule per-bridge
    • the rule can be placed second in the chain (after the unconditional jump to DOCKER-USER), so that related packets don't need to be checked against any other rules.
  • similarly, only one rule is needed for the jump to the DOCKER chain, rather than a rule per-bridge
  • in DOCKER-ISOLATION-STAGE-1:
    • accept conntrack RELATED,ESTABLISED packets coming from the routed network, so that responses make it back to the network that made the request
    • return early for routed networks as they're not isolated in the same way as nat networks, port/protocol filtering will still happen in the DOCKER chain.

Also:

  • remove unconditional RETURN rules from the end of the DOCKER-ISOLATION chains
    • they're unnecessary, because the return happens anyway
    • removing them makes it possible to insert rules that accept packets (the conntrack rules allowing responses from routed networks), and append rules that return early or drop packets (the early return for routed networks).
  • Related code-tidying.

- How to verify it

New tests.

- Description for the changelog

- dockerd requires `ipset` support in the Linux kernel
- allow access to `gateway_mode_ipv[46]=routed` networks from other networks on the same docker host

@robmry robmry added this to the 28.0.0 milestone Oct 7, 2024
@robmry robmry self-assigned this Oct 7, 2024
@robmry robmry changed the title 48526 gateway routed inc Make containers on routed-mode networks accessible from other bridge networks Oct 7, 2024
@robmry robmry force-pushed the 48526_gateway_routed_inc branch 5 times, most recently from 2d5ccca to 05b29b5 Compare October 10, 2024 18:04
@unclejack
Copy link
Contributor

ipset would most likely provide the best performance and be the most straightforward to implement. My concern is that this would cause issues for people who have to run Docker on an environment where they can't get ipset and the kernel module.

Perhaps we can take a bit more time to think.

@tianon
Copy link
Member

tianon commented Oct 11, 2024

For a "routed" network, we're already assuming the system administrator has set up appropriate routes, NAT, forwarding, etc as necessary for that bridge to work, right? And we're just adding a new veth to that pre-existing / pre-configured bridge? (trying to make sure my mental model of the feature is accurate)

If that's accurate, then the only piece we should be managing from the iptables perspective is --publish'd ports, right? In other words, https://github.com/moby/moby/compare/807373c53c4cea18a59bbd02dcc7fd1db64d22ca..05b29b585586e106dd80c2f9cfdee66bb8207d14#diff-1fc4994400997c1865570e0108c76632904b69879d884260455a096c4e38e6e2 should only have rules for the port publishing, and everything else should come directly from the host's preconfiguration, right?

(perhaps I've misunderstood the feature, in which case I might be using it incorrectly in the place I just started using it 😂 😇 ❤️)

@robmry
Copy link
Contributor Author

robmry commented Oct 14, 2024

For a "routed" network, we're already assuming the system administrator has set up appropriate routes, NAT, forwarding, etc as necessary for that bridge to work, right? And we're just adding a new veth to that pre-existing / pre-configured bridge? (trying to make sure my mental model of the feature is accurate)

If that's accurate, then the only piece we should be managing from the iptables perspective is --publish'd ports, right? In other words, https://github.com/moby/moby/compare/807373c53c4cea18a59bbd02dcc7fd1db64d22ca..05b29b585586e106dd80c2f9cfdee66bb8207d14#diff-1fc4994400997c1865570e0108c76632904b69879d884260455a096c4e38e6e2 should only have rules for the port publishing, and everything else should come directly from the host's preconfiguration, right?

(perhaps I've misunderstood the feature, in which case I might be using it incorrectly in the place I just started using it 😂 😇 ❤️)

Yes, exactly ... in 27.x, routed mode means no NAT mapping from host addresses, just port/protocol filtering to allow traffic for the container-end of -p mappings (the host-port in a port mapping has no effect, host-ip only determines address family). It's up to the user to sort out routing to the container network. https://docs.docker.com/engine/network/packet-filtering-firewalls/#direct-routing

In 28.x, routed mode will also allow access from neighbouring containers on the same host (this PR). It was an oversight that it wasn't allowed before - we only had no-NAT for IPv6 for direct routing from remote hosts in-mind.

@tianon
Copy link
Member

tianon commented Oct 23, 2024

Ah, thanks for confirming my mental model 🙇 ❤️

So I guess what I'm proposing is that the DOCKER-related chains shouldn't reference bridge1 at all in https://github.com/moby/moby/blob/05b29b585586e106dd80c2f9cfdee66bb8207d14/integration/network/bridge/iptablesdoc/generated/usernet-portmap-routed.md 🤔

(If routing on routed networks is the host/admin's responsibility, then we're almost entirely hands-off / out-of-the-loop / off the hook.)

I'm actually a little confused looking at that example because I'm not seeing where the 8080 mapping gets applied -- is that userland proxy or something? (the only hit for 8080 on that whole page is the docker run command itself 😅)

@tianon
Copy link
Member

tianon commented Oct 23, 2024

To put that another way, if we didn't add so many bridge1-related rules, the problem this PR is trying to fix would mostly just go away, right? Since all "other container" traffic would pass through the standard rules/routing as if it were any other machine-related traffic.

@robmry
Copy link
Contributor Author

robmry commented Oct 23, 2024

Ah, thanks for confirming my mental model 🙇 ❤️

Thank you for taking a look and giving it some thought! Much appreciated.

So I guess what I'm proposing is that the DOCKER-related chains shouldn't reference bridge1 at all in https://github.com/moby/moby/blob/05b29b585586e106dd80c2f9cfdee66bb8207d14/integration/network/bridge/iptablesdoc/generated/usernet-portmap-routed.md 🤔

(If routing on routed networks is the host/admin's responsibility, then we're almost entirely hands-off / out-of-the-loop / off the hook.)

Routing in the network to get traffic to the docker host is the user's responsibility, we have no control over that.

But, direct routing doesn't mean "no firewall". For that, nat-unprotected mode, which will be introduced by #48597 might be closer to the mark.

So, we don't want to leave all ports open on containers on bridge1 - only "published" ports. It's just that, for a routed network, "published" means "opened" rather than "nat'd/proxied from the host".

(Mode routed-unprotected is a possibility too, it'd be no-NAT and no port filtering, but it's not implemented and may not be unless it's asked-for so that we know it's needed. Mode nat-unprotected is equivalent to setting the filter-FORWARD policy to ACCEPT at the moment, but removing the dependency on the default policy in #48724 takes away that option, so nat-unprotected is an escape-hatch for those depending on that behaviour. Mode routed-unprotected has no direct equivalent in today's releases, it'd be somewhat like disabling iptables but just for one bridge network. However, it'd still need some iptables rules to allow responses to other networks - see the notes on DOCKER-ISOLATION-STAGE-1 below.)

I'm actually a little confused looking at that example because I'm not seeing where the 8080 mapping gets applied -- is that userland proxy or something? (the only hit for 8080 on that whole page is the docker run command itself 😅)

There's no NAT from host addresses, so port 8080 in the -p 8080:80 mapping is irrelevant and ignored - no rules are created for it in the nat or filter tables.

In 27.x, it's currently an error to over-specify the port mapping - if routed mode can't use it, don't specify the host port. But, that error will be downgraded to a warning in the next 27.x release, for the reasons described in #48575.

I'm not sure if that makes it any clearer - but it's what I meant by "Yes, exactly ... in 27.x, routed mode means no NAT mapping from host addresses, just port/protocol filtering to allow traffic for the container-end of -p mappings (the host-port in a port mapping has no effect, host-ip only determines address family). It's up to the user to sort out routing to the container network. https://docs.docker.com/engine/network/packet-filtering-firewalls/#direct-routing".

To put that another way, if we didn't add so many bridge1-related rules, the problem this PR is trying to fix would mostly just go away, right? Since all "other container" traffic would pass through the standard rules/routing as if it were any other machine-related traffic.

I don't think that's an option. It would be if we didn't want port filtering, and the routed-mode network didn't need to interact with any other networks - but I'll try to explain why each of the bridge1 rules is needed ...

Port 80 from the -p 8080:80 mapping still opens the port on the container, that's done by rule:

Chain DOCKER (1 references)
num   pkts bytes target     prot opt in     out     source               destination         
1        0     0 ACCEPT     6    --  !bridge1 bridge1  0.0.0.0/0            192.0.2.2            tcp dpt:80

The next bridge1 rule in the DOCKER chain is to allow ICMP, partly because that would have been allowed before (by what we're now calling the default gateway mode "nat"), also because IPv6 uses ICMP for all sorts of stuff.

Chain DOCKER (1 references)
num   pkts bytes target     prot opt in     out     source               destination         
3        0     0 ACCEPT     1    --  *      bridge1  0.0.0.0/0            0.0.0.0/0 

Those ACCEPT rules are necessary, because if the filter-FORWARD chain's policy is DROP (which it should be, because we need IP forwarding enabled in the kernel), packets will be dropped by-default. So, traffic for allowed ports/protocols has to be explicitly allowed. A blanket ACCEPT for traffic to bridge1 would open unpublished ports, and we don't want that.

The final bridge1 rule in the DOCKER chain is to drop any packet aimed at an unpublished port on bridge1. It's necessary in case the filter-FORWARD chain's default policy is ACCEPT:

Chain DOCKER (1 references)
num   pkts bytes target     prot opt in     out     source               destination              
4        0     0 DROP       0    --  !bridge1 bridge1  0.0.0.0/0            0.0.0.0/0 

There are two bridge1 rules in DOCKER-ISOLATION-STAGE-1:

Chain DOCKER-ISOLATION-STAGE-1 (1 references)
num   pkts bytes target     prot opt in     out     source               destination         
1        0     0 ACCEPT     0    --  bridge1 *       0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
2        0     0 RETURN     0    --  *      bridge1  0.0.0.0/0            0.0.0.0/0     

The first has nothing to do with what's allowed in to bridge1, it's about getting responses out. The second enables inter-network communication for the routed-mode network by skipping the rest of the isolation rules - that's the objective of this PR, for the reasons given in the linked issue. (The same logic actually applies to normal gw-mode-nat networks because direct routing is allowed there too. But, we don't want to suddenly open up inter-network communication between bridge networks, even though it's odd to allow remote but not local access to published ports. Maybe that'll become another option/mode.) These two DOCKER-ISOLATION-STAGE-1 rules are described in the doc as:

  • Rule 1 accepts outgoing packets related to established connections. This is for responses to containers on NAT networks that would not normally accept packets from another network, and may have port/protocol filtering rules in place that would otherwise drop these responses."
  • Rule 2 skips the jump to DOCKER-ISOLATION-STAGE-2 for any packet routed to the routed-mode network. So, it will accept packets from other networks, if they make it through the port/protocol filtering rules in the DOCKER chain.

The bridge1 rule from DOCKER-ISOLATION-STAGE-2 could be dropped, I think. It'll never be hit because of the return rule in stage-1. Perhaps I should do that, but leaving it in definitely isn't going to break anything.

The remaining two bridge1 rules are in the FORWARD chain itself:

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
num   pkts bytes target     prot opt in     out     source               destination         
5        0     0 ACCEPT     0    --  bridge1 !bridge1  0.0.0.0/0            0.0.0.0/0           
6        0     0 ACCEPT     0    --  bridge1 bridge1  0.0.0.0/0            0.0.0.0/0 
  • The first of those allows traffic out of the bridge1 network, if it hasn't already been dropped by the isolation stages or port-filtering rules for another network in the DOCKER chain. Without this rule, a filter-FORWARD policy of DROP would just drop all outgoing traffic.
  • The second implements inter-container communication in the bridge1 network, anything's allowed within the network - if ICC was disabled, it'd be a DROP rule.

With ICC enabled, these two rules can be combined, and they will be by #48641.

@robmry robmry force-pushed the 48526_gateway_routed_inc branch from 05b29b5 to 79a9fa7 Compare October 23, 2024 17:53
@tianon
Copy link
Member

tianon commented Oct 24, 2024

Ahhhhh, that helps a lot, yes, thank you ❤️

My mental model is wrong, it turns out, because I understood "routed" to be effectively "this bridge already exists, and is already configured with routing (likely for other purposes like Incus/LXD, nspawn, QEMU, etc), and I just want Docker to attach the veths it creates to it", which is actually how I'm using it in practice today, but you envision it to be something still mostly Docker-managed/protected. 🤔

FWIW, #48526 is the only issue I've personally observed with my mental model/setup, and it's "fixed" by removing more rules Docker's putting in place. What you're describing (locking it down to only explicitly published ports) sounds like it's probably going to break my setup actually (I don't publish any ports on my containers on the routed bridge, but their listening ports are still all routed correctly and answer properly to remote hosts on the connected network), so that's a bridge (heh) I'll probably have to cross separately. 🙈

@tianon
Copy link
Member

tianon commented Oct 28, 2024

Edit 2: @robmry clarified that I was indeed on Docker v23 (not a typo for v27.3 or something) and I am, and thus my com.docker.network.bridge.gateway_mode_ipv4=routed is totally ignored and this is all noise -- sorry! 😭 (perhaps it's still useful though, because I think what I'm trying to accomplish is somewhere near the use case for com.docker.network.bridge.gateway_mode_ipv4=routed 😅)


In the hopes that it helps, here's an attempt at describing my usecase/setup in more detail:

I've got a shared network (a /16) on which a /24 is delegated to me. My access to this shared network is over IPsec, via an "xfrm" device (this is relevant because ipvlan, macvlan, and even bridges are picky about interfaces, and cannot use this interface directly -- the same applies to WireGuard interfaces, which is probably a more common setup than IPsec+xfrm).

On my machine, I need to run a set of services available on different IP addresses within that /24. If this were a physical interface on the box, I'd probably create a bridge, add the physical interface to it, then create TAP interfaces for VMs with the IP addresses I wanted to expose. Alternatively, before gateway_mode_ipv4=routed, I might've added the IPs to the host (which has other unfortunate side effects) and then used --publish to expose ports on containers on the appropriate IPs.

What I managed to get working (thanks to gateway_mode_ipv4=routed), is that I have the /16 assigned to the IPsec interface, and I create a bridge interface that I assign my /24 subset to. I then rely on routing/forwarding to "bridge" the gap between the bridge and the IPsec interface, and have appropriate nftables rules set up to allow that. Then I can create TAP interfaces on that bridge for VMs that need to be "part of" that /24(+/16), and something like what I'll describe below for my containers (so that they're directly on that network too):

  • ipsec /16: 10.123.0.0/16
  • my /24 on that /16: 10.123.1.0/24
  • a /26 in my /24 that I reserve just for Docker: 10.123.1.64/26
  • bridge interface: br-ipsec (on the host, with the /24 assigned to it)
docker network create \
	--gateway 10.123.1.1 \
	--subnet 10.123.1.0/24 \
	--ip-range 10.123.1.64/26 \
	--opt com.docker.network.bridge.name=br-ipsec \
	--opt com.docker.network.bridge.inhibit_ipv4=true \
	--opt com.docker.network.driver.mtu=1400 \
	--opt com.docker.network.container_iface_prefix=ipsec \
	--opt com.docker.network.bridge.gateway_mode_ipv4=routed \
	--opt com.docker.network.bridge.enable_ip_masquerade=false \
	ipsec

(I don't know whether inhibit_ipv4 and enable_ip_masquerade are actually necessary for it to work, and container_iface_prefix is equal parts vanity/simpler for multi-homed containers, but I haven't been brave enough to test removing things after I got it working 😅)

Overall, this has worked really well -- I can choose IPs for containers directly or let Docker assign them, and other systems on the remote end of the IPsec can access them directly without issue. The only real issue I have is that other containers (which can access other hosts on the IPsec network just fine) can't access the same-host IPs (which is what I've understood #48526 to be), and that doing this adds extra firewall rules on my br-ipsec bridge that now block container-unrelated traffic in unexpected ways, where my expectation was that because I'm providing the bridge device, I'm also responsible for providing the routing and filtering for the containers/bridge device (because again, I've created and provided the bridge device, specifically because I also use it for other things).

Edit: I guess I should also note that I'm on Docker v23 and this is all working; no idea if newer versions have already broken this based on bad assumptions I've made 🙈

@robmry robmry force-pushed the 48526_gateway_routed_inc branch 2 times, most recently from 3a523c9 to 51cfe90 Compare November 12, 2024 19:39
@robmry robmry marked this pull request as ready for review November 13, 2024 09:14
@robmry robmry requested a review from akerouanton November 13, 2024 09:14
After an error, there's no need for it to roll back rules
it's created, the caller already does that.

Signed-off-by: Rob Murray <[email protected]>
IPv4 before IPv6, with consistent error paths.

Signed-off-by: Rob Murray <[email protected]>
The default for a user-defined chain is RETURN anyway.

This opens up the possibilty of sorting rules into two groups
by using insert or append, without having to deal with appending
after the unconditional RETURN.

Signed-off-by: Rob Murray <[email protected]>
Create ipsets containing the subnet of each non-internal bridge network.

Signed-off-by: Rob Murray <[email protected]>
Add an integration test to check that a container on a network
with gateway-mode=nat can access a container on a network with
gateway-mode=routed, but not vice-versa.

Signed-off-by: Rob Murray <[email protected]>
@robmry robmry force-pushed the 48526_gateway_routed_inc branch from 51cfe90 to 223929a Compare November 19, 2024 15:30
@robmry
Copy link
Contributor Author

robmry commented Nov 19, 2024

Rebased as it was a bit behind - I'll let the tests run, then merge.

@robmry robmry merged commit 069e41a into moby:master Nov 19, 2024
140 checks passed
@robmry robmry deleted the 48526_gateway_routed_inc branch November 19, 2024 16:26
aevesdocker pushed a commit to docker/docs that referenced this pull request Feb 20, 2025
## Description

Updates for moby 28.0 networking.

## Related issues or tickets

Series of commits ...
- Fix description of 'inhibit_ipv4'
- Not changed in moby 28.0, updated to clarify difference from (new)
IPv6-only networks.
- Updates to default bridge address config
  - moby/moby#48319
- Describe IPv6-only network config
  - moby/moby#48271
  - docker/cli#5599
- Update description of gateway modes
  - moby/moby#48594
  - moby/moby#48596
  - moby/moby#48597
- Describe gateway selection in the networking overview.
  - docker/cli#5664
- Describe gateway mode `isolated`
  - moby/moby#49262

## Reviews

<!-- Notes for reviewers here -->
<!-- List applicable reviews (optionally @tag reviewers) -->

- [ ] Technical review
- [ ] Editorial review
- [ ] Product review

---------

Signed-off-by: Rob Murray <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Containers on bridge networks with gateway_mode_ipv[46]=routed are inaccessible from other containers
5 participants