plugin/reload deadlocks with SIGTERM

**What happened**:

I had (a derivative of, but I can reproduce on official Docker image) CoreDNS in a Kubernetes cluster. After a manifest update that changed both the Corefile ConfigMap and the Pod specification, the following sequence of events happened:

1. CoreDNS reload plugin observed the Corefile update and triggered the reload.
2. Kubernetes decided to recreate the pod, sent a SIGTERM to the running pod with 30s grace period.
3. CoreDNS observed the SIGTERM and started shutting down.
4. After 30 seconds neither the reload nor the shutdown finished, and the pod was forcibly killed.

**What you expected to happen**:

Clean shutdown within the grace period, no deadlocks.

**How to reproduce it (as minimally and precisely as possible)**:

I reproduced it without much trouble today on Docker with latest CoreDNS.

Given a Corefile:

```
.:53 {
  errors
  reload 3s
  forward . 192.168.9.10
  pprof 0.0.0.0:6053
}
```

where 192.168.9.10 is an IP nothing runs on (so it will never respond, and that will keep CoreDNS busy, so it won't be able to reload instantly).

1. Start CoreDNS:
   ```shell
   docker run --rm --name nld-test --ip 192.168.9.2 -it -v $CONFIG_DIR/Corefile:/Corefile:ro coredns/coredns:1.12.1 -conf /Corefile
   ```
2. Start a DNS client to emulate traffic (not quite production, but enough to keep it busy during reload):
   ```shell
   dig +udp @192.168.9.2 +keepopen -f <(while true; do echo google.com; done)
   ```
3. Modify the Corefile slightly (adding a newline inside the server block works)
4. As soon as you observe `[INFO] Reloading` in CoreDNS logs, send the SIGTERM signal:
   ```shell
   docker kill -s TERM nld-test
   ```

If you don't see `[INFO] Reloading complete`, the process should be indefinitely stuck now. You can `curl 'http://192.168.9.2:6053/debug/pprof/goroutine?debug=2'` to fetch the stack traces. From my run, these two look relevant:

```
goroutine 80 [chan send]:
github.com/coredns/coredns/plugin/reload.setup.func2.1()
        /home/runner/work/coredns/coredns/plugin/reload/setup.go:75 +0x25
github.com/coredns/caddy.(*Instance).ShutdownCallbacks(0xc000aa2000)
        /home/runner/go/pkg/mod/github.com/coredns/caddy@v1.1.2-0.20241029205200-8de985351a98/caddy.go:174 +0x193
github.com/coredns/caddy.allShutdownCallbacks()
        /home/runner/go/pkg/mod/github.com/coredns/caddy@v1.1.2-0.20241029205200-8de985351a98/sigtrap.go:95 +0xf3
github.com/coredns/caddy.executeShutdownCallbacks.func1()
        /home/runner/go/pkg/mod/github.com/coredns/caddy@v1.1.2-0.20241029205200-8de985351a98/sigtrap.go:75 +0xaf
sync.(*Once).doSlow(0xc000116310?, 0x40ffe6?)
        /home/runner/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.1.linux-amd64/src/sync/once.go:78 +0xab
sync.(*Once).Do(...)
        /home/runner/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.1.linux-amd64/src/sync/once.go:69
github.com/coredns/caddy.executeShutdownCallbacks({0x2bd49d0?, 0x0?})
        /home/runner/go/pkg/mod/github.com/coredns/caddy@v1.1.2-0.20241029205200-8de985351a98/sigtrap.go:71 +0x5d
github.com/coredns/caddy.trapSignalsPosix.func1()
        /home/runner/go/pkg/mod/github.com/coredns/caddy@v1.1.2-0.20241029205200-8de985351a98/sigtrap_posix.go:43 +0x7fe
created by github.com/coredns/caddy.trapSignalsPosix in goroutine 1
        /home/runner/go/pkg/mod/github.com/coredns/caddy@v1.1.2-0.20241029205200-8de985351a98/sigtrap_posix.go:28 +0x1a

goroutine 103 [sync.Mutex.Lock]:
internal/sync.runtime_SemacquireMutex(0xc0006b22a0?, 0xbf?, 0xc0006b22a0?)
        /home/runner/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.1.linux-amd64/src/runtime/sema.go:95 +0x25
internal/sync.(*Mutex).lockSlow(0x49af510)
        /home/runner/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.1.linux-amd64/src/internal/sync/mutex.go:149 +0x15d
internal/sync.(*Mutex).Lock(...)
        /home/runner/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.1.linux-amd64/src/internal/sync/mutex.go:70
sync.(*Mutex).Lock(...)
        /home/runner/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.1.linux-amd64/src/sync/mutex.go:46
github.com/coredns/caddy.(*Instance).Stop(0xc000aa2000)
        /home/runner/go/pkg/mod/github.com/coredns/caddy@v1.1.2-0.20241029205200-8de985351a98/caddy.go:149 +0x1f7
github.com/coredns/caddy.(*Instance).Restart(0xc000aa2000, {0x3126430, 0xc000a8e300})
        /home/runner/go/pkg/mod/github.com/coredns/caddy@v1.1.2-0.20241029205200-8de985351a98/caddy.go:254 +0xb2a
github.com/coredns/coredns/plugin/reload.hook.func1()
        /home/runner/work/coredns/coredns/plugin/reload/reload.go:108 +0x386
created by github.com/coredns/coredns/plugin/reload.hook in goroutine 1
        /home/runner/work/coredns/coredns/plugin/reload/reload.go:84 +0x1f0
```

Signal handling goroutine (80) `executeShutdownCallbacks` claims the `instancesMu` mutex and runs the reload plugin's shutdown callback, which tries to send to unbuffered channel `exit` of reload plugin.

Reload plugin's goroutine (103) is not receiving the `exit` channel, because it's stuck in the other `select` `case` in its main loop, acquiring the `instancesMu` mutex in order to Stop the original instance and finish the reload process.

**Anything else we need to know?**:

We observed this issue in kubernetes/dns in the NodeLocalDNS addon. The clean shutdown is important to us, because we have iptables rules to remove in a shutdown callback, and if we don't remove them, all DNS traffic on the node is sent into the void.

The solution we found for ourselves was to stop using reload plugin and send SIGUSR1 manually: https://github.com/kubernetes/dns/pull/689 Since signal processing in coredns/caddy is a simple for loop, no goroutines magic, it's somewhat guaranteed that the two operations will not deadlock. @johnbelamaric was [suggesting in that PR](https://github.com/kubernetes/dns/pull/689#issuecomment-2895188022) that rewriting reload plugin to also send SIGUSR1 is a potential solution.

**Environment**:

- the version of CoreDNS: 1.12.1 in repro, 1.11.3 in NodeLocalDNS
- Corefile: presented in repro section
- logs, if applicable: n/a
- OS (e.g: `cat /etc/os-release`): n/a
- Others:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

plugin/reload deadlocks with SIGTERM #7314

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

plugin/reload deadlocks with SIGTERM #7314

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions