Thanks to visit codestin.com
Credit goes to github.com

Skip to content

plugin/reload deadlocks with SIGTERM #7314

@Michcioperz

Description

@Michcioperz

What happened:

I had (a derivative of, but I can reproduce on official Docker image) CoreDNS in a Kubernetes cluster. After a manifest update that changed both the Corefile ConfigMap and the Pod specification, the following sequence of events happened:

  1. CoreDNS reload plugin observed the Corefile update and triggered the reload.
  2. Kubernetes decided to recreate the pod, sent a SIGTERM to the running pod with 30s grace period.
  3. CoreDNS observed the SIGTERM and started shutting down.
  4. After 30 seconds neither the reload nor the shutdown finished, and the pod was forcibly killed.

What you expected to happen:

Clean shutdown within the grace period, no deadlocks.

How to reproduce it (as minimally and precisely as possible):

I reproduced it without much trouble today on Docker with latest CoreDNS.

Given a Corefile:

.:53 {
  errors
  reload 3s
  forward . 192.168.9.10
  pprof 0.0.0.0:6053
}

where 192.168.9.10 is an IP nothing runs on (so it will never respond, and that will keep CoreDNS busy, so it won't be able to reload instantly).

  1. Start CoreDNS:
    docker run --rm --name nld-test --ip 192.168.9.2 -it -v $CONFIG_DIR/Corefile:/Corefile:ro coredns/coredns:1.12.1 -conf /Corefile
  2. Start a DNS client to emulate traffic (not quite production, but enough to keep it busy during reload):
    dig +udp @192.168.9.2 +keepopen -f <(while true; do echo google.com; done)
  3. Modify the Corefile slightly (adding a newline inside the server block works)
  4. As soon as you observe [INFO] Reloading in CoreDNS logs, send the SIGTERM signal:
    docker kill -s TERM nld-test

If you don't see [INFO] Reloading complete, the process should be indefinitely stuck now. You can curl 'http://192.168.9.2:6053/debug/pprof/goroutine?debug=2' to fetch the stack traces. From my run, these two look relevant:

goroutine 80 [chan send]:
github.com/coredns/coredns/plugin/reload.setup.func2.1()
        /home/runner/work/coredns/coredns/plugin/reload/setup.go:75 +0x25
github.com/coredns/caddy.(*Instance).ShutdownCallbacks(0xc000aa2000)
        /home/runner/go/pkg/mod/github.com/coredns/[email protected]/caddy.go:174 +0x193
github.com/coredns/caddy.allShutdownCallbacks()
        /home/runner/go/pkg/mod/github.com/coredns/[email protected]/sigtrap.go:95 +0xf3
github.com/coredns/caddy.executeShutdownCallbacks.func1()
        /home/runner/go/pkg/mod/github.com/coredns/[email protected]/sigtrap.go:75 +0xaf
sync.(*Once).doSlow(0xc000116310?, 0x40ffe6?)
        /home/runner/go/pkg/mod/golang.org/[email protected]/src/sync/once.go:78 +0xab
sync.(*Once).Do(...)
        /home/runner/go/pkg/mod/golang.org/[email protected]/src/sync/once.go:69
github.com/coredns/caddy.executeShutdownCallbacks({0x2bd49d0?, 0x0?})
        /home/runner/go/pkg/mod/github.com/coredns/[email protected]/sigtrap.go:71 +0x5d
github.com/coredns/caddy.trapSignalsPosix.func1()
        /home/runner/go/pkg/mod/github.com/coredns/[email protected]/sigtrap_posix.go:43 +0x7fe
created by github.com/coredns/caddy.trapSignalsPosix in goroutine 1
        /home/runner/go/pkg/mod/github.com/coredns/[email protected]/sigtrap_posix.go:28 +0x1a

goroutine 103 [sync.Mutex.Lock]:
internal/sync.runtime_SemacquireMutex(0xc0006b22a0?, 0xbf?, 0xc0006b22a0?)
        /home/runner/go/pkg/mod/golang.org/[email protected]/src/runtime/sema.go:95 +0x25
internal/sync.(*Mutex).lockSlow(0x49af510)
        /home/runner/go/pkg/mod/golang.org/[email protected]/src/internal/sync/mutex.go:149 +0x15d
internal/sync.(*Mutex).Lock(...)
        /home/runner/go/pkg/mod/golang.org/[email protected]/src/internal/sync/mutex.go:70
sync.(*Mutex).Lock(...)
        /home/runner/go/pkg/mod/golang.org/[email protected]/src/sync/mutex.go:46
github.com/coredns/caddy.(*Instance).Stop(0xc000aa2000)
        /home/runner/go/pkg/mod/github.com/coredns/[email protected]/caddy.go:149 +0x1f7
github.com/coredns/caddy.(*Instance).Restart(0xc000aa2000, {0x3126430, 0xc000a8e300})
        /home/runner/go/pkg/mod/github.com/coredns/[email protected]/caddy.go:254 +0xb2a
github.com/coredns/coredns/plugin/reload.hook.func1()
        /home/runner/work/coredns/coredns/plugin/reload/reload.go:108 +0x386
created by github.com/coredns/coredns/plugin/reload.hook in goroutine 1
        /home/runner/work/coredns/coredns/plugin/reload/reload.go:84 +0x1f0

Signal handling goroutine (80) executeShutdownCallbacks claims the instancesMu mutex and runs the reload plugin's shutdown callback, which tries to send to unbuffered channel exit of reload plugin.

Reload plugin's goroutine (103) is not receiving the exit channel, because it's stuck in the other select case in its main loop, acquiring the instancesMu mutex in order to Stop the original instance and finish the reload process.

Anything else we need to know?:

We observed this issue in kubernetes/dns in the NodeLocalDNS addon. The clean shutdown is important to us, because we have iptables rules to remove in a shutdown callback, and if we don't remove them, all DNS traffic on the node is sent into the void.

The solution we found for ourselves was to stop using reload plugin and send SIGUSR1 manually: kubernetes/dns#689 Since signal processing in coredns/caddy is a simple for loop, no goroutines magic, it's somewhat guaranteed that the two operations will not deadlock. @johnbelamaric was suggesting in that PR that rewriting reload plugin to also send SIGUSR1 is a potential solution.

Environment:

  • the version of CoreDNS: 1.12.1 in repro, 1.11.3 in NodeLocalDNS
  • Corefile: presented in repro section
  • logs, if applicable: n/a
  • OS (e.g: cat /etc/os-release): n/a
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions