Thanks to visit codestin.com
Credit goes to github.com

Skip to content

plugin/reload does not work with SIGUSR1 triggered reload #6263

@cbarbian-sap

Description

@cbarbian-sap

What happened:

It seems that the reload plugin stops to work after a reload was triggered by sending SIGUSR1 to the coredns process.

What you expected to happen:

Reload plugin should not break due sending SIGUSR1.

How to reproduce it (as minimally and precisely as possible):

Start a coredns server with the following Corefile:

.:1053 {
        reload 2s 1s
        erratic
        # forward . 8.8.8.8
}

Confirm that it works:

$ dig @localhost -p 1053 dns.google

# (snip)
;; QUESTION SECTION:
;dns.google.			IN	A

;; ANSWER SECTION:
dns.google.		0	IN	A	192.0.2.53

# (snip)

Ok fine, the 192.0.2.53 response comes from the erratic plugin.

Now trigger a reload by sending SIGUSR1 to the coredns process: killall -USR1 coredns. Log output:

[INFO] SIGUSR1: Reloading
[INFO] Reloading
[INFO] Reloading complete

Still fine. Now modify the Corefile to

.:1053 {
	reload 2s 1s
  	# erratic
	forward . 8.8.8.8
}

Log output:

[INFO] Reloading
[ERROR] Restart failed: getting old listener file: file tcp [::]:1053: use of closed network connection
[ERROR] plugin/reload: Corefile changed but reload failed: starting with listener file descriptors: getting old listener file: file tcp [::]:1053: use of closed network connection

And it indeed did not reload; running the above dig command again still returns 192.0.2.53, instead of the expected response 4.4.4.4 and 8.8.8.8 that would have been returned by the forward plugin.

From that point on, no further reload will work.

Anything else we need to know?:

I tried to understand what happens. First of all it should be noted that the reload plugin registers an event hook (and an OnFinalShutdown callback) in its setup function:

once.Do(func() {
	caddy.RegisterEventHook("reload", hook)
})
// re-register on finalShutDown as the instance most-likely will be changed
shutOnce.Do(func() {
	c.OnFinalShutdown(func() error {
		r.quit <- true
		return nil
	})
})

(see https://github.com/coredns/coredns/blob/master/plugin/reload/setup.go#L69C1-L78C4), and it does that only once.
Which by the way seems wrong to me for the callback, because that callback is then missing for all future instances that are created by the plugin upon reloading; but that's maybe a different story.
In case of the event hook at least it seems to be correct and logical to do it just once.

BUT: the signal handler for the USR1 signal clears all event hooks before restarting the instance:

purgeEventHooks()

// Kick off the restart; our work is done
EmitEvent(InstanceRestartEvent, nil)
_, err = inst.Restart(caddyfileToUse)
if err != nil {
	restoreEventHooks(oldEventHooks)

	log.Printf("[ERROR] SIGUSR1: %v", err)
}

(see https://github.com/coredns/caddy/blob/master/sigtrap_posix.go#L84). It seems this was introduced around caddyserver/caddy#2044. Which by the way makes that EmitEvent(InstanceRestartEvent, nil) statement completely meaningless, because no hook will ever handle it, but that's maybe a different story, too.

Whatever, because of that, the reload hook() function (https://github.com/coredns/coredns/blob/master/plugin/reload/reload.go#L63) will not run, and therefore not spin up a new go routine (https://github.com/coredns/coredns/blob/master/plugin/reload/reload.go#L84) for the new instance created through the Restart() invocation in the signal handler.

So, when changing the Corefile afterwards, the go routine that was started for the old instance is still active, and tries to Restart() the instance it knows; however that instance was already stopped through the Restart() done by the USR1 signal handler. Which - as far as I understand it - explains the error message 'use of closed network connection'.

Now, even if that purgeEventHooks() logic would not be in place, another bad thing would have happened, because then, a second reloading go routine would have been started, while the old one would still be running. Then I guess, the new one would do its reloading job properly, but the old one would forever pollute the logs with error messages (and each reload would add another such zombie go routine).

Environment:

  • the version of CoreDNS: 1.11.0
  • Corefile: see above
  • logs, if applicable: see above
  • OS (e.g: cat /etc/os-release): linux, mac, does not matter
  • Others: n/a

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions