plugin/reload does not work with SIGUSR1 triggered reload

**What happened**:

It seems that the reload plugin stops to work after a reload was triggered by sending SIGUSR1 to the coredns process.

**What you expected to happen**:

Reload plugin should not break due sending SIGUSR1.

**How to reproduce it (as minimally and precisely as possible)**:

Start a coredns server with the following Corefile:
```
.:1053 {
        reload 2s 1s
        erratic
        # forward . 8.8.8.8
}
```

Confirm that it works:
```
$ dig @localhost -p 1053 dns.google

# (snip)
;; QUESTION SECTION:
;dns.google.			IN	A

;; ANSWER SECTION:
dns.google.		0	IN	A	192.0.2.53

# (snip)
```

Ok fine, the `192.0.2.53` response comes from the erratic plugin.

Now trigger a reload by sending SIGUSR1 to the coredns process: `killall -USR1 coredns`. Log output:

```
[INFO] SIGUSR1: Reloading
[INFO] Reloading
[INFO] Reloading complete
```

Still fine. Now modify the Corefile to

```
.:1053 {
	reload 2s 1s
  	# erratic
	forward . 8.8.8.8
}
```

Log output:
```
[INFO] Reloading
[ERROR] Restart failed: getting old listener file: file tcp [::]:1053: use of closed network connection
[ERROR] plugin/reload: Corefile changed but reload failed: starting with listener file descriptors: getting old listener file: file tcp [::]:1053: use of closed network connection
```

And it indeed did not reload; running the above dig command again still returns `192.0.2.53`,  instead of the expected response `4.4.4.4` and `8.8.8.8` that would have been returned by the forward plugin.

From that point on, no further reload will work.

**Anything else we need to know?**:

I tried to understand what happens. First of all it should be noted that the reload plugin registers an event hook (and an `OnFinalShutdown` callback) in its `setup` function: 

```
once.Do(func() {
	caddy.RegisterEventHook("reload", hook)
})
// re-register on finalShutDown as the instance most-likely will be changed
shutOnce.Do(func() {
	c.OnFinalShutdown(func() error {
		r.quit <- true
		return nil
	})
})
```
(see https://github.com/coredns/coredns/blob/master/plugin/reload/setup.go#L69C1-L78C4), and it does that only once.
Which by the way seems wrong to me for the callback, because that callback is then missing for all future instances that are created by the plugin upon reloading; but that's maybe a different story. 
In case of the event hook at least it seems to be correct and logical to do it just once.

BUT: the signal handler for the USR1 signal clears all event hooks before restarting the instance:
```
purgeEventHooks()

// Kick off the restart; our work is done
EmitEvent(InstanceRestartEvent, nil)
_, err = inst.Restart(caddyfileToUse)
if err != nil {
	restoreEventHooks(oldEventHooks)

	log.Printf("[ERROR] SIGUSR1: %v", err)
}
```
(see https://github.com/coredns/caddy/blob/master/sigtrap_posix.go#L84). It seems this was introduced around https://github.com/caddyserver/caddy/issues/2044. Which by the way makes that `EmitEvent(InstanceRestartEvent, nil)`  statement completely meaningless, because no hook will ever handle it, but that's maybe a different story, too.

Whatever, because of that, the reload `hook()` function (https://github.com/coredns/coredns/blob/master/plugin/reload/reload.go#L63) will not run, and therefore not spin up a new go routine (https://github.com/coredns/coredns/blob/master/plugin/reload/reload.go#L84) for the new instance created through the `Restart()` invocation in the signal handler.

So, when changing the Corefile afterwards,  the go routine that was started for the old instance is still active, and tries to `Restart()` the instance it knows; however that instance was already stopped through the `Restart()` done by the USR1 signal handler. Which - as far as I understand it - explains the error message 'use of closed network connection'.

Now, even if that `purgeEventHooks()` logic would not be in place, another bad thing would have happened, because then, a second reloading go routine would have been started, while the old one would still be running. Then I guess, the new one would do its reloading  job properly, but the old one would forever pollute the logs with error messages (and each reload would add another such zombie go routine).

**Environment**:

- the version of CoreDNS: 1.11.0
- Corefile: see above
- logs, if applicable: see above
- OS (e.g: `cat /etc/os-release`): linux, mac, does not matter
- Others: n/a


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

plugin/reload does not work with SIGUSR1 triggered reload #6263

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

plugin/reload does not work with SIGUSR1 triggered reload #6263

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions