-
Notifications
You must be signed in to change notification settings - Fork 504
Improve Consul failure handling #1597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@adleong oh, whoops, I can look into that. Sorry, my bad. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this lgtm
|
||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spaces
Okay, have tested this against https://github.com/linkerd/linkerd-examples/tree/master/consul and nothing appears to be obviously broken; going to go ahead and merge |
Upon Consul failure Namerd produces `Addr.Fail` and isn't able to resolve any names. If it has seen valid responses from Consul before entering the failed state, it should instead continue to resolve names based on the observed known-good state. I've modified the`SvcAddr` in `io.buoyant.namer.consul` to cache the last observed good state, and fall back to that state when polling Consul fails. If no good state was previously observed, `SvcAddr` will still produce `Addr.Failed`. I've also added log messages to all failure cases in `SvcAddr`: it will log at the `WARNING` level when an error occurred but it fell back to a previous state, and it will log at the `ERROR` level when an error occurred but no fallback was possible (i.e. `Addr.Fail` was produced). I've also added calls to `Activity.stabilize()` on the Consul observation activities in `ConsulDtabStore`. This should prevent these activities from entering the `Failed` state if they have previously observed a good state, and fall back to that state on observing an error instead. Failures that are flagged as interruptions should still close the `Activity`. I've also added added error logging to these activities. I've added some additional tests in `ConsulNamerTest` to ensure that the namer will fall back to the previous good state on errors, that the state the namer falls back to is the most recent observed good state, and that the namer will resume responding to good updates after falling back. Closes #1593
## 1.2.0 2017-09-07 * **Breaking Change**: `io.l5d.mesh`, `io.l5d.thriftNameInterpreter`, linkerd admin, and namerd admin now serve on 127.0.0.1 by default (instead of 0.0.0.0). * **Breaking Change**: Removed support for PKCS#1-formatted keys. PKCS#1 formatted keys must be converted to PKCS#8 format. * Added experimental `io.l5d.dnssrv` namer for DNS SRV records (#1611) * Kubernetes * Added an experimental `io.l5d.k8s.configMap` interpreter for reading dtabs from a Kubernetes ConfigMap (#1603). This interpreter will respond to changes in the ConfigMap, allowing for dynamic dtab updates without the need to run Namerd. * Made ingress controller's ingress class annotation configurable (#1584). * Fixed an issue where Linkerd would continue routing traffic to endpoints of a service after that service was removed (#1622). * Major refactoring and performance improvements to `io.l5d.k8s` and `io.l5d.k8s.ns` namers (#1603). * Ingress controller now checks all available ingress resources before using a default backend (#1607). * Ingress controller now correctly routes requests with host headers that contain ports (#1607). * HTTP/2 * Fixed an issue where long-running H2 streams would eventually hang (#1598). * Fixed a memory leak on long-running H2 streams (#1598) * Added a user-friendly error message when a HTTP/2 router receives a HTTP/1 request (#1618) * HTTP/1 * Removed spurious `ReaderDiscarded` exception logged on HTTP/1 retries (#1609) * Consul * Added support for querying Consul by specific service health states (#1601) * Consul namers and Dtab store now fall back to a last known good state on Consul observation errors (#1597) * Improved log messages for Consul observation errors (#1597) * TLS * Removed support for PKCS#1 keys (#1590) * Added validation to prevent incompatible `disableValidation: true` and `clientAuth` settings in TLS client configurations (#1621) * Changed `io.l5d.mesh`, `io.l5d.thriftNameInterpreter`, linkerd admin, and namerd admin to serve on 127.0.0.1 by default (instead of 0.0.0.0) (#1366) * Deprecated `io.l5d.statsd` telemeter.
Upon Consul failure Namerd produces
Addr.Fail
and isn't able to resolve any names. If it has seen valid responses from Consul before entering the failed state, it should instead continue to resolve names based on the observed known-good state.I've modified the
SvcAddr
inio.buoyant.namer.consul
to cache the last observed good state, and fall back to that state when polling Consul fails. If no good state was previously observed,SvcAddr
will still produceAddr.Failed
. I've also added log messages to all failure cases inSvcAddr
: it will log at theWARNING
level when an error occurred but it fell back to a previous state, and it will log at theERROR
level when an error occurred but no fallback was possible (i.e.Addr.Fail
was produced).I've also added calls to
Activity.stabilize()
on the Consul observation activities inConsulDtabStore
. This should prevent these activities from entering theFailed
state if they have previously observed a good state, and fall back to that state on observing an error instead. Failures that are flagged as interruptions should still close theActivity
. I've also added added error logging to these activities.I've added some additional tests in
ConsulNamerTest
to ensure that the namer will fall back to the previous good state on errors, that the state the namer falls back to is the most recent observed good state, and that the namer will resume responding to good updates after falling back.Closes #1593