Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

hawkw
Copy link
Contributor

@hawkw hawkw commented Aug 17, 2017

Upon Consul failure Namerd produces Addr.Fail and isn't able to resolve any names. If it has seen valid responses from Consul before entering the failed state, it should instead continue to resolve names based on the observed known-good state.

I've modified theSvcAddr in io.buoyant.namer.consul to cache the last observed good state, and fall back to that state when polling Consul fails. If no good state was previously observed, SvcAddr will still produce Addr.Failed. I've also added log messages to all failure cases in SvcAddr: it will log at the WARNING level when an error occurred but it fell back to a previous state, and it will log at the ERROR level when an error occurred but no fallback was possible (i.e. Addr.Fail was produced).

I've also added calls to Activity.stabilize() on the Consul observation activities in ConsulDtabStore. This should prevent these activities from entering the Failed state if they have previously observed a good state, and fall back to that state on observing an error instead. Failures that are flagged as interruptions should still close the Activity. I've also added added error logging to these activities.

I've added some additional tests in ConsulNamerTest to ensure that the namer will fall back to the previous good state on errors, that the state the namer falls back to is the most recent observed good state, and that the namer will resume responding to good updates after falling back.

Closes #1593

@hawkw hawkw added this to the 1.2.0 milestone Aug 17, 2017
@hawkw hawkw self-assigned this Aug 17, 2017
@hawkw hawkw requested review from adleong and olix0r August 17, 2017 21:24
@hawkw hawkw mentioned this pull request Aug 17, 2017
2 tasks
@adleong
Copy link
Member

adleong commented Aug 18, 2017

@hawkw I believe that #1593 refers to errors with the consul namer, not the consul dtab store. In particular, I think the problem stems from the way that errors are handled in SvcAddr.scala

@hawkw
Copy link
Contributor Author

hawkw commented Aug 18, 2017

@adleong oh, whoops, I can look into that. Sorry, my bad.

@hawkw
Copy link
Contributor Author

hawkw commented Aug 23, 2017

Okay @adleong and @ashald, I've updated this PR; hopefully it now makes the correct changes.

@hawkw hawkw requested review from esbie and klingerf and removed request for klingerf August 23, 2017 18:29
Copy link
Contributor

@esbie esbie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this lgtm





Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spaces

@hawkw
Copy link
Contributor Author

hawkw commented Aug 23, 2017

Okay, have tested this against https://github.com/linkerd/linkerd-examples/tree/master/consul and nothing appears to be obviously broken; going to go ahead and merge

@hawkw hawkw merged commit ef38c6a into master Aug 23, 2017
hawkw added a commit that referenced this pull request Aug 31, 2017
Upon Consul failure Namerd produces `Addr.Fail` and isn't able to resolve any names. If it has seen valid responses from Consul before entering the failed state, it should instead continue to resolve names based on the observed known-good state.

I've  modified the`SvcAddr` in `io.buoyant.namer.consul` to cache the last observed good state, and fall back to that state when polling Consul fails. If no good state was previously observed, `SvcAddr` will still produce `Addr.Failed`. I've also added log messages to all failure cases in `SvcAddr`: it will log at the `WARNING` level when an error occurred but it fell back to a previous state, and it will log at the `ERROR` level when an error occurred but no fallback was possible (i.e. `Addr.Fail` was produced). 

I've also added calls to `Activity.stabilize()` on the Consul observation activities in `ConsulDtabStore`. This should prevent these activities from entering the `Failed` state if they have previously observed a good state, and fall back to that state on observing an error instead. Failures that are flagged as interruptions should still close the `Activity`. I've also added added error logging to these activities.

I've added some additional tests in `ConsulNamerTest` to ensure that the namer will fall back to the previous good state on errors, that the state the namer falls back to is the most recent observed good state, and that the namer will resume responding to good updates after falling back.

Closes #1593
@hawkw hawkw mentioned this pull request Sep 7, 2017
hawkw added a commit that referenced this pull request Sep 7, 2017
## 1.2.0 2017-09-07

* **Breaking Change**: `io.l5d.mesh`, `io.l5d.thriftNameInterpreter`, linkerd
  admin, and namerd admin now serve on 127.0.0.1 by default (instead of
  0.0.0.0).
* **Breaking Change**: Removed support for PKCS#1-formatted keys. PKCS#1 formatted keys must be converted to PKCS#8 format.
* Added experimental `io.l5d.dnssrv` namer for DNS SRV records (#1611)
* Kubernetes
  * Added an experimental `io.l5d.k8s.configMap` interpreter for reading dtabs from a Kubernetes ConfigMap (#1603). This interpreter will respond to changes in the ConfigMap, allowing for dynamic dtab updates without the need to run Namerd.
  * Made ingress controller's ingress class annotation configurable (#1584).
  * Fixed an issue where Linkerd would continue routing traffic to endpoints of a service after that service was removed (#1622).
  * Major refactoring and performance improvements to `io.l5d.k8s` and `io.l5d.k8s.ns` namers (#1603).
  * Ingress controller now checks all available ingress resources before using a default backend (#1607).
  * Ingress controller now correctly routes requests with host headers that contain ports (#1607).
* HTTP/2
  * Fixed an issue where long-running H2 streams would eventually hang (#1598).
  * Fixed a memory leak on long-running H2 streams (#1598)
  * Added a user-friendly error message when a HTTP/2 router receives a HTTP/1 request (#1618)
* HTTP/1
  * Removed spurious `ReaderDiscarded` exception logged on HTTP/1 retries (#1609)
* Consul
  * Added support for querying Consul by specific service health states (#1601)
  * Consul namers and Dtab store now fall back to a last known good state on Consul observation errors (#1597)
  * Improved log messages for Consul observation errors (#1597)
* TLS
  * Removed support for PKCS#1 keys (#1590)
  * Added validation to prevent incompatible `disableValidation: true` and `clientAuth` settings in TLS client configurations (#1621)
* Changed `io.l5d.mesh`, `io.l5d.thriftNameInterpreter`, linkerd
  admin, and namerd admin to serve on 127.0.0.1 by default (instead of
  0.0.0.0) (#1366)
* Deprecated `io.l5d.statsd` telemeter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve handling of Consul failures

3 participants