Thanks to visit codestin.com
Credit goes to github.com

Skip to content

CoreDNS resiliency with non-compliant upstream  #5953

@gcs278

Description

@gcs278

What happened:
I'm trying to work through an issue in which a OKD user reported (https://issues.redhat.com/browse/OCPBUGS-6829) CoreDNS failing on upstream/forwarded requests with the failure:

CoreDNS logs: [ERROR] plugin/errors: 2 oauth-openshift.apps.okd-admin.muc.lv1871.de. AAAA: dns: overflowing header size 

The only way this logic could be getting executed is if the upstream is non-compliant regarding UDP payloads sizes. We specify bufsize: 512 and it appears the upstream is coming back with more than 512 and not setting the truncated bit. The user is using a Windows 2016 or 2019 AD DC integrated DNS server for upstream DNS.

I've reproduced this error with 2 CoreDNS binaries: one being a vanilla CoreDNS (first hop/downstream) and another being a hacked CoreDNS (upstream) that doesn't truncate and sends UDP responses larger than UDP payload requested. I "hacked" CoreDNS to never truncate by putting a return here. So it looks like:
Dig Client --> Vanilla Downstream CoreDNS 1.10.0 --> Upstream Hacked-to-not-truncate CoreDNS 1.10.0

In this reproducer, the chain of events is as follows:

  1. Client requests a larger-than-512-byte DNS Record
  2. The downstream CoreDNS forwards to the upstream CoreDNS, but the upstream hacked-CoreDNS doesn't truncate it's response
  3. Downstream CoreDNS repeatedly fails and tries sending more requests to upstream (in UDP) for 5 seconds straight. The error is always one of the "overflow" errors I mentioned above.
  4. The request times out after 5 seconds (hard coded) and the downstream returns the "overflow" message that it had been receiving for 5 seconds of non-stop querying

What you expected to happen:
I understand the upstream is non-compliant, but it seems non-compliant server exist in the wild (see #3941 and #3305). The user's solution here is to increase bufsize (seems a bit arbitrary) or turn on force_tcp.

My question here is: Could CoreDNS retry TCP after a UDP Payload failure automatically? It helplessly retries for 5 seconds, but would it increase resiliency to attempt a TCP retry upon one of these failure cases?

How to reproduce it (as minimally and precisely as possible):
I talked about creating the reproducer in the What happened section.

Anything else we need to know?:
N/A

Environment:

  • the version of CoreDNS:
    1.10.0
  • Corefile:
  Corefile: |
    .:5353 {
        bufsize 512
        errors
        log . {
            class error
        }
        health {
            lameduck 20s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus 127.0.0.1:9153
        forward . /etc/resolv.conf {
            policy sequential
        }
        cache 900 {
            denial 9984 30
        }
        reload
    }
  • logs, if applicable:
[ERROR] plugin/errors: 2 www.example.org. AAAA: dns: overflowing header size
  • OS (e.g: cat /etc/os-release):
  • RHEL 8.7
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions