CoreDNS resiliency with non-compliant upstream 

**What happened**:
I'm trying to work through an issue in which a OKD user reported (https://issues.redhat.com/browse/OCPBUGS-6829) CoreDNS failing on upstream/forwarded requests with the failure:
```
CoreDNS logs: [ERROR] plugin/errors: 2 oauth-openshift.apps.okd-admin.muc.lv1871.de. AAAA: dns: overflowing header size 
```

The only way this logic could be getting executed is if the upstream is non-compliant regarding UDP payloads sizes. We specify `bufsize: 512` and it appears the upstream is coming back with more than 512 and not setting the truncated bit. The user is using a Windows 2016 or 2019 AD DC integrated DNS server for upstream DNS.

I've reproduced this error with 2 CoreDNS binaries: one being a vanilla CoreDNS (first hop/downstream) and another being a hacked CoreDNS (upstream) that doesn't truncate and sends UDP responses larger than UDP payload requested. I "hacked" CoreDNS to never truncate by putting a `return` [here](https://github.com/openshift/coredns/blob/9aaa7e0a86b69bafb9f544a0e5cb1873535a8f6b/vendor/github.com/miekg/dns/msg_truncate.go#L50). So it looks like:
_Dig Client --> Vanilla Downstream CoreDNS 1.10.0 --> Upstream Hacked-to-not-truncate CoreDNS 1.10.0_

In this reproducer, the chain of events is as follows:
1. Client requests a larger-than-512-byte DNS Record
2. The downstream CoreDNS forwards to the upstream CoreDNS, but the upstream hacked-CoreDNS doesn't truncate it's response
3. Downstream CoreDNS repeatedly fails and tries sending more requests to upstream (in UDP) for 5 seconds straight. The error is always one of the "overflow" errors I mentioned above.
4. The request times out after 5 seconds ([hard coded](https://github.com/coredns/coredns/blob/b7279d1f668ad2fe401c254543f9dcbc878b0378/plugin/forward/forward.go#L247)) and the downstream returns the "overflow" message that it had been receiving for 5 seconds of non-stop querying

**What you expected to happen**:
I understand the upstream is non-compliant, but it seems non-compliant server exist in the wild (see https://github.com/coredns/coredns/issues/3941 and https://github.com/coredns/coredns/issues/3305). The user's solution here is to increase `bufsize` (seems a bit arbitrary) or turn on `force_tcp`.

_My question here is:_ Could CoreDNS retry TCP after a UDP Payload failure automatically? It helplessly retries for 5 seconds, but would it increase resiliency to attempt a TCP retry upon one of these failure cases?

**How to reproduce it (as minimally and precisely as possible)**:
I talked about creating the reproducer in the **What happened** section.

**Anything else we need to know?**:
N/A

**Environment**:

- the version of CoreDNS:
1.10.0
- Corefile:
```
  Corefile: |
    .:5353 {
        bufsize 512
        errors
        log . {
            class error
        }
        health {
            lameduck 20s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus 127.0.0.1:9153
        forward . /etc/resolv.conf {
            policy sequential
        }
        cache 900 {
            denial 9984 30
        }
        reload
    }
```
- logs, if applicable:
```
[ERROR] plugin/errors: 2 www.example.org. AAAA: dns: overflowing header size
```
- OS (e.g: `cat /etc/os-release`):
- RHEL 8.7
- Others:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CoreDNS resiliency with non-compliant upstream #5953

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CoreDNS resiliency with non-compliant upstream #5953

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions