-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
What would you like to be added:
Add optional show_first flag to consolidate directive that logs
the first error immediately and then consolidates subsequent errors.
When show_first is enabled:
- The first matching error is logged immediately with full details
(rcode, domain, type, error message) using the configured log level - Subsequent matching errors are consolidated during the period
- At period end:
- If only one error occurred, no summary is printed (already logged)
- If multiple errors occurred, summary shows the total count
Syntax:
consolidate DURATION REGEXP [LEVEL] [show_first]
Example with 3 errors:
[WARNING] 2 example.org. A: read udp 10.0.0.1:53->8.8.8.8:53: i/o timeout
[WARNING] 3 errors like '^read udp .* i/o timeout$' occurred in last 30s
Example with 1 error:
[WARNING] 2 example.org. A: read udp 10.0.0.1:53->8.8.8.8:53: i/o timeout
Why is this needed:
The current consolidate directive in the errors plugin effectively prevents log flooding by aggregating similar errors and showing only a summary. However, this approach has a significant limitation in production environments:
The consolidated summary lacks concrete error details needed for debugging.
For example, when you see:
[WARNING] 15 errors like '.timeout.' occurred in last 5m
You know errors occurred, but you don't know:
- Which specific domain triggered the error
- What was the exact error message
- What query type was involved
- Any other contextual information that could help diagnose the root cause
The Production Dilemma
In production environments, operators face a difficult choice:
- Keep consolidate enabled: Prevent log flooding, but lose debugging context
- Disable consolidate: Get full error details, but risk overwhelming the logging system and storage
This is especially problematic because many issues only manifest in production under real traffic patterns, making it impossible to reproduce in development or staging environments.
Solution: The show_first Option
The show_first flag provides the best of both worlds:
- Maintains log hygiene: Only the first error is logged in detail, not all occurrences
- Preserves debugging context: The first error includes full details (rcode, domain, query type, error message)
- Enables production troubleshooting: Operators can identify the specific scenario causing errors
- Minimal log volume increase: Only one additional log entry per consolidation period
Real-World Use Case
Consider a DNS server experiencing intermittent timeout errors to upstream resolvers:
Without show_first:
[WARNING] 247 errors like '.i/o timeout.' occurred in last 5m
→ You know there are timeout errors, but which upstream? Which domain? Hard to debug.
With show_first:
[WARNING] 2 example.org. A: read udp 10.0.0.1:53->8.8.8.8:53: i/o timeout
[WARNING] 247 errors like '.i/o timeout.' occurred in last 5m
→ Now you can see it's affecting queries to 8.8.8.8, and you can investigate network connectivity to that specific upstream.
Benefits
- Improves observability: Provides both metrics (error count) and examples (actual error)
- Enables faster incident response: No need to disable consolidate or wait for errors to reproduce
- Follows best practices: Similar to distributed tracing, where you sample exemplars alongside metrics
- Respects log level configuration: First error uses the configured log level (warning/error/info/debug)
- Backward compatible: Completely optional, existing configurations continue to work unchanged
This enhancement makes CoreDNS more production-ready by balancing operational needs (log volume control) with debugging requirements (contextual information).