-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Description
Proposal Details
The current methods for handling race detector output are fragile and cumbersome, leading to significant developer friction, repeated work, and missed opportunities for automated bug detection.
-
Fragile
stderr
output of TSAN: Race reports or go program compiled with-race
are long, multiline strings that are easily corrupted or scattered when interleaved with other logs. Parsing this unstructured text is brittle and prone to errors. At Uber, our stderr logs are consumed by a logging system which splits logs by the line separator and hence race stacks get scattered making it nearly impossible to reconstruct them from production machines that run race-enabled binaries. (Here the use case is running program with -race flag and halt_on_error=0 to find production races). -
The
GORACE=log_path=file
, redirects the race output to a file to avoid log corruption. However, It requires a separate, complex process to periodically poll the file, handle partial writes, and reliably parse each report. This adds overhead and complexity to deployment and monitoring pipelines. -
Brittle log parsing: The reliance on external log parsers is inherently fragile. Small changes in the report format can break log-scraping tools, requiring constant maintenance and reducing the reliability of automated race detection in CI/CD and production.
Proposed Solution: A Standard, Structured API
A new API is require to provide a robust and durable solution, enabling Go programs to directly consume race reports as structured data. runtime/pprof
is the ideal location for this API, as it is the established home for profiling and diagnostic tools.
The API should expose a structured report object that includes, at a minimum:
- A pair of structured stack traces: One for each of the conflicting accesses along with the goroutine ancestry.
- The memory address of the conflicting access (more on this below).
- The goroutine IDs involved.
Enhancing Race Debugging with Allocation Site Information
While the core of this proposal is the primary objective, we must also address a major pain point in race debugging: the difficulty of identifying the object involved. Currently, the race detector can't easily link a race to the stack trace that allocated the racy object.
Therefore, this proposal also suggests the long-term goal of exposing the allocation site stack within the structured report. This would enable a full, end-to-end view of the race, from the object's creation to its conflicting accesses, which would dramatically reduce the time spent on root cause analysis.