Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@kevwan
Copy link
Contributor

@kevwan kevwan commented Oct 13, 2025

Description

Fixes #5242

This PR resolves an issue where the trace agent would reinitialize when running multiple servers (REST + RPC) with different trace endpoints, causing the global TracerProvider to be overwritten.

Problem

When running multiple servers with tracing enabled:

RestConf.SetUp()  // Calls trace.StartAgent with endpoint A
ZRpcConf.SetUp()  // Calls trace.StartAgent with endpoint B

The trace agent would reinitialize for each server with different endpoints, overwriting the global TracerProvider. Only the last configuration would take effect.

Root Cause

The StartAgent function only prevented re-initialization when the exact same endpoint was used:

lock.Lock()
if _, ok := agents[c.Endpoint]; !ok {
    agents[c.Endpoint] = lang.Placeholder
    err = startAgent(c)
}
lock.Unlock()

When different endpoints were configured (common in multi-server setups), startAgent() would be called again, creating a new TracerProvider and overwriting the global one via otel.SetTracerProvider().

Solution

Adopted the sync.Once pattern used by prometheus.StartAgent and logx.SetUp to ensure trace initialization happens exactly once:

var (
    once sync.Once
    tp   *sdktrace.TracerProvider
)

func StartAgent(c Config) {
    if c.Disabled {
        return
    }
    once.Do(func() {
        if err := startAgent(c); err != nil {
            logx.Error(err)
        }
    })
}

Changes

  • ✅ Replaced agents map and lock mutex with sync.Once
  • ✅ Updated StartAgent() to wrap initialization in once.Do()
  • ✅ Simplified StopAgent() by removing unnecessary lock
  • ✅ Added documentation explaining the behavior
  • ✅ Updated TestStartAgent to verify new sync.Once behavior

Benefits

  • Consistent with go-zero patterns: Both prometheus.StartAgent and logx.SetUp use sync.Once
  • Predictable behavior: First configuration wins, subsequent calls safely ignored
  • Safe multi-server setup: Works correctly with REST + RPC servers
  • Prevents silent configuration loss: No more overwriting
  • Simpler code: Fewer lines, clearer intent

Testing

  • ✅ All existing tests pass (core/trace, core/service, rest, zrpc)
  • ✅ Created verification tests demonstrating the fix
  • ✅ No breaking changes, fully backward compatible

Verification

Before fix - TracerProvider being overwritten:

First StartAgent: tp=0x1400023a870
Second StartAgent: tp=0x1400023a990  // Different pointer!

After fix - TracerProvider remains consistent:

First StartAgent: tp=0x140001c6870
Second StartAgent: tp=0x140001c6870  // Same pointer

@codecov
Copy link

codecov bot commented Oct 13, 2025

Codecov Report

❌ Patch coverage is 71.42857% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
core/trace/agent.go 71.42% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Fixes zeromicro#5242

Problem:
When running multiple servers (REST + RPC) with tracing enabled, the trace
agent would reinitialize for each server with different endpoints, causing
the global TracerProvider to be overwritten. Only the last configuration
would take effect, and the first configuration was silently lost.

Root Cause:
The StartAgent function only prevented re-initialization when the exact
same endpoint was used. If different endpoints were configured (common in
multi-server setups), startAgent() would be called again, creating a new
TracerProvider and overwriting the global one via otel.SetTracerProvider().

Solution:
Adopted the sync.Once pattern used by prometheus.StartAgent and logx.SetUp
to ensure trace initialization happens only once. The first configuration
wins, and subsequent calls are safely ignored.

Changes:
- Replaced agents map and lock mutex with sync.Once
- Updated StartAgent() to wrap initialization in once.Do()
- Simplified StopAgent() by removing unnecessary lock
- Added documentation explaining the behavior
- Updated TestStartAgent to verify new sync.Once behavior

Benefits:
- Consistent with go-zero patterns (Prometheus, logx)
- Predictable behavior (first config wins)
- Safe multi-server setup
- Prevents silent configuration loss
- Simpler code

Testing:
- All existing tests pass (core/trace, core/service, rest, zrpc)
- Created verification tests demonstrating the fix
- No breaking changes, fully backward compatible
@kevwan kevwan force-pushed the chore/undefined branch 2 times, most recently from e41e015 to ec65a86 Compare October 25, 2025 11:47
@kevwan kevwan merged commit 4e52d77 into zeromicro:master Oct 25, 2025
5 of 6 checks passed
@kevwan kevwan deleted the chore/undefined branch October 25, 2025 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

starting multiple servers, the trace is initialized multiple times

2 participants