Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit b9570ba

Browse files
authored
feat(otel): Add OpenTelemetry GenAI instrumentation to Copilot Chat (#3917)
* feat: add OTel GenAI instrumentation foundation Phase 0 complete: - spec.md: Full spec with decisions, GenAI semconv, dual-write, eval signals, lessons from Gemini CLI + Claude Code - plan.md: E2E demo plan (chat ext + eval repo + Azure backend) - src/platform/otel/: IOTelService, config, attributes, metrics, events, message formatters, NodeOTelService, file exporters - package.json: Added @opentelemetry/* dependencies OTel opt-in behind OTEL_EXPORTER_OTLP_ENDPOINT env var. * refactor: reorder OTel type imports for consistency * refactor: reorder OTel type imports for consistency * feat(otel): wire OTel spans into chat extension — Phase 1 core - Register IOTelService in DI (NodeOTelService when enabled, NoopOTelService when disabled) - Add OTelContrib lifecycle contribution for OTel init/shutdown - Add `chat {model}` inference span in ChatMLFetcherImpl._doFetchAndStreamChat() - Add `execute_tool {name}` span in ToolsService.invokeTool() - Add `invoke_agent {participant}` parent span in ToolCallingLoop.run() - Record gen_ai.client.operation.duration, tool call count/duration, agent metrics - Thread IOTelService through all ToolCallingLoop subclasses - Update test files with NoopOTelService - Zero overhead when OTel is disabled (noop providers, no dynamic imports) * feat(otel): add embeddings span, config UI settings, and unit tests - Add `embeddings {model}` span in RemoteEmbeddingsComputer.computeEmbeddings() - Add VS Code settings under github.copilot.chat.otel.* in package.json (enabled, exporterType, otlpEndpoint, captureContent, outfile) - Wire VS Code settings into resolveOTelConfig in services.ts - Add unit tests for: - resolveOTelConfig: env precedence, kill switch, all config paths (16 tests) - NoopOTelService: zero-overhead noop behavior (8 tests) - GenAiMetrics: metric recording with correct attributes (7 tests) * test(otel): add unit tests for messageFormatters, genAiEvents, fileExporters - messageFormatters: 18 tests covering toInputMessages, toOutputMessages, toSystemInstructions, toToolDefinitions (edge cases, empty inputs, invalid JSON) - genAiEvents: 9 tests covering all 4 event emitters, content capture on/off - fileExporters: 5 tests covering write/read round-trip for span, log, metric exporters plus aggregation temporality Total OTel test suite: 63 tests across 6 files * feat(otel): record token usage and time-to-first-token metrics Add gen_ai.client.token.usage (input/output) and copilot_chat.time_to_first_token histogram metrics at the fetchMany success path where token counts and TTFT are available from the processSuccessfulResponse result. * docs: finalize sprint plan with completion status * style: apply formatter changes to OTel files * feat(otel): emit gen_ai.client.inference.operation.details event with token usage Wire emitInferenceDetailsEvent into fetchMany success path where full token usage (prompt_tokens, completion_tokens), resolved model, request ID, and finish reasons are available from processSuccessfulResponse. This follows the OTel GenAI spec pattern: - Spans: timing + hierarchy + error tracking - Events: full request/response details including token counts The data mirrors what RequestLogger captures for chat-export-logs.json. * feat(otel): add aggregated token usage to invoke_agent span Per the OTel GenAI agent spans spec, add gen_ai.usage.input_tokens and gen_ai.usage.output_tokens as Recommended attributes on the invoke_agent span. Tokens are accumulated across all LLM turns by listening to onDidReceiveResponse events during the agent loop, then set on the span before it ends. Ref: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/ * feat(otel): add token usage attributes to chat inference span Defer the `chat {model}` span completion from _doFetchAndStreamChat to fetchMany where processSuccessfulResponse has extracted token counts. The chat span now carries: - gen_ai.usage.input_tokens (prompt_tokens) - gen_ai.usage.output_tokens (completion_tokens) - gen_ai.response.model (resolved model) The span handle is returned from _doFetchAndStreamChat via the result object so fetchMany can set attributes and end it after tokens are known. This matches the chat-export-logs.json pattern where each request entry carries full usage data alongside the response. * style: apply formatter changes * fix: correct import paths in otelContrib and add IOTelService to test * feat: add diagnostic span exporter to log first successful export and failures * feat: add content capture to OTel spans (messages, responses, tool args/results) - Chat spans: add copilot.debug_name attribute for identifying orphan spans - Chat spans: capture gen_ai.input.messages and gen_ai.output.messages when captureContent enabled - Tool spans: capture gen_ai.tool.call.arguments and gen_ai.tool.call.result when captureContent enabled - Extension chat endpoint: capture input/output messages when captureContent enabled - Add CopilotAttr.DEBUG_NAME constant * fix: register IOTelService in chatLib setupServices for NES test * fix: register OTel ConfigKey settings in Advanced namespace for configurations test * fix: register IOTelService in shared test services (createExtensionUnitTestingServices) * fix: register IOTelService in platform test services * feat(otel): enhance GenAI span attributes per OTel semantic conventions - Change gen_ai.provider.name from 'openai' to 'github' for CAPI models - Rename CopilotAttr to CopilotChatAttr, prefix values with copilot_chat.* - Add GITHUB to GenAiProviderName enum - Replace copilot.debug_name with gen_ai.agent.name on chat spans - Add gen_ai.request.temperature, gen_ai.request.top_p to chat spans - Add gen_ai.response.id, gen_ai.response.finish_reasons on success - Add gen_ai.usage.cache_read.input_tokens from cached_tokens - Add copilot_chat.request.max_prompt_tokens and copilot_chat.time_to_first_token - Add gen_ai.tool.description to execute_tool spans - Fix gen_ai.tool.call.id to read chatStreamToolCallId (was reading nonexistent prop) - Fix tool result capture to handle PromptTsxPart and DataPart (not just TextPart) - Add gen_ai.input.messages and gen_ai.output.messages to invoke_agent span (opt-in) - Move gen_ai.tool.definitions from chat spans to invoke_agent span (opt-in) - Add gen_ai.system_instructions to chat spans (opt-in) - Fix error.type raw strings to use StdAttr.ERROR_TYPE constant - Centralize hardcoded copilot.turn_count and copilot.endpoint_type into CopilotChatAttr - Add COPILOT_OTEL_CAPTURE_CONTENT=true to launch.json for testing - Document span hierarchy fixes needed in plan.md * feat(otel): connect subagent spans to parent trace via context propagation - Add TraceContext type and getActiveTraceContext() to IOTelService - Add storeTraceContext/getStoredTraceContext for cross-boundary propagation - Add parentTraceContext option to SpanOptions for explicit parent linking - Implement in NodeOTelService using OTel remote span context - Capture trace context when execute_tool runSubagent fires (keyed by toolCallId) - Restore parent context in subagent invoke_agent span (via subAgentInvocationId) - Auto-cleanup stored contexts after 5 minutes to prevent memory leaks - Update test mocks with new IOTelService methods - Update plan.md with investigation findings * fix(otel): fix subagent trace context key to use parentRequestId The previous implementation stored trace context keyed by chatStreamToolCallId (model-assigned tool call ID), but looked it up by subAgentInvocationId (VS Code internal invocation.callId UUID). These are different IDs that don't match across the IPC boundary. Fix: key by chatRequestId on store side (available on invocation options), and look up by parentRequestId on subagent side (same value, available on ChatRequest). Both reference the parent agent's request ID. Verified: 21-span trace with subagent correctly nested under parent agent. * fix(otel): add model attrs to invoke_agent and max_prompt_tokens to BYOK chat - Set gen_ai.request.model on invoke_agent span from endpoint - Track gen_ai.response.model from last LLM response resolvedModel - Add copilot_chat.request.max_prompt_tokens to BYOK chat spans - Document upstream gaps in plan.md (BYOK token usage, programmatic tool IDs) * test(otel): add trace context propagation tests for subagent linkage Tests verify: - storeTraceContext/getStoredTraceContext round-trip and single-use semantics - getActiveTraceContext returns context inside startActiveSpan - parentTraceContext makes child span inherit traceId from parent - Independent spans get different traceIds without parentTraceContext - Full subagent flow: store context in tool call, retrieve in subagent * fix(otel): add finish_reasons and ttft to BYOK chat spans, document orphan spans - Set gen_ai.response.finish_reasons on BYOK chat success - Set copilot_chat.time_to_first_token on BYOK chat success - Document Gap 4: duplicate orphan spans from CopilotLanguageModelWrapper - Identify all orphan span categories (title, progressMessages, promptCategorization, wrapper) * docs(otel): update Gap 4 analysis — wrapper spans have actual token usage data The copilotLanguageModelWrapper orphan spans are the actual CAPI HTTP handlers, not duplicates. They contain real token usage, cache read tokens, resolved model names, and temperature — all missing from the consumer-side extChatEndpoint spans due to VS Code LM API limitations. Updated plan.md with: - Side-by-side attribute comparison table - Three fix approaches (context propagation, span suppression, enrichment) - Recommendation: Option 1 (propagate trace context through IPC) * feat(otel): propagate trace context through BYOK IPC to link wrapper spans - Pass _otelTraceContext through modelOptions alongside _capturingTokenCorrelationId - Inject IOTelService into CopilotLanguageModelWrapper - Wrap makeRequest in startActiveSpan with parentTraceContext when available - This creates a byok-provider bridge span that makes chatMLFetcher's chat span a child of the original invoke_agent trace, bringing real token usage data into the agent trace hierarchy * debug(otel): add debug attribute to verify trace context capture in BYOK path * fix(otel): remove debug attribute, BYOK trace context propagation verified working Verified: 63-span trace with Azure BYOK (gpt-5) correctly shows: - byok-provider bridge spans linking wrapper chat spans into agent trace - Real token usage (in:21458 out:1730 cache:19072) visible on wrapper chat spans - hasCtx:true on all extChatEndpoint spans confirming context capture - Two subagent invoke_agent spans correctly nested under main agent - Zero orphan copilotLanguageModelWrapper spans * refactor(otel): replace byok-provider bridge span with invisible context propagation Add runWithTraceContext() to IOTelService — sets parent trace context without creating a visible span. The wrapper's chat spans now appear directly as children of invoke_agent, eliminating the noisy byok-provider intermediary span. Before: invoke_agent → byok-provider → chat (wrapper) After: invoke_agent → chat (wrapper) * refactor(otel): remove duplicate BYOK consumer-side chat span The extChatEndpoint no longer creates its own chat span. The wrapper's chatMLFetcher span (via CopilotLanguageModelWrapper) is the single source of truth with full token usage, cache data, and resolved model. Before: invoke_agent → chat (empty, extChatEndpoint) + chat (rich, wrapper) After: invoke_agent → chat (rich, wrapper only) * fix(otel): restore chat span for non-wrapper BYOK providers (Anthropic, Gemini) The previous commit removed the extChatEndpoint chat span, which was correct for Azure/OpenAI BYOK (served by CopilotLanguageModelWrapper via chatMLFetcher). But Anthropic and Gemini BYOK providers call their native SDKs directly, bypassing CopilotLanguageModelWrapper — so they need the consumer-side span. Now: always create a chat span in extChatEndpoint with basic metadata (model, provider, response.id, finish_reasons). For wrapper-based providers, the chatMLFetcher also creates a richer sibling span with token usage. * fix(otel): skip consumer chat span for wrapper-based BYOK providers Only create the extChatEndpoint chat span for non-wrapper providers (Anthropic, Gemini) that need it as their only span. Wrapper-based providers (Azure, OpenAI, OpenRouter, Ollama, xAI) get a single rich span from chatMLFetcher via CopilotLanguageModelWrapper. Result: 1 chat span per LLM call for all provider types. * fix: remove unnecessary 'google' from non-wrapper vendor set * feat(otel): add rich chat span with usage data for Anthropic BYOK provider Move chat span creation into AnthropicLMProvider where actual API response data (token usage, cache reads) is available. The span is linked to the agent trace via runWithTraceContext and enriched with: - gen_ai.usage.input_tokens / output_tokens - gen_ai.usage.cache_read.input_tokens - gen_ai.response.model / response.id / finish_reasons Remove consumer-side extChatEndpoint span for all vendors (nonWrapperVendors now empty) since both wrapper-based and Anthropic providers create their own spans with full data. Next: apply same pattern to Gemini provider. * feat(otel): add rich chat span for Gemini BYOK, clean up extChatEndpoint - Add OTel chat span with full usage data to GeminiNativeBYOKLMProvider - Remove all consumer-side span code from extChatEndpoint (dead code) - Each provider now owns its chat span with real API response data: * CAPI: chatMLFetcher * OpenAI-compat BYOK: CopilotLanguageModelWrapper → chatMLFetcher * Anthropic: AnthropicLMProvider * Gemini: GeminiNativeBYOKLMProvider - Fix Gemini test to pass IOTelService * feat(otel): enrich Anthropic/Gemini chat spans with full metadata Add to both providers: - copilot_chat.request.max_prompt_tokens (model.maxInputTokens) - server.address (api.anthropic.com / generativelanguage.googleapis.com) - gen_ai.conversation.id (requestId) - copilot_chat.time_to_first_token (result.ttft) Now matches CAPI chat span attribute parity. * feat(otel): add server.address to CAPI/Azure BYOK chat spans Extract hostname from urlOrRequestMetadata when it's a URL string and set as server.address on the chat span. Works for both CAPI and CopilotLanguageModelWrapper (Azure BYOK) paths. * feat(otel): add max_tokens and output_messages to Anthropic/Gemini chat spans - gen_ai.request.max_tokens from model.maxOutputTokens - gen_ai.output.messages (opt-in) from response text - Closes remaining attribute gaps vs CAPI/Azure BYOK spans * fix(otel): capture tool calls in output_messages for chat spans When model responds with tool calls instead of text, the output_messages attribute was empty. Now captures both text parts and tool call parts in the output_messages, matching the OTel GenAI output messages schema. Also: Azure BYOK invoke_agent zero tokens is a known upstream gap — extChatEndpoint returns hardcoded usage:0 since VS Code LM API doesn't expose actual usage from the provider side. * fix(otel): capture tool calls in output_messages for Anthropic/Gemini BYOK spans Same fix as CAPI — when model responds with tool calls, include them in gen_ai.output.messages alongside text parts. All three provider paths (CAPI, Anthropic, Gemini) now consistently capture both text and tool call parts in output messages. * fix(otel): add input_messages and agent_name to Anthropic/Gemini chat spans - gen_ai.input.messages (opt-in) captured from provider messages parameter - gen_ai.agent.name set to AnthropicBYOK / GeminiBYOK for identification Closes the last attribute gaps vs CAPI/Azure BYOK chat spans. * fix(otel): fix input_messages serialization for Anthropic/Gemini BYOK - Map enum role values to names (1→user, 2→assistant, 3→system) - Extract text from LanguageModelTextPart content arrays instead of showing '[complex]' for all messages - Use OTel GenAI input messages schema with role + parts format * docs(otel): add remaining metrics/events work to plan.md Coverage matrix showing: - Anthropic/Gemini BYOK missing: operation.duration, token.usage, time_to_first_token metrics, and inference.details event - CAPI and Azure BYOK (via wrapper) fully covered - Tool/agent/session metrics covered across all providers - 4 tasks (M1-M4) to close the gap * feat(otel): add metrics and inference events to Anthropic/Gemini BYOK providers Both providers now record: - gen_ai.client.operation.duration histogram - gen_ai.client.token.usage histograms (input + output) - copilot_chat.time_to_first_token histogram - gen_ai.client.inference.operation.details log event All metrics/events now have full parity across CAPI, Azure BYOK, Anthropic BYOK, and Gemini BYOK. * fix(otel): fix LoggerProvider constructor — use 'processors' key (SDK v2) The OTel SDK v2 changed the LoggerProvider constructor option from 'logRecordProcessors' to 'processors'. The old key was silently ignored, causing all log records to be dropped. This is why logs never appeared in Loki despite traces working fine. * docs: add agent monitoring guide with OTel usage and Claude/Gemini comparison * docs: remove Claude/Gemini comparison from monitoring guide * docs: add OTel comparison with Claude Code and Gemini CLI * docs: reorganize monitoring docs — user guide + dev architecture - agent_monitoring.md: polished user-facing guide (for VS Code website) - agent_monitoring_arch.md: developer-facing architecture & instrumentation guide - Removed internal plan/spec/comparison files from repo (moved to ~/Documents) * fix(otel): restore _doFetchViaHttp body and _fetchWithInstrumentation after rebase * fix(otel): propagate otelSpan through WebSocket/HTTP routing paths The otelSpan was created in _doFetchAndStreamChat but not included in returns from _doFetchViaWebSocket and _doFetchViaHttp, causing the caller (fetchMany) to always receive undefined for otelSpan. Fix: await both routing paths and spread otelSpan into the result. * docs(otel): improve monitoring docs, add collector setup, fix trace context - Expand agent_monitoring.md with detailed span/metric/event attribute tables - Add BYOK provider coverage, subagent trace propagation docs - Add Backend Considerations: Azure App Insights (via collector), Langfuse, Grafana - Add End-to-End Setup & Verification section with KQL examples - Add OTel Collector config + docker-compose for Azure App Insights - Fix: emit inference details event before span.end() in chatMLFetcher (fixes 'No trace ID' log records in App Insights) - Fix: pass active context in emitLogRecord for trace correlation - Update launch.json to point at OTel Collector (localhost:4328) * docs(otel): merge Backend Considerations and E2E sections to remove redundancy * docs(otel): remove internal dev debug reference from user-facing guide * docs(otel): remove Grafana section and Jaeger refs from App Insights section * docs(otel): trim Backend section to factual setup guides, remove claims * docs(otel): final accuracy audit — fix false claims against code - Mark copilot_chat.session.start event as 'not yet emitted' (defined but no call site) - Mark copilot_chat.agent.turn event as 'not yet emitted' (defined but no call site) - Mark copilot_chat.session.count metric as 'not yet wired up' - Fix OTEL_EXPORTER_OTLP_PROTOCOL desc: only 'grpc' changes behavior - Fix telemetry kill switch claim: vscodeTelemetryLevel not wired in services.ts - Remove false toolCalling.tsx instrumentation point from arch doc - Fix docker-compose comments: wrong port numbers (16686→16687, 4318→4328) - Add reference to full collector config file from inline snippet * docs(otel): remove telemetry.telemetryLevel references — OTel is independent * feat(otel): wire up session.start event, agent.turn event, and session.count metric - emitSessionStartEvent + incrementSessionCount at invoke_agent start (top-level only) - emitAgentTurnEvent per LLM response in onDidReceiveResponse listener - Remove 'not yet wired' markers from docs * chore: untrack .playwright-mcp/ and add to .gitignore * chore: remove otel spec reference files * chore(otel): remove OpenTelemetry environment variables from launch configurations * fix(otel): add 64KB truncation limit for content capture attributes Prevents OTLP batch export failures when large prompts/responses are captured. Aligned with gemini-cli's limitTotalLength pattern. Applied truncateForOTel() to all JSON.stringify calls feeding span attributes across chatMLFetcher, toolCallingLoop, toolsService, anthropicProvider, geminiNativeProvider, and genAiEvents. * refactor(otel): make GenAiMetrics methods static to avoid per-call allocations Aligned with gemini-cli pattern of module-level metric functions. Eliminates 17+ throwaway GenAiMetrics instances per agent run. * fix(otel): fix timer leak, cap buffered ops, rate-limit export logs - storeTraceContext: track timers for clearTimeout on retrieval/shutdown, add 100-entry max with LRU eviction - BufferedSpanHandle: cap _ops at 200 to prevent unbounded growth - DiagnosticSpanExporter: rate-limit failure logs to once per 60s * docs(otel): fix Jaeger UI port to match docker-compose (16687) * chore(otel): update sprint plan — mark P0/P1 tasks done * fix(otel): remove as any casts in BYOK provider content capture Use proper Array.isArray + instanceof checks instead of as any[] casts for LanguageModelChatMessage.content iteration. * refactor(otel): extract OTelModelOptions shared interface Replaces 3 duplicated inline type assertions for _otelTraceContext and _capturingTokenCorrelationId with a single shared interface. * refactor(otel): route OTel logs through ILogService output channel Replace console.info/error/warn in NodeOTelService with a log callback. OTelContrib logs essential status to the Copilot Chat output channel for user troubleshooting (enabled/disabled, exporter config, shutdown). * fix(otel): remove orphaned OTel ConfigKey definitions OTel config is read via workspace.getConfiguration in services.ts, not through IConfigurationService.get(ConfigKey). These constants were unused dead code. * test(otel): add comprehensive OTel instrumentation tests - Agent trace hierarchy (invoke_agent → chat → execute_tool, subagent propagation, error states, metrics, events) - BYOK provider span emission (CLIENT kind, token usage, error.type, content capture gating, parentTraceContext linking) - chatMLFetcher two-phase span lifecycle (create → enrich → end, error path, operation duration metric) - Service robustness (runWithTraceContext, startActiveSpan error lifecycle, storeTraceContext overwrite) - CapturingOTelService reusable test mock for all OTel assertions * chore: apply formatter import sorting * chore: remove outdated sprint plan document * feat(otel): add OTel configuration settings for tracing and logging * fix(otel): ensure metric reader is flushed and shutdown properly
1 parent c065da3 commit b9570ba

52 files changed

Lines changed: 5751 additions & 216 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,3 +40,6 @@ test/aml/out
4040

4141
# claude
4242
.claude/settings.local.json
43+
44+
# playwright
45+
.playwright-mcp/

docs/monitoring/agent_monitoring.md

Lines changed: 506 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 297 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,297 @@
1+
# OTel Instrumentation — Developer Guide
2+
3+
This document describes the architecture, code structure, and conventions for the OpenTelemetry instrumentation in the Copilot Chat extension. It is intended for developers contributing to or maintaining this codebase.
4+
5+
For user-facing configuration and usage, see [agent_monitoring.md](agent_monitoring.md).
6+
7+
---
8+
9+
## Architecture Overview
10+
11+
```
12+
┌──────────────────────────────────────────────────────────────────┐
13+
│ VS Code Copilot Chat Extension │
14+
│ │
15+
│ ┌─────────────┐ ┌──────────────┐ ┌──────────┐ ┌──────────┐ │
16+
│ │ ChatML │ │ Tool Calling │ │ Tools │ │ Prompts │ │
17+
│ │ Fetcher │ │ Loop │ │ Service │ │ │ │
18+
│ └──────┬──────┘ └──────┬───────┘ └────┬─────┘ └────┬─────┘ │
19+
│ │ │ │ │ │
20+
│ ▼ ▼ ▼ ▼ │
21+
│ ┌──────────────────────────────────────────────────────────┐ │
22+
│ │ IOTelService (DI) │ │
23+
│ │ ┌─────────┐ ┌──────────┐ ┌─────────┐ ┌───────────┐ │ │
24+
│ │ │ Tracer │ │ Meter │ │ Logger │ │ Semantic │ │ │
25+
│ │ │ (spans) │ │ (metrics)│ │ (events)│ │ Helpers │ │ │
26+
│ │ └────┬────┘ └────┬─────┘ └────┬────┘ └───────────┘ │ │
27+
│ └───────┼─────────────┼────────────┼──────────────────────┘ │
28+
│ ▼ ▼ ▼ │
29+
│ ┌─────────────────────────────────────────────┐ │
30+
│ │ OTel SDK (BatchSpanProcessor, │ │
31+
│ │ BatchLogRecordProcessor, │ │
32+
│ │ PeriodicExportingMetricReader) │ │
33+
│ └──────────────────┬──────────────────────────┘ │
34+
│ ▼ │
35+
│ ┌─────────────────────────────────────────────┐ │
36+
│ │ Exporters: OTLP/HTTP | OTLP/gRPC | │ │
37+
│ │ Console | File (JSON-lines) │ │
38+
│ └─────────────────────────────────────────────┘ │
39+
└──────────────────────────────────────────────────────────────────┘
40+
```
41+
42+
---
43+
44+
## File Structure
45+
46+
```
47+
src/platform/otel/
48+
├── common/
49+
│ ├── otelService.ts # IOTelService interface + ISpanHandle
50+
│ ├── otelConfig.ts # Config resolution (env → settings → defaults)
51+
│ ├── noopOtelService.ts # Zero-cost no-op implementation
52+
│ ├── genAiAttributes.ts # GenAI semantic convention attribute keys
53+
│ ├── genAiEvents.ts # Event emitter helpers
54+
│ ├── genAiMetrics.ts # GenAiMetrics class (metric recording)
55+
│ ├── messageFormatters.ts # Message → OTel JSON schema converters
56+
│ ├── index.ts # Public API barrel export
57+
│ └── test/ # Unit tests
58+
└── node/
59+
├── otelServiceImpl.ts # NodeOTelService (real SDK implementation)
60+
├── fileExporters.ts # File-based span/log/metric exporters
61+
└── test/ # Unit tests
62+
63+
src/extension/otel/
64+
└── vscode-node/
65+
└── otelContrib.ts # Lifecycle contribution (shutdown hook)
66+
```
67+
68+
### Instrumentation Points
69+
70+
| File | What Gets Instrumented |
71+
|---|---|
72+
| `src/extension/prompt/node/chatMLFetcher.ts` | `chat` spans — one per LLM API call. Used by standard CAPI endpoints **and** all OpenAI-compatible BYOK providers (Azure, OpenAI, Ollama, OpenRouter, xAI, CustomOAI) via `CopilotLanguageModelWrapper``endpoint.makeChatRequest` |
73+
| `src/extension/byok/vscode-node/anthropicProvider.ts` | `chat` spans — BYOK Anthropic requests (native SDK, instrumented directly) |
74+
| `src/extension/byok/vscode-node/geminiNativeProvider.ts` | `chat` spans — BYOK Gemini requests (native SDK, instrumented directly) |
75+
| `src/extension/intents/node/toolCallingLoop.ts` | `invoke_agent` spans — wraps agent orchestration |
76+
| `src/extension/tools/vscode-node/toolsService.ts` | `execute_tool` spans — one per tool invocation |
77+
| `src/extension/extension/vscode-node/services.ts` | Service registration (config → NodeOTelService or NoopOTelService) |
78+
79+
---
80+
81+
## Service Layer
82+
83+
### `IOTelService` Interface
84+
85+
The core abstraction. All consumers depend on this interface, never on the OTel SDK directly. It exposes methods for starting spans, recording metrics, emitting log records, managing trace context propagation, and lifecycle (`flush`/`shutdown`).
86+
87+
### Implementations
88+
89+
| Class | When Used | Characteristics |
90+
|---|---|---|
91+
| `NoopOTelService` | OTel disabled (default) | All methods are empty. Zero cost. |
92+
| `NodeOTelService` | OTel enabled | Full SDK with dynamic imports, buffering, batched processors. |
93+
94+
### Registration
95+
96+
In `services.ts`, the config is resolved from env + settings, then the appropriate implementation is registered:
97+
98+
```typescript
99+
const otelConfig = resolveOTelConfig({ env: process.env, ... });
100+
if (otelConfig.enabled) {
101+
const { NodeOTelService } = require('.../otelServiceImpl');
102+
builder.define(IOTelService, new NodeOTelService(otelConfig));
103+
} else {
104+
builder.define(IOTelService, new NoopOTelService(otelConfig));
105+
}
106+
```
107+
108+
The `require()` (not `import()`) is intentional here — it avoids loading the SDK at all when disabled, while the `NodeOTelService` constructor internally uses `import()` for all OTel packages.
109+
110+
---
111+
112+
## Configuration Resolution
113+
114+
`resolveOTelConfig()` in `otelConfig.ts` implements layered precedence:
115+
116+
1. `COPILOT_OTEL_*` env vars (highest)
117+
2. `OTEL_EXPORTER_OTLP_*` standard env vars
118+
3. VS Code settings (`github.copilot.chat.otel.*`)
119+
4. Defaults (lowest)
120+
121+
Kill switch: If `telemetry.telemetryLevel === 'off'`, the config resolver returns a disabled config. Note: `vscodeTelemetryLevel` must be passed by the call site — currently not wired in `services.ts`.
122+
123+
Endpoint parsing: gRPC → origin only (`scheme://host:port`). HTTP → full href.
124+
125+
---
126+
127+
## Span Conventions
128+
129+
### Naming
130+
131+
Follow the OTel GenAI conventions:
132+
133+
| Operation | Span Name | Kind |
134+
|---|---|---|
135+
| Agent orchestration | `invoke_agent {agent_name}` | `INTERNAL` |
136+
| LLM API call | `chat {model}` | `CLIENT` |
137+
| Tool execution | `execute_tool {tool_name}` | `INTERNAL` |
138+
139+
### Attributes
140+
141+
Use the constants from `genAiAttributes.ts`:
142+
143+
```typescript
144+
import { GenAiAttr, GenAiOperationName, CopilotChatAttr, StdAttr } from '../../platform/otel/common/index';
145+
146+
span.setAttributes({
147+
[GenAiAttr.OPERATION_NAME]: GenAiOperationName.CHAT,
148+
[GenAiAttr.REQUEST_MODEL]: model,
149+
[GenAiAttr.USAGE_INPUT_TOKENS]: inputTokens,
150+
[StdAttr.ERROR_TYPE]: error.constructor.name,
151+
});
152+
```
153+
154+
### Error Handling
155+
156+
On error, set both status and `error.type`:
157+
158+
```typescript
159+
span.setStatus(SpanStatusCode.ERROR, error.message);
160+
span.setAttribute(StdAttr.ERROR_TYPE, error.constructor.name);
161+
```
162+
163+
### Content Capture
164+
165+
Always gate content capture on `otel.config.captureContent`:
166+
167+
```typescript
168+
if (this._otelService.config.captureContent) {
169+
span.setAttribute(GenAiAttr.INPUT_MESSAGES, JSON.stringify(messages));
170+
}
171+
```
172+
173+
---
174+
175+
## Adding Instrumentation to New Code
176+
177+
### Pattern: Wrapping an Operation with a Span
178+
179+
```typescript
180+
class MyService {
181+
constructor(@IOTelService private readonly _otel: IOTelService) {}
182+
183+
async doWork(): Promise<Result> {
184+
return this._otel.startActiveSpan(
185+
'execute_tool myTool',
186+
{ kind: SpanKind.INTERNAL, attributes: { [GenAiAttr.TOOL_NAME]: 'myTool' } },
187+
async (span) => {
188+
try {
189+
const result = await this._actualWork();
190+
span.setStatus(SpanStatusCode.OK);
191+
return result;
192+
} catch (err) {
193+
span.setStatus(SpanStatusCode.ERROR, err instanceof Error ? err.message : String(err));
194+
span.setAttribute(StdAttr.ERROR_TYPE, err instanceof Error ? err.constructor.name : 'Error');
195+
throw err;
196+
}
197+
},
198+
);
199+
}
200+
}
201+
```
202+
203+
### Pattern: Recording Metrics
204+
205+
Use `GenAiMetrics` for standard metric recording:
206+
207+
```typescript
208+
const metrics = new GenAiMetrics(this._otelService);
209+
metrics.recordTokenUsage(1500, 'input', {
210+
operationName: GenAiOperationName.CHAT,
211+
providerName: GenAiProviderName.GITHUB,
212+
requestModel: 'gpt-4o',
213+
});
214+
metrics.recordToolCallCount('readFile', true);
215+
metrics.recordTimeToFirstToken('gpt-4o', 0.45);
216+
```
217+
218+
### Pattern: Emitting Events
219+
220+
```typescript
221+
import { emitToolCallEvent, emitInferenceDetailsEvent } from '../../platform/otel/common/index';
222+
223+
emitToolCallEvent(this._otelService, 'readFile', 50, true);
224+
emitInferenceDetailsEvent(this._otelService, { model: 'gpt-4o' }, { inputTokens: 1500 });
225+
```
226+
227+
### Pattern: Cross-Boundary Trace Propagation
228+
229+
When spawning a subagent, store the current trace context and retrieve it in the child:
230+
231+
```typescript
232+
// Parent: store context before spawning subagent
233+
const traceContext = this._otelService.getActiveTraceContext();
234+
if (traceContext) {
235+
this._otelService.storeTraceContext(`subagent:${requestId}`, traceContext);
236+
}
237+
238+
// Child: retrieve and use as parent
239+
const parentCtx = this._otelService.getStoredTraceContext(`subagent:${requestId}`);
240+
return this._otelService.startActiveSpan('invoke_agent child', { parentTraceContext: parentCtx }, async (span) => {
241+
// child spans are now part of the same trace
242+
});
243+
```
244+
245+
---
246+
247+
## Buffering & Initialization
248+
249+
`NodeOTelService` buffers operations during async SDK initialization. Once init completes, the buffer is drained in order; on failure, it is discarded and all future calls become no-ops. `BufferedSpanHandle` captures span mutations during this window and replays them onto the real span once available.
250+
251+
---
252+
253+
## Exporters
254+
255+
Four exporter types are supported: OTLP/HTTP (default), OTLP/gRPC, Console (stdout), and File (JSON-lines). All OTel SDK packages are dynamically imported — none are loaded when OTel is disabled. `DiagnosticSpanExporter` wraps the span exporter to log the first successful export (confirms connectivity).
256+
257+
---
258+
259+
## GenAI Semantic Convention Reference
260+
261+
All attribute names follow [OTel GenAI Semantic Conventions](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/gen-ai/).
262+
263+
Constants are defined in `genAiAttributes.ts`:
264+
265+
- `GenAiAttr.*` — Standard `gen_ai.*` attribute keys
266+
- `CopilotChatAttr.*` — Extension-specific `copilot_chat.*` keys
267+
- `StdAttr.*` — Standard OTel keys (`error.type`, `server.address`, `server.port`)
268+
- `GenAiOperationName.*` — Operation name values (`chat`, `invoke_agent`, `execute_tool`)
269+
- `GenAiProviderName.*` — Provider values (`github`, `openai`, `anthropic`)
270+
271+
Message formatting helpers in `messageFormatters.ts` convert internal message types to the OTel JSON schema:
272+
273+
- `toInputMessages()` — CAPI messages → OTel input format
274+
- `toOutputMessages()` — Model response choices → OTel output format
275+
- `toSystemInstructions()` — System message → OTel system instruction format
276+
- `toToolDefinitions()` — Tool schemas → OTel tool definition format
277+
278+
---
279+
280+
## Testing
281+
282+
Unit tests live alongside the source:
283+
284+
```
285+
src/platform/otel/common/test/
286+
├── genAiEvents.spec.ts
287+
├── genAiMetrics.spec.ts
288+
├── messageFormatters.spec.ts
289+
├── noopOtelService.spec.ts
290+
└── otelConfig.spec.ts
291+
292+
src/platform/otel/node/test/
293+
├── fileExporters.spec.ts
294+
└── traceContextPropagation.spec.ts
295+
```
296+
297+
Run with: `npm test -- --grep "OTel"`
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# Copilot Chat OTel monitoring stack
2+
#
3+
# Starts an OpenTelemetry Collector that accepts OTLP on :4318 (HTTP) and :4317 (gRPC),
4+
# then forwards traces/metrics/logs to Azure Application Insights and a local Jaeger instance.
5+
#
6+
# Usage:
7+
# # Set your App Insights connection string:
8+
# export APPLICATIONINSIGHTS_CONNECTION_STRING="InstrumentationKey=...;IngestionEndpoint=..."
9+
#
10+
# # Start the stack:
11+
# docker compose up -d
12+
#
13+
# # View traces in Jaeger:
14+
# open http://localhost:16687
15+
#
16+
# # Then launch VS Code with:
17+
# COPILOT_OTEL_ENABLED=true OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4328 code .
18+
19+
services:
20+
otel-collector:
21+
image: otel/opentelemetry-collector-contrib:latest
22+
command: ["--config=/etc/otel-collector-config.yaml"]
23+
volumes:
24+
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml:ro
25+
ports:
26+
- "4327:4317" # OTLP gRPC (host:4327 → container:4317)
27+
- "4328:4318" # OTLP HTTP (host:4328 → container:4318)
28+
environment:
29+
- APPLICATIONINSIGHTS_CONNECTION_STRING=${APPLICATIONINSIGHTS_CONNECTION_STRING:-}
30+
restart: unless-stopped
31+
32+
jaeger:
33+
image: jaegertracing/jaeger:latest
34+
ports:
35+
- "16687:16686" # Jaeger UI (host:16687 to avoid conflict)
36+
restart: unless-stopped
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# OpenTelemetry Collector configuration for Copilot Chat
2+
# Receives OTLP from Copilot Chat and exports to multiple backends.
3+
#
4+
# Usage:
5+
# docker compose -f docs/monitoring/docker-compose.yaml up -d
6+
#
7+
# Then set in VS Code or launch.json:
8+
# OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
9+
10+
receivers:
11+
otlp:
12+
protocols:
13+
http:
14+
endpoint: 0.0.0.0:4318
15+
grpc:
16+
endpoint: 0.0.0.0:4317
17+
18+
processors:
19+
batch:
20+
timeout: 5s
21+
send_batch_size: 256
22+
23+
exporters:
24+
# Azure Application Insights via connection string
25+
# Replace <your-connection-string> with your App Insights connection string
26+
azuremonitor:
27+
connection_string: "${APPLICATIONINSIGHTS_CONNECTION_STRING}"
28+
29+
# Debug exporter — prints to collector stdout (useful for troubleshooting)
30+
debug:
31+
verbosity: basic
32+
33+
# Local Jaeger for trace visualization
34+
otlphttp/jaeger:
35+
endpoint: http://jaeger:4318
36+
37+
service:
38+
pipelines:
39+
traces:
40+
receivers: [otlp]
41+
processors: [batch]
42+
exporters: [azuremonitor, otlphttp/jaeger, debug]
43+
metrics:
44+
receivers: [otlp]
45+
processors: [batch]
46+
exporters: [azuremonitor, debug]
47+
logs:
48+
receivers: [otlp]
49+
processors: [batch]
50+
exporters: [azuremonitor, debug]

0 commit comments

Comments
 (0)