-
Notifications
You must be signed in to change notification settings - Fork 302
Closed
Labels
enhancementNew feature or requestNew feature or requestobservabilityObservability, logging, monitoringObservability, logging, monitoringpluginspythonPython / backend development (FastAPI)Python / backend development (FastAPI)triageIssues / Features awaiting triageIssues / Features awaiting triage
Milestone
Description
Feature Request: Phoenix Observability Integration
Summary
Integrate Arize Phoenix as an observability plugin for MCP Gateway to provide comprehensive LLM tracing, evaluation, and monitoring capabilities.
Note: due to the licensing of Arize Phoenix, we should NOT include it as part of MCP Gateway, but rather, look compatibility with their SaaS service.
Motivation
As identified in the roadmap (Release 0.7.0), MCP Gateway requires core observability features. Phoenix provides a mature, open-source AI observability platform that aligns perfectly with the gateway's needs for:
- OpenTelemetry-based tracing for MCP tool invocations
- LLM performance evaluation and benchmarking
- Request/response monitoring across virtual servers
- Prompt management and optimization tracking
Proposed Implementation
1. Phoenix Observability Plugin
Create a new plugin at plugins/phoenix_observability/
that implements:
Hook Integration Points
tool_pre_invoke
/tool_post_invoke
: Capture tool execution traces with input/output dataprompt_pre_fetch
/prompt_post_fetch
: Track prompt template usage and performanceresource_pre_fetch
/resource_post_fetch
: Monitor resource access patterns- Request/Response hooks: Trace complete MCP request lifecycle
Plugin Configuration (plugins/config.yaml
)
- name: "PhoenixObservabilityPlugin"
kind: "plugins.phoenix_observability.phoenix_plugin.PhoenixObservabilityPlugin"
description: "AI observability with tracing, evaluation, and monitoring via Arize Phoenix"
version: "1.0.0"
author: "MCP Gateway Team"
hooks:
- "tool_pre_invoke"
- "tool_post_invoke"
- "prompt_pre_fetch"
- "prompt_post_fetch"
- "resource_pre_fetch"
- "resource_post_fetch"
tags: ["observability", "tracing", "monitoring", "opentelemetry", "phoenix"]
mode: "permissive"
priority: 200 # Lower priority to not interfere with security plugins
config:
phoenix_endpoint: "${PHOENIX_ENDPOINT:-http://localhost:6006}"
enable_tracing: true
enable_evaluation: false # Can be enabled for automatic quality checks
sample_rate: 1.0 # Trace sampling rate
export_batch_size: 100
export_interval_ms: 5000
# OpenTelemetry configuration
otel_service_name: "mcp-gateway"
otel_resource_attributes:
deployment.environment: "${DEPLOYMENT_ENV:-development}"
service.version: "${MCPGATEWAY_VERSION}"
# Trace enrichment
capture_request_headers: ["X-Request-ID", "X-Tenant-ID", "User-Agent"]
capture_response_headers: ["X-Response-Time"]
redact_sensitive_fields: ["password", "api_key", "secret", "token"]
2. Core Features
A. Distributed Tracing
- Implement OpenTelemetry instrumentation using Phoenix's OpenInference spec
- Create spans for:
- Virtual server requests
- Individual tool invocations
- Prompt rendering operations
- Resource fetching
- Federation calls between gateways
- Include context propagation for distributed traces across federated gateways
B. Performance Metrics
- Tool execution latency by server/tool
- Request throughput per virtual server
- Error rates and success rates
- Resource usage patterns
- Cache hit/miss ratios
C. LLM-Specific Observability
- Token usage tracking (if available from MCP servers)
- Prompt template effectiveness metrics
- Tool selection patterns
- Chain-of-thought execution traces
D. Evaluation Integration
- Support for Phoenix's evaluation framework to:
- Assess tool output quality
- Monitor prompt effectiveness
- Detect anomalies in responses
- Track drift in tool behavior over time
3. Integration with Existing Infrastructure
Alignment with Roadmap
- Release 0.7.0: Core implementation as part of "Core Observability"
- Complements planned Prometheus metrics ([Feature Request]: Prometheus Metrics Instrumentation using prometheus-fastapi-instrumentator #218) with detailed traces
- Works alongside OpenLLMetry integration ([Feature Request]: Add OpenLLMetry Integration for Observability #175)
- Enhances structured JSON logging ([Feature Request]: Structured JSON Logging with Correlation IDs #300) with correlation IDs
Plugin Framework Utilization
- Leverages existing
PluginContext
for request tracking - Uses
GlobalContext
for tenant/user attribution - Implements async operations for non-blocking trace export
- Respects plugin priority system (runs after security plugins)
4. Deployment Options
Docker Compose Configuration
Option A: PostgreSQL Backend (Recommended for Production)
# docker-compose.phoenix.yaml - Separate compose file for Phoenix stack
version: '3.8'
services:
phoenix:
image: arizephoenix/phoenix:v4.0.0 # Pin to specific version
container_name: mcp-phoenix
restart: unless-stopped
depends_on:
- phoenix-db
ports:
- "6006:6006" # Phoenix UI and OTLP HTTP collector
- "4317:4317" # OTLP gRPC collector
environment:
- PHOENIX_SQL_DATABASE_URL=postgresql://phoenix:phoenix_secret@phoenix-db:5432/phoenix
- PHOENIX_ENABLE_AUTH=false # Set to true for production
- PHOENIX_SECRET_KEY=${PHOENIX_SECRET_KEY:-your-secret-key-here}
- PHOENIX_LOG_LEVEL=info
- PHOENIX_GRPC_PORT=4317
- PHOENIX_HTTP_PORT=6006
networks:
- phoenix-net
- mcpgateway-net # Connect to MCP Gateway network
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:6006/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
phoenix-db:
image: postgres:15-alpine
container_name: mcp-phoenix-db
restart: unless-stopped
environment:
- POSTGRES_USER=phoenix
- POSTGRES_PASSWORD=phoenix_secret
- POSTGRES_DB=phoenix
- POSTGRES_INITDB_ARGS=--encoding=UTF8
volumes:
- phoenix-postgres-data:/var/lib/postgresql/data
networks:
- phoenix-net
healthcheck:
test: ["CMD-SHELL", "pg_isready -U phoenix"]
interval: 10s
timeout: 5s
retries: 5
networks:
phoenix-net:
driver: bridge
mcpgateway-net:
external: true # Assuming MCP Gateway network exists
volumes:
phoenix-postgres-data:
driver: local
Option B: SQLite Backend (Development/Testing)
# docker-compose.phoenix-dev.yaml - Lightweight option for development
version: '3.8'
services:
phoenix:
image: arizephoenix/phoenix:latest
container_name: mcp-phoenix-dev
restart: unless-stopped
ports:
- "6006:6006" # Phoenix UI and OTLP HTTP
- "4317:4317" # OTLP gRPC
environment:
- PHOENIX_WORKING_DIR=/mnt/data
- PHOENIX_ENABLE_AUTH=false
- PHOENIX_LOG_LEVEL=debug
volumes:
- phoenix-sqlite-data:/mnt/data
networks:
- mcpgateway-net
volumes:
phoenix-sqlite-data:
driver: local
networks:
mcpgateway-net:
external: true
MCP Gateway Integration Configuration
Update the main docker-compose.yaml
to connect with Phoenix:
# Addition to mcpgateway service in main docker-compose.yaml
services:
mcpgateway:
# ... existing configuration ...
environment:
# Phoenix Observability
- PHOENIX_ENDPOINT=http://mcp-phoenix:6006
- OTEL_EXPORTER_OTLP_ENDPOINT=http://mcp-phoenix:4317
- OTEL_SERVICE_NAME=mcp-gateway
- OTEL_TRACES_EXPORTER=otlp
- OTEL_METRICS_EXPORTER=otlp
networks:
- mcpgateway-net
depends_on:
- phoenix # If running in same compose file
networks:
mcpgateway-net:
driver: bridge
Deployment Commands
# Development: Start Phoenix with SQLite
docker-compose -f docker-compose.phoenix-dev.yaml up -d
# Production: Start Phoenix with PostgreSQL
docker-compose -f docker-compose.phoenix.yaml up -d
# Start both MCP Gateway and Phoenix together
docker-compose -f docker-compose.yaml -f docker-compose.phoenix.yaml up -d
# View logs
docker-compose -f docker-compose.phoenix.yaml logs -f phoenix
# Stop services
docker-compose -f docker-compose.phoenix.yaml down
Kubernetes Deployment
# phoenix-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: phoenix
namespace: mcp-gateway
spec:
replicas: 1
selector:
matchLabels:
app: phoenix
template:
metadata:
labels:
app: phoenix
spec:
containers:
- name: phoenix
image: arizephoenix/phoenix:v4.0.0
ports:
- containerPort: 6006
name: http
- containerPort: 4317
name: grpc
env:
- name: PHOENIX_SQL_DATABASE_URL
valueFrom:
secretKeyRef:
name: phoenix-db-secret
key: connection-string
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
---
apiVersion: v1
kind: Service
metadata:
name: phoenix
namespace: mcp-gateway
spec:
selector:
app: phoenix
ports:
- port: 6006
targetPort: 6006
name: http
- port: 4317
targetPort: 4317
name: grpc
Cloud Deployment Options
- Phoenix Cloud: Managed SaaS offering with API key authentication
- AWS ECS: Container deployment with RDS PostgreSQL backend
- GCP Cloud Run: Serverless deployment with Cloud SQL
- Azure Container Instances: With Azure Database for PostgreSQL
5. Benefits
- Complete Visibility: End-to-end tracing of MCP requests across all components
- Performance Optimization: Identify bottlenecks in tool execution and federation
- Quality Assurance: Automatic evaluation of tool outputs and prompt effectiveness
- Debugging: Detailed traces for troubleshooting complex multi-tool workflows
- Compliance: Audit trail of all MCP operations with user/tenant attribution
- Scalability Insights: Understand system behavior under load with distributed tracing
6. Implementation Phases
Phase 1: Basic Tracing (2 weeks)
- OpenTelemetry setup and configuration
- Basic span creation for tool invocations
- Phoenix endpoint integration
Phase 2: Enhanced Observability (2 weeks)
- Prompt and resource tracking
- Federation tracing
- Custom attributes and metadata
Phase 3: Evaluation & Analytics (1 week)
- LLM evaluation integration
- Performance baselines
- Anomaly detection setup
Phase 4: Production Hardening (1 week)
- Error handling and retry logic
- Performance optimization
- Documentation and examples
Technical Requirements
Dependencies
# pyproject.toml additions
opentelemetry-api = "^1.20.0"
opentelemetry-sdk = "^1.20.0"
opentelemetry-exporter-otlp = "^1.20.0"
openinference-instrumentation = "^0.1.0"
arize-phoenix = "^4.0.0" # Optional for embedded mode
Environment Variables
PHOENIX_ENDPOINT=http://localhost:6006
PHOENIX_API_KEY=<optional-for-cloud>
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
OTEL_SERVICE_NAME=mcp-gateway
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production
Testing Strategy
- Unit Tests: Mock Phoenix client, verify span creation
- Integration Tests: Test with local Phoenix instance
- Load Tests: Validate performance impact < 5% overhead
- E2E Tests: Complete trace verification across federated gateways
Documentation Requirements
- Setup Guide: Phoenix deployment and configuration
- User Guide: Interpreting traces and metrics
- Troubleshooting: Common issues and solutions
- Best Practices: Sampling strategies, sensitive data handling
Success Criteria
- Complete request tracing with < 5% performance impact
- Tool invocation visibility across all virtual servers
- Federation trace correlation working
- Phoenix UI showing meaningful insights
- Documentation and examples complete
- All tests passing with 90%+ coverage
Related Issues
- [Feature Request]: Add OpenLLMetry Integration for Observability #175 - Add OpenLLMetry Integration for Observability
- [Feature Request]: Prometheus Metrics Instrumentation using prometheus-fastapi-instrumentator #218 - Prometheus Metrics Instrumentation
- [Feature Request]: Structured JSON Logging with Correlation IDs #300 - Structured JSON Logging with Correlation IDs
- [Feature Request]: AI Middleware Integration / Plugin Framework for extensible gateway capabilities #319 - AI Middleware Integration / Plugin Framework
References
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestobservabilityObservability, logging, monitoringObservability, logging, monitoringpluginspythonPython / backend development (FastAPI)Python / backend development (FastAPI)triageIssues / Features awaiting triageIssues / Features awaiting triage