mcp-name: io.github.OtherVibes/mcp-as-a-judge
MCP as a Judge acts as a validation layer between AI coding assistants and LLMs, helping ensure safer and higher-quality code.
MCP as a Judge is a behavioral MCP that strengthens AI coding assistants by requiring explicit LLM evaluations for:
- Research, system design, and planning
- Code changes, testing, and task-completion verification
It enforces evidence-based research, reuse over reinvention, and human-in-the-loop decisions.
If your IDE has rules/agents (Copilot, Cursor, Claude Code), keep using them—this Judge adds enforceable approval gates on plan, code diffs, and tests.
- Treat LLM output as ground truth; skip research and use outdated information
- Reinvent the wheel instead of reusing libraries and existing code
- Cut corners: code below engineering standards and weak tests
- Make unilateral decisions when requirements are ambiguous or plans change
- Security blind spots: missing input validation, injection risks/attack vectors, least‑privilege violations, and weak defensive programming
- Evidence‑based research and reuse (best practices, libraries, existing code)
- Plan‑first delivery aligned to user requirements
- Human‑in‑the‑loop decisions for ambiguity and blockers
- Quality gates on code and tests (security, performance, maintainability)
- Intelligent code evaluation via MCP sampling; enforces software‑engineering standards and flags security/performance/maintainability risks
- Comprehensive plan/design review: validates architecture, research depth, requirements fit, and implementation approach
- User‑driven decisions via MCP elicitation: clarifies requirements, resolves obstacles, and keeps choices transparent
- Security validation in system design and code changes
| Tool | What it solves | 
|---|---|
| set_coding_task | Creates/updates task metadata; classifies task_size; returns next-step workflow guidance | 
| get_current_coding_task | Recovers the latest task_id and metadata to resume work safely | 
| judge_coding_plan | Validates plan/design; requires library selection and internal reuse maps; flags risks | 
| judge_code_change | Reviews unified Git diffs for correctness, reuse, security, and code quality | 
| judge_testing_implementation | Validates tests using real runner output and optional coverage | 
| judge_coding_task_completion | Final gate ensuring plan, code, and tests approvals before completion | 
| raise_missing_requirements | Elicits missing details and decisions to unblock progress | 
| raise_obstacle | Engages the user on trade‑offs, constraints, and enforced changes | 
MCP as a Judge is heavily dependent on MCP Sampling and MCP Elicitation features for its core functionality:
- MCP Sampling - Required for AI-powered code evaluation and judgment
- MCP Elicitation - Required for interactive user decision prompts
- Docker Desktop / Python 3.13+ - Required for running the MCP server
| AI Assistant | Platform | MCP Support | Status | Notes | 
|---|---|---|---|---|
| GitHub Copilot | Visual Studio Code | ✅ Full | Recommended | Complete MCP integration with sampling and elicitation | 
| Claude Code | - | Requires LLM API key | Sampling Support feature request Elicitation Support feature request | |
| Cursor | - | Requires LLM API key | MCP support available, but sampling/elicitation limited | |
| Augment | - | Requires LLM API key | MCP support available, but sampling/elicitation limited | |
| Qodo | - | Requires LLM API key | MCP support available, but sampling/elicitation limited | 
✅ Recommended setup: GitHub Copilot + VS Code — full MCP sampling; no API key needed.
LLM_API_KEY. Without it, the server cannot evaluate plans or code. See LLM API Configuration.
💡 Tip: Prefer large context models (≥ 1M tokens) for better analysis and judgments.
For troubleshooting, visit the FAQs section.
Configure MCP as a Judge in your MCP-enabled client:
Notes:
- VS Code controls the sampling model; select it via “MCP: List Servers → mcp-as-a-judge → Configure Model Access”.
- 
Configure MCP Settings: Add this to your MCP client configuration file: { "command": "docker", "args": ["run", "--rm", "-i", "--pull=always", "ghcr.io/othervibes/mcp-as-a-judge:latest"], "env": { "LLM_API_KEY": "your-openai-api-key-here", "LLM_MODEL_NAME": "gpt-4o-mini" } }📝 Configuration Options (All Optional): - LLM_API_KEY: Optional for GitHub Copilot + VS Code (has built-in MCP sampling)
- LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)
- The --pull=alwaysflag ensures you always get the latest version automatically
 Then manually update when needed: # Pull the latest version docker pull ghcr.io/othervibes/mcp-as-a-judge:latest
- 
Install the package: uv tool install mcp-as-a-judge 
- 
Configure MCP Settings: The MCP server may be automatically detected by your MCP‑enabled client. 📝 Notes: - No additional configuration needed for GitHub Copilot + VS Code (has built-in MCP sampling)
- LLM_API_KEY is optional and can be set via environment variable if needed
 
- 
To update to the latest version: # Update MCP as a Judge to the latest version uv tool upgrade mcp-as-a-judge
- Open Command Palette (Cmd/Ctrl+Shift+P) → “MCP: List Servers”
- Select the configured server “mcp-as-a-judge”
- Choose “Configure Model Access”
- Check your preferred model(s) to enable sampling
For AI assistants without full MCP sampling support you can configure an LLM API key as a fallback. This ensures MCP as a Judge works even when the client doesn't support MCP sampling.
- Set LLM_API_KEY(unified key). Vendor is auto-detected; optionally setLLM_MODEL_NAMEto override the default.
| Rank | Provider | API Key Format | Default Model | Notes | 
|---|---|---|---|---|
| 1 | OpenAI | sk-... | gpt-4.1 | Fast and reliable model optimized for speed | 
| 2 | Anthropic | sk-ant-... | claude-sonnet-4-20250514 | High-performance with exceptional reasoning | 
| 3 | AIza... | gemini-2.5-pro | Most advanced model with built-in thinking | |
| 4 | Azure OpenAI | [a-f0-9]{32} | gpt-4.1 | Same as OpenAI but via Azure | 
| 5 | AWS Bedrock | AWS credentials | anthropic.claude-sonnet-4-20250514-v1:0 | Aligned with Anthropic | 
| 6 | Vertex AI | Service Account JSON | gemini-2.5-pro | Enterprise Gemini via Google Cloud | 
| 7 | Groq | gsk_... | deepseek-r1 | Best reasoning model with speed advantage | 
| 8 | OpenRouter | sk-or-... | deepseek/deepseek-r1 | Best reasoning model available | 
| 9 | xAI | xai-... | grok-code-fast-1 | Latest coding-focused model (Aug 2025) | 
| 10 | Mistral | [a-f0-9]{64} | pixtral-large | Most advanced model (124B params) | 
- 
Open Cursor Settings: - Go to File→Preferences→Cursor Settings
- Navigate to the MCPtab
- Click + Addto add a new MCP server
 
- Go to 
- 
Add MCP Server Configuration: { "command": "uv", "args": ["tool", "run", "mcp-as-a-judge"], "env": { "LLM_API_KEY": "your-openai-api-key-here", "LLM_MODEL_NAME": "gpt-4.1" } }📝 Configuration Options: - LLM_API_KEY: Required for Cursor (limited MCP sampling)
- LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)
 
- 
Add MCP Server via CLI: # Set environment variables first (optional model override) export LLM_API_KEY="your_api_key_here" export LLM_MODEL_NAME="claude-3-5-haiku" # Optional: faster/cheaper model # Add MCP server claude mcp add mcp-as-a-judge -- uv tool run mcp-as-a-judge 
- 
Alternative: Manual Configuration: - Create or edit ~/.config/claude-code/mcp_servers.json
 { "command": "uv", "args": ["tool", "run", "mcp-as-a-judge"], "env": { "LLM_API_KEY": "your-anthropic-api-key-here", "LLM_MODEL_NAME": "claude-3-5-haiku" } }📝 Configuration Options: - LLM_API_KEY: Required for Claude Code (limited MCP sampling)
- LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)
 
- Create or edit 
For other MCP-compatible clients, use the standard MCP server configuration:
{
  "command": "uv",
  "args": ["tool", "run", "mcp-as-a-judge"],
  "env": {
    "LLM_API_KEY": "your-openai-api-key-here",
    "LLM_MODEL_NAME": "gpt-5"
  }
}📝 Configuration Options:
- LLM_API_KEY: Required for most MCP clients (except GitHub Copilot + VS Code)
- LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)
Primary Mode: MCP Sampling
- All judgments are performed using MCP Sampling capability
- No need to configure or pay for external LLM API services
- Works directly with your MCP-compatible client's existing AI model
- Currently supported by: GitHub Copilot + VS Code
Fallback Mode: LLM API Key
- When MCP sampling is not available, the server can use LLM API keys
- Supports multiple providers via LiteLLM: OpenAI, Anthropic, Google, Azure, Groq, Mistral, xAI
- Automatic vendor detection from API key patterns
- Default model selection per vendor when no model is specified
- The server runs locally on your machine
- No data collection - your code and conversations stay private
- No external API calls when using MCP Sampling. If you set LLM_API_KEYfor fallback, the server will call your chosen LLM provider only to perform judgments (plan/code/test) with the evaluation content you provide.
- Complete control over your development workflow and sensitive information
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
# Clone the repository
git clone https://github.com/OtherVibes/mcp-as-a-judge.git
cd mcp-as-a-judge
# Install dependencies with uv
uv sync --all-extras --dev
# Install pre-commit hooks
uv run pre-commit install
# Run tests
uv run pytest
# Run all checks
uv run pytest && uv run ruff check && uv run ruff format --check && uv run mypy src© 2025 OtherVibes and Zvi Fried. The "MCP as a Judge" concept, the "behavioral MCP" approach, the staged workflow (plan → code → test → completion), tool taxonomy/descriptions, and prompt templates are original work developed in this repository.
While “LLM‑as‑a‑judge” is a broadly known idea, this repository defines the original “MCP as a Judge” behavioral MCP pattern by OtherVibes and Zvi Fried. It combines task‑centric workflow enforcement (plan → code → test → completion), explicit LLM‑based validations, and human‑in‑the‑loop elicitation, along with the prompt templates and tool taxonomy provided here. Please attribute as: “OtherVibes – MCP as a Judge (Zvi Fried)”.
How is “MCP as a Judge” different from rules/subagents in IDE assistants (GitHub Copilot, Cursor, Claude Code)?
| Feature | IDE Rules | Subagents | MCP as a Judge | 
|---|---|---|---|
| Static behavior guidance | ✓ | ✓ | ✗ | 
| Custom system prompts | ✓ | ✓ | ✓ | 
| Project context integration | ✓ | ✓ | ✓ | 
| Specialized task handling | ✗ | ✓ | ✓ | 
| Active quality gates | ✗ | ✗ | ✓ | 
| Evidence-based validation | ✗ | ✗ | ✓ | 
| Approve/reject with feedback | ✗ | ✗ | ✓ | 
| Workflow enforcement | ✗ | ✗ | ✓ | 
| Cross-assistant compatibility | ✗ | ✗ | ✓ | 
- Tasklist = planning/organization: tracks tasks, priorities, and status. It doesn’t guarantee engineering quality or readiness.
- Judge workflow = quality gates: enforces approvals for plan/design, code diffs, tests, and final completion. It demands real evidence (e.g., unified Git diffs and raw test output) and returns structured approvals and required improvements.
- Together: Use the tasklist to organize work; use the Judge to decide when each stage is actually ready to proceed. The server also emits next_tool guidance to keep progress moving through the gates.
- In your prompt: "use mcp-as-a-judge" or "Evaluate plan/code/test using the MCP server mcp-as-a-judge".
- VS Code: Command Palette → "MCP: List Servers" → ensure "mcp-as-a-judge" is listed and enabled.
- Ensure the MCP server is running and, in your client, the judge tools are enabled/approved.
- Open Command Palette (Cmd/Ctrl+Shift+P) → "MCP: List Servers"
- Select "mcp-as-a-judge" → "Configure Model Access"
- Check your preferred model(s) to enable sampling
This project is licensed under the MIT License (see LICENSE).
- Model Context Protocol by Anthropic
- LiteLLM for unified LLM API integration