AI SDK benchmarking tool that tests AI agents with MCP (Model Context Protocol) integration using the Vercel AI Gateway. Automatically discovers and runs all tests in the tests/ directory, verifying LLM-generated Svelte components against test suites.
To install dependencies:
./scripts/install.sh # installs the correct bun version
bun installConfigure your Vercel OIDC token using bun.secrets:
- Install Vercel CLI if you haven't already
- Run
bun run vercel:linkand link the benchmark to a project that has AI Gateway enabled - Store your VERCEL_OIDC_TOKEN securely:
# Get your token from Vercel project settings bun run secrets set VERCEL_OIDC_TOKEN your_token_here
VERCEL_OIDC_TOKEN: Required for Vercel AI Gateway (stored in bun.secrets)- Other API keys (Anthropic, OpenAI, OpenRouter) are configured in the Vercel dashboard when using AI Gateway
API keys are stored securely using your OS credential manager:
# Check if token is set
bun run secrets
# Set token
bun run secrets set VERCEL_OIDC_TOKEN your_token_here
# Get token
bun run secrets get VERCEL_OIDC_TOKENSecurity Benefits:
- Encrypted storage using OS credential manager (Keychain, libsecret, Windows Credential Manager)
- No plaintext API keys in files
- User-level access control
To run the benchmark:
bun run index.tsThe benchmark features an interactive CLI that will prompt you for configuration:
-
Model Selection: Choose one or more models from the Vercel AI Gateway
- Select from available models in your configured providers
- Optionally add custom model IDs
- Can test multiple models in a single run
-
MCP Integration: Choose your MCP configuration
- No MCP Integration: Run without external tools
- MCP over HTTP: Use HTTP-based MCP server (default:
https://mcp.svelte.dev/mcp) - MCP over StdIO: Use local MCP server via command (default:
npx -y @sveltejs/mcp) - Option to provide custom MCP server URL or command
-
TestComponent Tool: Enable/disable the testing tool for models
- Allows models to run tests during component development
- Enabled by default
After configuration, the benchmark will:
- Discover all tests in
tests/directory - For each selected model and test:
- Run the AI agent with the test's prompt
- Extract the generated Svelte component
- Verify the component against the test suite
- Generate a combined report with all results
Results are saved to the results/ directory with timestamped filenames:
results/result-2024-12-07-14-30-45.json- Full execution trace with all test resultsresults/result-2024-12-07-14-30-45.html- Interactive HTML report with expandable test sections
The HTML report includes:
- Summary bar showing passed/failed/skipped counts
- Expandable sections for each test
- Step-by-step execution trace
- Generated component code
- Test verification results with pass/fail details
- Token usage statistics
- MCP status badge
- Dark/light theme toggle
To regenerate an HTML report from a JSON file:
# Regenerate most recent result
bun run generate-report.ts
# Regenerate specific result
bun run generate-report.ts results/result-2024-12-07-14-30-45.jsonEach test in the tests/ directory should have:
tests/
{test-name}/
Reference.svelte - Reference implementation (known-good solution)
test.ts - Vitest test file (imports "./Component.svelte")
prompt.md - Prompt for the AI agent
The benchmark:
- Reads the prompt from
prompt.md - Asks the agent to generate a component
- Writes the generated component to a temporary location
- Runs the tests against the generated component
- Reports pass/fail status
To verify that all reference implementations pass their tests:
bun run verify-testsThis copies each Reference.svelte to Component.svelte temporarily and runs the tests.
The tool supports optional integration with MCP (Model Context Protocol) servers through the interactive CLI. When running the benchmark, you'll be prompted to choose:
- No MCP Integration: Run without external tools
- MCP over HTTP: Connect to an HTTP-based MCP server
- Default:
https://mcp.svelte.dev/mcp - Option to provide a custom URL
- Default:
- MCP over StdIO: Connect to a local MCP server via command
- Default:
npx -y @sveltejs/mcp - Option to provide a custom command
- Default:
MCP status, transport type, and server configuration are documented in both the JSON metadata and displayed as a badge in the HTML report.
0: All tests passed1: One or more tests failed
See AGENTS.md for detailed documentation on:
- Architecture and components
- Environment variables and model configuration
- MCP integration details
- Development commands
- Multi-test result format
This project was created using bun init in bun v1.3.3. Bun is a fast all-in-one JavaScript runtime.