A web application for testing agentic server-side prompts against multiple LLMs with automated scoring and prompt improvement capabilities.
- Multi-LLM Testing: Test prompts against Multiple LLMs and models
- Parallel Execution: Run tests in parallel across all LLMs
- 10x Repeatability: Each test case runs n times per LLM to measure consistency
- Auto-Improvement: LLMs automatically suggest and test prompt improvements
- Version Control: All prompts are versioned with full history
-
Install dependencies:
bun install
-
Start the development server:
bun dev
-
Open the app: Navigate to http://localhost:3000
-
Configure API Keys:
- Go to Configuration page
- Add your OpenAI, Bedrock, and/or Deepseek API keys
- At least one provider must be configured to run tests
- Navigate to "Prompts" page
- Enter a name and the system prompt content
- Prompts are automatically versioned
- Navigate to "Test Cases" page
- Select a prompt from the dropdown
- Add test cases with:
- Input: The user message to send to the LLM
- Expected Output: Valid JSON that the LLM should return
- Navigate to "Run Tests" page
- Select a prompt with test cases
- Click "Run Tests"
- Watch progress bars as tests execute
- View detailed results with per-LLM scores
- Navigate to "Auto-Improve" page
- Select a prompt with test cases
- Set max iterations (how many improvement attempts)
- Click "Start Improvement"
- Watch the log as LLMs analyze failures and suggest improvements
- Best improvements are automatically saved as new versions
- For each test case × LLM combination:
- Send the system prompt + test input to the LLM
- Parse the response as JSON
- Compare against expected output (exact match after normalization)
- Repeat 10 times to measure consistency
- Calculate scores per LLM and overall
- Test the original prompt and record failures
- For each iteration:
- Send failed test results to all LLMs
- Ask each LLM to suggest an improved prompt
- Test each suggestion
- Keep the best-scoring version
- Revert if no improvement found
- Save the best version automatically
GET /api/config- Get all config (API keys masked)POST /api/config- Update API keysGET /api/config/improvement-prompt- Get improvement prompt templatePUT /api/config/improvement-prompt- Update improvement prompt templateGET /api/models- Get all available models from configured providers
GET /api/prompts- Get latest version of each promptPOST /api/prompts- Create a new promptGET /api/prompts/:id- Get a specific prompt by IDGET /api/prompts/:id/versions- Get version history for a prompt by IDDELETE /api/prompts/:id- Delete a specific prompt versionDELETE /api/prompts/:id/all-versions- Delete all versions of a prompt (by ID)
GET /api/prompts/:id/test-cases- Get test cases for a promptPOST /api/prompts/:id/test-cases- Create a test casePUT /api/test-cases/:id- Update a test caseDELETE /api/test-cases/:id- Delete a test case
POST /api/test/run- Start test run (returns jobId)GET /api/test/status/:jobId- Poll test progress
POST /api/improve/start- Start improvement jobGET /api/improve/status/:jobId- Poll improvement progress
├── src/
│ ├── server.ts # Express server and routes
│ ├── database.ts # Database operations
│ ├── db/
│ │ ├── index.ts # Database initialization
│ │ └── schema.ts # Drizzle schema
│ ├── llm-clients/ # LLM provider clients
│ │ ├── llm-client.ts # Unified interface
│ │ ├── openai-client.ts
│ │ ├── bedrock-client.ts
│ │ └── deepseek-client.ts
│ ├── services/
│ │ ├── test-runner.ts # Test execution service
│ │ └── improvement-service.ts
│ └── utils/
│ └── json-comparison.ts
├── public/ # Frontend files
│ ├── index.html
│ ├── prompts.html
│ ├── test-cases.html
│ ├── test-runs.html
│ ├── improve.html
│ └── styles.css
├── drizzle/ # Database migrations
└── data/ # SQLite database (auto-created)
bun dev
bun run build
bun start
bun run format
bun run db:generate
bun run db:studioMIT