A lightweight, flexible evaluation 'eval' framework for testing models with automated judging capabilities, supporting Gemini, Anthropic, OpenAI, and Ollama.
- SQLite database for saving history
- Specify LLM provider for both model and judge
- Batch evaluations to multiple providers/models
- API endpoints for developers to consume
- Built-in GUI and results dashboard
- Additional evaluation criteria options (exact match, semantic similarity, etc.)
- Python SDK available
- Real-time WebSocket updates
If you want to use 'evaluate' via your own Python scripts or Jupyter Notebooks, you can use the SDK:
https://pypi.org/project/llmeval-sdk/
(Example usage with Python is shown on that PyPi page)
You'll need:
- Docker (recommended) OR Rust/Cargo
- API keys for your LLM provider(s)
If you use Ollama:
ollama pull llama3
Create a .env
file in your project root (see env.example
):
DATABASE_URL=sqlite:./data/evals.db
GEMINI_API_BASE=https://generativelanguage.googleapis.com
GEMINI_API_KEY=AIzaxxxxxxxxxxxxxxxxxxxxxxxxxxc
GEMINI_MODELS=gemini-2.5-pro,gemini-2.5-flash
OLLAMA_API_BASE=http://host.docker.internal:11434
OPENAI_API_BASE=https://api.openai.com/v1
OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxxx
OPENAI_MODELS=gpt-4o,gpt-4o-mini,gpt-3.5-turbo
ANTHROPIC_API_KEY=sk-placeholder-ant-a1b2c3d4e5f6-a1b2c3d4e5f6-a1b2c3d4e5f6-a1b2c3d4e5f6
ANTHROPIC_MODELS=claude-opus-4,claude-sonnet-4-5,claude-haiku-4
RUST_LOG=info
Build the image:
docker build -t evaluate:latest .
Run on Linux:
docker run --rm -it \
--network host \
--env-file .env \
-v $(pwd)/data:/usr/local/bin/data \
-e OLLAMA_API_BASE=http://localhost:11434 \
evaluate:latest
Run on Mac:
docker run --rm -it -p 8080:8080 \
--env-file .env \
-v $(pwd)/data:/usr/local/bin/data \
evaluate:latest
Run on Windows (PowerShell):
docker run --rm -it -p 8080:8080 `
--env-file .env `
-v ${PWD}/data:/usr/local/bin/data `
evaluate:latest
# 1. Clone the repository
git clone [email protected]:RGGH/evaluate.git
# 2. Navigate into the project directory
cd evaluate
# 3. Run with Cargo (requires Rust/Cargo installed)
cargo run
You should see output similar to:
[INFO] Starting database migration...
[INFO] Starting server at 127.0.0.1:8080
Access the application at http://localhost:8080
Gemini Example:
curl -X POST http://127.0.0.1:8080/api/v1/evals/run \
-H "Content-Type: application/json" \
-d '{
"model": "gemini-2.5-pro",
"prompt": "What is the capital of France?",
"expected": "Paris",
"judge_model": "gemini-2.5-pro",
"criteria": "Does the output correctly name the capital city?"
}' | jq
Response:
{
"id": "619cd32a-4376-4969-ac48-0f25b37bc933",
"status": "passed",
"result": {
"model": "gemini-2.5-pro",
"prompt": "What is the capital of France?",
"model_output": "The capital of France is **Paris**.",
"expected": "Paris",
"judge_result": {
"judge_model": "gemini-2.5-pro",
"verdict": "Pass",
"reasoning": "Verdict: PASS\n\nThe actual output correctly names Paris as the capital city...",
"confidence": null
},
"timestamp": "2024-07-29T10:30:00.123456789+00:00"
},
"error": null
}
Ollama Example:
curl -X POST http://127.0.0.1:8080/api/v1/evals/run \
-H "Content-Type: application/json" \
-d '{
"model": "ollama:llama3",
"prompt": "What is the capital of France?",
"expected": "Paris",
"judge_model": "ollama:llama3",
"criteria": "Does the output correctly name the capital city?"
}' | jq
You can set the provider in the JSON file and use generic syntax for batch evals:
{
"model": "gemini:gemini-2.5-flash-latest",
"prompt": "What is the capital of France?",
"expected": "Paris",
"judge_model": "gemini:gemini-2.5-pro-latest"
}
Call the api/v1/evals/batch
endpoint:
curl -X POST http://127.0.0.1:8080/api/v1/evals/batch \
-H "Content-Type: application/json" \
-d '@qa_sample.json' | jq
curl -X POST http://127.0.0.1:8080/api/v1/evals/batch \
-H "Content-Type: application/json" \
-d '@qa_f1.json' | jq
Base URL: http://localhost:8080/api/v1
Method | Endpoint | Description | Response |
---|---|---|---|
GET | /health |
Health check endpoint | {"status": "healthy", "service": "eval-api", "version": "..."} |
GET | /models |
List all available models | {"models": ["gemini:model-name", "ollama:model-name", ...]} |
Method | Endpoint | Description | Request Body |
---|---|---|---|
POST | /evals/run |
Run a single evaluation | RunEvalRequest |
POST | /evals/batch |
Run multiple evaluations concurrently | Array of EvalConfig |
GET | /evals/history |
Get all evaluation history | - |
GET | /evals/{id} |
Get specific evaluation result | - |
GET | /evals/{id}/status |
Get evaluation status | - |
Method | Endpoint | Description | Request Body |
---|---|---|---|
GET | /judge-prompts |
Get all judge prompt versions | - |
GET | /judge-prompts/active |
Get the currently active judge prompt | - |
GET | /judge-prompts/{version} |
Get a specific judge prompt by version | - |
POST | /judge-prompts |
Create a new judge prompt version | CreateJudgePromptRequest |
PUT | /judge-prompts/active |
Set a judge prompt version as active | {"version": 2} |
Get all judge prompts:
curl http://localhost:8080/api/v1/judge-prompts
Create a new judge prompt:
curl -X POST http://localhost:8080/api/v1/judge-prompts \
-H "Content-Type: application/json" \
-d '{
"name": "Strict Evaluator",
"template": "Compare:\nExpected: {{expected}}\nActual: {{actual}}\nVerdict: PASS or FAIL",
"description": "Requires exact semantic match",
"set_active": true
}'
Set a version as active:
curl -X PUT http://localhost:8080/api/v1/judge-prompts/active \
-H "Content-Type: application/json" \
-d '{"version": 2}'
Method | Endpoint | Description | Request Body |
---|---|---|---|
POST | /experiments |
Create a new experiment | CreateExperimentRequest |
GET | /experiments/{id} |
Get experiment details | - |
Protocol | Endpoint | Description |
---|---|---|
WS | /ws |
Real-time evaluation updates |
Connect to WebSocket:
const ws = new WebSocket('ws://localhost:8080/api/v1/ws');
ws.onmessage = (event) => {
const update = JSON.parse(event.data);
console.log('Eval update:', update);
};
{
"model": "gemini:gemini-2.5-flash-latest",
"prompt": "What is 2+2?",
"expected": "4",
"judge_model": "gemini:gemini-1.5-pro-latest",
"criteria": "The output should be mathematically correct"
}
Fields:
model
(required): Model identifier in formatprovider:model_name
prompt
(required): The prompt to send to the modelexpected
(optional): Expected output for comparisonjudge_model
(optional): Judge model for LLM-as-a-judge evaluationcriteria
(optional): Custom evaluation criteria
{
"model": "openai:gpt-4o",
"prompt": "Explain quantum computing",
"expected": "Quantum computing uses quantum bits...",
"judge_model": "gemini:gemini-2.5-pro-latest",
"criteria": "The explanation should be accurate and accessible",
"tags": ["physics", "computing"],
"metadata": {
"category": "science",
"difficulty": "advanced"
}
}
{
"id": "uuid-string",
"status": "passed",
"result": {
"model": "gemini:gemini-2.5-flash-latest",
"prompt": "What is 2+2?",
"model_output": "2+2 equals 4",
"expected": "4",
"judge_result": {
"judge_model": "gemini:gemini-2.5-pro-latest",
"verdict": "Pass",
"reasoning": "The output correctly identifies that 2+2 equals 4...",
"confidence": null
},
"timestamp": "2025-10-15T12:34:56Z",
"latency_ms": 450,
"judge_latency_ms": 320,
"total_latency_ms": 770
},
"error": null
}
Status values: "passed"
, "failed"
, "uncertain"
, "completed"
, "error"
Verdict values: "Pass"
, "Fail"
, "Uncertain"
{
"batch_id": "uuid-string",
"status": "completed",
"total": 10,
"completed": 10,
"passed": 8,
"failed": 2,
"average_model_latency_ms": 425,
"average_judge_latency_ms": 315,
"results": []
}
See full documentation for:
HistoryResponse
CreateExperimentRequest
ExperimentResponse
CreateJudgePromptRequest
JudgePrompt
EvalUpdate
(WebSocket)
Models are specified in the format provider:model_name
:
Gemini:
gemini:gemini-2.5-flash-latest
gemini:gemini-2.5-pro-latest
Ollama:
ollama:llama3
ollama:gemma
OpenAI:
openai:gpt-4o
openai:gpt-4o-mini
openai:gpt-3.5-turbo
Anthropic:
anthropic:claude-opus-4
anthropic:claude-sonnet-4
anthropic:claude-sonnet-4-5
anthropic:claude-haiku-4
If no provider is specified, gemini
is used as the default.
One major limitation of LLMs is knowledge recency. Since these models are trained on fixed datasets that quickly become outdated, they often struggle with topics that rely on the latest information β such as new laws, policies, or medical guidance. This means their judgements can be based on old or irrelevant data, leading to unreliable results. To keep them up to date, techniques like retrieval-augmented generation (RAG), regular fine-tuning, and continual learning can help ensure LLMs-as-judges have access to the most current knowledge when making decisions.
Another key weakness is hallucination, where LLMs confidently generate information that isn't true. In an evaluation context, this could mean inventing fake references, misinterpreting facts, or fabricating evidence β all of which can undermine trust in their output. Building in robust fact-checking systems that verify claims against reliable sources is essential to reduce the impact of these errors and maintain fairness in judgement.
Lastly, LLMs often face domain-specific knowledge gaps. While they're great generalists, they can lack the deep understanding needed for complex areas like law, finance, or medicine. Integrating domain-specific knowledge graphs or using RAG to pull in expert information can help bridge this gap, allowing them to deliver more accurate and context-aware evaluations.
Thank you for your interest in contributing! π
We welcome contributions of all kinds β bug fixes, improvements, documentation, examples, or new features. π¦ Rust, Python, and front-end JS/TS contributions are all welcome. See current issues for ideas.
- Fork the repository and create a new branch for your changes
- Make your changes with clear, descriptive commit messages
- Open a Pull Request explaining what you've done and why
Please make sure your code follows the existing style and passes any tests. For larger changes, feel free to open an issue first to discuss your approach.
By contributing, you agree that your work will be licensed under this project's license.
Thank you for helping make this project better! π‘
- https://arxiv.org/html/2412.05579v2
- https://github.com/openai/evals
- https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/evaluate-generative-ai-app#query-and-response-metric-requirements
- Image Classifier Evals