Evaluate - An LLM Testing Framework

A lightweight, flexible evaluation 'eval' framework for testing models with automated judging capabilities, supporting Gemini, Anthropic, OpenAI, and Ollama.

▶️ Watch on YouTube

Features

SQLite database for saving history
Specify LLM provider for both model and judge
Batch evaluations to multiple providers/models
API endpoints for developers to consume
Built-in GUI and results dashboard
Additional evaluation criteria options (exact match, semantic similarity, etc.)
Python SDK available
Real-time WebSocket updates

Python SDK

If you want to use 'evaluate' via your own Python scripts or Jupyter Notebooks, you can use the SDK:

https://pypi.org/project/llmeval-sdk/

(Example usage with Python is shown on that PyPi page)

Quick Start

Prerequisites

You'll need:

Docker (recommended) OR Rust/Cargo
API keys for your LLM provider(s)

If you use Ollama:

ollama pull llama3

Setup Environment Variables

Create a .env file in your project root (see env.example):

DATABASE_URL=sqlite:./data/evals.db
GEMINI_API_BASE=https://generativelanguage.googleapis.com
GEMINI_API_KEY=AIzaxxxxxxxxxxxxxxxxxxxxxxxxxxc
GEMINI_MODELS=gemini-2.5-pro,gemini-2.5-flash
OLLAMA_API_BASE=http://host.docker.internal:11434
OPENAI_API_BASE=https://api.openai.com/v1
OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxxx
OPENAI_MODELS=gpt-4o,gpt-4o-mini,gpt-3.5-turbo
ANTHROPIC_API_KEY=sk-placeholder-ant-a1b2c3d4e5f6-a1b2c3d4e5f6-a1b2c3d4e5f6-a1b2c3d4e5f6
ANTHROPIC_MODELS=claude-opus-4,claude-sonnet-4-5,claude-haiku-4
RUST_LOG=info

Installation Options

Option 1: Docker (Recommended)

Build the image:

docker build -t evaluate:latest .

Run on Linux:

docker run --rm -it \
  --network host \
  --env-file .env \
  -v $(pwd)/data:/usr/local/bin/data \
  -e OLLAMA_API_BASE=http://localhost:11434 \
  evaluate:latest

Run on Mac:

docker run --rm -it -p 8080:8080 \
  --env-file .env \
  -v $(pwd)/data:/usr/local/bin/data \
  evaluate:latest

Run on Windows (PowerShell):

docker run --rm -it -p 8080:8080 `
  --env-file .env `
  -v ${PWD}/data:/usr/local/bin/data `
  evaluate:latest

Option 2: Install from Source

# 1. Clone the repository
git clone [email protected]:RGGH/evaluate.git

# 2. Navigate into the project directory
cd evaluate

# 3. Run with Cargo (requires Rust/Cargo installed)
cargo run

You should see output similar to:

[INFO] Starting database migration...
[INFO] Starting server at 127.0.0.1:8080

Access the application at http://localhost:8080

Usage Examples

Single Evaluation (API)

Gemini Example:

curl -X POST http://127.0.0.1:8080/api/v1/evals/run \
-H "Content-Type: application/json" \
-d '{
  "model": "gemini-2.5-pro",             
  "prompt": "What is the capital of France?", 
  "expected": "Paris",
  "judge_model": "gemini-2.5-pro",             
  "criteria": "Does the output correctly name the capital city?"
}' | jq

Response:

{
  "id": "619cd32a-4376-4969-ac48-0f25b37bc933",
  "status": "passed",
  "result": {
    "model": "gemini-2.5-pro",
    "prompt": "What is the capital of France?",
    "model_output": "The capital of France is **Paris**.",
    "expected": "Paris",
    "judge_result": {
      "judge_model": "gemini-2.5-pro",
      "verdict": "Pass",
      "reasoning": "Verdict: PASS\n\nThe actual output correctly names Paris as the capital city...",
      "confidence": null
    },
    "timestamp": "2024-07-29T10:30:00.123456789+00:00"
  },
  "error": null
}

Ollama Example:

curl -X POST http://127.0.0.1:8080/api/v1/evals/run \
-H "Content-Type: application/json" \
-d '{
  "model": "ollama:llama3",
  "prompt": "What is the capital of France?",
  "expected": "Paris",
  "judge_model": "ollama:llama3",
  "criteria": "Does the output correctly name the capital city?"
}' | jq

Batch Evaluations

You can set the provider in the JSON file and use generic syntax for batch evals:

{
  "model": "gemini:gemini-2.5-flash-latest",
  "prompt": "What is the capital of France?",
  "expected": "Paris",
  "judge_model": "gemini:gemini-2.5-pro-latest"
}

Call the api/v1/evals/batch endpoint:

curl -X POST http://127.0.0.1:8080/api/v1/evals/batch \
-H "Content-Type: application/json" \
-d '@qa_sample.json' | jq

curl -X POST http://127.0.0.1:8080/api/v1/evals/batch \
-H "Content-Type: application/json" \
-d '@qa_f1.json' | jq

Built-in GUI

Single Eval Interface

History View

Results Dashboard

API Reference

Base URL: http://localhost:8080/api/v1

Health & System

Method	Endpoint	Description	Response
GET	`/health`	Health check endpoint	`{"status": "healthy", "service": "eval-api", "version": "..."}`
GET	`/models`	List all available models	`{"models": ["gemini:model-name", "ollama:model-name", ...]}`

Evaluations

Method	Endpoint	Description	Request Body
POST	`/evals/run`	Run a single evaluation	`RunEvalRequest`
POST	`/evals/batch`	Run multiple evaluations concurrently	Array of `EvalConfig`
GET	`/evals/history`	Get all evaluation history	-
GET	`/evals/{id}`	Get specific evaluation result	-
GET	`/evals/{id}/status`	Get evaluation status	-

Judge Prompts

Method	Endpoint	Description	Request Body
GET	`/judge-prompts`	Get all judge prompt versions	-
GET	`/judge-prompts/active`	Get the currently active judge prompt	-
GET	`/judge-prompts/{version}`	Get a specific judge prompt by version	-
POST	`/judge-prompts`	Create a new judge prompt version	`CreateJudgePromptRequest`
PUT	`/judge-prompts/active`	Set a judge prompt version as active	`{"version": 2}`

Judge Prompt Examples

Get all judge prompts:

curl http://localhost:8080/api/v1/judge-prompts

Create a new judge prompt:

curl -X POST http://localhost:8080/api/v1/judge-prompts \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Strict Evaluator",
    "template": "Compare:\nExpected: {{expected}}\nActual: {{actual}}\nVerdict: PASS or FAIL",
    "description": "Requires exact semantic match",
    "set_active": true
  }'

Set a version as active:

curl -X PUT http://localhost:8080/api/v1/judge-prompts/active \
  -H "Content-Type: application/json" \
  -d '{"version": 2}'

Experiments

Method	Endpoint	Description	Request Body
POST	`/experiments`	Create a new experiment	`CreateExperimentRequest`
GET	`/experiments/{id}`	Get experiment details	-

WebSocket

Protocol	Endpoint	Description
WS	`/ws`	Real-time evaluation updates

Connect to WebSocket:

const ws = new WebSocket('ws://localhost:8080/api/v1/ws');
ws.onmessage = (event) => {
  const update = JSON.parse(event.data);
  console.log('Eval update:', update);
};

Request/Response Schemas

RunEvalRequest

{
  "model": "gemini:gemini-2.5-flash-latest",
  "prompt": "What is 2+2?",
  "expected": "4",
  "judge_model": "gemini:gemini-1.5-pro-latest",
  "criteria": "The output should be mathematically correct"
}

Fields:

model (required): Model identifier in format provider:model_name
prompt (required): The prompt to send to the model
expected (optional): Expected output for comparison
judge_model (optional): Judge model for LLM-as-a-judge evaluation
criteria (optional): Custom evaluation criteria

EvalConfig

{
  "model": "openai:gpt-4o",
  "prompt": "Explain quantum computing",
  "expected": "Quantum computing uses quantum bits...",
  "judge_model": "gemini:gemini-2.5-pro-latest",
  "criteria": "The explanation should be accurate and accessible",
  "tags": ["physics", "computing"],
  "metadata": {
    "category": "science",
    "difficulty": "advanced"
  }
}

EvalResponse

{
  "id": "uuid-string",
  "status": "passed",
  "result": {
    "model": "gemini:gemini-2.5-flash-latest",
    "prompt": "What is 2+2?",
    "model_output": "2+2 equals 4",
    "expected": "4",
    "judge_result": {
      "judge_model": "gemini:gemini-2.5-pro-latest",
      "verdict": "Pass",
      "reasoning": "The output correctly identifies that 2+2 equals 4...",
      "confidence": null
    },
    "timestamp": "2025-10-15T12:34:56Z",
    "latency_ms": 450,
    "judge_latency_ms": 320,
    "total_latency_ms": 770
  },
  "error": null
}

Status values: "passed", "failed", "uncertain", "completed", "error"

Verdict values: "Pass", "Fail", "Uncertain"

BatchEvalResponse

{
  "batch_id": "uuid-string",
  "status": "completed",
  "total": 10,
  "completed": 10,
  "passed": 8,
  "failed": 2,
  "average_model_latency_ms": 425,
  "average_judge_latency_ms": 315,
  "results": []
}

Other Schemas

See full documentation for:

HistoryResponse
CreateExperimentRequest
ExperimentResponse
CreateJudgePromptRequest
JudgePrompt
EvalUpdate (WebSocket)

Supported Models

Models are specified in the format provider:model_name:

Gemini:

gemini:gemini-2.5-flash-latest
gemini:gemini-2.5-pro-latest

Ollama:

ollama:llama3
ollama:gemma

OpenAI:

openai:gpt-4o
openai:gpt-4o-mini
openai:gpt-3.5-turbo

Anthropic:

anthropic:claude-opus-4
anthropic:claude-sonnet-4
anthropic:claude-sonnet-4-5
anthropic:claude-haiku-4

If no provider is specified, gemini is used as the default.

Understanding LLM-as-Judge Limitations

Knowledge Recency

One major limitation of LLMs is knowledge recency. Since these models are trained on fixed datasets that quickly become outdated, they often struggle with topics that rely on the latest information — such as new laws, policies, or medical guidance. This means their judgements can be based on old or irrelevant data, leading to unreliable results. To keep them up to date, techniques like retrieval-augmented generation (RAG), regular fine-tuning, and continual learning can help ensure LLMs-as-judges have access to the most current knowledge when making decisions.

Hallucination

Another key weakness is hallucination, where LLMs confidently generate information that isn't true. In an evaluation context, this could mean inventing fake references, misinterpreting facts, or fabricating evidence — all of which can undermine trust in their output. Building in robust fact-checking systems that verify claims against reliable sources is essential to reduce the impact of these errors and maintain fairness in judgement.

Domain-Specific Knowledge Gaps

Lastly, LLMs often face domain-specific knowledge gaps. While they're great generalists, they can lack the deep understanding needed for complex areas like law, finance, or medicine. Integrating domain-specific knowledge graphs or using RAG to pull in expert information can help bridge this gap, allowing them to deliver more accurate and context-aware evaluations.

Contributing

Thank you for your interest in contributing! 🎉

We welcome contributions of all kinds — bug fixes, improvements, documentation, examples, or new features. 🦀 Rust, Python, and front-end JS/TS contributions are all welcome. See current issues for ideas.

How to Contribute

Fork the repository and create a new branch for your changes
Make your changes with clear, descriptive commit messages
Open a Pull Request explaining what you've done and why

Please make sure your code follows the existing style and passes any tests. For larger changes, feel free to open an issue first to discuss your approach.

By contributing, you agree that your work will be licensed under this project's license.

Thank you for helping make this project better! 💡

References

Roadmap

Image Classifier Evals

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
migrations		migrations
src		src
static		static
tests		tests
.env.example		.env.example
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
qa_f1.json		qa_f1.json
qa_sample.json		qa_sample.json
test_judge_prompts.sh		test_judge_prompts.sh

License

RGGH/evaluate

Folders and files

Latest commit

History

Repository files navigation