Thanks to visit codestin.com
Credit goes to github.com

Skip to content

RGGH/evaluate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Rust Tests

evaluate

Evaluate - An LLM Testing Framework

A lightweight, flexible evaluation 'eval' framework for testing models with automated judging capabilities, supporting Gemini, Anthropic, OpenAI, and Ollama.

▢️ Watch on YouTube

Features

  • SQLite database for saving history
  • Specify LLM provider for both model and judge
  • Batch evaluations to multiple providers/models
  • API endpoints for developers to consume
  • Built-in GUI and results dashboard
  • Additional evaluation criteria options (exact match, semantic similarity, etc.)
  • Python SDK available
  • Real-time WebSocket updates

Python SDK

If you want to use 'evaluate' via your own Python scripts or Jupyter Notebooks, you can use the SDK:

https://pypi.org/project/llmeval-sdk/

(Example usage with Python is shown on that PyPi page)

Quick Start

Prerequisites

You'll need:

  • Docker (recommended) OR Rust/Cargo
  • API keys for your LLM provider(s)

If you use Ollama:

ollama pull llama3

Setup Environment Variables

Create a .env file in your project root (see env.example):

DATABASE_URL=sqlite:./data/evals.db
GEMINI_API_BASE=https://generativelanguage.googleapis.com
GEMINI_API_KEY=AIzaxxxxxxxxxxxxxxxxxxxxxxxxxxc
GEMINI_MODELS=gemini-2.5-pro,gemini-2.5-flash
OLLAMA_API_BASE=http://host.docker.internal:11434
OPENAI_API_BASE=https://api.openai.com/v1
OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxxx
OPENAI_MODELS=gpt-4o,gpt-4o-mini,gpt-3.5-turbo
ANTHROPIC_API_KEY=sk-placeholder-ant-a1b2c3d4e5f6-a1b2c3d4e5f6-a1b2c3d4e5f6-a1b2c3d4e5f6
ANTHROPIC_MODELS=claude-opus-4,claude-sonnet-4-5,claude-haiku-4
RUST_LOG=info

Installation Options

Option 1: Docker (Recommended)

Build the image:

docker build -t evaluate:latest .

Run on Linux:

docker run --rm -it \
  --network host \
  --env-file .env \
  -v $(pwd)/data:/usr/local/bin/data \
  -e OLLAMA_API_BASE=http://localhost:11434 \
  evaluate:latest

Run on Mac:

docker run --rm -it -p 8080:8080 \
  --env-file .env \
  -v $(pwd)/data:/usr/local/bin/data \
  evaluate:latest

Run on Windows (PowerShell):

docker run --rm -it -p 8080:8080 `
  --env-file .env `
  -v ${PWD}/data:/usr/local/bin/data `
  evaluate:latest

Option 2: Install from Source

# 1. Clone the repository
git clone [email protected]:RGGH/evaluate.git

# 2. Navigate into the project directory
cd evaluate

# 3. Run with Cargo (requires Rust/Cargo installed)
cargo run

You should see output similar to:

[INFO] Starting database migration...
[INFO] Starting server at 127.0.0.1:8080

Access the application at http://localhost:8080

Usage Examples

Single Evaluation (API)

Gemini Example:

curl -X POST http://127.0.0.1:8080/api/v1/evals/run \
-H "Content-Type: application/json" \
-d '{
  "model": "gemini-2.5-pro",             
  "prompt": "What is the capital of France?", 
  "expected": "Paris",
  "judge_model": "gemini-2.5-pro",             
  "criteria": "Does the output correctly name the capital city?"
}' | jq

Response:

{
  "id": "619cd32a-4376-4969-ac48-0f25b37bc933",
  "status": "passed",
  "result": {
    "model": "gemini-2.5-pro",
    "prompt": "What is the capital of France?",
    "model_output": "The capital of France is **Paris**.",
    "expected": "Paris",
    "judge_result": {
      "judge_model": "gemini-2.5-pro",
      "verdict": "Pass",
      "reasoning": "Verdict: PASS\n\nThe actual output correctly names Paris as the capital city...",
      "confidence": null
    },
    "timestamp": "2024-07-29T10:30:00.123456789+00:00"
  },
  "error": null
}

Ollama Example:

curl -X POST http://127.0.0.1:8080/api/v1/evals/run \
-H "Content-Type: application/json" \
-d '{
  "model": "ollama:llama3",
  "prompt": "What is the capital of France?",
  "expected": "Paris",
  "judge_model": "ollama:llama3",
  "criteria": "Does the output correctly name the capital city?"
}' | jq

Batch Evaluations

You can set the provider in the JSON file and use generic syntax for batch evals:

{
  "model": "gemini:gemini-2.5-flash-latest",
  "prompt": "What is the capital of France?",
  "expected": "Paris",
  "judge_model": "gemini:gemini-2.5-pro-latest"
}

Call the api/v1/evals/batch endpoint:

curl -X POST http://127.0.0.1:8080/api/v1/evals/batch \
-H "Content-Type: application/json" \
-d '@qa_sample.json' | jq
curl -X POST http://127.0.0.1:8080/api/v1/evals/batch \
-H "Content-Type: application/json" \
-d '@qa_f1.json' | jq

Built-in GUI

Single Eval Interface

Screenshot from 2025-10-14 19-27-23

History View

Screenshot from 2025-10-14 19-27-14

Results Dashboard

Screenshot from 2025-10-17 17-17-04 Screenshot from 2025-10-17 17-17-12

API Reference

Base URL: http://localhost:8080/api/v1

Health & System

Method Endpoint Description Response
GET /health Health check endpoint {"status": "healthy", "service": "eval-api", "version": "..."}
GET /models List all available models {"models": ["gemini:model-name", "ollama:model-name", ...]}

Evaluations

Method Endpoint Description Request Body
POST /evals/run Run a single evaluation RunEvalRequest
POST /evals/batch Run multiple evaluations concurrently Array of EvalConfig
GET /evals/history Get all evaluation history -
GET /evals/{id} Get specific evaluation result -
GET /evals/{id}/status Get evaluation status -

Judge Prompts

Method Endpoint Description Request Body
GET /judge-prompts Get all judge prompt versions -
GET /judge-prompts/active Get the currently active judge prompt -
GET /judge-prompts/{version} Get a specific judge prompt by version -
POST /judge-prompts Create a new judge prompt version CreateJudgePromptRequest
PUT /judge-prompts/active Set a judge prompt version as active {"version": 2}

Judge Prompt Examples

Get all judge prompts:

curl http://localhost:8080/api/v1/judge-prompts

Create a new judge prompt:

curl -X POST http://localhost:8080/api/v1/judge-prompts \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Strict Evaluator",
    "template": "Compare:\nExpected: {{expected}}\nActual: {{actual}}\nVerdict: PASS or FAIL",
    "description": "Requires exact semantic match",
    "set_active": true
  }'

Set a version as active:

curl -X PUT http://localhost:8080/api/v1/judge-prompts/active \
  -H "Content-Type: application/json" \
  -d '{"version": 2}'

Experiments

Method Endpoint Description Request Body
POST /experiments Create a new experiment CreateExperimentRequest
GET /experiments/{id} Get experiment details -

WebSocket

Protocol Endpoint Description
WS /ws Real-time evaluation updates

Connect to WebSocket:

const ws = new WebSocket('ws://localhost:8080/api/v1/ws');
ws.onmessage = (event) => {
  const update = JSON.parse(event.data);
  console.log('Eval update:', update);
};

Request/Response Schemas

RunEvalRequest

{
  "model": "gemini:gemini-2.5-flash-latest",
  "prompt": "What is 2+2?",
  "expected": "4",
  "judge_model": "gemini:gemini-1.5-pro-latest",
  "criteria": "The output should be mathematically correct"
}

Fields:

  • model (required): Model identifier in format provider:model_name
  • prompt (required): The prompt to send to the model
  • expected (optional): Expected output for comparison
  • judge_model (optional): Judge model for LLM-as-a-judge evaluation
  • criteria (optional): Custom evaluation criteria

EvalConfig

{
  "model": "openai:gpt-4o",
  "prompt": "Explain quantum computing",
  "expected": "Quantum computing uses quantum bits...",
  "judge_model": "gemini:gemini-2.5-pro-latest",
  "criteria": "The explanation should be accurate and accessible",
  "tags": ["physics", "computing"],
  "metadata": {
    "category": "science",
    "difficulty": "advanced"
  }
}

EvalResponse

{
  "id": "uuid-string",
  "status": "passed",
  "result": {
    "model": "gemini:gemini-2.5-flash-latest",
    "prompt": "What is 2+2?",
    "model_output": "2+2 equals 4",
    "expected": "4",
    "judge_result": {
      "judge_model": "gemini:gemini-2.5-pro-latest",
      "verdict": "Pass",
      "reasoning": "The output correctly identifies that 2+2 equals 4...",
      "confidence": null
    },
    "timestamp": "2025-10-15T12:34:56Z",
    "latency_ms": 450,
    "judge_latency_ms": 320,
    "total_latency_ms": 770
  },
  "error": null
}

Status values: "passed", "failed", "uncertain", "completed", "error"

Verdict values: "Pass", "Fail", "Uncertain"

BatchEvalResponse

{
  "batch_id": "uuid-string",
  "status": "completed",
  "total": 10,
  "completed": 10,
  "passed": 8,
  "failed": 2,
  "average_model_latency_ms": 425,
  "average_judge_latency_ms": 315,
  "results": []
}

Other Schemas

See full documentation for:

  • HistoryResponse
  • CreateExperimentRequest
  • ExperimentResponse
  • CreateJudgePromptRequest
  • JudgePrompt
  • EvalUpdate (WebSocket)

Supported Models

Models are specified in the format provider:model_name:

Gemini:

  • gemini:gemini-2.5-flash-latest
  • gemini:gemini-2.5-pro-latest

Ollama:

  • ollama:llama3
  • ollama:gemma

OpenAI:

  • openai:gpt-4o
  • openai:gpt-4o-mini
  • openai:gpt-3.5-turbo

Anthropic:

  • anthropic:claude-opus-4
  • anthropic:claude-sonnet-4
  • anthropic:claude-sonnet-4-5
  • anthropic:claude-haiku-4

If no provider is specified, gemini is used as the default.

Understanding LLM-as-Judge Limitations

Knowledge Recency

One major limitation of LLMs is knowledge recency. Since these models are trained on fixed datasets that quickly become outdated, they often struggle with topics that rely on the latest information β€” such as new laws, policies, or medical guidance. This means their judgements can be based on old or irrelevant data, leading to unreliable results. To keep them up to date, techniques like retrieval-augmented generation (RAG), regular fine-tuning, and continual learning can help ensure LLMs-as-judges have access to the most current knowledge when making decisions.

Hallucination

Another key weakness is hallucination, where LLMs confidently generate information that isn't true. In an evaluation context, this could mean inventing fake references, misinterpreting facts, or fabricating evidence β€” all of which can undermine trust in their output. Building in robust fact-checking systems that verify claims against reliable sources is essential to reduce the impact of these errors and maintain fairness in judgement.

Domain-Specific Knowledge Gaps

Lastly, LLMs often face domain-specific knowledge gaps. While they're great generalists, they can lack the deep understanding needed for complex areas like law, finance, or medicine. Integrating domain-specific knowledge graphs or using RAG to pull in expert information can help bridge this gap, allowing them to deliver more accurate and context-aware evaluations.

Contributing

Thank you for your interest in contributing! πŸŽ‰

We welcome contributions of all kinds β€” bug fixes, improvements, documentation, examples, or new features. πŸ¦€ Rust, Python, and front-end JS/TS contributions are all welcome. See current issues for ideas.

How to Contribute

  1. Fork the repository and create a new branch for your changes
  2. Make your changes with clear, descriptive commit messages
  3. Open a Pull Request explaining what you've done and why

Please make sure your code follows the existing style and passes any tests. For larger changes, feel free to open an issue first to discuss your approach.

By contributing, you agree that your work will be licensed under this project's license.

Thank you for helping make this project better! πŸ’‘

References

Roadmap

  • Image Classifier Evals

Application Demo