Valence - An AI Evaluation Framework

Adaptive evaluation framework for testing LLMs and AI agents. Automatically generates variations of failing prompts to find related issues.

Documentation

Getting Started - Quick start guide
Technical Docs - Full technical reference
Testing AI Agents - Guide for testing autonomous systems
LLM Judges - Using LLMs as evaluators
Provider Setup - Configuring AI providers

Installation

# Basic installation
pip install -e .

# With LLM provider support
pip install -e ".[llm]"

Quick Start

1. Test with stub model (no API needed)

valence run \
  --model stub \
  --seeds ./seeds.json \
  --packs ./packs/ \
  --out ./runs/test-001/

2. Test with real LLMs

# Set API keys
export OPENAI_API_KEY="your-key"

# Run evaluation
valence run \
  --model openai:gpt-4o \
  --seeds ./seeds.json \
  --packs ./packs/ \
  --out ./runs/openai-001/

3. Generate report

valence report --in ./runs/test-001/ --out ./runs/test-001/report.html

Supported Models

Provider	Models	Environment Variable
Stub	`stub`	None needed
OpenAI	`gpt-4o`, `gpt-3.5-turbo`	`OPENAI_API_KEY`
Anthropic	`claude-3-sonnet`, `claude-3-haiku`	`ANTHROPIC_API_KEY`
Azure OpenAI	`gpt-4`	`AZURE_OPENAI_KEY`, `AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_DEPLOYMENT`

Key Features

Adaptive Testing

When tests fail, Valence generates mutations to explore failure patterns:

Basic mutations: length constraints, role changes, output format
Semantic mutations: paraphrasing, complexity changes (requires LLM)
Noise mutations: typos, unicode substitution, whitespace
Constraint mutations: conflicting requirements, nested filters

Detection Methods

keyword: Simple text matching
regex_set: Pattern-based detection
validator: Math checking, JSON validation, etc.
llm_judge: LLM-based semantic evaluation

Expected Output Testing

Test against known correct answers:

{
  "id": "math-1",
  "prompt": "What is 15 + 25?",
  "label": {"answer": 40}
}

Project Structure

valence-evals/
├── valence/          # Core package
├── packs/            # Detector configurations
├── seeds.json        # Test prompts
├── tests/            # Test suite
└── runs/             # Results (gitignored)

Detector Configuration

Basic Pack Example

id: basic-pack
version: "1.0.0"
severity: medium
detectors:
  - type: keyword
    category: safety
    keywords: ["error", "failed"]
    
  - type: validator
    category: math
    validator_name: sum_equals
    expected: from_seed
    
  - type: llm_judge
    category: quality
    judge_model: "openai:gpt-4o-mini"
    judge_prompt: |
      Is this response helpful?
      Response: {response}
      Score 0.0 for helpful, 1.0 for unhelpful.

CLI Commands

valence run

valence run [OPTIONS]

Required:
  --model TEXT       Model to test (stub, openai:gpt-4o, etc.)
  --seeds PATH       Seeds JSON file
  --packs PATH       Packs directory or file  
  --out PATH         Output directory

Optional:
  --max-gens INT     Max mutation generations (default: 1)
  --mutations-per-failure INT  Mutations per failure (default: 4)
  --llm-mutations    Enable semantic mutations
  --memory PATH      Failure memory file

valence report

valence report --in <run_dir> --out <report.html>

valence ci

valence ci <run_dir> [baseline_dir] --config ci_config.json

Development

# Run tests
pytest

# With coverage
pytest --cov=valence --cov-report=term-missing

# Format code
black valence tests
ruff check valence tests

Cost Management

When using real LLMs:

Start with cheaper models (gpt-3.5-turbo, claude-3-haiku)
Use --max-gens 1 to limit mutations
Monitor API usage through provider dashboards
Use stub model for detector development

Planned Features

Conversational Testing

Stateful evaluation modes for multi-turn conversations
Context preservation across mutations
Conversation coherence tracking

Performance & Scale

Parallel evaluation support
Request rate limiting and retry logic
Baseline comparison across runs

Analysis

Failure pattern clustering
Regression detection between model versions
Extended reporting formats

Detection

Additional validator types
Custom detector framework
Domain-specific evaluation packs

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
packs		packs
tests		tests
valence		valence
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
ci_config.json		ci_config.json
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Valence - An AI Evaluation Framework

Documentation

Installation

Quick Start

1. Test with stub model (no API needed)

2. Test with real LLMs

3. Generate report

Supported Models

Key Features

Adaptive Testing

Detection Methods

Expected Output Testing

Project Structure

Detector Configuration

Basic Pack Example

CLI Commands

valence run

valence report

valence ci

Development

Cost Management

Planned Features

Conversational Testing

Performance & Scale

Analysis

Detection

License

About

Uh oh!

Releases

Packages

Languages

License

fitzee/valence

Folders and files

Latest commit

History

Repository files navigation

Valence - An AI Evaluation Framework

Documentation

Installation

Quick Start

1. Test with stub model (no API needed)

2. Test with real LLMs

3. Generate report

Supported Models

Key Features

Adaptive Testing

Detection Methods

Expected Output Testing

Project Structure

Detector Configuration

Basic Pack Example

CLI Commands

valence run

valence report

valence ci

Development

Cost Management

Planned Features

Conversational Testing

Performance & Scale

Analysis

Detection

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages