ACE Playbook - Adaptive Code Evolution

Note: This project is an adapted version of jmanhype/ace-playbook. We extend our gratitude to the original authors for their foundational work on the Generator-Reflector-Curator pattern for self-improving LLM systems.

Self-improving LLM system using the Generator-Reflector-Curator pattern for online learning from execution feedback.

Architecture

Generator-Reflector-Curator Pattern:

Generator: DSPy ReAct/CoT modules that execute tasks using playbook strategies
Reflector: Analyzes outcomes and extracts labeled insights (Helpful/Harmful/Neutral)
Curator: Pure Python semantic deduplication with FAISS (0.8 cosine similarity threshold)

Key Features

Append-only playbook: Never rewrite bullet content, only increment counters
Semantic deduplication: 0.8 cosine similarity threshold prevents context collapse
Staged rollout: shadow → staging → prod with automated promotion gates
Multi-domain isolation: Per-tenant namespaces with separate FAISS indices
Rollback procedures: <5 minute automated rollback on regression detection
Performance budgets: ≤10ms P50 playbook retrieval, ≤+15% end-to-end overhead
Observability metrics: Prometheus-format metrics for monitoring (T065)
Guardrail monitoring: Automated rollback on performance regression (T066)
Docker support: Full containerization with Docker Compose (T067)
E2E testing: Comprehensive smoke tests for production readiness (T068)
Runtime adaptation: Merge coordinator + runtime adapter enable in-flight learning with optional benchmark harness
Async task queue: SQLite-backed queue for non-blocking memory formation (~50ms response)
FAISS IVF+PQ optimization: Sub-millisecond queries at 100K+ vectors with automatic index upgrades
Multi-user authentication: API key auth with usage tracking and rate limiting

Guardrails as High-Precision Sensors

ACE turns tiny heuristic checks into reusable guardrails without manual babysitting:

Detect: Domain heuristics (e.g., ±0.4% drift, missing "%") label a generator trajectory as a precise failure mode.
Distill: The reflector converts that signal into a lesson (“round to whole percent and append %”).
Persist: The curator records a typed delta with helpful/harmful counters and merges it into the playbook.
Reuse: Runtime adapter + merge coordinator surface the tactic immediately so later tasks cannot repeat the mistake.

This loop mirrors the +8.6% improvements reported on FiNER/XBRL benchmarks—subtle finance errors become actionable context upgrades instead of one-off patches.

Quick Start

Prerequisites

Python 3.11 or higher
pip or uv package manager
OpenAI API key, Anthropic API key, or OpenRouter API key

Environment Configuration

Copy .env.example to .env and configure the following:

Required API Keys (at least one):

OPENAI_API_KEY=sk-...           # OpenAI API
ANTHROPIC_API_KEY=sk-ant-...    # Anthropic API
OPENROUTER_API_KEY=sk-or-...    # OpenRouter (multi-model gateway)

Database Configuration:

DATABASE_URL=sqlite:///ace_playbook.db  # SQLite (default) or PostgreSQL
DATABASE_WAL_MODE=true                   # SQLite WAL for better concurrency

Embedding & Search:

EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2  # Embedding model
EMBEDDING_DIMENSION=384                                  # Vector dimensions
FAISS_INDEX_TYPE=IndexFlatIP                             # FAISS index type
SIMILARITY_THRESHOLD=0.8                                 # Deduplication threshold

Promotion Gates (bullet lifecycle):

SHADOW_HELPFUL_MIN=0      # Min helpful for shadow stage
STAGING_HELPFUL_MIN=3     # Min helpful for staging promotion
PROD_HELPFUL_MIN=5        # Min helpful for production promotion
STAGING_RATIO_MIN=3.0     # Min helpful/harmful ratio for staging
PROD_RATIO_MIN=5.0        # Min helpful/harmful ratio for production

Monitoring & Logging:

LOG_LEVEL=INFO            # Logging verbosity
LOG_FORMAT=json           # Log format (json or text)
METRICS_ENABLED=true      # Enable Prometheus metrics
METRICS_PORT=9090         # Metrics endpoint port

Local Installation

# Clone the repository
git clone https://github.com/your-username/ace-playbook.git
cd ace-playbook

# Option 1: Install with pip (standard)
pip install -e ".[dev]"

# Option 2: Install with uv (faster)
pip install uv
uv pip install -e ".[dev]"

# Set up environment variables
cp .env.example .env
# Edit .env with your API keys:
#   OPENAI_API_KEY=sk-...
#   ANTHROPIC_API_KEY=sk-ant-...
#   OPENROUTER_API_KEY=sk-or-... (alternative)

# Initialize the database
alembic upgrade head

# Verify installation with smoke tests
pytest tests/e2e/test_smoke.py -v

# Run a simple example
python examples/arithmetic_learning.py

Agent Learning Live Loop

The Agent Learning (Early Experience) harness now lives in this repository under ace/agent_learning. It reuses the ACE runtime client, curator, and metrics stack to run a live loop that streams experience back into the playbook. See docs/combined_quickstart.md for a walkthrough and run the demo script with:

python examples/live_loop_quickstart.py
# Or run with your configured DSPy backend
python examples/live_loop_quickstart.py --backend dspy --episodes 10

Environment checklist

OPENROUTER_API_KEY (preferred), OPENAI_API_KEY, or ANTHROPIC_API_KEY
DATABASE_URL (defaults to sqlite:///ace_playbook.db)
Optional: OPENROUTER_MODEL if you want to experiment with different hosted LLMs

Docker Compose (Recommended for Production)

# Create .env file with your API keys
echo "OPENAI_API_KEY=sk-..." > .env
echo "ANTHROPIC_API_KEY=sk-ant-..." >> .env

# Start services
docker-compose up -d

# View logs
docker-compose logs -f ace

# Stop services
docker-compose down

Observability

# Export Prometheus metrics
from ace.ops import get_metrics_collector

collector = get_metrics_collector()
print(collector.export_prometheus())

Guardrail Monitoring

# Check for performance regressions
from ace.ops import create_guardrail_monitor

monitor = create_guardrail_monitor(session)
trigger = monitor.check_guardrails("customer-acme")
if trigger:
    print(f"Rollback triggered: {trigger.reason}")

REST API

ACE Playbook provides a FastAPI-based REST API for memory formation and retrieval.

Start the API Server

# Option 1: Use the convenience script (recommended for development)
python run_api_server.py
# Runs on http://localhost:8000 with auto-reload enabled

# Option 2: Run with uvicorn directly
python -m uvicorn ace.api.app:app --host 0.0.0.0 --port 8000

# Option 3: With auto-reload for development
python -m uvicorn ace.api.app:app --reload

API Endpoints

Endpoint	Method	Description
`/health`	GET	Health check with service status
`/api/memory/form`	POST	Synchronous memory formation (~30s)
`/api/memory/form-async`	POST	Async memory formation (~50ms response)
`/api/memory/status/{task_id}`	GET	Poll async task status
`/api/memory/queue-stats`	GET	Task queue statistics
`/api/memory/retrieve`	POST	Retrieve relevant memories

Async Task Queue

For production use, the async endpoints provide non-blocking memory formation:

# 1. Queue a memory formation task (returns immediately)
curl -X POST http://localhost:8000/api/memory/form-async \
  -H "Authorization: Bearer ace_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "task_id": "task-001",
    "domain_id": "user-123-python",
    "reasoning_trace": ["Step 1: Analyze", "Step 2: Implement"],
    "answer": "def foo(): pass",
    "ground_truth": "def foo(): pass",
    "confidence": 0.9,
    "domain": "python-coding"
  }'

# Response: {"queue_task_id": "abc-123", "status": "queued"}

# 2. Poll for completion
curl http://localhost:8000/api/memory/status/abc-123 \
  -H "Authorization: Bearer ace_your_api_key"

# Response: {"status": "completed", "result": {"bullets_added": 2, ...}}

# 3. Check queue health
curl http://localhost:8000/api/memory/queue-stats \
  -H "Authorization: Bearer ace_your_api_key"

# Response: {"pending": 0, "processing": 1, "completed": 50, "failed": 0, "worker_running": true}

Benefits of Async Processing:

50ms response time instead of 30s blocking
Automatic retries on transient failures (up to 3 attempts)
Queue persistence across API restarts
Concurrent processing via background worker thread

Vector Search Optimization

ACE Playbook uses FAISS for high-performance vector similarity search with automatic index optimization:

Vector Count	Index Type	Query Time	Memory
<1,000	IndexFlatIP (exact)	~1ms	Full
1,000-10,000	IndexIVFFlat	~0.5ms	Full
>10,000	IndexIVFPQ	~0.3ms	10-100x reduced

Installing FAISS (optional but recommended for production):

# With pip (may require compilation)
pip install faiss-cpu

# With conda (pre-built binaries)
conda install -c pytorch faiss-cpu

# On macOS with Homebrew
brew install faiss && pip install faiss-cpu

Without FAISS, the system automatically falls back to NumPy-based search (suitable for <10K vectors).

MCP Server (Claude Integration)

ACE Playbook includes an MCP (Model Context Protocol) server that exposes memory functionality as tools for Claude agents.

Available Tools

Tool	Description
`recall_memories`	Retrieve relevant memories via semantic search
`teach_memory`	Store learned patterns from interactions
`forget_memory`	Delete or quarantine specific memories

Running the MCP Server

# Option 1: Use the convenience script (recommended)
python run_mcp_server.py

# Option 2: With SSE transport for development/testing
python run_mcp_server.py --transport sse --port 8765

# Option 3: With debug mode and custom settings
python run_mcp_server.py --debug --log-level DEBUG

# Option 4: Run with module directly
python -m ace.mcp.server

MCP Server CLI Options

Option	Default	Description
`--transport`	`stdio`	Transport mode: `stdio` (Claude Desktop) or `sse` (development)
`--host`	`127.0.0.1`	Host to bind for SSE transport
`--port`	`8765`	Port to bind for SSE transport
`--debug`	`false`	Enable debug mode
`--log-level`	`INFO`	Logging level: DEBUG, INFO, WARNING, ERROR, CRITICAL
`--database-url`	`sqlite:///ace_playbook.db`	Database connection URL
`--reflector-model`	`gpt-4o-mini`	LLM model for Reflector
`--similarity-threshold`	`0.8`	Semantic deduplication threshold

MCP Environment Variables

All settings can be configured via environment variables with ACE_MCP_ prefix:

# Database
ACE_MCP_DATABASE_URL=sqlite:///custom.db

# Server settings
ACE_MCP_SERVER_HOST=127.0.0.1
ACE_MCP_SERVER_PORT=8765
ACE_MCP_TRANSPORT=stdio
ACE_MCP_DEBUG=false
ACE_MCP_LOG_LEVEL=INFO

# LLM settings
ACE_MCP_REFLECTOR_MODEL=gpt-4o-mini
ACE_MCP_SIMILARITY_THRESHOLD=0.8

Claude Desktop Configuration

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "ace-memory": {
      "command": "python",
      "args": ["run_mcp_server.py"],
      "cwd": "/path/to/ace-playbook",
      "env": {
        "ACE_MCP_DATABASE_URL": "sqlite:///ace_playbook.db",
        "OPENAI_API_KEY": "your-key-here"
      }
    }
  }
}

Claude Code Configuration

Add to your Claude Code settings or .mcp.json:

{
  "mcpServers": {
    "ace-memory": {
      "command": "python",
      "args": ["run_mcp_server.py", "--debug"],
      "cwd": "/path/to/ace-playbook"
    }
  }
}

Benchmarking & Runtime Adaptation

Use the benchmark harness to compare variants and capture guardrail activity. Detailed notes live in docs/runtime_benchmarks.rst; aggregated numbers are tracked in benchmarks/RESULTS.md alongside links to the GitHub Action artifacts.

Run Baseline vs ACE

# Baseline: Chain-of-Thought generator only
python scripts/run_benchmark.py benchmarks/finance_subset.jsonl baseline --output results/baseline_finance_subset.json

# Full ACE stack: ReAct generator + runtime adapter + merge coordinator + refinement scheduler
python scripts/run_benchmark.py benchmarks/finance_subset.jsonl ace_full --output results/ace_full_finance_subset.json

# ACE vs baseline live loop comparison (ACE + EE harness)
python benchmarks/run_live_loop_benchmark.py --backend dspy --episodes 10

# Trigger the CI workflow (optional)
gh workflow run ace-benchmark.yml
# The matrix covers finance (easy + hard, GT/no-GT), agent-hard, and finance ablations.
# Each job uploads `ace-benchmark-<matrix.name>` under `results/actions/<run-id>/`.

# Audit agent heuristics locally (sample 20 tasks)
python scripts/audit_agent_scoring.py benchmarks/agent_small.jsonl --sample 20

# Hard finance split (Table 2 replication)
ACE_BENCHMARK_TEMPERATURE=1.3 \
  python scripts/run_benchmark.py benchmarks/finance_hard.jsonl baseline \
  --output results/benchmark/baseline_finance_hard.json

python scripts/run_benchmark.py benchmarks/finance_hard.jsonl ace_full \
  --output results/benchmark/ace_finance_hard_gt.json

ACE_BENCHMARK_USE_GROUND_TRUTH=false ACE_BENCHMARK_TEMPERATURE=0.6 \
  python scripts/run_benchmark.py benchmarks/finance_hard.jsonl ace_full \
  --output results/benchmark/ace_finance_hard_no_gt.json

# Finance ablations (Table 2 component analysis)
ACE_ENABLE_REFLECTOR=false \
  python scripts/run_benchmark.py benchmarks/finance_hard.jsonl ace_full \
  --output results/benchmark/ace_finance_hard_no_reflector.json

ACE_MULTI_EPOCH=false \
  python scripts/run_benchmark.py benchmarks/finance_hard.jsonl ace_full \
  --output results/benchmark/ace_finance_hard_no_multiepoch.json

ACE_OFFLINE_WARMUP=false \
  python scripts/run_benchmark.py benchmarks/finance_hard.jsonl ace_full \
  --output results/benchmark/ace_finance_hard_no_warmup.json

# Agent/AppWorld hard split with conservative heuristics
ACE_BENCHMARK_TEMPERATURE=1.3 \
  python scripts/run_benchmark.py benchmarks/agent_hard.jsonl baseline \
  --output results/benchmark/baseline_agent_hard.json

python scripts/run_benchmark.py benchmarks/agent_hard.jsonl ace_full \
  --output results/benchmark/ace_agent_hard.json

# Quickly sanity-check heuristic thresholds on harder agent tasks
python scripts/audit_agent_scoring.py benchmarks/agent_hard.jsonl --sample 20

Key metrics in the JSON output:

correct / total – benchmark score
promotions, new_bullets, increments – curator activity
auto_corrections – guardrail canonical replacements (e.g., finance rounding)
format_corrections – post-process clamps that strip extra words but retain the raw answer for reflection
agent_feedback_log – path to the per-task ledger (*.feedback.jsonl) emitted for every run

Populate or refresh benchmarks/RESULTS.md with the numbers emitted by these commands (or the CI artifacts). The guardrails and heuristics default to a fail-closed posture: when they cannot certify an answer they mark it unknown, mirroring the safety constraint highlighted in the paper.

Add a New Finance Guardrail

Edit ace/utils/finance_guardrails.py and add an entry to FINANCE_GUARDRAILS with instructions, calculator, and decimals.
Set auto_correct=True if the calculator should override the raw answer.
Re-run scripts/run_benchmark.py for the relevant dataset.
Inspect results/*.json to confirm the guardrail triggered and push the refreshed artifact.

Pro tip: keep regenerated results in source control so regressions surface in diffs.

Add a New Domain in 5 Steps

Scaffold stubs
```
python scripts/scaffold_domain.py claims-processing
```
This creates:
- benchmarks/claims-processing.jsonl
- ace/utils/claims-processing_guardrails.py
- docs/domains/claims-processing.rst
Populate ground truth – Fill the benchmark file with real tasks (one JSON per line).
Implement guardrails – Update the guardrail module with instructions, calculators, and auto_correct flags.
Run the benchmark – python scripts/run_benchmark.py benchmarks/claims-processing.jsonl ace_full --output results/ace_full_claims-processing.json
Document & commit – Summarize behavior in the docs stub, review results/*.json, and push the changes.

Tip: repeat the harness run periodically (or in CI) so regressions surface immediately.

Release Notes

See docs/release_notes.md for the changelog and upgrade instructions for the unified ACE + Agent Learning stack. Tag v1.0.0 corresponds to the integration referenced in the companion papers.

Project Structure

ace-playbook/
├── ace/                    # Core ACE framework
│   ├── generator/         # DSPy Generator modules (CoT, ReAct)
│   ├── reflector/         # Reflector insight extraction
│   ├── curator/           # Semantic deduplication with FAISS
│   ├── models/            # SQLAlchemy ORM models
│   │   ├── playbook.py    # PlaybookBullet with user_id FK
│   │   ├── user.py        # User, UserUsage, AuditLog
│   │   └── task_queue.py  # TaskQueue for async processing
│   ├── repositories/      # Database access layer
│   ├── services/          # Business logic layer
│   │   ├── auth_service.py        # API key authentication
│   │   ├── task_queue_service.py  # Async queue management
│   │   └── memory_formation_service.py  # G-R-C orchestration
│   ├── workers/           # Background task processing
│   │   └── memory_worker.py  # Async memory formation worker
│   ├── api/               # FastAPI REST endpoints
│   │   ├── app.py         # API routes (sync + async)
│   │   ├── models.py      # Pydantic request/response schemas
│   │   └── middleware.py  # Auth + rate limiting
│   ├── mcp/               # MCP Server for Claude integration
│   │   ├── server.py      # ACEMemoryMCPServer implementation
│   │   ├── config.py      # MCPServerConfig (pydantic-settings)
│   │   └── tools.py       # MCP tool implementations
│   ├── utils/             # Infrastructure utilities
│   │   ├── faiss_index.py # FAISS IVF+PQ optimization
│   │   ├── numpy_vector_search.py  # NumPy fallback
│   │   └── embeddings.py  # Sentence-transformer wrapper
│   └── ops/               # Operations (metrics, guardrails)
├── tests/                  # Test suite
│   ├── unit/              # Unit tests
│   ├── integration/       # Integration tests
│   └── e2e/               # End-to-end smoke tests
├── examples/               # Usage examples
├── config/                 # Configuration files
├── alembic/                # Database migrations
│   └── versions/          # Migration scripts
├── run_api_server.py       # Convenience script for API server
├── run_mcp_server.py       # Convenience script for MCP server
├── Dockerfile              # Container image definition
├── docker-compose.yml      # Local development stack
└── docs/                   # Additional documentation

Development

Pre-commit Hooks

Pre-commit hooks automatically run code quality checks before each commit:

# Install pre-commit hooks (one-time setup)
pre-commit install
pre-commit install --hook-type commit-msg

# Run manually on all files
pre-commit run --all-files

# Skip hooks for a specific commit (use sparingly)
git commit --no-verify -m "WIP: temporary commit"

Installed Hooks:

Code Quality: Black formatting, Ruff linting, isort import sorting, autoflake (unused imports)
Type Safety: mypy static type checking
Security: Bandit vulnerability scanning, detect-secrets, Safety (dependency vulnerabilities)
Documentation: Docstring coverage (interrogate), markdown linting
Standards: Conventional commits validation, trailing whitespace, end-of-file fixes
Infrastructure: YAML/JSON/TOML validation, Dockerfile linting, SQL linting
Testing: pytest coverage ≥80% (on push)
Complexity: Radon cyclomatic complexity and maintainability index (on push)
Dead Code: Dead code detection

Manual Testing

# Run tests
pytest tests/ -v

# Type checking
mypy ace/

# Code formatting
black ace/ tests/
ruff check ace/ tests/

# Security scan
bandit -r ace/

# Docstring coverage
interrogate -vv ace/

Documentation

Comprehensive Documentation (v1.14.0+)

Build and view the complete documentation:

# Build HTML documentation
make docs

# Serve documentation locally
make docs-serve  # http://localhost:8000

Available Documentation:

📚 API Reference: Auto-generated Sphinx docs for all modules
🏗️ Architecture Guide: System design with Mermaid diagrams (docs/architecture.md)
🎓 Developer Onboarding: Setup, workflows, and best practices (docs/onboarding.md)
⚠️ Edge Cases: Error handling and recovery procedures (docs/edge_cases.md)
🚀 Tutorials: Step-by-step guides (docs/tutorials/01-quick-start.rst)
📖 Getting Started: Quick installation guide (docs/getting_started.rst)

Additional Resources

Architecture Guide: docs/architecture.md - System design and component diagrams
API Documentation: docs/api/ - Detailed API reference
Runbook: docs/RUNBOOK.md - Operations guide
Edge Cases: docs/edge_cases.md - Error handling documentation

Acknowledgments

This project is an adapted version of the original ace-playbook by jmanhype. The original implementation provided the foundation for the Generator-Reflector-Curator pattern used in this system.

Key contributions from the original project:

Generator-Reflector-Curator architecture design
Semantic deduplication with FAISS
Staged rollout (shadow → staging → prod) concept
Multi-domain isolation approach

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
ace		ace
alembic		alembic
benchmarks		benchmarks
docs		docs
examples		examples
migrations		migrations
results		results
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.mutmut-config		.mutmut-config
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
run_api_server.py		run_api_server.py
run_mcp_server.py		run_mcp_server.py

budprat/agentic-context-engineering

Folders and files

Latest commit

History

Repository files navigation