Note: This project is an adapted version of jmanhype/ace-playbook. We extend our gratitude to the original authors for their foundational work on the Generator-Reflector-Curator pattern for self-improving LLM systems.
Self-improving LLM system using the Generator-Reflector-Curator pattern for online learning from execution feedback.
- Architecture
- Key Features
- Guardrails as High-Precision Sensors
- Quick Start
- REST API
- MCP Server (Claude Integration)
- Benchmarking & Runtime Adaptation
- Release Notes
- Project Structure
- Development
- Documentation
- Acknowledgments
Generator-Reflector-Curator Pattern:
- Generator: DSPy ReAct/CoT modules that execute tasks using playbook strategies
- Reflector: Analyzes outcomes and extracts labeled insights (Helpful/Harmful/Neutral)
- Curator: Pure Python semantic deduplication with FAISS (0.8 cosine similarity threshold)
- Append-only playbook: Never rewrite bullet content, only increment counters
- Semantic deduplication: 0.8 cosine similarity threshold prevents context collapse
- Staged rollout: shadow → staging → prod with automated promotion gates
- Multi-domain isolation: Per-tenant namespaces with separate FAISS indices
- Rollback procedures: <5 minute automated rollback on regression detection
- Performance budgets: ≤10ms P50 playbook retrieval, ≤+15% end-to-end overhead
- Observability metrics: Prometheus-format metrics for monitoring (T065)
- Guardrail monitoring: Automated rollback on performance regression (T066)
- Docker support: Full containerization with Docker Compose (T067)
- E2E testing: Comprehensive smoke tests for production readiness (T068)
- Runtime adaptation: Merge coordinator + runtime adapter enable in-flight learning with optional benchmark harness
- Async task queue: SQLite-backed queue for non-blocking memory formation (~50ms response)
- FAISS IVF+PQ optimization: Sub-millisecond queries at 100K+ vectors with automatic index upgrades
- Multi-user authentication: API key auth with usage tracking and rate limiting
ACE turns tiny heuristic checks into reusable guardrails without manual babysitting:
- Detect: Domain heuristics (e.g., ±0.4% drift, missing "%") label a generator trajectory as a precise failure mode.
- Distill: The reflector converts that signal into a lesson (“round to whole percent and append %”).
- Persist: The curator records a typed delta with helpful/harmful counters and merges it into the playbook.
- Reuse: Runtime adapter + merge coordinator surface the tactic immediately so later tasks cannot repeat the mistake.
This loop mirrors the +8.6% improvements reported on FiNER/XBRL benchmarks—subtle finance errors become actionable context upgrades instead of one-off patches.
- Python 3.11 or higher
- pip or uv package manager
- OpenAI API key, Anthropic API key, or OpenRouter API key
Copy .env.example to .env and configure the following:
Required API Keys (at least one):
OPENAI_API_KEY=sk-... # OpenAI API
ANTHROPIC_API_KEY=sk-ant-... # Anthropic API
OPENROUTER_API_KEY=sk-or-... # OpenRouter (multi-model gateway)Database Configuration:
DATABASE_URL=sqlite:///ace_playbook.db # SQLite (default) or PostgreSQL
DATABASE_WAL_MODE=true # SQLite WAL for better concurrencyEmbedding & Search:
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2 # Embedding model
EMBEDDING_DIMENSION=384 # Vector dimensions
FAISS_INDEX_TYPE=IndexFlatIP # FAISS index type
SIMILARITY_THRESHOLD=0.8 # Deduplication thresholdPromotion Gates (bullet lifecycle):
SHADOW_HELPFUL_MIN=0 # Min helpful for shadow stage
STAGING_HELPFUL_MIN=3 # Min helpful for staging promotion
PROD_HELPFUL_MIN=5 # Min helpful for production promotion
STAGING_RATIO_MIN=3.0 # Min helpful/harmful ratio for staging
PROD_RATIO_MIN=5.0 # Min helpful/harmful ratio for productionMonitoring & Logging:
LOG_LEVEL=INFO # Logging verbosity
LOG_FORMAT=json # Log format (json or text)
METRICS_ENABLED=true # Enable Prometheus metrics
METRICS_PORT=9090 # Metrics endpoint port# Clone the repository
git clone https://github.com/your-username/ace-playbook.git
cd ace-playbook
# Option 1: Install with pip (standard)
pip install -e ".[dev]"
# Option 2: Install with uv (faster)
pip install uv
uv pip install -e ".[dev]"
# Set up environment variables
cp .env.example .env
# Edit .env with your API keys:
# OPENAI_API_KEY=sk-...
# ANTHROPIC_API_KEY=sk-ant-...
# OPENROUTER_API_KEY=sk-or-... (alternative)
# Initialize the database
alembic upgrade head
# Verify installation with smoke tests
pytest tests/e2e/test_smoke.py -v
# Run a simple example
python examples/arithmetic_learning.pyThe Agent Learning (Early Experience) harness now lives in this repository under
ace/agent_learning. It reuses the ACE runtime client, curator, and metrics
stack to run a live loop that streams experience back into the playbook. See
docs/combined_quickstart.md for a walkthrough
and run the demo script with:
python examples/live_loop_quickstart.py
# Or run with your configured DSPy backend
python examples/live_loop_quickstart.py --backend dspy --episodes 10Environment checklist
OPENROUTER_API_KEY(preferred),OPENAI_API_KEY, orANTHROPIC_API_KEYDATABASE_URL(defaults tosqlite:///ace_playbook.db)- Optional:
OPENROUTER_MODELif you want to experiment with different hosted LLMs
# Create .env file with your API keys
echo "OPENAI_API_KEY=sk-..." > .env
echo "ANTHROPIC_API_KEY=sk-ant-..." >> .env
# Start services
docker-compose up -d
# View logs
docker-compose logs -f ace
# Stop services
docker-compose down# Export Prometheus metrics
from ace.ops import get_metrics_collector
collector = get_metrics_collector()
print(collector.export_prometheus())# Check for performance regressions
from ace.ops import create_guardrail_monitor
monitor = create_guardrail_monitor(session)
trigger = monitor.check_guardrails("customer-acme")
if trigger:
print(f"Rollback triggered: {trigger.reason}")ACE Playbook provides a FastAPI-based REST API for memory formation and retrieval.
# Option 1: Use the convenience script (recommended for development)
python run_api_server.py
# Runs on http://localhost:8000 with auto-reload enabled
# Option 2: Run with uvicorn directly
python -m uvicorn ace.api.app:app --host 0.0.0.0 --port 8000
# Option 3: With auto-reload for development
python -m uvicorn ace.api.app:app --reload| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check with service status |
/api/memory/form |
POST | Synchronous memory formation (~30s) |
/api/memory/form-async |
POST | Async memory formation (~50ms response) |
/api/memory/status/{task_id} |
GET | Poll async task status |
/api/memory/queue-stats |
GET | Task queue statistics |
/api/memory/retrieve |
POST | Retrieve relevant memories |
For production use, the async endpoints provide non-blocking memory formation:
# 1. Queue a memory formation task (returns immediately)
curl -X POST http://localhost:8000/api/memory/form-async \
-H "Authorization: Bearer ace_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"task_id": "task-001",
"domain_id": "user-123-python",
"reasoning_trace": ["Step 1: Analyze", "Step 2: Implement"],
"answer": "def foo(): pass",
"ground_truth": "def foo(): pass",
"confidence": 0.9,
"domain": "python-coding"
}'
# Response: {"queue_task_id": "abc-123", "status": "queued"}
# 2. Poll for completion
curl http://localhost:8000/api/memory/status/abc-123 \
-H "Authorization: Bearer ace_your_api_key"
# Response: {"status": "completed", "result": {"bullets_added": 2, ...}}
# 3. Check queue health
curl http://localhost:8000/api/memory/queue-stats \
-H "Authorization: Bearer ace_your_api_key"
# Response: {"pending": 0, "processing": 1, "completed": 50, "failed": 0, "worker_running": true}Benefits of Async Processing:
- 50ms response time instead of 30s blocking
- Automatic retries on transient failures (up to 3 attempts)
- Queue persistence across API restarts
- Concurrent processing via background worker thread
ACE Playbook uses FAISS for high-performance vector similarity search with automatic index optimization:
| Vector Count | Index Type | Query Time | Memory |
|---|---|---|---|
| <1,000 | IndexFlatIP (exact) | ~1ms | Full |
| 1,000-10,000 | IndexIVFFlat | ~0.5ms | Full |
| >10,000 | IndexIVFPQ | ~0.3ms | 10-100x reduced |
Installing FAISS (optional but recommended for production):
# With pip (may require compilation)
pip install faiss-cpu
# With conda (pre-built binaries)
conda install -c pytorch faiss-cpu
# On macOS with Homebrew
brew install faiss && pip install faiss-cpuWithout FAISS, the system automatically falls back to NumPy-based search (suitable for <10K vectors).
ACE Playbook includes an MCP (Model Context Protocol) server that exposes memory functionality as tools for Claude agents.
| Tool | Description |
|---|---|
recall_memories |
Retrieve relevant memories via semantic search |
teach_memory |
Store learned patterns from interactions |
forget_memory |
Delete or quarantine specific memories |
# Option 1: Use the convenience script (recommended)
python run_mcp_server.py
# Option 2: With SSE transport for development/testing
python run_mcp_server.py --transport sse --port 8765
# Option 3: With debug mode and custom settings
python run_mcp_server.py --debug --log-level DEBUG
# Option 4: Run with module directly
python -m ace.mcp.server| Option | Default | Description |
|---|---|---|
--transport |
stdio |
Transport mode: stdio (Claude Desktop) or sse (development) |
--host |
127.0.0.1 |
Host to bind for SSE transport |
--port |
8765 |
Port to bind for SSE transport |
--debug |
false |
Enable debug mode |
--log-level |
INFO |
Logging level: DEBUG, INFO, WARNING, ERROR, CRITICAL |
--database-url |
sqlite:///ace_playbook.db |
Database connection URL |
--reflector-model |
gpt-4o-mini |
LLM model for Reflector |
--similarity-threshold |
0.8 |
Semantic deduplication threshold |
All settings can be configured via environment variables with ACE_MCP_ prefix:
# Database
ACE_MCP_DATABASE_URL=sqlite:///custom.db
# Server settings
ACE_MCP_SERVER_HOST=127.0.0.1
ACE_MCP_SERVER_PORT=8765
ACE_MCP_TRANSPORT=stdio
ACE_MCP_DEBUG=false
ACE_MCP_LOG_LEVEL=INFO
# LLM settings
ACE_MCP_REFLECTOR_MODEL=gpt-4o-mini
ACE_MCP_SIMILARITY_THRESHOLD=0.8Add to your claude_desktop_config.json:
{
"mcpServers": {
"ace-memory": {
"command": "python",
"args": ["run_mcp_server.py"],
"cwd": "/path/to/ace-playbook",
"env": {
"ACE_MCP_DATABASE_URL": "sqlite:///ace_playbook.db",
"OPENAI_API_KEY": "your-key-here"
}
}
}
}Add to your Claude Code settings or .mcp.json:
{
"mcpServers": {
"ace-memory": {
"command": "python",
"args": ["run_mcp_server.py", "--debug"],
"cwd": "/path/to/ace-playbook"
}
}
}Use the benchmark harness to compare variants and capture guardrail activity. Detailed notes live in docs/runtime_benchmarks.rst; aggregated numbers are tracked in benchmarks/RESULTS.md alongside links to the GitHub Action artifacts.
# Baseline: Chain-of-Thought generator only
python scripts/run_benchmark.py benchmarks/finance_subset.jsonl baseline --output results/baseline_finance_subset.json
# Full ACE stack: ReAct generator + runtime adapter + merge coordinator + refinement scheduler
python scripts/run_benchmark.py benchmarks/finance_subset.jsonl ace_full --output results/ace_full_finance_subset.json
# ACE vs baseline live loop comparison (ACE + EE harness)
python benchmarks/run_live_loop_benchmark.py --backend dspy --episodes 10
# Trigger the CI workflow (optional)
gh workflow run ace-benchmark.yml
# The matrix covers finance (easy + hard, GT/no-GT), agent-hard, and finance ablations.
# Each job uploads `ace-benchmark-<matrix.name>` under `results/actions/<run-id>/`.
# Audit agent heuristics locally (sample 20 tasks)
python scripts/audit_agent_scoring.py benchmarks/agent_small.jsonl --sample 20
# Hard finance split (Table 2 replication)
ACE_BENCHMARK_TEMPERATURE=1.3 \
python scripts/run_benchmark.py benchmarks/finance_hard.jsonl baseline \
--output results/benchmark/baseline_finance_hard.json
python scripts/run_benchmark.py benchmarks/finance_hard.jsonl ace_full \
--output results/benchmark/ace_finance_hard_gt.json
ACE_BENCHMARK_USE_GROUND_TRUTH=false ACE_BENCHMARK_TEMPERATURE=0.6 \
python scripts/run_benchmark.py benchmarks/finance_hard.jsonl ace_full \
--output results/benchmark/ace_finance_hard_no_gt.json
# Finance ablations (Table 2 component analysis)
ACE_ENABLE_REFLECTOR=false \
python scripts/run_benchmark.py benchmarks/finance_hard.jsonl ace_full \
--output results/benchmark/ace_finance_hard_no_reflector.json
ACE_MULTI_EPOCH=false \
python scripts/run_benchmark.py benchmarks/finance_hard.jsonl ace_full \
--output results/benchmark/ace_finance_hard_no_multiepoch.json
ACE_OFFLINE_WARMUP=false \
python scripts/run_benchmark.py benchmarks/finance_hard.jsonl ace_full \
--output results/benchmark/ace_finance_hard_no_warmup.json
# Agent/AppWorld hard split with conservative heuristics
ACE_BENCHMARK_TEMPERATURE=1.3 \
python scripts/run_benchmark.py benchmarks/agent_hard.jsonl baseline \
--output results/benchmark/baseline_agent_hard.json
python scripts/run_benchmark.py benchmarks/agent_hard.jsonl ace_full \
--output results/benchmark/ace_agent_hard.json
# Quickly sanity-check heuristic thresholds on harder agent tasks
python scripts/audit_agent_scoring.py benchmarks/agent_hard.jsonl --sample 20Key metrics in the JSON output:
correct/total– benchmark scorepromotions,new_bullets,increments– curator activityauto_corrections– guardrail canonical replacements (e.g., finance rounding)format_corrections– post-process clamps that strip extra words but retain the raw answer for reflectionagent_feedback_log– path to the per-task ledger (*.feedback.jsonl) emitted for every run
Populate or refresh benchmarks/RESULTS.md with the numbers emitted by these commands (or the CI artifacts). The guardrails and heuristics default to a fail-closed posture: when they cannot certify an answer they mark it unknown, mirroring the safety constraint highlighted in the paper.
- Edit
ace/utils/finance_guardrails.pyand add an entry toFINANCE_GUARDRAILSwithinstructions,calculator, anddecimals. - Set
auto_correct=Trueif the calculator should override the raw answer. - Re-run
scripts/run_benchmark.pyfor the relevant dataset. - Inspect
results/*.jsonto confirm the guardrail triggered and push the refreshed artifact.
Pro tip: keep regenerated results in source control so regressions surface in diffs.
-
Scaffold stubs
python scripts/scaffold_domain.py claims-processing
This creates:
benchmarks/claims-processing.jsonlace/utils/claims-processing_guardrails.pydocs/domains/claims-processing.rst
-
Populate ground truth – Fill the benchmark file with real tasks (one JSON per line).
-
Implement guardrails – Update the guardrail module with instructions, calculators, and
auto_correctflags. -
Run the benchmark –
python scripts/run_benchmark.py benchmarks/claims-processing.jsonl ace_full --output results/ace_full_claims-processing.json -
Document & commit – Summarize behavior in the docs stub, review
results/*.json, and push the changes.
Tip: repeat the harness run periodically (or in CI) so regressions surface immediately.
See docs/release_notes.md for the changelog and upgrade
instructions for the unified ACE + Agent Learning stack. Tag v1.0.0
corresponds to the integration referenced in the companion papers.
ace-playbook/
├── ace/ # Core ACE framework
│ ├── generator/ # DSPy Generator modules (CoT, ReAct)
│ ├── reflector/ # Reflector insight extraction
│ ├── curator/ # Semantic deduplication with FAISS
│ ├── models/ # SQLAlchemy ORM models
│ │ ├── playbook.py # PlaybookBullet with user_id FK
│ │ ├── user.py # User, UserUsage, AuditLog
│ │ └── task_queue.py # TaskQueue for async processing
│ ├── repositories/ # Database access layer
│ ├── services/ # Business logic layer
│ │ ├── auth_service.py # API key authentication
│ │ ├── task_queue_service.py # Async queue management
│ │ └── memory_formation_service.py # G-R-C orchestration
│ ├── workers/ # Background task processing
│ │ └── memory_worker.py # Async memory formation worker
│ ├── api/ # FastAPI REST endpoints
│ │ ├── app.py # API routes (sync + async)
│ │ ├── models.py # Pydantic request/response schemas
│ │ └── middleware.py # Auth + rate limiting
│ ├── mcp/ # MCP Server for Claude integration
│ │ ├── server.py # ACEMemoryMCPServer implementation
│ │ ├── config.py # MCPServerConfig (pydantic-settings)
│ │ └── tools.py # MCP tool implementations
│ ├── utils/ # Infrastructure utilities
│ │ ├── faiss_index.py # FAISS IVF+PQ optimization
│ │ ├── numpy_vector_search.py # NumPy fallback
│ │ └── embeddings.py # Sentence-transformer wrapper
│ └── ops/ # Operations (metrics, guardrails)
├── tests/ # Test suite
│ ├── unit/ # Unit tests
│ ├── integration/ # Integration tests
│ └── e2e/ # End-to-end smoke tests
├── examples/ # Usage examples
├── config/ # Configuration files
├── alembic/ # Database migrations
│ └── versions/ # Migration scripts
├── run_api_server.py # Convenience script for API server
├── run_mcp_server.py # Convenience script for MCP server
├── Dockerfile # Container image definition
├── docker-compose.yml # Local development stack
└── docs/ # Additional documentation
Pre-commit hooks automatically run code quality checks before each commit:
# Install pre-commit hooks (one-time setup)
pre-commit install
pre-commit install --hook-type commit-msg
# Run manually on all files
pre-commit run --all-files
# Skip hooks for a specific commit (use sparingly)
git commit --no-verify -m "WIP: temporary commit"Installed Hooks:
- Code Quality: Black formatting, Ruff linting, isort import sorting, autoflake (unused imports)
- Type Safety: mypy static type checking
- Security: Bandit vulnerability scanning, detect-secrets, Safety (dependency vulnerabilities)
- Documentation: Docstring coverage (interrogate), markdown linting
- Standards: Conventional commits validation, trailing whitespace, end-of-file fixes
- Infrastructure: YAML/JSON/TOML validation, Dockerfile linting, SQL linting
- Testing: pytest coverage ≥80% (on push)
- Complexity: Radon cyclomatic complexity and maintainability index (on push)
- Dead Code: Dead code detection
# Run tests
pytest tests/ -v
# Type checking
mypy ace/
# Code formatting
black ace/ tests/
ruff check ace/ tests/
# Security scan
bandit -r ace/
# Docstring coverage
interrogate -vv ace/Build and view the complete documentation:
# Build HTML documentation
make docs
# Serve documentation locally
make docs-serve # http://localhost:8000Available Documentation:
- 📚 API Reference: Auto-generated Sphinx docs for all modules
- 🏗️ Architecture Guide: System design with Mermaid diagrams (docs/architecture.md)
- 🎓 Developer Onboarding: Setup, workflows, and best practices (docs/onboarding.md)
⚠️ Edge Cases: Error handling and recovery procedures (docs/edge_cases.md)- 🚀 Tutorials: Step-by-step guides (docs/tutorials/01-quick-start.rst)
- 📖 Getting Started: Quick installation guide (docs/getting_started.rst)
- Architecture Guide:
docs/architecture.md- System design and component diagrams - API Documentation:
docs/api/- Detailed API reference - Runbook:
docs/RUNBOOK.md- Operations guide - Edge Cases:
docs/edge_cases.md- Error handling documentation
This project is an adapted version of the original ace-playbook by jmanhype. The original implementation provided the foundation for the Generator-Reflector-Curator pattern used in this system.
Key contributions from the original project:
- Generator-Reflector-Curator architecture design
- Semantic deduplication with FAISS
- Staged rollout (shadow → staging → prod) concept
- Multi-domain isolation approach
MIT