Thanks to visit codestin.com
Credit goes to github.com

Skip to content

budprat/agentic-context-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ACE Playbook - Adaptive Code Evolution

Note: This project is an adapted version of jmanhype/ace-playbook. We extend our gratitude to the original authors for their foundational work on the Generator-Reflector-Curator pattern for self-improving LLM systems.

Self-improving LLM system using the Generator-Reflector-Curator pattern for online learning from execution feedback.

Table of Contents

Architecture

Generator-Reflector-Curator Pattern:

  • Generator: DSPy ReAct/CoT modules that execute tasks using playbook strategies
  • Reflector: Analyzes outcomes and extracts labeled insights (Helpful/Harmful/Neutral)
  • Curator: Pure Python semantic deduplication with FAISS (0.8 cosine similarity threshold)

Key Features

  • Append-only playbook: Never rewrite bullet content, only increment counters
  • Semantic deduplication: 0.8 cosine similarity threshold prevents context collapse
  • Staged rollout: shadow → staging → prod with automated promotion gates
  • Multi-domain isolation: Per-tenant namespaces with separate FAISS indices
  • Rollback procedures: <5 minute automated rollback on regression detection
  • Performance budgets: ≤10ms P50 playbook retrieval, ≤+15% end-to-end overhead
  • Observability metrics: Prometheus-format metrics for monitoring (T065)
  • Guardrail monitoring: Automated rollback on performance regression (T066)
  • Docker support: Full containerization with Docker Compose (T067)
  • E2E testing: Comprehensive smoke tests for production readiness (T068)
  • Runtime adaptation: Merge coordinator + runtime adapter enable in-flight learning with optional benchmark harness
  • Async task queue: SQLite-backed queue for non-blocking memory formation (~50ms response)
  • FAISS IVF+PQ optimization: Sub-millisecond queries at 100K+ vectors with automatic index upgrades
  • Multi-user authentication: API key auth with usage tracking and rate limiting

Guardrails as High-Precision Sensors

ACE turns tiny heuristic checks into reusable guardrails without manual babysitting:

  • Detect: Domain heuristics (e.g., ±0.4% drift, missing "%") label a generator trajectory as a precise failure mode.
  • Distill: The reflector converts that signal into a lesson (“round to whole percent and append %”).
  • Persist: The curator records a typed delta with helpful/harmful counters and merges it into the playbook.
  • Reuse: Runtime adapter + merge coordinator surface the tactic immediately so later tasks cannot repeat the mistake.

This loop mirrors the +8.6% improvements reported on FiNER/XBRL benchmarks—subtle finance errors become actionable context upgrades instead of one-off patches.

Quick Start

Prerequisites

  • Python 3.11 or higher
  • pip or uv package manager
  • OpenAI API key, Anthropic API key, or OpenRouter API key

Environment Configuration

Copy .env.example to .env and configure the following:

Required API Keys (at least one):

OPENAI_API_KEY=sk-...           # OpenAI API
ANTHROPIC_API_KEY=sk-ant-...    # Anthropic API
OPENROUTER_API_KEY=sk-or-...    # OpenRouter (multi-model gateway)

Database Configuration:

DATABASE_URL=sqlite:///ace_playbook.db  # SQLite (default) or PostgreSQL
DATABASE_WAL_MODE=true                   # SQLite WAL for better concurrency

Embedding & Search:

EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2  # Embedding model
EMBEDDING_DIMENSION=384                                  # Vector dimensions
FAISS_INDEX_TYPE=IndexFlatIP                             # FAISS index type
SIMILARITY_THRESHOLD=0.8                                 # Deduplication threshold

Promotion Gates (bullet lifecycle):

SHADOW_HELPFUL_MIN=0      # Min helpful for shadow stage
STAGING_HELPFUL_MIN=3     # Min helpful for staging promotion
PROD_HELPFUL_MIN=5        # Min helpful for production promotion
STAGING_RATIO_MIN=3.0     # Min helpful/harmful ratio for staging
PROD_RATIO_MIN=5.0        # Min helpful/harmful ratio for production

Monitoring & Logging:

LOG_LEVEL=INFO            # Logging verbosity
LOG_FORMAT=json           # Log format (json or text)
METRICS_ENABLED=true      # Enable Prometheus metrics
METRICS_PORT=9090         # Metrics endpoint port

Local Installation

# Clone the repository
git clone https://github.com/your-username/ace-playbook.git
cd ace-playbook

# Option 1: Install with pip (standard)
pip install -e ".[dev]"

# Option 2: Install with uv (faster)
pip install uv
uv pip install -e ".[dev]"

# Set up environment variables
cp .env.example .env
# Edit .env with your API keys:
#   OPENAI_API_KEY=sk-...
#   ANTHROPIC_API_KEY=sk-ant-...
#   OPENROUTER_API_KEY=sk-or-... (alternative)

# Initialize the database
alembic upgrade head

# Verify installation with smoke tests
pytest tests/e2e/test_smoke.py -v

# Run a simple example
python examples/arithmetic_learning.py

Agent Learning Live Loop

The Agent Learning (Early Experience) harness now lives in this repository under ace/agent_learning. It reuses the ACE runtime client, curator, and metrics stack to run a live loop that streams experience back into the playbook. See docs/combined_quickstart.md for a walkthrough and run the demo script with:

python examples/live_loop_quickstart.py
# Or run with your configured DSPy backend
python examples/live_loop_quickstart.py --backend dspy --episodes 10

Environment checklist

  • OPENROUTER_API_KEY (preferred), OPENAI_API_KEY, or ANTHROPIC_API_KEY
  • DATABASE_URL (defaults to sqlite:///ace_playbook.db)
  • Optional: OPENROUTER_MODEL if you want to experiment with different hosted LLMs

Docker Compose (Recommended for Production)

# Create .env file with your API keys
echo "OPENAI_API_KEY=sk-..." > .env
echo "ANTHROPIC_API_KEY=sk-ant-..." >> .env

# Start services
docker-compose up -d

# View logs
docker-compose logs -f ace

# Stop services
docker-compose down

Observability

# Export Prometheus metrics
from ace.ops import get_metrics_collector

collector = get_metrics_collector()
print(collector.export_prometheus())

Guardrail Monitoring

# Check for performance regressions
from ace.ops import create_guardrail_monitor

monitor = create_guardrail_monitor(session)
trigger = monitor.check_guardrails("customer-acme")
if trigger:
    print(f"Rollback triggered: {trigger.reason}")

REST API

ACE Playbook provides a FastAPI-based REST API for memory formation and retrieval.

Start the API Server

# Option 1: Use the convenience script (recommended for development)
python run_api_server.py
# Runs on http://localhost:8000 with auto-reload enabled

# Option 2: Run with uvicorn directly
python -m uvicorn ace.api.app:app --host 0.0.0.0 --port 8000

# Option 3: With auto-reload for development
python -m uvicorn ace.api.app:app --reload

API Endpoints

Endpoint Method Description
/health GET Health check with service status
/api/memory/form POST Synchronous memory formation (~30s)
/api/memory/form-async POST Async memory formation (~50ms response)
/api/memory/status/{task_id} GET Poll async task status
/api/memory/queue-stats GET Task queue statistics
/api/memory/retrieve POST Retrieve relevant memories

Async Task Queue

For production use, the async endpoints provide non-blocking memory formation:

# 1. Queue a memory formation task (returns immediately)
curl -X POST http://localhost:8000/api/memory/form-async \
  -H "Authorization: Bearer ace_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "task_id": "task-001",
    "domain_id": "user-123-python",
    "reasoning_trace": ["Step 1: Analyze", "Step 2: Implement"],
    "answer": "def foo(): pass",
    "ground_truth": "def foo(): pass",
    "confidence": 0.9,
    "domain": "python-coding"
  }'

# Response: {"queue_task_id": "abc-123", "status": "queued"}

# 2. Poll for completion
curl http://localhost:8000/api/memory/status/abc-123 \
  -H "Authorization: Bearer ace_your_api_key"

# Response: {"status": "completed", "result": {"bullets_added": 2, ...}}

# 3. Check queue health
curl http://localhost:8000/api/memory/queue-stats \
  -H "Authorization: Bearer ace_your_api_key"

# Response: {"pending": 0, "processing": 1, "completed": 50, "failed": 0, "worker_running": true}

Benefits of Async Processing:

  • 50ms response time instead of 30s blocking
  • Automatic retries on transient failures (up to 3 attempts)
  • Queue persistence across API restarts
  • Concurrent processing via background worker thread

Vector Search Optimization

ACE Playbook uses FAISS for high-performance vector similarity search with automatic index optimization:

Vector Count Index Type Query Time Memory
<1,000 IndexFlatIP (exact) ~1ms Full
1,000-10,000 IndexIVFFlat ~0.5ms Full
>10,000 IndexIVFPQ ~0.3ms 10-100x reduced

Installing FAISS (optional but recommended for production):

# With pip (may require compilation)
pip install faiss-cpu

# With conda (pre-built binaries)
conda install -c pytorch faiss-cpu

# On macOS with Homebrew
brew install faiss && pip install faiss-cpu

Without FAISS, the system automatically falls back to NumPy-based search (suitable for <10K vectors).

MCP Server (Claude Integration)

ACE Playbook includes an MCP (Model Context Protocol) server that exposes memory functionality as tools for Claude agents.

Available Tools

Tool Description
recall_memories Retrieve relevant memories via semantic search
teach_memory Store learned patterns from interactions
forget_memory Delete or quarantine specific memories

Running the MCP Server

# Option 1: Use the convenience script (recommended)
python run_mcp_server.py

# Option 2: With SSE transport for development/testing
python run_mcp_server.py --transport sse --port 8765

# Option 3: With debug mode and custom settings
python run_mcp_server.py --debug --log-level DEBUG

# Option 4: Run with module directly
python -m ace.mcp.server

MCP Server CLI Options

Option Default Description
--transport stdio Transport mode: stdio (Claude Desktop) or sse (development)
--host 127.0.0.1 Host to bind for SSE transport
--port 8765 Port to bind for SSE transport
--debug false Enable debug mode
--log-level INFO Logging level: DEBUG, INFO, WARNING, ERROR, CRITICAL
--database-url sqlite:///ace_playbook.db Database connection URL
--reflector-model gpt-4o-mini LLM model for Reflector
--similarity-threshold 0.8 Semantic deduplication threshold

MCP Environment Variables

All settings can be configured via environment variables with ACE_MCP_ prefix:

# Database
ACE_MCP_DATABASE_URL=sqlite:///custom.db

# Server settings
ACE_MCP_SERVER_HOST=127.0.0.1
ACE_MCP_SERVER_PORT=8765
ACE_MCP_TRANSPORT=stdio
ACE_MCP_DEBUG=false
ACE_MCP_LOG_LEVEL=INFO

# LLM settings
ACE_MCP_REFLECTOR_MODEL=gpt-4o-mini
ACE_MCP_SIMILARITY_THRESHOLD=0.8

Claude Desktop Configuration

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "ace-memory": {
      "command": "python",
      "args": ["run_mcp_server.py"],
      "cwd": "/path/to/ace-playbook",
      "env": {
        "ACE_MCP_DATABASE_URL": "sqlite:///ace_playbook.db",
        "OPENAI_API_KEY": "your-key-here"
      }
    }
  }
}

Claude Code Configuration

Add to your Claude Code settings or .mcp.json:

{
  "mcpServers": {
    "ace-memory": {
      "command": "python",
      "args": ["run_mcp_server.py", "--debug"],
      "cwd": "/path/to/ace-playbook"
    }
  }
}

Benchmarking & Runtime Adaptation

Use the benchmark harness to compare variants and capture guardrail activity. Detailed notes live in docs/runtime_benchmarks.rst; aggregated numbers are tracked in benchmarks/RESULTS.md alongside links to the GitHub Action artifacts.

Run Baseline vs ACE

# Baseline: Chain-of-Thought generator only
python scripts/run_benchmark.py benchmarks/finance_subset.jsonl baseline --output results/baseline_finance_subset.json

# Full ACE stack: ReAct generator + runtime adapter + merge coordinator + refinement scheduler
python scripts/run_benchmark.py benchmarks/finance_subset.jsonl ace_full --output results/ace_full_finance_subset.json

# ACE vs baseline live loop comparison (ACE + EE harness)
python benchmarks/run_live_loop_benchmark.py --backend dspy --episodes 10

# Trigger the CI workflow (optional)
gh workflow run ace-benchmark.yml
# The matrix covers finance (easy + hard, GT/no-GT), agent-hard, and finance ablations.
# Each job uploads `ace-benchmark-<matrix.name>` under `results/actions/<run-id>/`.

# Audit agent heuristics locally (sample 20 tasks)
python scripts/audit_agent_scoring.py benchmarks/agent_small.jsonl --sample 20

# Hard finance split (Table 2 replication)
ACE_BENCHMARK_TEMPERATURE=1.3 \
  python scripts/run_benchmark.py benchmarks/finance_hard.jsonl baseline \
  --output results/benchmark/baseline_finance_hard.json

python scripts/run_benchmark.py benchmarks/finance_hard.jsonl ace_full \
  --output results/benchmark/ace_finance_hard_gt.json

ACE_BENCHMARK_USE_GROUND_TRUTH=false ACE_BENCHMARK_TEMPERATURE=0.6 \
  python scripts/run_benchmark.py benchmarks/finance_hard.jsonl ace_full \
  --output results/benchmark/ace_finance_hard_no_gt.json

# Finance ablations (Table 2 component analysis)
ACE_ENABLE_REFLECTOR=false \
  python scripts/run_benchmark.py benchmarks/finance_hard.jsonl ace_full \
  --output results/benchmark/ace_finance_hard_no_reflector.json

ACE_MULTI_EPOCH=false \
  python scripts/run_benchmark.py benchmarks/finance_hard.jsonl ace_full \
  --output results/benchmark/ace_finance_hard_no_multiepoch.json

ACE_OFFLINE_WARMUP=false \
  python scripts/run_benchmark.py benchmarks/finance_hard.jsonl ace_full \
  --output results/benchmark/ace_finance_hard_no_warmup.json

# Agent/AppWorld hard split with conservative heuristics
ACE_BENCHMARK_TEMPERATURE=1.3 \
  python scripts/run_benchmark.py benchmarks/agent_hard.jsonl baseline \
  --output results/benchmark/baseline_agent_hard.json

python scripts/run_benchmark.py benchmarks/agent_hard.jsonl ace_full \
  --output results/benchmark/ace_agent_hard.json

# Quickly sanity-check heuristic thresholds on harder agent tasks
python scripts/audit_agent_scoring.py benchmarks/agent_hard.jsonl --sample 20

Key metrics in the JSON output:

  • correct / total – benchmark score
  • promotions, new_bullets, increments – curator activity
  • auto_corrections – guardrail canonical replacements (e.g., finance rounding)
  • format_corrections – post-process clamps that strip extra words but retain the raw answer for reflection
  • agent_feedback_log – path to the per-task ledger (*.feedback.jsonl) emitted for every run

Populate or refresh benchmarks/RESULTS.md with the numbers emitted by these commands (or the CI artifacts). The guardrails and heuristics default to a fail-closed posture: when they cannot certify an answer they mark it unknown, mirroring the safety constraint highlighted in the paper.

Add a New Finance Guardrail

  1. Edit ace/utils/finance_guardrails.py and add an entry to FINANCE_GUARDRAILS with instructions, calculator, and decimals.
  2. Set auto_correct=True if the calculator should override the raw answer.
  3. Re-run scripts/run_benchmark.py for the relevant dataset.
  4. Inspect results/*.json to confirm the guardrail triggered and push the refreshed artifact.

Pro tip: keep regenerated results in source control so regressions surface in diffs.

Add a New Domain in 5 Steps

  1. Scaffold stubs

    python scripts/scaffold_domain.py claims-processing

    This creates:

    • benchmarks/claims-processing.jsonl
    • ace/utils/claims-processing_guardrails.py
    • docs/domains/claims-processing.rst
  2. Populate ground truth – Fill the benchmark file with real tasks (one JSON per line).

  3. Implement guardrails – Update the guardrail module with instructions, calculators, and auto_correct flags.

  4. Run the benchmarkpython scripts/run_benchmark.py benchmarks/claims-processing.jsonl ace_full --output results/ace_full_claims-processing.json

  5. Document & commit – Summarize behavior in the docs stub, review results/*.json, and push the changes.

Tip: repeat the harness run periodically (or in CI) so regressions surface immediately.

Release Notes

See docs/release_notes.md for the changelog and upgrade instructions for the unified ACE + Agent Learning stack. Tag v1.0.0 corresponds to the integration referenced in the companion papers.

Project Structure

ace-playbook/
├── ace/                    # Core ACE framework
│   ├── generator/         # DSPy Generator modules (CoT, ReAct)
│   ├── reflector/         # Reflector insight extraction
│   ├── curator/           # Semantic deduplication with FAISS
│   ├── models/            # SQLAlchemy ORM models
│   │   ├── playbook.py    # PlaybookBullet with user_id FK
│   │   ├── user.py        # User, UserUsage, AuditLog
│   │   └── task_queue.py  # TaskQueue for async processing
│   ├── repositories/      # Database access layer
│   ├── services/          # Business logic layer
│   │   ├── auth_service.py        # API key authentication
│   │   ├── task_queue_service.py  # Async queue management
│   │   └── memory_formation_service.py  # G-R-C orchestration
│   ├── workers/           # Background task processing
│   │   └── memory_worker.py  # Async memory formation worker
│   ├── api/               # FastAPI REST endpoints
│   │   ├── app.py         # API routes (sync + async)
│   │   ├── models.py      # Pydantic request/response schemas
│   │   └── middleware.py  # Auth + rate limiting
│   ├── mcp/               # MCP Server for Claude integration
│   │   ├── server.py      # ACEMemoryMCPServer implementation
│   │   ├── config.py      # MCPServerConfig (pydantic-settings)
│   │   └── tools.py       # MCP tool implementations
│   ├── utils/             # Infrastructure utilities
│   │   ├── faiss_index.py # FAISS IVF+PQ optimization
│   │   ├── numpy_vector_search.py  # NumPy fallback
│   │   └── embeddings.py  # Sentence-transformer wrapper
│   └── ops/               # Operations (metrics, guardrails)
├── tests/                  # Test suite
│   ├── unit/              # Unit tests
│   ├── integration/       # Integration tests
│   └── e2e/               # End-to-end smoke tests
├── examples/               # Usage examples
├── config/                 # Configuration files
├── alembic/                # Database migrations
│   └── versions/          # Migration scripts
├── run_api_server.py       # Convenience script for API server
├── run_mcp_server.py       # Convenience script for MCP server
├── Dockerfile              # Container image definition
├── docker-compose.yml      # Local development stack
└── docs/                   # Additional documentation

Development

Pre-commit Hooks

Pre-commit hooks automatically run code quality checks before each commit:

# Install pre-commit hooks (one-time setup)
pre-commit install
pre-commit install --hook-type commit-msg

# Run manually on all files
pre-commit run --all-files

# Skip hooks for a specific commit (use sparingly)
git commit --no-verify -m "WIP: temporary commit"

Installed Hooks:

  • Code Quality: Black formatting, Ruff linting, isort import sorting, autoflake (unused imports)
  • Type Safety: mypy static type checking
  • Security: Bandit vulnerability scanning, detect-secrets, Safety (dependency vulnerabilities)
  • Documentation: Docstring coverage (interrogate), markdown linting
  • Standards: Conventional commits validation, trailing whitespace, end-of-file fixes
  • Infrastructure: YAML/JSON/TOML validation, Dockerfile linting, SQL linting
  • Testing: pytest coverage ≥80% (on push)
  • Complexity: Radon cyclomatic complexity and maintainability index (on push)
  • Dead Code: Dead code detection

Manual Testing

# Run tests
pytest tests/ -v

# Type checking
mypy ace/

# Code formatting
black ace/ tests/
ruff check ace/ tests/

# Security scan
bandit -r ace/

# Docstring coverage
interrogate -vv ace/

Documentation

Comprehensive Documentation (v1.14.0+)

Build and view the complete documentation:

# Build HTML documentation
make docs

# Serve documentation locally
make docs-serve  # http://localhost:8000

Available Documentation:

Additional Resources

Acknowledgments

This project is an adapted version of the original ace-playbook by jmanhype. The original implementation provided the foundation for the Generator-Reflector-Curator pattern used in this system.

Key contributions from the original project:

  • Generator-Reflector-Curator architecture design
  • Semantic deduplication with FAISS
  • Staged rollout (shadow → staging → prod) concept
  • Multi-domain isolation approach

License

MIT

About

agentic-context-engineering

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages