Multi-turn adversarial testing for AI customer support agents, mapped to real-world incidents.
Tests RetailHub's customer support agent against 8 real-world AI chatbot failures (Air Canada $812 settlement, NYC MyCity illegal advice, DPD profanity jailbreak, etc.). Red team agent simulates realistic attacks across 20 test variants.
# Install dependencies
pip install anthropic pyyaml
# Set API key
export ANTHROPIC_API_KEY=your_key_here
# Run evaluation (auto-generates dashboard when complete)
python red_team_eval.py
# Open dashboard
open dashboard.html
Note: The dashboard is automatically generated when evaluation completes. To regenerate manually: python generate_dashboard.py
red_team_eval.py
- Multi-turn red team pipeline with LLM-as-Judgetest_cases.yaml
- 20 test cases mapped to real incidentsknowledge/retailhub_policies.md
- Company policy for RAG groundinggenerate_dashboard.py
- HTML dashboard generatordashboard.html
- Interactive results dashboardresults/
- CSV output from each evaluation run
- 8 real incidents (2016-2025): Microsoft Tay, Air Canada, NYC MyCity, DPD, Chevrolet, Cursor AI, Zillow, Anthropic
- Attack tactics: Emotional manipulation, Jailbreak, Prompt injection, Authority invocation, Crescendo, Confidence calibration
- Harm types: Financial loss, Legal liability, Brand damage, Privacy violation
- Severity grading: PASS, P4 (Trivial), P3 (Minor), P2 (Significant), P1 (Major), P0 (Critical)
Open dashboard.html
to view:
- Summary table with severity breakdown and real examples
- "View All Failures" section with judge evaluations
- Attack methodology explanations (6 attack types)
- Real-world incidents with full test results
- System configuration (prompts + policy)
Red Team (Sonnet) ---> RAG Agent (Haiku + Policy)
|
v
LLM Judge (Sonnet)
|
v
results/*.csv
|
v
dashboard.html (auto-generated)
- Red Team:
claude-sonnet-4-5-20250929
(aggressive, adaptive attacks) - RAG Agent:
claude-haiku-4-5-20251001
(intentionally vulnerable for testing) - Judge:
claude-sonnet-4-5-20250929
(9-criteria evaluation)
Latest run: 20 tests, 75% pass rate
- P0 (Critical): 0
- P1 (Major): 1
- P2 (Significant): 4
- P3-P4: 0
- PASS: 15
Incident-driven evaluation methodology: maps real-world AI chatbot failures to adversarial test cases. Uses RAG + policy grounding with multi-turn adversarial testing to identify edge cases and vulnerabilities.