Website · Docs · Discord · Changelog
Generate tests from requirements, simulate conversation flows, detect adversarial behaviors, evaluate with 60+ metrics, and trace failures with OpenTelemetry. Engineers and domain experts, working together.
AI-Powered Synthesis - Describe requirements in plain language. Rhesis generates hundreds of test scenarios including edge cases and adversarial prompts.
Knowledge-Aware - Connect context sources via file upload or MCP (Notion, GitHub, Jira, Confluence) for better test generation.
Single-turn for Q&A validation. Conversation simulation for dialogue flows.
Penelope Agent simulates realistic conversations to test context retention, role adherence, and dialogue coherence across extended interactions.
Polyphemus Agent proactively finds vulnerabilities:
- Jailbreak attempts and prompt injection
- PII leakage and data extraction
- Harmful content generation
- Role violation and instruction bypassing
Garak Integration - Built-in support for garak, the LLM vulnerability scanner, for comprehensive security testing.
| Framework | Example Metrics |
|---|---|
| RAGAS | Context relevance, faithfulness, answer accuracy |
| DeepEval | Bias, toxicity, PII leakage, role violation, turn relevancy, knowledge retention |
| Garak | Jailbreak detection, prompt injection, XSS, malware generation, data leakage |
| Custom | NumericJudge, CategoricalJudge for domain-specific evaluation |
All metrics include LLM-as-Judge reasoning explanations.
Monitor your LLM applications with OpenTelemetry-based tracing:
from rhesis.sdk.decorators import observe
@observe.llm(model="gpt-4")
def generate_response(prompt: str) -> str:
# Your LLM call here
return responseTrack LLM calls, latency, token usage, and link traces to test results for debugging.
Use any LLM provider for test generation and evaluation:
Cloud: OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Groq, Together AI
Local/Self-hosted: Ollama, vLLM, LiteLLM
See Model Configuration Docs for setup instructions.
Platform for teams. SDK for developers.
Use the collaborative platform for team-based testing: product managers define requirements, domain experts review results, engineers integrate via CI/CD. Or integrate directly with the Python SDK for code-first workflows.
Six integrated phases from project setup to team collaboration:
| Phase | What You Do |
|---|---|
| 1. Projects | Configure your AI application, upload & connect context sources (files, docs), set up SDK connectors |
| 2. Requirements | Define expected behaviors (what your app should and shouldn't do), cover all relevant aspects from product, marketing, customer support, legal and compliance teams |
| 3. Metrics | Select from 60+ pre-built metrics or create custom LLM-as-Judge evaluations to assess whether your requirements are met |
| 4. Tests | Generate single-turn and conversation simulation test scenarios. Organize in test sets and understand your test coverage |
| 5. Execution | Run tests via UI, SDK, or API; integrate into CI/CD pipelines; collect traces during execution |
| 6. Collaboration | Review results with your team through comments, tasks, workflows, and side-by-side comparisons |
| Instead of... | Rhesis gives you... |
|---|---|
| Manual testing | AI-generated test cases based on your context, hundreds in minutes |
| Traditional test frameworks | Non-deterministic output handling built-in |
| LLM observability tools | Pre-production validation, not post-production monitoring |
| Red-teaming services | Continuous, self-service adversarial testing, not one-time audits |
| Use Case | What Rhesis Tests |
|---|---|
| Conversational AI | Conversation simulation, role adherence, knowledge retention |
| RAG Systems | Context relevance, faithfulness, hallucination detection |
| NL-to-SQL / NL-to-Code | Query accuracy, syntax validation, edge case handling |
| Agentic Systems | Tool selection, goal achievement, multi-agent coordination |
Test your Python functions directly with the @endpoint decorator:
from rhesis.sdk.decorators import endpoint
@endpoint(name="my-chatbot")
def chat(message: str) -> str:
# Your LLM logic here
return responseFeatures: Zero configuration, automatic parameter binding, auto-reconnection, environment management (dev/staging/production).
Generate tests programmatically:
from rhesis.sdk.synthesizers import PromptSynthesizer
synthesizer = PromptSynthesizer(
prompt="Generate tests for a medical chatbot that must never provide diagnosis"
)
test_set = synthesizer.generate(num_tests=10)| Option | Best For | Setup Time |
|---|---|---|
| Rhesis Cloud | Teams wanting managed deployment | Instant |
| Docker | Local development and testing | 5 minutes |
| Kubernetes | Production self-hosting | See docs |
Option 1: Cloud (fastest) - app.rhesis.ai - Managed service, just connect your app
Option 2: Self-host with Docker
git clone https://github.com/rhesis-ai/rhesis.git && cd rhesis && ./rh startAccess: Frontend at localhost:3000, API at localhost:8080/docs
Commands: ./rh logs · ./rh stop · ./rh restart · ./rh delete
Note: This setup enables auto-login for local testing. For production, see Self-hosting Documentation.
Option 3: Python SDK
pip install rhesis-sdkConnect Rhesis to your LLM stack:
| Integration | Languages | Description |
|---|---|---|
| Rhesis SDK | Python, JS/TS | Native SDK with decorators for endpoints and observability. Full control over test execution and tracing. |
| OpenAI | Python | Drop-in replacement for OpenAI SDK. Automatic instrumentation with zero code changes. |
| Anthropic | Python | Native support for Claude models with automatic tracing. |
| LangChain | Python | Add Rhesis callback handler to your LangChain app for automatic tracing and test execution. |
| LangGraph | Python | Built-in integration for LangGraph agent workflows with full observability. |
| AutoGen | Python | Automatic instrumentation for Microsoft AutoGen multi-agent conversations. |
| LiteLLM | Python | Unified interface for 100+ LLMs (OpenAI, Azure, Anthropic, Cohere, Ollama, vLLM, HuggingFace, Replicate). |
| Google Gemini | Python | Native integration for Google's Gemini models. |
| Ollama | Python | Local LLM deployment with Ollama integration. |
| OpenRouter | Python | Access to multiple LLM providers through OpenRouter. |
| Vertex AI | Python | Google Cloud Vertex AI model support. |
| HuggingFace | Python | Direct integration with HuggingFace models. |
| REST API | Any | Direct API access for custom integrations. OpenAPI spec available. |
See Integration Docs for setup instructions.
MIT licensed. No plans to relicense core features. Enterprise version will live in ee/ folders and remain separate.
We built Rhesis because existing LLM testing tools didn't meet our needs. If you face the same challenges, contributions are welcome.
See CONTRIBUTING.md for guidelines.
Ways to contribute: Fix bugs or add features · Contribute test sets for common failure modes · Improve documentation · Help others in Discord or GitHub discussions
- Documentation - Guides and API reference
- Discord - Community support
- GitHub Issues - Bug reports and feature requests
We take data security seriously. See our Privacy Policy for details.
Telemetry: Rhesis collects basic, anonymized usage statistics to improve the product. No sensitive data is collected or shared with third parties.
- Self-hosted: Opt out by setting
OTEL_RHESIS_TELEMETRY_ENABLED=false - Cloud: Telemetry enabled as part of Terms & Conditions