# vLLora - Debug your agents in realtime
> Your AI Agent Debugger
This file contains all documentation content in a single document following the llmstxt.org standard.
## Clone and Experiment with Requests
Use **Clone Request** to turn any finished trace into an isolated **Experiment**, so you can safely try new prompts, models, and parameters without touching the original run.
This is the fastest way to A/B test ideas, compare models, and iterate on prompts directly from your existing traces.
## What are Clone Request and Experiment?
- **Clone Request**: the action of taking a finished trace and creating an Experiment from it.
- **Experiment**: the editable copy of the original LLM request (messages, tools, model, temperature, max tokens, etc.) that you can rerun and tweak independently of the original trace.
## How to Clone a Request

1. **Open Traces and select a request**: In the vLLora UI at `http://localhost:9091`, go to the **Traces** view and click the specific trace (span) you want to clone and experiment with. This opens the **Experiment** view for that request.
2. **Create the clone**: Click the **Clone** tab/button. vLLora creates a new experiment based on that trace while keeping the original trace and output frozen on the right as **Original Output**.
The new experiment becomes your **New Output** area where you can safely change the request and re‑run it as many times as you like.
## Editing the Cloned Request
The cloned request is a full OpenAI‑compatible payload with the same messages, tools, **model**, and parameters as the original. You can edit it in two main ways:
- **Visual mode (`INPUT` tab)**
- Edit **system** and **user** messages in a structured, form-like UI.
- Add or remove messages, tools, and tool calls to change how the assistant behaves.
- Switch the **model** used for the Experiment to compare behaviour across providers or versions.
- Great when you want to tweak prompts or tool wiring without touching raw JSON.

- **JSON mode (`JSON` tab)**
- Edit the raw request body exactly as your app would send it.
- Change fields like `model`, `temperature`, `max_tokens`, `tools`, `tool_choice`, and other advanced options.
- Ideal for precise parameter tuning or reproducing a request from your own code.

When you’re ready, click **Run**. Each run of the cloned Experiment creates a **new trace**, so you can A/B test and iterate freely without ever mutating the original request.
In the **Output** panel you can compare the cloned Experiment’s **New Output** against the **Original Output** at a glance:
- **Tokens & context**: see how many prompt + completion tokens were used.
- **Cost**: compare the estimated cost of the original vs the experiment (and how much higher/lower it is, e.g. `<1%`).
- **Trace**: every run appears as its own trace in the **Traces** view, tagged as an **Experiment**, so you can quickly spot and inspect all your experimental runs and dive deeper into timing, tool calls, and other details.
## Use Cases
### 1. Prompt Engineering
Test different phrasings, instructions, or prompt structures to find the most effective version:
```text
Original: "Summarize this article"
Cloned & Modified: "Provide a concise 3-sentence summary of the key points in this article"
```
### 2. Model Comparison
Compare how different models handle the same request:
- Clone a request that used `openai/gpt-4o-mini`
- Change the model to `anthropic/claude-3-5-sonnet`
- Compare outputs side-by-side
### 3. Parameter Tuning
Experiment with different parameter values to optimize performance:
- **Temperature**: Adjust creativity vs. consistency (0.0 to 2.0)
- **Max Tokens**: Control response length
- **Top P**: Fine-tune sampling behavior
- **Frequency Penalty**: Reduce repetition
### 4. A/B Testing
Create multiple clones of the same request with different configurations to systematically test which approach works best for your use case.
### 5. Iterative Debugging
When debugging agent behavior:
1. Clone a request that produced unexpected results
2. Modify specific parameters or prompts
3. Test the changes without affecting the original trace
4. Compare results to understand what caused the issue
---
The Clone Request feature makes it easy to experiment and optimize your AI agent interactions without losing your original requests. Use it to refine prompts, compare models, and fine-tune parameters until you achieve the best results for your use case.
---
## Configuration
vLLora can be configured via a `config.yaml` file or through command-line arguments. CLI arguments take precedence over config file settings.
## Config File
Create a `config.yaml` file in your working directory to configure vLLora.
### HTTP Configuration
Configure the backend API server settings:
```yaml
http:
host: "0.0.0.0" # Host address to bind to
port: 9090 # Backend API port (default: 9090)
cors_allowed_origins: # List of allowed CORS origins
- "*" # Default: ["*"] (all origins)
```
### UI Configuration
Configure the web UI server settings:
```yaml
ui:
port: 9091 # UI port (default: 9091)
open_on_startup: true # Auto-open browser on startup (default: true)
```
### OTEL Configuration
Configure the OpenTelemetry gRPC collector settings:
```yaml
otel:
host: "[::]" # Host for OTEL gRPC collector (default: "[::]")
port: 4317 # OTEL port (default: 4317)
```
## Example Config File
Here's a complete example `config.yaml`:
```yaml
http:
host: "0.0.0.0"
port: 9090
cors_allowed_origins:
- "http://localhost:3000"
- "https://example.com"
ui:
port: 9091
open_on_startup: true
otel:
host: "[::]"
port: 4317
```
### Environment Variable Substitution
You can use environment variables in your config file:
```yaml
http:
host: "{{ HTTP_HOST }}"
port: 9090
```
Set the environment variable before starting vLLora:
```bash
export HTTP_HOST="127.0.0.1"
vllora serve
```
## Command-Line Arguments
All configuration options can also be set via CLI arguments. Run `vllora serve --help` to see all available options:
```bash
vllora serve --help
```
**Available CLI options:**
- `--host
` - Host address to bind to
- `--port ` - Backend API port
- `--ui-port ` - UI server port
- `--cors-origins ` - Comma-separated list of allowed CORS origins
- `--open-ui-on-startup ` - Control browser auto-open on startup
CLI arguments override corresponding config file settings when both are specified.
## Port Conflicts
If a configured port is already in use, vLLora will automatically find the next available port and prompt you to accept the change. You can accept or reject the new port assignment.
---
## Custom Endpoints
Connect your own endpoint to any provider in vLLora. This allows you to use custom API gateways, self-hosted models, or OpenAI-compatible proxies.
## Configuration
To configure a custom endpoint:
1. Navigate to **Providers** in the settings.
2. Select the provider you want to configure
3. Select **"Use Custom Endpoint"**
4. Enter your **Endpoint URL** (e.g., `https://api.example.com`)
5. Enter your **API Key**

That's it! vLLora will now use your custom endpoint instead of the default provider endpoint.
## Using Custom Endpoints
No changes are needed when calling models. You continue to use the same model naming convention. For example, if you configured a custom endpoint for OpenAI, you would still call models as:
```text
openai/your-model-name
```
The custom endpoint is used automatically in the background—your API calls remain the same.
---
## Custom Providers and Models
vLLora is designed to be agnostic and flexible, allowing you to register **Custom Providers** (your own API endpoints) and **Custom Models** (specific model identifiers).
This architecture enables "bring your own endpoint" scenarios, such as connecting to self-hosted inference engines (like Ollama or LocalAI), private enterprise gateways, or standard OpenAI-compatible services.
## The Namespace System
To prevent collisions between different services, vLLora organizes resources using a namespaced format:
```text
/
```
This structure ensures that a model ID like `llama-3` from a local provider is distinct from `llama-3` hosted on a remote gateway.
Example:
```text
my-gateway/llama-3.3-70b
openai/gpt-4.1
anthropic/claude-3-5-sonnet
```
## Configuration Entry Points
You can configure custom providers and models in two locations within the application:
- Settings: The centralized hub for managing all provider connections and model definitions.
- Chat Model Selector: A quick-action menu allowing you to add new models and providers on the fly without leaving your current thread.
## Adding a Custom Provider

To connect an external service, click Add Provider in the Settings menu. This opens the configuration modal where you define the connection details and register initial models.
| Field | Required | Description |
| ------ | -------- | ----------- |
| Provider Name | Required | A unique identifier that becomes the namespace for your models (e.g., entering `ollama` results in `ollama/model-id`). |
| Description | Optional | A short note to help you identify the purpose of this provider (e.g., "Local dev server" or "Company Gateway"). |
| API Type | Required | The communication protocol used by the upstream API. Select OpenAI-compatible for most standard integrations (Ollama, vLLM, LocalAI). See full list of [Supported API Protocols](#supported-api-protocols) |
| Base Endpoint URL | Required | The full URL to the upstream API. Ensure this includes the version suffix if required (e.g., `http://localhost:11434/v1`). |
| API Key | Optional | The authentication token. This is stored securely and used for all requests to this provider. Leave blank for local tools that do not require auth. |
### Registering Models Inline
The Models section at the bottom of the modal allows you to register Model IDs immediately while creating the provider.
- Add Model ID: Type the exact ID used by the upstream API (e.g., `llama3.2:70b` or `gpt-4-turbo`) and press Enter (or click the + button).
- Configure Details: You can add more details about the context size and capabilities like `tools` and `reasoning` support
## Adding a Custom Model
If you already have a provider set up—or want to quickly add a single model—use the Add Custom Model button (found in Settings or the Chat Model Selector).

### Configuration Flow
1. Provider: Select the upstream provider.
- Existing: Choose a provider you have already configured.
- Create New: Select "Create New Provider" to open the full Provider configuration modal described above.
2. Model ID: Enter the specific identifier (e.g., gpt-4o, deepseek-coder).
3. Model Name (Optional): A friendly display name for the UI.
### Advanced Settings
- Context Size: Define the token limit.
- Capabilities: Toggle Tools or Reasoning support.
- Custom Endpoint: Enter a URL here only if this specific model uses a different API endpoint than the provider's default Base URL.
## Using Your Custom Models
Once added, no code changes are required. Models are accessed using the namespaced format:
```bash
provider-name/model-id
```
Examples:
- `ollama-local/llama3.2`
- `my-gateway/gpt-4.1`
### Practical Patterns
- **One provider, many models**: A single gateway entry (e.g., openai) hosting multiple IDs (gpt-4, gpt-3.5).
- **Model-level override**s: Using the "Custom Endpoint" field in the Add Custom Model flow to point specific models to different URLs while sharing the same API key.
- **Quick add from chat**: Use the link in the Chat Model Selector to add a model while experimenting, then refine its settings later.
## Supported API Protocols
- OpenAI-compatible
- Anthropic
---
## Debugging LLM Requests
vLLora supports interactive debugging for LLM requests. When Debug Mode is enabled, vLLora pauses requests before they are sent to the model. You can inspect, edit, and continue the request. This allows you to debug agent prompts, model selection, tool schemas, and parameters in real time.
Debug Mode works by inserting breakpoints on every outgoing LLM request. When enabled, each request is paused so you can inspect, edit, or continue execution.

With Debug Mode you can:
- **Inspect** the model, messages, parameters, and tool schemas
- **Continue** with the original request
- **Modify** the request and send your edited version instead
## Enable Debug Mode
You can enable Debug Mode directly in the vLLora UI:

1. Open the UI at `http://localhost:9091/chat`.
2. Click the debug mode (bug) icon to turn on Debug Mode.
Once enabled, vLLora pauses every outgoing LLM request so you can inspect, edit, or continue it.
Toggle the icon again to disable Debug Mode and let requests flow normally.
:::tip Debug Mode Scope
Debug Mode affects all requests that flow through vLLora. There are no per-route or per-endpoint settings—when enabled, every LLM request is intercepted.
:::
## Paused Requests
When a request is paused in Debug Mode, vLLora pauses execution and displays a detailed view of the request payload. Here's what you'll see:

The paused view gives you a clear snapshot of the request:
- The selected model
- All messages in the request (user, system, assistant)
- Parameters like temperature or max tokens
- Any additional fields your application included
This data appears exactly as vLLora will send it to the provider.
## Inspecting the Request
When you hover over a paused request in the trace view, the request payload is shown in a JSON viewer so you can quickly read through the structure. This makes it simple to confirm:
- What message content the model is receiving
- Whether parameters are set as expected
- How frameworks or middleware have transformed the request
- Whether the prompt is what you intended to send
## Editing the Request
Click **Edit** to unlock the JSON panel. You can update any part of the payload:

- Change the model
- Modify message content
- Adjust parameters
- Remove or add fields
Edits apply only to this request and do not modify your application code.
### What Changes Happen
It's important to understand what happens when you edit a request:
- **Only that specific request is modified**: Your edits affect only the current paused request. The changes are not saved to your application code or configuration.
- **The agent continues normally afterward**: After you continue with the edited request, your agent or workflow proceeds as if the request was sent normally—just with your modifications applied.
- **Does not modify application code**: Your source code remains unchanged. This is purely a runtime debugging feature.
:::important Temporary Changes
All edits are temporary and apply only to the single request being debugged. To make permanent changes, you'll need to update your application code or configuration files.
:::
## Continue Execution
After inspecting or editing a paused request, you can continue execution in two ways:
- **From the top panel**: Click the **Continue** button in the paused request panel to send the request (with any edits you made).
- **From the trace view**: Click **"Click to continue"** on a specific paused request in the trace timeline.

### What Continue Does
When you click **Continue**:
1. **Sends the edited request to the model**: The request payload (whether edited or unchanged) is sent to the LLM provider.
2. **Provides the real response from the provider**: You receive the actual model response as if the request had been sent normally.
3. **Resumes the agent or workflow**: Your agent or application continues executing as if nothing changed, using the response from the modified request.
4. **Shows the final output below the editor**: The model's response appears in the trace view, allowing you to see the result of your edits.
:::important
Clicking **Continue** sends the edited request to the provider and your application resumes normally with the new response.
:::
To disable Debug Mode and let requests flow normally, click the **Stop** button to turn off Debug Mode.
---
Interactive debugging with Debug Mode gives you complete control over LLM requests in real time. Use it to quickly diagnose issues, test parameter changes, and understand exactly what your agents are sending to models without modifying your code.
---
## Installation
vLLora can be installed via Homebrew, the Rust crate, or by building from source.
## Homebrew (macOS & Linux)
Easy install on Linux and macOS using Homebrew.
```bash
brew tap vllora/vllora
brew install vllora
```
Launch vLLora:
```bash
vllora
```
This starts the gateway and opens the UI in your browser.
:::tip Homebrew Setup
New to Homebrew? Check these guides:
- [Homebrew Installation](https://docs.brew.sh/Installation)
- [Homebrew on Linux](https://docs.brew.sh/Homebrew-on-Linux)
:::
## Install with Rust Crate
[](https://crates.io/crates/vllora)
Install vLLora directly from [crates.io](https://crates.io/crates/vllora) using Cargo. This is the recommended installation method for Linux users who have Rust installed.
:::tip Prerequisites
Make sure you have Rust and Cargo installed. Visit [rustup.rs](https://rustup.rs/) to get started.
:::
### Install Command
Install vLLora using Cargo:
```bash
cargo install vllora
```
This will download and compile vLLora from the published crate on crates.io.
### Launch vLLora
After installation, launch vLLora:
```bash
vllora
```
This starts the gateway and opens the UI in your browser at [http://localhost:9091](http://localhost:9091).
## Build from Source
Want to contribute or run the latest development version? Clone the [GitHub repository](https://github.com/vllora/vllora) (including submodules) and build from source:
```bash
git clone https://github.com/vllora/vllora.git
cd vllora
git submodule update --init --recursive
cargo run serve
```
This will start the gateway on `http://localhost:9090` with the UI available at `http://localhost:9091` in your browser.
:::tip Development Setup
Make sure you have Rust installed. Visit [rustup.rs](https://rustup.rs/) to get started.
:::
---
## Introduction
Debug your AI agents with complete visibility into every request. vLLora works out of the box with OpenAI-compatible endpoints, supports 300+ models with your own keys, and captures deep traces on latency, cost, and model output.

## Installation
Easy install on Linux and macOS using Homebrew.
```bash
brew tap vllora/vllora
brew install vllora
```
Launch vLLora:
```bash
vllora
```
For more installation options (Rust crate, build from source, etc.), see the [Installation](./installation) page.
## Send your First Request
After starting vLLora, visit http://localhost:9091 to configure your API keys through the UI. Once configured, point your application to http://localhost:9090 as the base URL.
vLLora works as a drop-in replacement for the OpenAI API, so you can use any OpenAI-compatible client or SDK. Every request will be captured and visualized in the UI with full tracing details.
```bash
curl -X POST http://localhost:9090/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [
{
"role": "user",
"content": "Hello, vLLora!"
}
]
}'
```
Now check `http://localhost:9091` to see your first trace with full request details, costs, and timing breakdowns.
---
## License
vLLora is [fair-code](https://faircode.io/) distributed under the **Elastic License 2.0 (ELv2)**.
By using vLLora, you agree to all of the terms and conditions of the Elastic License 2.0.
## Rust Crate (`vllora_llm`)
The `vllora_llm` Rust crate is a separate, embeddable component released under the **Apache License 2.0**. This crate is intentionally licensed separately to ensure that when you embed it in your own product, you're not inheriting any license-key or licensing concerns in your codebase.
See the [Rust crate license documentation](./vllora-llm/license) for more details.
## Enterprise
Proprietary licenses are available for enterprise customers. Please reach out via [email](mailto:hello@vllora.dev).
## Full License Text
The complete Elastic License 2.0 text is available at [www.elastic.co/licensing/elastic-license](https://www.elastic.co/licensing/elastic-license).
---
## Lucy
**Lucy** is an in-app AI assistant that inspects your traces to diagnose errors, latency issues, and high costs. It replaces manual scrolling with automated root-cause analysis.
## Quick Start
Lucy is available globally in the vLLora dashboard.
1. **Open any Trace Details or Thread**.
2. **Click the Lucy icon** in the bottom right corner.
3. **Ask a question** (e.g., "What went wrong here?").
### Example Commands
* **General Debugging:** *"What is wrong with this thread?"*
* **Performance:** *"Show me the slowest operations."*
* **Cost Analysis:** *"Why is this run costing so much?"*
* **Specific Errors:** *"Why did the `research_flights` tool fail?"*
## Failure Detection Capabilities
Lucy automatically detects specific patterns common in Agentic workflows:
| Issue Type | Description |
| :--- | :--- |
| **Schema Mismatches** | The model is hallucinating arguments (e.g., `checkinDate` vs `check_in_date`) or sending wrong data types. |
| **Prompt Contradictions** | The system prompt contains conflicting instructions (e.g., "Use tools" vs "Do not use external data"). |
| **Silent Truncation** | The model output was cut off by `max_tokens`, but the HTTP request appeared successful. |
| **Retry Loops** | The agent is repeatedly failing the same step without changing its approach. |
## Example Analysis
Lucy provides a structured breakdown of issues found in the trace, sorted by severity.

In the example above, Lucy detected:
1. **High Severity:** A schema mismatch where the model used invalid arguments for a tool call.
2. **Medium Severity:** A logic conflict in the system prompt causing the agent to hesitate.
3. **Low Severity:** Output truncation in a data extraction step.
---
## MCP Support
vLLora provides full support for **Model Context Protocol (MCP)** servers, enabling seamless integration with external tools by connecting with MCP Servers through HTTP and SSE. When your model requests a tool call, vLLora automatically executes the MCP tool call on your behalf and returns the results to the model, allowing your AI models to dynamically access external data sources, APIs, databases, and tools during conversations.
## What is MCP?
**Model Context Protocol (MCP)** is an open standard that enables AI models to seamlessly communicate with external systems. It allows models to dynamically process contextual data, ensuring efficient, adaptive, and scalable interactions. MCP simplifies request orchestration across distributed AI systems, enhancing interoperability and context-awareness.
With native tool integrations, MCP connects AI models to APIs, databases, local files, automation tools, and remote services through a standardized protocol. Developers can effortlessly integrate MCP with IDEs, business workflows, and cloud platforms, while retaining the flexibility to switch between LLM providers. This enables the creation of intelligent, multi-modal workflows where AI securely interacts with real-world data and tools.
For more details, visit the [Model Context Protocol official page](https://modelcontextprotocol.io/introduction) and explore [Anthropic MCP documentation](https://docs.anthropic.com/en/docs/build-with-claude/mcp).
## Using MCP with vLLora
vLLora supports two ways to use MCP servers:
1. **Configure MCP servers in settings** - Set up MCP servers through the vLLora UI and use them in Chat
2. **Send MCP servers in request body** - Include MCP server configuration directly in your chat completions API request
## Method 1: Configure MCP Servers in Settings
You can configure MCP servers through the vLLora settings. Once configured, these servers will be available for use in the chat interface.
1. Navigate to the Settings section in the vLLora UI
2. Add your MCP server configuration

3. Use the configured servers in your chat conversations

:::tip Settings Configuration
MCP servers configured in settings are persistent and available across all your projects. This is ideal for frequently used MCP servers.
:::
## Method 2: Send MCP Servers in Request Body
You can include MCP server configuration directly in your chat completions request body. This method gives you full control over which MCP servers to use for each request.
### Request Format
Add an `mcp_servers` array to your chat completions request body:
```json
{
"model": "openai/gpt-4o-mini",
"messages": [
{
"role": "user",
"content": "use deepwiki and get information about java"
}
],
"stream": true,
"mcp_servers": [
{
"type": "http",
"server_url": "https://mcp.deepwiki.com/mcp",
"headers": {},
"env": null
}
]
}
```
### MCP Server Configuration
Each MCP server in the `mcp_servers` array supports the following configuration:
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | string | Yes | Connection type for MCP server. Must be one of: `"ws"` (WebSocket), `"http"`, or `"sse"` (Server-Sent Events) |
| `server_url` | string | Yes | URL for the MCP server connection. Supports WebSocket (wss://), HTTP (https://), and SSE (https://) endpoints |
| `headers` | object | No | Custom HTTP headers to send with requests to the MCP server (default: `{}`) |
| `env` | object/null | No | Environment variables for the MCP server (default: `null`) |
| `filter` | array | No | Optional filter to limit which tools/resources are available from this server. Each item should have a `name` field (and optionally `description`). Supports regex patterns in the name field |
### Complete Example
Here's a complete example using multiple MCP servers:
```json
{
"model": "openai/gpt-4o-mini",
"messages": [
{
"role": "user",
"content": "use deepwiki and get information about java"
}
],
"stream": true,
"mcp_servers": [
{
"type": "http",
"server_url": "https://mcp.deepwiki.com/mcp",
"headers": {},
"env": null
},
{
"type": "http",
"server_url": "https://remote.mcpservers.org/edgeone-pages/mcp",
"headers": {},
"env": null
}
]
}
```
### Using Filters
You can optionally filter which tools or resources are available from an MCP server by including a `filter` array:
```json
{
"mcp_servers": [
{
"filter": [
{
"name": "read_wiki_structure"
},
{
"name": "read_wiki_contents"
},
{
"name": "ask_question"
}
],
"type": "http",
"server_url": "https://mcp.deepwiki.com/mcp",
"headers": {},
"env": null
}
]
}
```
When `filter` is specified, only the tools/resources matching the filter criteria will be available to the model.
## How MCP Tool Execution Works
When you include MCP servers in your request, vLLora:
1. **Connects to the MCP server** - Establishes a connection using the specified transport type (HTTP, SSE, or WebSocket)
2. **Discovers available tools** - Retrieves the list of tools and resources exposed by the MCP server
3. **Makes tools available to the model** - The model can see and request these tools during the conversation
4. **Executes tool calls automatically** - When the model requests a tool call, vLLora executes it on the MCP server and returns the results
5. **Traces all interactions** - All MCP tool calls, their parameters, and results are captured in vLLora's tracing system
This means you don't need to handle tool execution yourself—vLLora manages the entire MCP workflow, from connection to execution to result delivery.
## Code Examples
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
```python title="Python (OpenAI SDK)"
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:9090/v1",
api_key="no_key",
)
response = client.chat.completions.create(
model="openai/gpt-4o-mini",
messages=[
{
"role": "user",
"content": "use deepwiki and get information about java"
}
],
stream=True,
extra_body={
"mcp_servers": [
{
"type": "http",
"server_url": "https://mcp.deepwiki.com/mcp",
}
]
}
)
```
```bash title="curl"
curl -X POST 'http://localhost:9090/v1/chat/completions' \
-H 'Content-Type: application/json' \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [
{
"role": "user",
"content": "use deepwiki and get information about java"
}
],
"mcp_servers": [
{
"type": "http",
"server_url": "https://mcp.deepwiki.com/mcp",
}
]
}'
```
```typescript title="TypeScript (OpenAI SDK)"
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:9090/v1',
apiKey: 'no_key',
});
const response = await client.chat.completions.create({
model: 'openai/gpt-4o-mini',
messages: [
{
role: 'user',
content: 'use deepwiki and get information about java',
},
],
// @ts-expect-error mcp_servers is a vLLora extension
mcp_servers: [
{
type: 'http',
server_url: 'https://mcp.deepwiki.com/mcp'
},
],
});
```
---
## Quickstart
Get up and running with vLLora in minutes. This guide will help you install vLLora, setup provider and start debugging your AI agents immediately.
## Step 1: Install vLLora
Follow the Installation guide in [Introduction](/docs/#installation) (Homebrew or [Build from Source](/docs/installation#build-from-source)).
## Step 2: Set up vLLora with the provider of your choice
Let’s take OpenAI as an example: open the UI at http://localhost:9091, select the OpenAI card, and paste your API key. Once saved, you’re ready to send requests. Other providers follow the same flow.

## Step 3: Start the Chat
Go to the Chat Section to send your first request. You can use either the Chat UI or the curl request provided there.

## Step 4: Using vLLora with your existing AI Agents
vLLora is OpenAI-compatible, so you can point your existing agent frameworks (LangChain, CrewAI, Google ADK, custom apps, etc.) to vLLora without code changes beyond the base URL.
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
### Code Examples
```python title="Python (OpenAI SDK)"
from openai import OpenAI
client = OpenAI(
# highlight-next-line
base_url="http://localhost:9090/v1",
api_key="no_key", # vLLora does not validate this token
)
completion = client.chat.completions.create(
# highlight-next-line
model="openai/gpt-4o-mini", # Use ANY model supported by vLLora
messages=[
{"role": "system", "content": "You are a senior AI engineer. Output two parts: SUMMARY (bullets) and JSON matching {service, endpoints, schema}. Keep it concise."},
{"role": "user", "content": "Design a minimal text-analytics microservice: word_count, unique_words, top_tokens, sentiment; include streaming; note auth and rate limits."},
],
)
```
```python title="LangChain (Python)"
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
llm = ChatOpenAI(
# highlight-next-line
base_url="http://localhost:9090/v1",
# highlight-next-line
model="openai/gpt-4o-mini",
api_key="no_key",
temperature=0.2,
)
response = llm.invoke([HumanMessage(content="Hello, vLLora!")])
print(response)
```
```bash title="curl"
curl -X POST \
# highlight-next-line
'http://localhost:9090/v1/chat/completions' \
-H 'x-project-id: 61a94de7-7d37-4944-a36a-f1a8a093db51' \
-H 'x-thread-id: 56fe0e65-f87c-4dde-b053-b764e52571a0' \
-H 'content-type: application/json' \
-d '{
# highlight-next-line
"model": "openai/gpt-4.1-nano",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"stream": true
}'
```
### Traces View
After running, you'll see the full trace.

---
You're all set! vLLora is now capturing every request, showing you token usage, costs, and execution timelines. Click on any trace in the UI to view detailed breakdowns of each step. Keep the UI open while you build to debug your AI agents in real-time.
For more advanced tracing support (custom spans, nested operations, metadata), check out the vLLora Python library in the documentation.
---
## Roadmap
- [x] Realtime Tracing
- [x] MCP Server Config: Configure MCP servers (ws, http, sse) for capabilities like web search
- [x] Experiment with traces: Rerun requests with modified prompts, tools, and configs from existing traces
- [x] Debug Mode: Pause and inspect execution at breakpoints during trace analysis
- [ ] OpenAI Responses API: Full support for OpenAI API compatibility
- [ ] Evaluations and finetuning: Built-in evaluation tools and fine-tuning support (we are testing internally)
- [ ] Support for Agno: Full tracing and observability for Agno framework
- [ ] Support for LangGraph: Full tracing and observability for LangGraph
- [ ] Support for CrewAI: Full tracing and observability for CrewAI
---
## vLLora CLI
Use vLLora from the terminal and scripts. The CLI is designed for **fast iteration**, **local reproduction**, and **automation workflows**—perfect when you need to query traces, export data, or check recent failures without opening a browser or IDE.
The CLI is not "MCP but in terminal." It's built for **non-agent, non-editor workflows** where you want direct command-line access to trace data.
## Quick Start
The core workflow is:
### **Find a trace**
```bash
vllora traces list --last-n-minutes 60 --limit 20
```
```text
+--------------------------------------+----------------------+------------------+--------+---------------+---------------------+--------------------------------------+--------------------------------------+
| Trace ID | Span ID | Operation | Status | Duration (ms) | Start Time | Run ID | Thread ID |
+--------------------------------------+----------------------+------------------+--------+---------------+---------------------+--------------------------------------+--------------------------------------+
| a7838793-6421-43b9-9dcb-0bc08fc6ab6f | 13919283956904092872 | openai | ✓ OK | 14312 | 2025-12-23 05:04:38 | 4ea18f79-4c4c-4d2c-b628-20d510af7181 | 7510b431-109c-42b2-a858-f05c29a4f952 |
+--------------------------------------+----------------------+------------------+--------+---------------+---------------------+--------------------------------------+--------------------------------------+
| a7838793-6421-43b9-9dcb-0bc08fc6ab6f | 314675728497877876 | run | ✓ OK | 14320 | 2025-12-23 05:04:38 | 4ea18f79-4c4c-4d2c-b628-20d510af7181 | 7510b431-109c-42b2-a858-f05c29a4f952 |
+--------------------------------------+----------------------+------------------+--------+---------------+---------------------+--------------------------------------+--------------------------------------+
... truncated ...
```
### **Inspect the run**
```bash
vllora traces run-info --run-id
```
```text
Run Overview:
+--------------+--------------------------------------+
| Field | Value |
+--------------+--------------------------------------+
| Run ID | 4ea18f79-4c4c-4d2c-b628-20d510af7181 |
| Status | ok |
| Start Time | 2025-12-23T05:02:52.801745+00:00 |
| Duration | 120114 ms |
| Root Span ID | 10384579106551160164 |
+--------------+--------------------------------------+
LLM Calls (18):
+----------------------+----------+--------------+----------+-------+
| Span ID | Provider | Model | Messages | Tools |
+----------------------+----------+--------------+----------+-------+
| 12495210593948314377 | openai | gpt-4.1-mini | 30 | 0 |
+----------------------+----------+--------------+----------+-------+
... truncated ...
```
### **Inspect an LLM call**
```bash
vllora traces call-info --span-id
```
```json
{
"span_id": "12495210593948314377",
"trace_id": "40c1a59d-5d10-47c5-8e68-65dcf7a31668",
"run_id": "4ea18f79-4c4c-4d2c-b628-20d510af7181",
"thread_id": "7510b431-109c-42b2-a858-f05c29a4f952",
"duration_ms": 1515,
"costs": "0.0016456000245213508",
"raw_request": "{\"messages\":[{\"role\":\"system\",\"content\":\"...\"},{\"role\":\"user\",\"content\":[{\"type\":\"text\",\"text\":\"Plan a 5-day trip to Tokyo in April\"}]}],\"model\":\"gpt-4.1-mini\",\"stream\":false,\"temperature\":0.7,\"tool_choice\":\"auto\",\"tools\":[...]}",
"raw_response": "{\"id\":\"chatcmpl_...\",\"choices\":[{\"index\":0,\"message\":{\"role\":\"assistant\",\"tool_calls\":[{\"id\":\"call_...\",\"type\":\"function\",\"function\":{\"name\":\"research_destination\",\"arguments\":\"{\\\"destination\\\":\\\"Tokyo\\\"}\"}}]},\"finish_reason\":\"tool_calls\"}],\"model\":\"gpt-4.1-mini-2025-04-14\",\"usage\":{\"prompt_tokens\":3910,\"completion_tokens\":51,\"total_tokens\":3961}}"
}
```
## Commands
### `traces list`
Search/list traces by various criteria.
```bash
vllora traces list [OPTIONS]
```
**Options:**
- `--limit ` - Limit number of results (default: 20)
- `--offset ` - Offset for pagination (default: 0)
- `--run-id ` - Filter by run ID
- `--thread-id ` - Filter by thread ID
- `--operation-name ` - Filter by operation name: `run`, `agent`, `task`, `tools`, `openai`, `anthropic`, `bedrock`, `gemini`, `model_call`
- `--text ` - Text search query
- `--last-n-minutes ` - Filter traces from last N minutes
- `--sort-by ` - Sort by field (default: `start_time`)
- `--sort-order ` - Sort order: `asc` or `desc` (default: `desc`)
- `--output