dilagent 🔍

Agentic root cause analysis through hypothesis-driven debugging

dilagent automates the tedious process of reproducing, diagnosing, and fixing bugs through systematic hypothesis testing and experimentation. It combines LLM agents with structured scientific methodology to identify root causes and validate fixes.

When to Use dilagent

Ideal for:

Complex, hard-to-reproduce bugs (race conditions, timing issues)
Performance regressions without obvious cause
Issues that occur only in specific environments
Bugs with multiple potential root causes
Intermittent or flaky test failures

Not suitable for:

Simple syntax errors or compilation issues
Obvious logic errors that are easily spotted
Issues with clear error messages pointing to the problem

How It Works

dilagent follows a structured, multi-stage approach to debugging:

1. 🔬 Reproduction Stage

First, dilagent attempts to reproduce the issue:

Creates minimal reproducible test cases (repro.ts)
Measures timing characteristics (performance, timeouts, race conditions)
Documents setup requirements
Asks clarifying questions when needed

2. 🧠 Hypothesis Generation

Based on successful reproduction, dilagent generates targeted hypotheses:

Analyzes reproduction data and codebase context
Creates multiple hypotheses (H001, H002, H003...)
Tailors hypotheses to the type of issue (performance, concurrency, logic)
Outputs structured hypotheses.json for tracking

3. 🧪 Hypothesis Testing

Each hypothesis is tested in parallel:

Independent testing in isolated worktrees
Each hypothesis can result in:
- ✅ Proven: Root cause identified
- ❌ Disproven: Not the issue, move on
- ❓ Inconclusive: Requires additional data
Counter-experiments validate findings

4. 💬 Interactive Exploration (Optional)

For complex issues requiring human insight:

REPL-based interactive debugging session
Agent-assisted exploration with full context
Direct manipulation and testing

The process continues iteratively until the root cause is found and validated.

System Architecture

Overall Manager Flow

Hypothesis Testing Loop

Key Features

Automated Reproduction: Generates minimal test cases from bug reports
Parallel Hypothesis Testing: Tests multiple theories simultaneously in isolated environments
Counter-Experiment Validation: Prevents false positives through negative testing
Interactive Fallback: REPL mode for complex cases requiring human expertise
Evidence-Based: Every conclusion backed by reproducible experiments
MCP Integration: Leverages Model Context Protocol for tool orchestration

Workflow Stages & Key Files

Stage 0: Setup

dilagent manager setup --working-directory ./debug-session --context-directory ./my-project

Creates .dilagent/ directory structure
Generates context.md with codebase information and issue description

Stage 1: Reproduce

dilagent manager repro --working-directory ./debug-session --llm claude

Creates reproduction.md with steps to reproduce the issue
Generates diagnostic information and error details

Stage 2: Generate Hypotheses

dilagent manager generate-hypotheses --working-directory ./debug-session --hypothesis-count 3 --llm claude

Creates numbered hypothesis directories: H001-config-issue/, H002-race-condition/, etc.
Each contains:
- hypothesis.md - The specific theory about the bug
- instructions.md - Steps to test the hypothesis

Stage 3: Test Hypotheses

dilagent manager run-hypotheses --working-directory ./debug-session --llm claude

Creates git worktrees for parallel testing (e.g., worktree-H001-config-issue/)
AI agents test each hypothesis independently
Updates report.md in each hypothesis directory with findings
Logs stored in .dilagent/H{NNN}-{slug}/hypothesis.log

Example report.md output (click to expand)

# Hypothesis H003: Race Condition in Connection Pool

## Status: ✅ PROVEN

## Initial Analysis
The intermittent timeout errors in production suggested a potential race condition
in the database connection pool management. The error pattern showed:
- Errors occur only under high concurrent load (>100 req/s)  
- Error rate increases exponentially with load
- Database logs show connection pool exhaustion messages
- Issue started appearing after the connection pooling refactor in commit abc123

## Investigation Steps

### Step 1: Reproduce the Issue
Created load test script that successfully reproduced the issue:
```bash
# Load test that triggers the race condition
ab -n 10000 -c 50 http://localhost:3000/api/users/search
# Result: 3.2% failure rate with "connection pool timeout" errors

Step 2: Code Analysis

Identified suspicious code in src/db/pool.ts:42-58:

// PROBLEMATIC: Race condition between check and increment
if (this.activeConnections < this.maxConnections) {
  // Gap here - another request could increment activeConnections
  this.activeConnections++;
  return this.createConnection();
}

Step 3: Root Cause Identification

The race condition occurs when multiple requests simultaneously:

Check activeConnections < maxConnections (both see same value)
Both increment activeConnections
Both attempt to create connections
Total connections exceed maxConnections
Database rejects excess connections
Pool state becomes inconsistent

Fix Applied

Implemented atomic operation using mutex lock:

// FIXED: Atomic check-and-increment
async acquireConnection(): Promise<Connection> {
  return this.mutex.acquire(async () => {
    if (this.activeConnections >= this.maxConnections) {
      throw new PoolExhaustedError();
    }
    this.activeConnections++;
    return this.createConnection();
  });
}

Validation Results

Load Testing

✅ 10,000 requests at 50 concurrent: 0% failure rate
✅ 50,000 requests at 200 concurrent: 0% failure rate
✅ Connection count never exceeds maxConnections under load

Code Review

✅ All database operations now use atomic operations
✅ Added connection pool metrics and monitoring
✅ Updated connection pool tests to include concurrency scenarios

Counter-Experiments

✅ Reverting to old code: Issue reproduces immediately
✅ Artificially increasing load beyond capacity: Proper error handling
✅ Simulating connection failures: Pool recovers correctly

Performance Impact

No measurable latency increase (<1ms)
Memory usage unchanged
CPU overhead negligible

Additional Observations

This pattern exists in 3 other services using the same pooling library
Similar race conditions found in Redis connection pool
Recommended: Audit all resource pooling implementations company-wide

Files Modified

src/db/pool.ts - Fixed race condition
tests/db/pool.test.ts - Added concurrency tests
package.json - Added async-mutex dependency

</details>

### Stage 4: Summary
```bash
dilagent manager summary --working-directory ./debug-session

Aggregates all hypothesis reports
Generates final summary with likely root causes and fixes

Quick Start

# Install
npm install -g dilagent

# Run the complete workflow in one command
dilagent manager all \
  --context-directory ./my-project \
  --working-directory ./debug-session \
  --hypothesis-count 3 \
  --llm claude

# Key options:
# --llm claude|codex - Choose AI model (Claude recommended)
# --working-directory - Where dilagent stores its files
# --context-directory - The codebase to debug
# --hypothesis-count - Number of hypotheses to generate (default: 3)
# --flaky - Use this flag for intermittent/flaky issues
# --repl - Start interactive mode for complex debugging

Troubleshooting

Issue: Reproduction fails

Ensure the bug description is clear and specific
Check that all required dependencies are installed
Verify the issue occurs in the provided codebase

Issue: All hypotheses are inconclusive

Try increasing --hypothesis-count to generate more hypotheses
Add more context to context.md about recent changes
Use --repl mode for manual exploration

Issue: Worktree creation fails

Ensure you're in a git repository
Check that git worktree is supported (Git 2.5+)
Verify sufficient disk space for multiple worktrees

Configuration

Environment Variables:

DILAGENT_CLI_PATH - Automatically set to CLI location for MCP proxy

LLM Tools:

Requires either claude or codex command in your PATH
Claude recommended for best results
Configure your API keys according to your LLM tool's documentation

Working Directory Best Practices:

Use a dedicated directory (e.g., ./debug-sessions/issue-123/)
Keep separate from your main codebase
Clean up old sessions periodically to save disk space

Requirements

Bun 1.2+
Git (for worktree isolation)
Local LLM tool (claude or codex command available in PATH)

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.vscode		.vscode
apps/dilagent-cli		apps/dilagent-cli
diagrams		diagrams
scripts		scripts
specs		specs
test-context-dirs		test-context-dirs
.envrc		.envrc
.gitignore		.gitignore
NOTES.md		NOTES.md
README.md		README.md
biome.json		biome.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.all.json		tsconfig.all.json
tsconfig.base.json		tsconfig.base.json
vitest.config.ts		vitest.config.ts

schickling/dilagent

Folders and files

Latest commit

History

Repository files navigation