Konveyor AI Evaluation Framework

A comprehensive framework for evaluating LLM performance on application modernization tasks. Automatically generates test cases from Konveyor rules, validates code migrations, and produces professional reports with migration complexity analysis.

✨ Key Features

🤖 Automatic test generation from 2,680+ Konveyor rules with LLM code generation
📊 Migration complexity classification (TRIVIAL → EXPERT) with expected AI success rates
🔄 Agentic workflows for self-correcting code generation and fixing
📈 Professional HTML reports with Grafana-style design and complexity breakdowns
🎯 Multi-dimensional evaluation: Functional correctness, security, code quality, explainability
⚡ Graceful interruption with progress saving and resume capability
🔍 Real Java compilation with 200+ JARs from Maven Central
💾 Database storage (SQLite/PostgreSQL) for historical tracking and trend analysis

🚀 Quick Start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Configure API keys
cp config.example.yaml config.yaml
# Edit config.yaml with your OpenAI/Anthropic API keys

# 3. Setup Java dependencies (one-time)
cd evaluators/stubs && mvn clean package && cd ../..

# 4. Generate test suite with auto-generated code
python scripts/generate_tests.py --all-rulesets --target quarkus \
  --auto-generate --validate --model gpt-4-turbo

# 5. Classify by complexity
python scripts/classify_rule_complexity.py benchmarks/test_cases/generated/quarkus.yaml

# 6. Run evaluation
python evaluate.py --benchmark benchmarks/test_cases/generated/quarkus.yaml

# 7. View results
open results/evaluation_report_*.html

See WORKFLOW.md for the complete recommended workflow.

📋 Migration Complexity Classification

Automatically categorizes rules by difficulty to set appropriate AI expectations:

Complexity	AI Success	Example	Automation Strategy
TRIVIAL	95%+	`javax.` → `jakarta.` namespace changes	✅ Full automation
LOW	80%+	`@Stateless` → `@ApplicationScoped`	✅ Automation with light review
MEDIUM	60%+	JMS → Reactive Messaging patterns	⚡ AI acceleration + developer completion
HIGH	30-50%	Spring Security → Quarkus Security	🔍 AI scaffolding + expert review
EXPERT	<30%	Custom security realms, internal APIs	📋 AI checklist + human migration

Why it matters: Segmented analysis reveals AI value at each difficulty level, not just overall pass rate.

See docs/COMPLEXITY_CLASSIFICATION.md for details.

📦 Test Generation

Generate test cases from Konveyor rulesets with automatic code generation:

# Generate from specific migration path (Java EE → Quarkus)
python scripts/generate_tests.py --all-rulesets --source java-ee --target quarkus \
  --auto-generate --validate --model gpt-4-turbo

# With local rulesets (100x faster, no API rate limits)
git clone https://github.com/konveyor/rulesets.git ~/projects/rulesets
python scripts/generate_tests.py --all-rulesets --target quarkus \
  --local-rulesets ~/projects/rulesets \
  --auto-generate --validate --model gpt-4-turbo

# Process in batches for large test suites
python scripts/generate_tests.py --all-rulesets --target quarkus \
  --auto-generate --model gpt-4-turbo \
  --batch-size 20

Features:

✅ Auto-generates code_snippet and expected_fix using LLM
✅ --validate flag: Agentic workflow compiles and auto-fixes errors
✅ Ctrl+C to pause and save progress - resume by re-running same command
✅ Auto-detects language from rules (Java, XML, YAML, properties)
✅ Includes Konveyor migration guidance in prompts

Output: benchmarks/test_cases/generated/quarkus.yaml

See docs/generating_tests.md for full documentation.

🔧 Test Validation & Fixing

Validate Test Cases

# Check compilation status
python scripts/validate_expected_fixes.py benchmarks/test_cases/generated/quarkus.yaml

Auto-Fix Compilation Errors

# Fix by complexity level (recommended: start with TRIVIAL/LOW)
python scripts/fix_expected_fixes.py \
  --file benchmarks/test_cases/generated/quarkus.yaml \
  --complexity trivial,low

# Preview fixes without applying
python scripts/fix_expected_fixes.py \
  --file benchmarks/test_cases/generated/quarkus.yaml \
  --dry-run

Features:

✅ Agentic workflow: Iterates with compilation error feedback (max 3 attempts)
✅ Migration-specific guidance (Java EE → Quarkus patterns)
✅ Ctrl+C to pause - progress saved incrementally
✅ Auto-skips non-compilable tests (marked with reasons)
✅ Validates each fix before applying

See docs/automating_test_case_fixes.md for details.

📊 Evaluation & Reports

Run Evaluation

# Evaluate all complexity levels
python evaluate.py --benchmark benchmarks/test_cases/generated/quarkus.yaml

# Filter by complexity
python evaluate.py \
  --benchmark benchmarks/test_cases/generated/quarkus.yaml \
  --filter-complexity trivial,low

# Exclude EXPERT level
python evaluate.py \
  --benchmark benchmarks/test_cases/generated/quarkus.yaml \
  --exclude-complexity expert

# Parallel execution for multiple models
python evaluate.py \
  --benchmark benchmarks/test_cases/generated/quarkus.yaml \
  --parallel 4

HTML Reports

📊 View Live Sample Report - Interactive demo (no installation required)

Professional Grafana-style reports with:

🎨 Dark theme with Konveyor branding
📊 Complexity breakdown table - Pass rates by difficulty level (TRIVIAL → EXPERT)
🏆 Model ranking with composite scoring across all dimensions
🔍 Individual test cards with complexity badges
📈 Interactive charts - Response time, cost analysis, performance metrics
🔒 Security analysis - Vulnerability detection with severity levels
💡 Explainability metrics - Comment density, explanation quality scores
⚠️ Detailed failure analysis - Diff highlighting, compilation errors

Output: results/evaluation_report_*.html

🎯 Evaluation Dimensions

Tests evaluate LLM performance across:

Functional Correctness (40% weight)
- Compiles successfully
- Resolves original violation
- No new violations introduced
Compilation Rate (15% weight)
- Valid Java syntax
- Correct imports and dependencies
Code Quality (15% weight)
- Cyclomatic complexity
- Maintainability index
- Style adherence
Security (15% weight)
- No SQL injection vulnerabilities
- No hardcoded credentials
- Migration-specific checks (lost @RolesAllowed, etc.)
- Optional Semgrep integration for comprehensive analysis
Explainability (10% weight)
- Explanation quality score (0-10)
- Comment density (10-30% ideal)
Performance Metrics (5% weight)
- Response time
- Cost efficiency

🔒 Security Analysis

Hybrid approach for comprehensive security scanning:

# Install Semgrep (optional but recommended)
pip install semgrep

# Uncomment in config.yaml:
security:
  tools:
    - semgrep

Without Semgrep: Fast pattern-based checks for common vulnerabilities With Semgrep: Comprehensive taint analysis and framework-specific rules

📁 Project Structure

konveyor-iq/
├── benchmarks/              # Test datasets
│   └── test_cases/
│       └── generated/       # Auto-generated from Konveyor rules
├── config/                  # Configuration files
│   └── migration_guidance.yaml  # Migration-specific prompts
├── docs/                    # Documentation
│   ├── SETUP.md
│   ├── COMPLEXITY_CLASSIFICATION.md
│   ├── generating_tests.md
│   └── automating_test_case_fixes.md
├── evaluators/              # Metric implementations
│   ├── functional.py        # Compilation & pattern validation
│   ├── security.py          # Security analysis
│   ├── quality.py           # Code quality metrics
│   └── stubs/              # Java compilation dependencies
│       ├── pom.xml         # Maven dependencies (200+ JARs)
│       └── lib/            # Downloaded JARs
├── models/                  # LLM adapters
│   ├── openai_adapter.py
│   ├── anthropic_adapter.py
│   └── google_adapter.py
├── reporters/               # Report generation
│   └── html_reporter.py    # Grafana-style HTML reports
├── scripts/                 # Utility scripts
│   ├── generate_tests.py   # Generate from Konveyor rules
│   ├── classify_rule_complexity.py  # Classify difficulty
│   ├── validate_expected_fixes.py   # Validate compilation
│   └── fix_expected_fixes.py       # Auto-fix errors
├── config.yaml              # Main configuration
├── evaluate.py              # Main evaluation script
├── WORKFLOW.md              # Recommended workflow
└── requirements.txt         # Python dependencies

📚 Documentation

WORKFLOW.md - Complete workflow from generation to evaluation
WORKFLOW_GENERATING_TESTS.md - Technical architecture and JAR system
docs/SETUP.md - Installation and configuration
docs/COMPLEXITY_CLASSIFICATION.md - Classification system details
docs/generating_tests.md - Test generation guide
docs/automating_test_case_fixes.md - Auto-fixing workflow
docs/validating_expected_fixes.md - Validation guide

🛠️ Advanced Features

Agentic Workflows

Both test generation and fixing use agentic workflows for self-correction:

Generate code → Compile → Get errors
Fix errors with LLM (passes error feedback)
Validate fix → Compile → Repeat if needed (max 3 attempts)

This dramatically improves compilation success rates (50% → 85%+).

Graceful Interruption

Press Ctrl+C (or Cmd+C) to pause long-running scripts:

Finishes current operation
Saves all progress to YAML
Shows summary
Resume by re-running same command

Supported in: generate_tests.py, fix_expected_fixes.py

Migration Guidance System

Centralized, technology-specific guidance in config/migration_guidance.yaml:

Java EE → Quarkus patterns
Spring Boot → Quarkus patterns
Messaging migrations
Security migrations

Automatically injected into prompts based on test suite metadata.

Konveyor Rule Integration

When test cases reference Konveyor rules via source URL:

Framework fetches official rule from GitHub
Extracts migration guidance (message field)
Includes in LLM prompt under "Konveyor Migration Guidance"

Ensures LLM receives authoritative migration patterns.

🔧 Configuration

Edit config.yaml to specify:

Models: OpenAI (GPT-4, GPT-3.5), Anthropic (Claude), Google (Gemini)
Evaluation dimensions: Enable/disable metrics
Security tools: Pattern-based or Semgrep
Report format: HTML, Markdown
Parallelization: Number of concurrent workers
Storage: File (JSON), SQLite, or PostgreSQL

💾 Database Storage (Optional)

Track model performance over time with relational database storage:

# Install dependencies
pip install sqlalchemy

# Enable in config.yaml
storage:
  type: "sqlite"
  path: "konveyor_iq.db"

reporting:
  write_to_database: true

# Initialize database
python db_cli.py init

# Run evaluations (results auto-saved to database)
python evaluate.py --benchmark benchmarks/test_cases/generated/quarkus.yaml

# Query historical data
python db_cli.py query runs              # Recent evaluations
python db_cli.py query models --days 30  # Model comparison
python db_cli.py query rules             # All rules by pass rate
python db_cli.py query complexity        # Pass rates by difficulty
python db_cli.py query regressions       # Detect performance drops

Features:

📊 Track model performance trends over time
🔍 Identify rules needing prompt engineering
💰 Cost analysis and forecasting
📈 Regression detection and alerting
👥 Team collaboration via PostgreSQL

Backends:

SQLite - Local database, zero setup (recommended)
PostgreSQL - Team/production use, multi-user support
File - Default JSON storage (no historical queries)

See docs/database_storage.md for complete guide.

🎓 Example Results

Pass Rate by Complexity (gpt-4-turbo):
- TRIVIAL: 96% (24/25 tests)
- LOW:     84% (21/25 tests)
- MEDIUM:  68% (17/25 tests)
- HIGH:    44% (11/25 tests)
- EXPERT:  20% (5/25 tests)

Overall: 78% (78/125 tests)

This segmentation demonstrates AI value at each difficulty level.

🤝 Contributing

Contributions welcome! Areas of interest:

New migration patterns for evaluators/functional.py
Additional migration guidance for config/migration_guidance.yaml
Enhanced security rules
New LLM adapters

📄 License

Apache 2.0

Quick Links:

WORKFLOW.md - Recommended workflow
docs/SETUP.md - Installation guide
docs/COMPLEXITY_CLASSIFICATION.md - Classification details
docs/database_storage.md - Database storage guide
docs/database_setup_guide.md - Database deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Konveyor AI Evaluation Framework

✨ Key Features

🚀 Quick Start

📋 Migration Complexity Classification

📦 Test Generation

🔧 Test Validation & Fixing

Validate Test Cases

Auto-Fix Compilation Errors

📊 Evaluation & Reports

Run Evaluation

HTML Reports

🎯 Evaluation Dimensions

🔒 Security Analysis

📁 Project Structure

📚 Documentation

🛠️ Advanced Features

Agentic Workflows

Graceful Interruption

Migration Guidance System

Konveyor Rule Integration

🔧 Configuration

💾 Database Storage (Optional)

🎓 Example Results

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
benchmarks		benchmarks
changelog		changelog
config		config
dashboards		dashboards
docs		docs
evaluators		evaluators
models		models
reporters		reporters
scripts		scripts
storage		storage
.gitignore		.gitignore
.nojekyll		.nojekyll
FEATURE_DATABASE_STORAGE.md		FEATURE_DATABASE_STORAGE.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
ROADMAP.md		ROADMAP.md
USAGE.md		USAGE.md
WORKFLOW.md		WORKFLOW.md
WORKFLOW_GENERATING_TESTS.md		WORKFLOW_GENERATING_TESTS.md
config.example.yaml		config.example.yaml
db_cli.py		db_cli.py
evaluate.py		evaluate.py
index.html		index.html
requirements.txt		requirements.txt

tsanders-rh/konveyor-iq

Folders and files

Latest commit

History

Repository files navigation

Konveyor AI Evaluation Framework

✨ Key Features

🚀 Quick Start

📋 Migration Complexity Classification

📦 Test Generation

🔧 Test Validation & Fixing

Validate Test Cases

Auto-Fix Compilation Errors

📊 Evaluation & Reports

Run Evaluation

HTML Reports

🎯 Evaluation Dimensions

🔒 Security Analysis

📁 Project Structure

📚 Documentation

🛠️ Advanced Features

Agentic Workflows

Graceful Interruption

Migration Guidance System

Konveyor Rule Integration

🔧 Configuration

💾 Database Storage (Optional)

🎓 Example Results

🤝 Contributing

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages