Comprehensive benchmarking system for @just-every/ensemble that automatically tests models across relevant datasets based on their model class.
The benchmark system provides automated performance testing for LLM models across industry-standard datasets. It intelligently selects appropriate benchmarks based on model capabilities (reasoning, code, vision, etc.) and provides detailed performance metrics including accuracy, latency, and cost analysis.
Perfect for comparing models, tracking performance over time, and making informed decisions about which models to use for specific tasks.
- π― Automatic Dataset Selection - Matches benchmarks to model capabilities
- π Comprehensive Metrics - ROUGE scores, F1, exact match, latency stats
- π Winner Determination - Task-specific weighted scoring
- π° Cost Analysis - Track API costs across providers
- π Model Class Support - Specialized benchmarks for each model type
- π Detailed Reporting - Export results to JSON for analysis
- Node.js 18.x or higher
- API keys for LLM providers you want to benchmark
- @just-every/ensemble (installed as dependency)
Copy .env.example to .env and add your API keys:
cp .env.example .env
# Edit .env and add your API keysSupported providers:
- OpenAI (
OPENAI_API_KEY) - Anthropic (
ANTHROPIC_API_KEY) - Google (
GOOGLE_API_KEY) - DeepSeek (
DEEPSEEK_API_KEY) - xAI (
XAI_API_KEY)
npm install
npm run buildThe benchmark system automatically selects appropriate datasets based on the model class:
# Benchmark all models in the 'summary' class
npx tsx src/cli.ts benchmark --class summary
# Benchmark all models in the 'reasoning' class with more samples
npx tsx src/cli.ts benchmark --class reasoning --samples 20
# Test a specific model (even if not in the class)
npx tsx src/cli.ts benchmark --class code --model gpt-4o
# Save results to file
npx tsx src/cli.ts benchmark --class standard --output results/standard-benchmark.jsonnpx tsx src/cli.ts list-classesThis shows all model classes and their associated datasets.
Each model class is automatically tested on relevant datasets:
- standard: MMLU, HellaSwag (general knowledge & reasoning)
- mini: Same as standard but with smaller samples
- reasoning: GSM8K, ARC Challenge (math & logic)
- reasoning_mini: Smaller reasoning datasets
- code: HumanEval, MBPP (code generation)
- writing: Writing prompts (creative writing)
- summary: XSum, CNN/DailyMail (summarization)
- vision: VQA, COCO Captions (multimodal)
- search: Natural Questions (retrieval-augmented QA)
- And more...
standard- General purpose modelsmini- Small, fast modelsreasoning- Models optimized for reasoning tasksreasoning_mini- Smaller reasoning modelscode- Code generation modelswriting- Writing and content creationsummary- Summarization modelsvision- Vision-capable modelsvision_mini- Smaller vision models
# Run directly with tsx
npx tsx src/cli.ts run --dataset cnn-dailymail --model-class summary
# List available datasets
npx tsx src/cli.ts listRun a simple test to ensure ensemble is working:
npx tsx test.tsThe benchmark will display:
- Performance metrics (ROUGE scores, F1, exact match)
- Latency statistics (average, p95)
- Error rates
- Winner determination based on task-specific weights
Results can be saved to JSON for later analysis:
npm run benchmark run --dataset squad --model-class qa --output squad-results.jsonπ Benchmark Results for Model Class: summary
Average Scores Across All Datasets:
βββββββββββββββββββ¬βββββββββ¬βββββββββ¬βββββββββ¬βββββββββββββ¬βββββββββββββ
β Model β rouge1 β rouge2 β rougeL β avgLatency β p95Latency β
βββββββββββββββββββΌβββββββββΌβββββββββΌβββββββββΌβββββββββββββΌβββββββββββββ€
β summary β 0.245 β 0.082 β 0.216 β 4521ms β 5123ms β
β gpt-4o-mini β 0.221 β 0.072 β 0.196 β 3890ms β 4234ms β
β claude-3-haiku β 0.238 β 0.079 β 0.208 β 2145ms β 2456ms β
β gemini-1.5-flashβ 0.252 β 0.085 β 0.223 β 3234ms β 3567ms β
βββββββββββββββββββ΄βββββββββ΄βββββββββ΄βββββββββ΄βββββββββββββ΄βββββββββββββ
π Detailed Results by Dataset:
Dataset: xsum
βββββββββββββββββββ¬βββββββββ¬βββββββββ¬βββββββββ¬βββββββββ¬ββββββββββ
β Model β rouge1 β rouge2 β rougeL β Errors β Durationβ
βββββββββββββββββββΌβββββββββΌβββββββββΌβββββββββΌβββββββββΌββββββββββ€
β summary β 0.251 β 0.089 β 0.224 β 0 β 8.2s β
β gpt-4o-mini β 0.232 β 0.078 β 0.205 β 0 β 7.1s β
βββββββββββββββββββ΄βββββββββ΄βββββββββ΄βββββββββ΄βββββββββ΄ββββββββββ
π Best Overall Model: gemini-1.5-flash
# Install dependencies
npm install
# Build TypeScript
npm run build
# Run tests
npm test
# Lint code
npm run lintContributions are welcome! To add new datasets or improve benchmarking:
- Fork the repository
- Create a feature branch
- Add your dataset in
src/datasets/ - Update model class mappings
- Submit a pull request
- Ensure your
.envfile has valid API keys - Check provider-specific rate limits
- Verify API key permissions
- Reduce
--samplesfor faster testing - Use
--modelto test specific models - Check network connectivity to API endpoints
- Ensure datasets are properly downloaded
- Check dataset format compatibility
- Verify model supports the dataset type
MIT