Thanks to visit codestin.com
Credit goes to github.com

Skip to content

just-every/benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 

Repository files navigation

@just-every/benchmark

Comprehensive benchmarking system for @just-every/ensemble that automatically tests models across relevant datasets based on their model class.

npm version GitHub Actions

Overview

The benchmark system provides automated performance testing for LLM models across industry-standard datasets. It intelligently selects appropriate benchmarks based on model capabilities (reasoning, code, vision, etc.) and provides detailed performance metrics including accuracy, latency, and cost analysis.

Perfect for comparing models, tracking performance over time, and making informed decisions about which models to use for specific tasks.

Features

  • 🎯 Automatic Dataset Selection - Matches benchmarks to model capabilities
  • πŸ“Š Comprehensive Metrics - ROUGE scores, F1, exact match, latency stats
  • πŸ† Winner Determination - Task-specific weighted scoring
  • πŸ’° Cost Analysis - Track API costs across providers
  • πŸ”„ Model Class Support - Specialized benchmarks for each model type
  • πŸ“ˆ Detailed Reporting - Export results to JSON for analysis

Prerequisites

  • Node.js 18.x or higher
  • API keys for LLM providers you want to benchmark
  • @just-every/ensemble (installed as dependency)

Environment Setup

Copy .env.example to .env and add your API keys:

cp .env.example .env
# Edit .env and add your API keys

Supported providers:

  • OpenAI (OPENAI_API_KEY)
  • Anthropic (ANTHROPIC_API_KEY)
  • Google (GOOGLE_API_KEY)
  • DeepSeek (DEEPSEEK_API_KEY)
  • xAI (XAI_API_KEY)

Installation

npm install
npm run build

Usage

Primary Usage - Class-Based Benchmarking

The benchmark system automatically selects appropriate datasets based on the model class:

# Benchmark all models in the 'summary' class
npx tsx src/cli.ts benchmark --class summary

# Benchmark all models in the 'reasoning' class with more samples
npx tsx src/cli.ts benchmark --class reasoning --samples 20

# Test a specific model (even if not in the class)
npx tsx src/cli.ts benchmark --class code --model gpt-4o

# Save results to file
npx tsx src/cli.ts benchmark --class standard --output results/standard-benchmark.json

List Available Model Classes

npx tsx src/cli.ts list-classes

This shows all model classes and their associated datasets.

Model Class β†’ Dataset Mapping

Each model class is automatically tested on relevant datasets:

  • standard: MMLU, HellaSwag (general knowledge & reasoning)
  • mini: Same as standard but with smaller samples
  • reasoning: GSM8K, ARC Challenge (math & logic)
  • reasoning_mini: Smaller reasoning datasets
  • code: HumanEval, MBPP (code generation)
  • writing: Writing prompts (creative writing)
  • summary: XSum, CNN/DailyMail (summarization)
  • vision: VQA, COCO Captions (multimodal)
  • search: Natural Questions (retrieval-augmented QA)
  • And more...

Available model classes

  • standard - General purpose models
  • mini - Small, fast models
  • reasoning - Models optimized for reasoning tasks
  • reasoning_mini - Smaller reasoning models
  • code - Code generation models
  • writing - Writing and content creation
  • summary - Summarization models
  • vision - Vision-capable models
  • vision_mini - Smaller vision models

Development mode

# Run directly with tsx
npx tsx src/cli.ts run --dataset cnn-dailymail --model-class summary

# List available datasets
npx tsx src/cli.ts list

Testing

Run a simple test to ensure ensemble is working:

npx tsx test.ts

Output

The benchmark will display:

  • Performance metrics (ROUGE scores, F1, exact match)
  • Latency statistics (average, p95)
  • Error rates
  • Winner determination based on task-specific weights

Results can be saved to JSON for later analysis:

npm run benchmark run --dataset squad --model-class qa --output squad-results.json

Example Output

πŸ“Š Benchmark Results for Model Class: summary

Average Scores Across All Datasets:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Model           β”‚ rouge1 β”‚ rouge2 β”‚ rougeL β”‚ avgLatency β”‚ p95Latency β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ summary         β”‚ 0.245  β”‚ 0.082  β”‚ 0.216  β”‚ 4521ms     β”‚ 5123ms     β”‚
β”‚ gpt-4o-mini     β”‚ 0.221  β”‚ 0.072  β”‚ 0.196  β”‚ 3890ms     β”‚ 4234ms     β”‚
β”‚ claude-3-haiku  β”‚ 0.238  β”‚ 0.079  β”‚ 0.208  β”‚ 2145ms     β”‚ 2456ms     β”‚
β”‚ gemini-1.5-flashβ”‚ 0.252  β”‚ 0.085  β”‚ 0.223  β”‚ 3234ms     β”‚ 3567ms     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ˆ Detailed Results by Dataset:

Dataset: xsum
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Model           β”‚ rouge1 β”‚ rouge2 β”‚ rougeL β”‚ Errors β”‚ Durationβ”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ summary         β”‚ 0.251  β”‚ 0.089  β”‚ 0.224  β”‚ 0      β”‚ 8.2s    β”‚
β”‚ gpt-4o-mini     β”‚ 0.232  β”‚ 0.078  β”‚ 0.205  β”‚ 0      β”‚ 7.1s    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ† Best Overall Model: gemini-1.5-flash

Development

# Install dependencies
npm install

# Build TypeScript
npm run build

# Run tests
npm test

# Lint code
npm run lint

Contributing

Contributions are welcome! To add new datasets or improve benchmarking:

  1. Fork the repository
  2. Create a feature branch
  3. Add your dataset in src/datasets/
  4. Update model class mappings
  5. Submit a pull request

Troubleshooting

API Key Issues

  • Ensure your .env file has valid API keys
  • Check provider-specific rate limits
  • Verify API key permissions

Performance Issues

  • Reduce --samples for faster testing
  • Use --model to test specific models
  • Check network connectivity to API endpoints

Dataset Errors

  • Ensure datasets are properly downloaded
  • Check dataset format compatibility
  • Verify model supports the dataset type

License

MIT

About

Runs benchmarks on Ensemble and Task to compare models and features

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •