A comprehensive benchmark system for evaluating large language model performance on structured data question-answering tasks across multiple formats (JSON, XML, HTML, Markdown, TTL, TXT) and question types.
- Multi-Provider Support: OpenAI (GPT-4, GPT-3.5), Google (Gemini), AWS Bedrock (Llama, Claude, Mistral)
- Multiple Data Formats: JSON, XML, HTML, Markdown, TTL, TXT
- Diverse Tasks: Answer lookup, reverse lookup, aggregation, multi-hop inference, counting, rule-based querying
- Prompt Variants: Test different prompt strategies (with/without role prompting, formatting, etc.)
- Self-Augmentation: Enhanced prompts with structural information
pip install -r requirements.txtcp .env.example .env
# Edit .env with your API keys:
# - OPENAI_API_KEY for GPT models
# - GOOGLE_API_KEY for Gemini models
# - AWS_BEARER_TOKEN_BEDROCK for Bedrock models
# - AWS_REGION (e.g., us-east-1)# Generate base prompts
python scripts/generate_prompts.py
# Optional: Generate prompt variants
python scripts/generate_prompt_variants.py
# Optional: Generate self-augmentation prompts
python scripts/generate_prompt_self_augmentation.py# Full benchmark with GPT-4o-mini
python benchmark_pipeline.py --model openai --openai-model gpt-4o-mini
# Test specific dataset/task
python benchmark_pipeline.py \
--model openai \
--openai-model gpt-4o-mini \
--dataset healthcare-dataset \
--task answer_lookup \
--format json \
--max-cases 5# Full benchmark with Gemini
python benchmark_pipeline.py --model google --google-model gemini-1.5-flash
# Test specific dataset
python benchmark_pipeline.py \
--model google \
--google-model gemini-1.5-flash \
--dataset healthcare-dataset \
--max-cases 10# Llama 3.3 70B (us-east-1)
python benchmark_pipeline.py \
--model bedrock \
--bedrock-model us.meta.llama3-3-70b-instruct-v1:0 \
--dataset healthcare-dataset \
--task answer_lookup \
--max-cases 5
# Claude 3 Haiku (check region availability)
python benchmark_pipeline.py \
--model bedrock \
--bedrock-model anthropic.claude-3-haiku-20240307-v1:0 \
--dataset isbar \
--max-cases 5
# Mistral Large (eu-west-1)
python benchmark_pipeline.py \
--model bedrock \
--bedrock-model mistral.mistral-large-2402-v1:0 \
--dataset sus-uta7 \
--max-cases 5# Analyze results for a specific model
python scripts/benchmark_analysis.py --model gpt-4o-mini
# Analyze Bedrock model results
python scripts/benchmark_analysis.py --model us.meta.llama3-3-70b-instruct-v1:0healthcare-datasetisbarself-reported-mental-healthstack-overflow-2022sus-uta7
answer_lookup- Find specific valuesanswer_reverse_lookup- Find respondents matching criteriaconceptual_aggregation- Aggregate/count valuesmulti_hop_relational_inference- Multi-step reasoningrespondent_count- Count respondents matching conditionsrule_based_querying- Apply rules to find answers
JSON, XML, HTML, Markdown, TTL, TXT
--dataset DATASET # Specific dataset to process
--task TASK # Specific task type
--format FORMAT # Data format (json, xml, html, md, ttl, txt)
--model {openai,google,bedrock} # Model provider
--openai-model MODEL # OpenAI model name (default: gpt-3.5-turbo)
--google-model MODEL # Google model name (default: gemini-1.5-flash)
--bedrock-model MODEL # Bedrock model ID
--max-cases N # Limit number of cases per file
--start-case N # Start from specific case number (default: 2)
--variants TYPE # Use prompt variants
--self_aug TYPE # Use self-augmentation prompts
--list # List available optionsNote: Model availability depends on your AWS region and permissions.
US Regions (us-east-1, us-west-2):
us.meta.llama3-3-70b-instruct-v1:0- Llama 3.3 70Bus.anthropic.claude-3-5-sonnet-20240620-v1:0- Claude 3.5 Sonnetus.anthropic.claude-3-opus-20240229-v1:0- Claude 3 Opus
EU Regions (eu-west-1):
mistral.mistral-large-2402-v1:0- Mistral Largemistral.mistral-7b-instruct-v0:2- Mistral 7Bamazon.titan-text-express-v1- Amazon Titananthropic.claude-3-haiku-20240307-v1:0- Claude 3 Haiku
Check model availability in your region using the AWS Console or contact your administrator.
QASU/
├── benchmark_pipeline.py # Main benchmark runner
├── scripts/ # Utility scripts
│ ├── generate_prompts.py # Base prompt generator
│ ├── generate_prompt_variants.py # Variant prompt generator
│ ├── generate_prompt_self_augmentation.py # Self-augmentation prompt generator
│ └── benchmark_analysis.py # Results analyzer
├── utils/ # Core utilities
│ ├── bedrock_client.py # AWS Bedrock API client
│ ├── evaluation.py # Enhanced evaluation functions
│ └── csv_parser.py # Robust CSV parser
├── advanced_prompts/ # Source questionnaire data
├── requirements.txt # Python dependencies
├── .env.example # Environment template
└── README.md # This file
converted_prompts/- Generated promptsbenchmark_results/- Test resultsanalysis_results/- Analysis outputsbenchmark_cache/- Cached benchmark data
Create a .env file with:
# OpenAI
OPENAI_API_KEY=sk-...
# Google Gemini
GOOGLE_API_KEY=AIza...
# AWS Bedrock
AWS_BEARER_TOKEN_BEDROCK=your-bearer-token
AWS_REGION=us-east-1- Start Small: Use
--max-cases 5for initial tests - Check Region: Verify model availability in your AWS region
- Monitor Costs: LLM API calls can be expensive for large benchmarks
- Parallel Processing: Run different datasets in parallel for faster results
- Review Logs: Check console output for errors or warnings
403 Forbidden Error:
- Your Bearer token may not have access to that region
- Check
AWS_REGIONin.envmatches your token's permissions
404 Model Not Found:
- Model may not be available in your region
- Try a different model or contact AWS support
Empty Responses:
- Check API credentials are valid
- Verify model has access permissions
- Review CloudWatch logs (AWS)
"No prompts found":
- Run
python scripts/generate_prompts.pyfirst
"Client initialization failed":
- Check API keys in
.envfile - Verify API key format is correct