Q_Benchmark - Language Model Structured Data Evaluation

A comprehensive benchmark system for evaluating large language model performance on structured data question-answering tasks across multiple formats (JSON, XML, HTML, Markdown, TTL, TXT) and question types.

Features

Multi-Provider Support: OpenAI (GPT-4, GPT-3.5), Google (Gemini), AWS Bedrock (Llama, Claude, Mistral)
Multiple Data Formats: JSON, XML, HTML, Markdown, TTL, TXT
Diverse Tasks: Answer lookup, reverse lookup, aggregation, multi-hop inference, counting, rule-based querying
Prompt Variants: Test different prompt strategies (with/without role prompting, formatting, etc.)
Self-Augmentation: Enhanced prompts with structural information

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Set Up API Keys

cp .env.example .env
# Edit .env with your API keys:
# - OPENAI_API_KEY for GPT models
# - GOOGLE_API_KEY for Gemini models
# - AWS_BEARER_TOKEN_BEDROCK for Bedrock models
# - AWS_REGION (e.g., us-east-1)

3. Generate Converted Prompts

# Generate base prompts
python scripts/generate_prompts.py

# Optional: Generate prompt variants
python scripts/generate_prompt_variants.py

# Optional: Generate self-augmentation prompts
python scripts/generate_prompt_self_augmentation.py

4. Run Benchmarks

OpenAI (GPT Models)

# Full benchmark with GPT-4o-mini
python benchmark_pipeline.py --model openai --openai-model gpt-4o-mini

# Test specific dataset/task
python benchmark_pipeline.py \
  --model openai \
  --openai-model gpt-4o-mini \
  --dataset healthcare-dataset \
  --task answer_lookup \
  --format json \
  --max-cases 5

Google (Gemini Models)

# Full benchmark with Gemini
python benchmark_pipeline.py --model google --google-model gemini-1.5-flash

# Test specific dataset
python benchmark_pipeline.py \
  --model google \
  --google-model gemini-1.5-flash \
  --dataset healthcare-dataset \
  --max-cases 10

AWS Bedrock (Llama, Claude, Mistral)

# Llama 3.3 70B (us-east-1)
python benchmark_pipeline.py \
  --model bedrock \
  --bedrock-model us.meta.llama3-3-70b-instruct-v1:0 \
  --dataset healthcare-dataset \
  --task answer_lookup \
  --max-cases 5

# Claude 3 Haiku (check region availability)
python benchmark_pipeline.py \
  --model bedrock \
  --bedrock-model anthropic.claude-3-haiku-20240307-v1:0 \
  --dataset isbar \
  --max-cases 5

# Mistral Large (eu-west-1)
python benchmark_pipeline.py \
  --model bedrock \
  --bedrock-model mistral.mistral-large-2402-v1:0 \
  --dataset sus-uta7 \
  --max-cases 5

5. Analyze Results

# Analyze results for a specific model
python scripts/benchmark_analysis.py --model gpt-4o-mini

# Analyze Bedrock model results
python scripts/benchmark_analysis.py --model us.meta.llama3-3-70b-instruct-v1:0

Available Options

Datasets (5)

healthcare-dataset
isbar
self-reported-mental-health
stack-overflow-2022
sus-uta7

Task Types (6)

answer_lookup - Find specific values
answer_reverse_lookup - Find respondents matching criteria
conceptual_aggregation - Aggregate/count values
multi_hop_relational_inference - Multi-step reasoning
respondent_count - Count respondents matching conditions
rule_based_querying - Apply rules to find answers

Data Formats (6)

JSON, XML, HTML, Markdown, TTL, TXT

Command-Line Options

--dataset DATASET          # Specific dataset to process
--task TASK               # Specific task type
--format FORMAT           # Data format (json, xml, html, md, ttl, txt)
--model {openai,google,bedrock}  # Model provider
--openai-model MODEL      # OpenAI model name (default: gpt-3.5-turbo)
--google-model MODEL      # Google model name (default: gemini-1.5-flash)
--bedrock-model MODEL     # Bedrock model ID
--max-cases N             # Limit number of cases per file
--start-case N            # Start from specific case number (default: 2)
--variants TYPE           # Use prompt variants
--self_aug TYPE           # Use self-augmentation prompts
--list                    # List available options

Supported Models

AWS Bedrock Models

Note: Model availability depends on your AWS region and permissions.

US Regions (us-east-1, us-west-2):

us.meta.llama3-3-70b-instruct-v1:0 - Llama 3.3 70B
us.anthropic.claude-3-5-sonnet-20240620-v1:0 - Claude 3.5 Sonnet
us.anthropic.claude-3-opus-20240229-v1:0 - Claude 3 Opus

EU Regions (eu-west-1):

mistral.mistral-large-2402-v1:0 - Mistral Large
mistral.mistral-7b-instruct-v0:2 - Mistral 7B
amazon.titan-text-express-v1 - Amazon Titan
anthropic.claude-3-haiku-20240307-v1:0 - Claude 3 Haiku

Check model availability in your region using the AWS Console or contact your administrator.

Project Structure

QASU/
├── benchmark_pipeline.py           # Main benchmark runner
├── scripts/                        # Utility scripts
│   ├── generate_prompts.py        # Base prompt generator
│   ├── generate_prompt_variants.py          # Variant prompt generator
│   ├── generate_prompt_self_augmentation.py # Self-augmentation prompt generator
│   └── benchmark_analysis.py      # Results analyzer
├── utils/                          # Core utilities
│   ├── bedrock_client.py          # AWS Bedrock API client
│   ├── evaluation.py              # Enhanced evaluation functions
│   └── csv_parser.py              # Robust CSV parser
├── advanced_prompts/               # Source questionnaire data
├── requirements.txt                # Python dependencies
├── .env.example                    # Environment template
└── README.md                       # This file

Generated Directories (Not in Git)

converted_prompts/ - Generated prompts
benchmark_results/ - Test results
analysis_results/ - Analysis outputs
benchmark_cache/ - Cached benchmark data

Environment Variables

Create a .env file with:

# OpenAI
OPENAI_API_KEY=sk-...

# Google Gemini
GOOGLE_API_KEY=AIza...

# AWS Bedrock
AWS_BEARER_TOKEN_BEDROCK=your-bearer-token
AWS_REGION=us-east-1

Tips

Start Small: Use --max-cases 5 for initial tests
Check Region: Verify model availability in your AWS region
Monitor Costs: LLM API calls can be expensive for large benchmarks
Parallel Processing: Run different datasets in parallel for faster results
Review Logs: Check console output for errors or warnings

Troubleshooting

AWS Bedrock Issues

403 Forbidden Error:

Your Bearer token may not have access to that region
Check AWS_REGION in .env matches your token's permissions

404 Model Not Found:

Model may not be available in your region
Try a different model or contact AWS support

Empty Responses:

Check API credentials are valid
Verify model has access permissions
Review CloudWatch logs (AWS)

Common Issues

"No prompts found":

Run python scripts/generate_prompts.py first

"Client initialization failed":

Check API keys in .env file
Verify API key format is correct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Q_Benchmark - Language Model Structured Data Evaluation

Features

Quick Start

1. Install Dependencies

2. Set Up API Keys

3. Generate Converted Prompts

4. Run Benchmarks

OpenAI (GPT Models)

Google (Gemini Models)

AWS Bedrock (Llama, Claude, Mistral)

5. Analyze Results

Available Options

Datasets (5)

Task Types (6)

Data Formats (6)

Command-Line Options

Supported Models

AWS Bedrock Models

Project Structure

Generated Directories (Not in Git)

Environment Variables

Tips

Troubleshooting

AWS Bedrock Issues

Common Issues

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
advanced_prompts		advanced_prompts
benchmark_cache		benchmark_cache
scripts		scripts
utils		utils
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
benchmark_pipeline.py		benchmark_pipeline.py
requirements.txt		requirements.txt

ReML-AI/QASU

Folders and files

Latest commit

History

Repository files navigation

Q_Benchmark - Language Model Structured Data Evaluation

Features

Quick Start

1. Install Dependencies

2. Set Up API Keys

3. Generate Converted Prompts

4. Run Benchmarks

OpenAI (GPT Models)

Google (Gemini Models)

AWS Bedrock (Llama, Claude, Mistral)

5. Analyze Results

Available Options

Datasets (5)

Task Types (6)

Data Formats (6)

Command-Line Options

Supported Models

AWS Bedrock Models

Project Structure

Generated Directories (Not in Git)

Environment Variables

Tips

Troubleshooting

AWS Bedrock Issues

Common Issues

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages