Peking University & Quwan Ability Evaluation Framework
Quwan: Guangzhou Quwan Network Technology
Hello, future master of LLM evaluation! 👋
Are you still scratching your head over how to scientifically and systematically evaluate the capabilities of Large Language Models? Look no further than PQAEF!
PQAEF is a highly extensible evaluation framework built upon a "Four-Dimensional" concept: Ability -> Level-3 Task -> Data -> Method. Whether you want to benchmark Qwen, GPT-4, or your own local model, PQAEF makes it a breeze.
Get your LLM evaluation running in just three simple steps!
There's no cooler way to start than with a git clone!
Bash
git clone https://github.com/your-repo/PQAEF.git
cd /path/to/PQAEF
One command to rule them all (the dependencies, that is).
Bash
# Install core dependencies
pip install -r requirements.txt
# Install the framework itself
pip install -e .
- Download a dataset: Grab any dataset you want from places like Hugging Face, Kaggle, etc.
- Configure the path: Create a
.yamlconfig file for your dataset to tell PQAEF where to find it.
We've prepared an example using the AGNews dataset to get you up and running in seconds. Just download the data and fill in the path in the config file!
👉 Example file: ./test/test_AGNews.yaml
YAML
data_loaders:
AGNewsDataLoader:
class: CSVDataLoader
paths:
- /path/to/your/ag_news_csv/test.csv # ✍️ Fill in your own dataset path here!
We support both remote APIs and local giants!
In the model_configs.json file, you can configure as many models as you want. The script will automatically call them up one by one to take on the challenge.
JSON
[
{
"model_type": "api",
"model_name": "openai_evaluator",
"class": "ApiModel",
"config": {
"provider": "url",
"model_identifier": "YOUR_MODEL",
"api_key": "YOUR_API_KEY",
"base_url": "YOUR_BASE_URL",
"concurrency": 1
}
},
{
"model_type": "local",
"model_name": "qwen_evaluator",
"class": "LocalModel",
"config": {
"model_path": "/path/to/model/Qwen2.5-7B-Instruct",
"batch_size": 32,
"device_ids": [6, 7],
"model_kwargs": {
"torch_dtype": "bfloat16",
"attn_implementation": "sdpa"
},
"generation_kwargs": {
"max_new_tokens": 300,
"temperature": 0.1,
"top_p": 0.95
}
}
}
]
Everything is ready. Run the script below, grab a coffee, and wait for the results!
Bash
sh ./run_all_tests_with_multi_models.sh
🎉 Results are in: You can find the detailed evaluation reports in the result_analyze/results/ directory.
Want to challenge LLMs with your own unique dataset? No problem! Integrating new data is as easy as building with LEGOs.
Following the AGNews dataset as an example, here's how you can do it:
-
Specify the Task Type 📌
In your .yaml config file, choose a suitable task type from the src/PQAEF/tasks directory.
YAML
tasks: - task_class: SingleChoiceTask # For example, this is a single-choice task module_path: PQAEF.tasks.single_choice.single_choice_task # The Python module path for the task class -
Configure the Data Loader 📦
Choose or create a DataLoader from the src/PQAEF/data_ops/dataloader directory.
YAML
data_loaders: AGNewsDataLoader: # Give your loader a name class: CSVDataLoader # Specify which class to use for loading paths: - path/to/ag_news_csv/test.csv # Path to the data formatter_name: AGNewsFormatter # Specify a formatter -
Write a Formatter 🎨
This is the key step for data adaptation! Add your dataset's formatting logic in src/PQAEF/data_ops/formatters/formatters.py to tell the framework how to parse your data.
-
Create a Prompt Template ✍️
A good prompt is half the battle. Refer to the src/PQAEF/tasks/single_choice/prompt directory to craft the perfect prompt for your task.
-
Define Evaluation Metrics 📈
Do you care about accuracy, recall, or F1-score? Specify it in your task configuration.
YAML
tasks: - task_class: SingleChoiceTask eval_tool: - accuracy # For a single-choice task, we care most about accuracy! -
Choose an Analyzer 🔬
Finally, select a suitable analyzer from src/PQAEF/statistics/analysis to process and present the results.
And you're done! You now have a fully functional configuration file tailored for your custom dataset.
Want to become a power user? Dive deep into the .yaml configuration file to unlock the full potential of PQAEF!
👉 Click to expand/collapse the full configuration guide
# ------------------ 📊 Data Loaders Configuration ------------------ data_loaders: AGNewsDataLoader: # Define a data loader instance class: CSVDataLoader # Specify the loading class paths: - path/to/ag_news_csv/test.csv # List of data file paths recursive: false # Whether to recursively search subdirectories num: 300 # Number of samples to load, -1 means load all formatter_name: AGNewsFormatter # Name of the tool to preprocess and format raw data encoding: utf-8 skip_header: true # Whether to skip the first row (header) of the CSV file seed: 42 # Random seed for reproducibility# ------------------ 🤖 Models Configuration ------------------ # Note: Settings here will be overridden by model_configs.json if it exists! models: # Example for a local model qwen_evaluator: class: LocalModel # LocalModel means the model is loaded from a local path name: qwen_evaluator_llm config: model_path: /path/to/model/Qwen2.5-7B-Instruct # Storage path of the model on the server batch_size: 32 device_ids: [6, 7] model_kwargs: torch_dtype: bfloat16 attn_implementation: sdpa # Attention mechanism implementation generation_kwargs: max_new_tokens: 50 temperature: 0.1 top_p: 0.95
# Example for an API model (can be used alongside or instead of a local model) openai_evaluator: class: "ApiModel" name: "openai_evaluator" config: provider: "url" model_identifier: "YOUR_MODEL_IDENTIFIER" # e.g., "gpt-4-turbo", "qwen-max" api_key: "YOUR_API_KEY" # Highly recommended to set the API Key using environment variables base_url: "YOUR_BASE_URL" # Specify this if using VLLM or other compatible APIs concurrency: 6 # Key parameter: max concurrent requests, adjust based on your API rate limits
# ------------------ 📝 Tasks Configuration ------------------ tasks: - task_class: SingleChoiceTask # Task type module_path: PQAEF.tasks.single_choice.single_choice_task # The Python module path for the task class loader_names: # The data loader to be used by this task - AGNewsDataLoader config: llm_model_name: openai_evaluator # Specify which model will execute this task prompt_path: ./src/PQAEF/tasks/single_choice/prompt/agnews.txt # The prompt template file for the task eval_tool: # Evaluation metrics - accuracy
# ------------------ 💾 Data Dumper Configuration ------------------ data_dumper: output_dir: ./output/test file_prefix: cped_annotated chunk_size: 5000
# ------------------ 📈 Statistics Generator Configuration ------------------ statistics_generator: analyses_to_run: # A list of analysis types to run on the evaluation results - single_choice # Run statistical analysis for the single-choice task
Our journey has just begun! To make PQAEF even more powerful and user-friendly, we are planning to:
- Enrich Evaluation Dimensions: Continuously add more diverse datasets (especially for multi-lingual and multi-modal scenarios 🖼️🗣️) and introduce new task types like dialogue, summarization, and multi-method evaluation.
- Build an Open Community: Improve developer documentation and tutorials, and foster an active community. We encourage you to contribute new datasets, models, and evaluation modules to grow with us!
This code repository is licensed under Apache-2.0 license, the corresponding dataset is licensed under CC BY-NC-SA 4.0.
Found a bug? Have a brilliant idea? Or developed a cool new feature?
We warmly welcome all forms of contributions! Whether it's submitting a new DataLoader, Task, or a more powerful Analyzer, we're excited to see it.
Please share your ideas, suggestions, or bug reports by opening a GitHub Issue. Let's work together to build better PQAEF for LLM evaluation!