🚀 Welcome to PQAEF

Peking University & Quwan Ability Evaluation Framework

Quwan: Guangzhou Quwan Network Technology

Hello, future master of LLM evaluation! 👋

Are you still scratching your head over how to scientifically and systematically evaluate the capabilities of Large Language Models? Look no further than PQAEF!

PQAEF is a highly extensible evaluation framework built upon a "Four-Dimensional" concept: Ability -> Level-3 Task -> Data -> Method. Whether you want to benchmark Qwen, GPT-4, or your own local model, PQAEF makes it a breeze.

🛠️ Quick Start

Get your LLM evaluation running in just three simple steps!

1. Download the Code

There's no cooler way to start than with a git clone!

Bash

git clone https://github.com/your-repo/PQAEF.git
cd /path/to/PQAEF

2. Setup Your Environment ⚙️

One command to rule them all (the dependencies, that is).

Bash

# Install core dependencies
pip install -r requirements.txt

# Install the framework itself
pip install -e .

3. Get Ready to Run!

📊 Prepare Your Dataset

Download a dataset: Grab any dataset you want from places like Hugging Face, Kaggle, etc.
Configure the path: Create a .yaml config file for your dataset to tell PQAEF where to find it.

We've prepared an example using the AGNews dataset to get you up and running in seconds. Just download the data and fill in the path in the config file!

👉 Example file: ./test/test_AGNews.yaml

YAML

data_loaders:
  AGNewsDataLoader:
    class: CSVDataLoader
    paths:
      - /path/to/your/ag_news_csv/test.csv # ✍️ Fill in your own dataset path here!

🤖 Configure Models for Evaluation

We support both remote APIs and local giants!

In the model_configs.json file, you can configure as many models as you want. The script will automatically call them up one by one to take on the challenge.

JSON

[
    {
        "model_type": "api",
        "model_name": "openai_evaluator",
        "class": "ApiModel",
        "config": {
            "provider": "url",
            "model_identifier": "YOUR_MODEL",
            "api_key": "YOUR_API_KEY",
            "base_url": "YOUR_BASE_URL",
            "concurrency": 1
        }
    },
    {
        "model_type": "local",
        "model_name": "qwen_evaluator",
        "class": "LocalModel",
        "config": {
            "model_path": "/path/to/model/Qwen2.5-7B-Instruct",
            "batch_size": 32,
            "device_ids": [6, 7],
            "model_kwargs": {
                "torch_dtype": "bfloat16",
                "attn_implementation": "sdpa"
            },
            "generation_kwargs": {
                "max_new_tokens": 300,
                "temperature": 0.1,
                "top_p": 0.95
            }
        }
    }
]

▶️ Run it!

Everything is ready. Run the script below, grab a coffee, and wait for the results!

Bash

sh ./run_all_tests_with_multi_models.sh

🎉 Results are in: You can find the detailed evaluation reports in the result_analyze/results/ directory.

🧩 New Dataset Integration Guide

Want to challenge LLMs with your own unique dataset? No problem! Integrating new data is as easy as building with LEGOs.

Following the AGNews dataset as an example, here's how you can do it:

Specify the Task Type 📌

In your .yaml config file, choose a suitable task type from the src/PQAEF/tasks directory.

YAML

tasks:
  - task_class: SingleChoiceTask # For example, this is a single-choice task
    module_path: PQAEF.tasks.single_choice.single_choice_task # The Python module path for the task class

Configure the Data Loader 📦

Choose or create a DataLoader from the src/PQAEF/data_ops/dataloader directory.

YAML

data_loaders:
  AGNewsDataLoader: # Give your loader a name
    class: CSVDataLoader # Specify which class to use for loading
    paths:
      - path/to/ag_news_csv/test.csv # Path to the data
    formatter_name: AGNewsFormatter # Specify a formatter

Write a Formatter 🎨

This is the key step for data adaptation! Add your dataset's formatting logic in src/PQAEF/data_ops/formatters/formatters.py to tell the framework how to parse your data.
Create a Prompt Template ✍️

A good prompt is half the battle. Refer to the src/PQAEF/tasks/single_choice/prompt directory to craft the perfect prompt for your task.

Define Evaluation Metrics 📈

Do you care about accuracy, recall, or F1-score? Specify it in your task configuration.

YAML

tasks:
  - task_class: SingleChoiceTask
    eval_tool:
      - accuracy # For a single-choice task, we care most about accuracy!

Choose an Analyzer 🔬

Finally, select a suitable analyzer from src/PQAEF/statistics/analysis to process and present the results.

And you're done! You now have a fully functional configuration file tailored for your custom dataset.

⚙️ The Ultimate Configuration Guide

Want to become a power user? Dive deep into the .yaml configuration file to unlock the full potential of PQAEF!

👉 Click to expand/collapse the full configuration guide

YAML

# ------------------ 📊 Data Loaders Configuration ------------------ data_loaders: AGNewsDataLoader: # Define a data loader instance class: CSVDataLoader # Specify the loading class paths: - path/to/ag_news_csv/test.csv # List of data file paths recursive: false # Whether to recursively search subdirectories num: 300 # Number of samples to load, -1 means load all formatter_name: AGNewsFormatter # Name of the tool to preprocess and format raw data encoding: utf-8 skip_header: true # Whether to skip the first row (header) of the CSV file seed: 42 # Random seed for reproducibility # ------------------ 🤖 Models Configuration ------------------ # Note: Settings here will be overridden by model_configs.json if it exists! models: # Example for a local model qwen_evaluator: class: LocalModel # LocalModel means the model is loaded from a local path name: qwen_evaluator_llm config: model_path: /path/to/model/Qwen2.5-7B-Instruct # Storage path of the model on the server batch_size: 32 device_ids: [6, 7] model_kwargs: torch_dtype: bfloat16 attn_implementation: sdpa # Attention mechanism implementation generation_kwargs: max_new_tokens: 50 temperature: 0.1 top_p: 0.95 # Example for an API model (can be used alongside or instead of a local model) openai_evaluator: class: "ApiModel" name: "openai_evaluator" config: provider: "url" model_identifier: "YOUR_MODEL_IDENTIFIER" # e.g., "gpt-4-turbo", "qwen-max" api_key: "YOUR_API_KEY" # Highly recommended to set the API Key using environment variables base_url: "YOUR_BASE_URL" # Specify this if using VLLM or other compatible APIs concurrency: 6 # Key parameter: max concurrent requests, adjust based on your API rate limits # ------------------ 📝 Tasks Configuration ------------------ tasks: - task_class: SingleChoiceTask # Task type module_path: PQAEF.tasks.single_choice.single_choice_task # The Python module path for the task class loader_names: # The data loader to be used by this task - AGNewsDataLoader config: llm_model_name: openai_evaluator # Specify which model will execute this task prompt_path: ./src/PQAEF/tasks/single_choice/prompt/agnews.txt # The prompt template file for the task eval_tool: # Evaluation metrics - accuracy # ------------------ 💾 Data Dumper Configuration ------------------ data_dumper: output_dir: ./output/test file_prefix: cped_annotated chunk_size: 5000

# ------------------ 📈 Statistics Generator Configuration ------------------ statistics_generator: analyses_to_run: # A list of analysis types to run on the evaluation results - single_choice # Run statistical analysis for the single-choice task

🗺️ Future Roadmap

Our journey has just begun! To make PQAEF even more powerful and user-friendly, we are planning to:

Enrich Evaluation Dimensions: Continuously add more diverse datasets (especially for multi-lingual and multi-modal scenarios 🖼️🗣️) and introduce new task types like dialogue, summarization, and multi-method evaluation.
Build an Open Community: Improve developer documentation and tutorials, and foster an active community. We encourage you to contribute new datasets, models, and evaluation modules to grow with us!

👏🏻 License

This code repository is licensed under Apache-2.0 license, the corresponding dataset is licensed under CC BY-NC-SA 4.0.

🤝 Join Us & Contribute

Found a bug? Have a brilliant idea? Or developed a cool new feature?

We warmly welcome all forms of contributions! Whether it's submitting a new DataLoader, Task, or a more powerful Analyzer, we're excited to see it.

Please share your ideas, suggestions, or bug reports by opening a GitHub Issue. Let's work together to build better PQAEF for LLM evaluation!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data/ag_news_csv		data/ag_news_csv
resources/fonts		resources/fonts
result_analyze		result_analyze
src/PQAEF		src/PQAEF
test		test
LICENSE		LICENSE
README.md		README.md
calculate_weighted_scores.py		calculate_weighted_scores.py
convert_models.py		convert_models.py
model_configs.json		model_configs.json
requirements.txt		requirements.txt
run_all_tests.sh		run_all_tests.sh
run_all_tests_with_multi_models.sh		run_all_tests_with_multi_models.sh
setup.py		setup.py
weight_config.yaml		weight_config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 Welcome to PQAEF

🛠️ Quick Start

1. Download the Code

2. Setup Your Environment ⚙️

3. Get Ready to Run!

📊 Prepare Your Dataset

🤖 Configure Models for Evaluation

▶️ Run it!

🧩 New Dataset Integration Guide

⚙️ The Ultimate Configuration Guide

🗺️ Future Roadmap

👏🏻 License

🤝 Join Us & Contribute

About

Uh oh!

Releases

Packages

Languages

License

QuwanAI/PQAEF

Folders and files

Latest commit

History

Repository files navigation

🚀 Welcome to PQAEF

🛠️ Quick Start

1. Download the Code

2. Setup Your Environment ⚙️

3. Get Ready to Run!

📊 Prepare Your Dataset

🤖 Configure Models for Evaluation

▶️ Run it!

🧩 New Dataset Integration Guide

⚙️ The Ultimate Configuration Guide

🗺️ Future Roadmap

👏🏻 License

🤝 Join Us & Contribute

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages