Thanks to visit codestin.com
Credit goes to github.com

Skip to content

QuwanAI/PQAEF

Repository files navigation

🚀 Welcome to PQAEF

Peking University & Quwan Ability Evaluation Framework

Quwan: Guangzhou Quwan Network Technology

Hello, future master of LLM evaluation! 👋

Are you still scratching your head over how to scientifically and systematically evaluate the capabilities of Large Language Models? Look no further than PQAEF!

PQAEF is a highly extensible evaluation framework built upon a "Four-Dimensional" concept: Ability -> Level-3 Task -> Data -> Method. Whether you want to benchmark Qwen, GPT-4, or your own local model, PQAEF makes it a breeze.

🛠️ Quick Start

Get your LLM evaluation running in just three simple steps!

1. Download the Code

There's no cooler way to start than with a git clone!

Bash

git clone https://github.com/your-repo/PQAEF.git
cd /path/to/PQAEF

2. Setup Your Environment ⚙️

One command to rule them all (the dependencies, that is).

Bash

# Install core dependencies
pip install -r requirements.txt

# Install the framework itself
pip install -e .

3. Get Ready to Run!

📊 Prepare Your Dataset

  1. Download a dataset: Grab any dataset you want from places like Hugging Face, Kaggle, etc.
  2. Configure the path: Create a .yaml config file for your dataset to tell PQAEF where to find it.

We've prepared an example using the AGNews dataset to get you up and running in seconds. Just download the data and fill in the path in the config file!

👉 Example file: ./test/test_AGNews.yaml

YAML

data_loaders:
  AGNewsDataLoader:
    class: CSVDataLoader
    paths:
      - /path/to/your/ag_news_csv/test.csv # ✍️ Fill in your own dataset path here!

🤖 Configure Models for Evaluation

We support both remote APIs and local giants!

In the model_configs.json file, you can configure as many models as you want. The script will automatically call them up one by one to take on the challenge.

JSON

[
    {
        "model_type": "api",
        "model_name": "openai_evaluator",
        "class": "ApiModel",
        "config": {
            "provider": "url",
            "model_identifier": "YOUR_MODEL",
            "api_key": "YOUR_API_KEY",
            "base_url": "YOUR_BASE_URL",
            "concurrency": 1
        }
    },
    {
        "model_type": "local",
        "model_name": "qwen_evaluator",
        "class": "LocalModel",
        "config": {
            "model_path": "/path/to/model/Qwen2.5-7B-Instruct",
            "batch_size": 32,
            "device_ids": [6, 7],
            "model_kwargs": {
                "torch_dtype": "bfloat16",
                "attn_implementation": "sdpa"
            },
            "generation_kwargs": {
                "max_new_tokens": 300,
                "temperature": 0.1,
                "top_p": 0.95
            }
        }
    }
]

▶️ Run it!

Everything is ready. Run the script below, grab a coffee, and wait for the results!

Bash

sh ./run_all_tests_with_multi_models.sh

🎉 Results are in: You can find the detailed evaluation reports in the result_analyze/results/ directory.


🧩 New Dataset Integration Guide

Want to challenge LLMs with your own unique dataset? No problem! Integrating new data is as easy as building with LEGOs.

Following the AGNews dataset as an example, here's how you can do it:

  1. Specify the Task Type 📌

    In your .yaml config file, choose a suitable task type from the src/PQAEF/tasks directory.

    YAML

    tasks:
      - task_class: SingleChoiceTask # For example, this is a single-choice task
        module_path: PQAEF.tasks.single_choice.single_choice_task # The Python module path for the task class
    
  2. Configure the Data Loader 📦

    Choose or create a DataLoader from the src/PQAEF/data_ops/dataloader directory.

    YAML

    data_loaders:
      AGNewsDataLoader: # Give your loader a name
        class: CSVDataLoader # Specify which class to use for loading
        paths:
          - path/to/ag_news_csv/test.csv # Path to the data
        formatter_name: AGNewsFormatter # Specify a formatter
    
  3. Write a Formatter 🎨

    This is the key step for data adaptation! Add your dataset's formatting logic in src/PQAEF/data_ops/formatters/formatters.py to tell the framework how to parse your data.

  4. Create a Prompt Template ✍️

    A good prompt is half the battle. Refer to the src/PQAEF/tasks/single_choice/prompt directory to craft the perfect prompt for your task.

  5. Define Evaluation Metrics 📈

    Do you care about accuracy, recall, or F1-score? Specify it in your task configuration.

    YAML

    tasks:
      - task_class: SingleChoiceTask
        eval_tool:
          - accuracy # For a single-choice task, we care most about accuracy!
    
  6. Choose an Analyzer 🔬

    Finally, select a suitable analyzer from src/PQAEF/statistics/analysis to process and present the results.

And you're done! You now have a fully functional configuration file tailored for your custom dataset.


⚙️ The Ultimate Configuration Guide

Want to become a power user? Dive deep into the .yaml configuration file to unlock the full potential of PQAEF!

👉 Click to expand/collapse the full configuration guide

YAML
# ------------------ 📊 Data Loaders Configuration ------------------
data_loaders:
  AGNewsDataLoader: # Define a data loader instance
    class: CSVDataLoader # Specify the loading class
    paths:
      - path/to/ag_news_csv/test.csv # List of data file paths
    recursive: false # Whether to recursively search subdirectories
    num: 300 # Number of samples to load, -1 means load all
    formatter_name: AGNewsFormatter # Name of the tool to preprocess and format raw data
    encoding: utf-8
    skip_header: true # Whether to skip the first row (header) of the CSV file
    seed: 42 # Random seed for reproducibility

# ------------------ 🤖 Models Configuration ------------------ # Note: Settings here will be overridden by model_configs.json if it exists! models: # Example for a local model qwen_evaluator: class: LocalModel # LocalModel means the model is loaded from a local path name: qwen_evaluator_llm config: model_path: /path/to/model/Qwen2.5-7B-Instruct # Storage path of the model on the server batch_size: 32 device_ids: [6, 7] model_kwargs: torch_dtype: bfloat16 attn_implementation: sdpa # Attention mechanism implementation generation_kwargs: max_new_tokens: 50 temperature: 0.1 top_p: 0.95

# Example for an API model (can be used alongside or instead of a local model) openai_evaluator: class: "ApiModel" name: "openai_evaluator" config: provider: "url" model_identifier: "YOUR_MODEL_IDENTIFIER" # e.g., "gpt-4-turbo", "qwen-max" api_key: "YOUR_API_KEY" # Highly recommended to set the API Key using environment variables base_url: "YOUR_BASE_URL" # Specify this if using VLLM or other compatible APIs concurrency: 6 # Key parameter: max concurrent requests, adjust based on your API rate limits

# ------------------ 📝 Tasks Configuration ------------------ tasks: - task_class: SingleChoiceTask # Task type module_path: PQAEF.tasks.single_choice.single_choice_task # The Python module path for the task class loader_names: # The data loader to be used by this task - AGNewsDataLoader config: llm_model_name: openai_evaluator # Specify which model will execute this task prompt_path: ./src/PQAEF/tasks/single_choice/prompt/agnews.txt # The prompt template file for the task eval_tool: # Evaluation metrics - accuracy

# ------------------ 💾 Data Dumper Configuration ------------------ data_dumper: output_dir: ./output/test file_prefix: cped_annotated chunk_size: 5000

# ------------------ 📈 Statistics Generator Configuration ------------------ statistics_generator: analyses_to_run: # A list of analysis types to run on the evaluation results - single_choice # Run statistical analysis for the single-choice task


🗺️ Future Roadmap

Our journey has just begun! To make PQAEF even more powerful and user-friendly, we are planning to:

  • Enrich Evaluation Dimensions: Continuously add more diverse datasets (especially for multi-lingual and multi-modal scenarios 🖼️🗣️) and introduce new task types like dialogue, summarization, and multi-method evaluation.
  • Build an Open Community: Improve developer documentation and tutorials, and foster an active community. We encourage you to contribute new datasets, models, and evaluation modules to grow with us!

👏🏻 License

This code repository is licensed under Apache-2.0 license, the corresponding dataset is licensed under CC BY-NC-SA 4.0.

🤝 Join Us & Contribute

Found a bug? Have a brilliant idea? Or developed a cool new feature?

We warmly welcome all forms of contributions! Whether it's submitting a new DataLoader, Task, or a more powerful Analyzer, we're excited to see it.

Please share your ideas, suggestions, or bug reports by opening a GitHub Issue. Let's work together to build better PQAEF for LLM evaluation!

About

Peking University & Quwan Ability Evaluation Framework ;

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published