Thanks to visit codestin.com
Credit goes to github.com

Skip to content

huggingface/lighteval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation


lighteval library logo

Your go-to toolkit for lightning-fast, flexible LLM evaluation, from Hugging Face's Leaderboard and Evals Team.

Tests Quality Python versions License Version


Documentation


Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backendsโ€”whether your model is being served somewhere or already loaded in memory. Dive deep into your model's performance by saving and exploring detailed, sample-by-sample results to debug and see how your models stack-up.

Customization at your fingertips: letting you either browse all our existing tasks and metrics or effortlessly create your own custom task and custom metric, tailored to your needs.

Available Tasks

Lighteval supports 7,000+ evaluation tasks across multiple domains and languages. Here's an overview of some popular benchmarks:

๐Ÿ“š Knowledge

  • General Knowledge: MMLU, MMLU-Pro, MMMU, BIG-Bench
  • Question Answering: TriviaQA, Natural Questions, SimpleQA, Humanity's Last Exam (HLE)
  • Specialized: GPQA, AGIEval

๐Ÿงฎ Math and Code

  • Math Problems: GSM8K, GSM-Plus, MATH, MATH500
  • Competition Math: AIME24, AIME25
  • Multilingual Math: MGSM (Grade School Math in 10+ languages)
  • Coding Benchmarks: LCB (LiveCodeBench)

๐ŸŽฏ Chat Model Evaluation

  • Instruction Following: IFEval, IFEval-fr
  • Reasoning: MUSR, DROP (discrete reasoning)
  • Long Context: RULER
  • Dialogue: MT-Bench
  • Holistic Evaluation: HELM, BIG-Bench

๐ŸŒ Multilingual Evaluation

  • Cross-lingual: XTREME, Flores200 (200 languages), XCOPA, XQuAD
  • Language-specific:
    • Arabic: ArabicMMLU
    • Filipino: FilBench
    • French: IFEval-fr, GPQA-fr, BAC-fr
    • German: German RAG Eval
    • Serbian: Serbian LLM Benchmark, OZ Eval
    • Turkic: TUMLU (9 Turkic languages)
    • Chinese: CMMLU, CEval, AGIEval
    • Russian: RUMMLU, Russian SQuAD
    • And many more...

๐Ÿง  Core Language Understanding

  • NLU: GLUE, SuperGLUE, TriviaQA, Natural Questions
  • Commonsense: HellaSwag, WinoGrande, ProtoQA
  • Natural Language Inference: XNLI
  • Reading Comprehension: SQuAD, XQuAD, MLQA, Belebele

โšก๏ธ Installation

Note: lighteval is currently completely untested on Windows, and we don't support it yet. (Should be fully functional on Mac/Linux)

pip install lighteval

Lighteval allows for many extras when installing, see here for a complete list.

If you want to push results to the Hugging Face Hub, add your access token as an environment variable:

huggingface-cli login

๐Ÿš€ Quickstart

Lighteval offers the following entry points for model evaluation:

Did not find what you need ? You can always make your custom model API by following this guide

  • lighteval custom: Evaluate custom models (can be anything)

Here's a quick command to evaluate using the Accelerate backend:

lighteval accelerate \
    "model_name=gpt2" \
    "leaderboard|truthfulqa:mc|0"

Or use the Python API to run a model already loaded in memory!

from transformers import AutoModelForCausalLM

from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.models.transformers.transformers_model import TransformersModel, TransformersModelConfig
from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters


MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
BENCHMARKS = "lighteval|gsm8k|0"

evaluation_tracker = EvaluationTracker(output_dir="./results")
pipeline_params = PipelineParameters(
    launcher_type=ParallelismManager.NONE,
    max_samples=2
)

model = AutoModelForCausalLM.from_pretrained(
  MODEL_NAME, device_map="auto"
)
config = TransformersModelConfig(model_name=MODEL_NAME, batch_size=1)
model = TransformersModel.from_model(model, config)

pipeline = Pipeline(
    model=model,
    pipeline_parameters=pipeline_params,
    evaluation_tracker=evaluation_tracker,
    tasks=BENCHMARKS,
)

results = pipeline.evaluate()
pipeline.show_results()
results = pipeline.get_results()

๐Ÿ™ Acknowledgements

Lighteval started as an extension of the fantastic Eleuther AI Harness (which powers the Open LLM Leaderboard) and draws inspiration from the amazing HELM framework.

While evolving Lighteval into its own standalone tool, we are grateful to the Harness and HELM teams for their pioneering work on LLM evaluations.

๐ŸŒŸ Contributions Welcome ๐Ÿ’™๐Ÿ’š๐Ÿ’›๐Ÿ’œ๐Ÿงก

Got ideas? Found a bug? Want to add a task or metric? Contributions are warmly welcomed!

If you're adding a new feature, please open an issue first.

If you open a PR, don't forget to run the styling!

pip install -e .[dev]
pre-commit install
pre-commit run --all-files

๐Ÿ“œ Citation

@misc{lighteval,
  author = {Habib, Nathan and Fourrier, Clรฉmentine and Kydlรญฤek, Hynek and Wolf, Thomas and Tunstall, Lewis},
  title = {LightEval: A lightweight framework for LLM evaluation},
  year = {2023},
  version = {0.10.0},
  url = {https://github.com/huggingface/lighteval}
}

About

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

Topics

Resources

License

Stars

Watchers

Forks

Contributors 100

Languages