Welcome! This repo is a modified fork of the original MMLU benchmark, updated to support open-source Hugging Face models like google/gemma-3b-it and stabilityai/stable-code-3b.
MMLU (Massive Multitask Language Understanding) is a benchmark designed to test language models across 57 subjects
For a deep dive, check out my blog post:
Paper Breakdown #1 – MMLU: LLMs Have Exams Too!
- Modified
evaluate.pyto work with Hugging Face models - A random subset of 20 subjects from MMLU (because Colab runtimes aren't infinite)
- Scripts to run few-shot evaluation
You can swap in any Hugging Face causal LM (AutoModelForCausalLM compatible).
| Model | Description |
|---|---|
google/gemma-3b-it |
General-purpose instruction-tuned LLM |
stabilityai/stable-code-3b |
Code-first model, tested just for fun |
Abstract Algebra, Anatomy, College Biology, College Chemistry, College Mathematics, Global Facts, High School Biology, High School Computer Science, High School Government and Politics, High School World History, Human Sexuality, Management, Medical Genetics, Miscellaneous, Moral Disputes, Professional Accounting, Public Relations, Sociology, Virology, World Religions
LLMs are smart, but they're not magic. This repo exists to help you measure just how smart (or not) they really are.
Got suggestions? Found a bug? Want to run it on another model? Open an issue or shoot me a message. Let’s benchmark responsibly.