Factly is a modern CLI tool designed to evaluate the factuality of Large Language Models (LLMs) on the Massive Multitask Language Understanding (MMLU) benchmark. It provides a robust framework for prompt engineering experiments and factual accuracy assessment.
- Evaluate LLM factuality on the MMLU benchmark with detailed results
- Support for various prompt engineering experiments via configurable system instructions
- Generate comparative visualizations of factuality scores across models and prompts
- Structured output for easy analysis and comparison
- Built with modern Python tooling (Python 3.12, uv, click, pydantic)
- Extensible and reproducible evaluation workflows
Note
Currently, only OpenAI models are supported.
# Run MMLU evaluation with default settings
factly mmlu
# Run MMLU evaluation and generate plots
factly mmlu --plot
# Get help on all available options
factly mmlu --help
# Get help on all available commands
factly --helpThat's it! The tool uses optimized default parameters and saves all outputs to the output directory.
Note
For detailed installation instructions, please see the Installation Guide. And for usage instructions, use cases, examples, and advanced configuration options, please see the Usage Guide.
Factly is released under the MIT License, its documentation lives at Read the Docs, the code on GitHub, and the latest release on PyPI. It's rigorously tested on Python 3.12+.
If you'd like to contribute to Factly you're most welcome!
Should you have any question, any remark, or if you find a bug, or if there is something you can't do with the Factly, please open an issue.