This repository is the official implementation of the paper "FLEX: A Benchmark for Evaluating Robustness of Fairness in Large Language Models", NAACL-findings 2025.
To install the lm-eval package from the github repository, run:
git clone https://github.com/dhaabb55/FLEX/
cd FLEX
pip install -e .bash run_FLEX.shOur code is based on Language Model Evaluation Harness
Our data is based on On Second Thought, Let's Not Think Step by Step: Bias and Toxicity in Zero-Shot Reasoning
@misc{jung2025flexbenchmarkevaluatingrobustness,
title={FLEX: A Benchmark for Evaluating Robustness of Fairness in Large Language Models},
author={Dahyun Jung and Seungyoon Lee and Hyeonseok Moon and Chanjun Park and Heuiseok Lim},
year={2025},
eprint={2503.19540},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.19540},
}