PoSh is an interpretable, replicable metric for detailed description evaluation that produces both granular and coarse scores for the mistakes, omissions and overall quality of a generation in comparison to a reference. It does so in three steps:
- Given a generated description and its reference, PoSh extracts scene graphs that reduce each text's surface diversity to its objects, attributes and relations.
- Using each scene graph as a structured rubric, PoSh produces granular scores for the presence of its components in the other text through QA.
- PoSh aggregates these granular scores for each scene graph to produce interpretable, coarse scores for mistakes and omissions.
To validate PoSh, we collect a new benchmark named DOCENT of artwork from the U.S. National Gallery of Art with expert written references paired with both granular and coarse judgments of model generations from art history students. DOCENT allows evaluating detailed image description metrics and detailed image descriptions themselves. To learn more, please see DOCENT.
In our evaluations, PoSh is a better proxy for the human judgments in DOCENT than existing open-weight metrics (and GPT4o-as-a-Judge). Moreover, PoSh is robust to image type and source model, performing well on CapArena. Finally, we find that PoSh is an effective reward function, outperforming SFT on the 1,000 training images on DOCENT.
To learn more about PoSh and DOCENT, please read our paper, "PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions": https://arxiv.org/abs/2510.19060
To replicate our evaluation of PoSh on DOCENT and CapArena, please run the following on a single H100 GPU running CUDA 12.7:
conda env create -f environment.yml
conda activate posh
python evaluate_posh_coarse.py --benchmark docent
python evaluate_posh_coarse.py --benchmark caparena
To use PoSh, simply instantiate a PoSh instance and call #evaluate.
from posh.posh import PoSh
posh = PoSh(
qa_gpu_memory_utilization=args.gpu_memory_utilization,
qa_tensor_parallel_size=args.tensor_parallel_size,
qa_enable_prefix_caching=args.enable_prefix_caching,
verbosity=args.verbosity,
)
generations, references = [# your generations here], [# your references here]
coarse_scores = posh.evaluate(generations=generations, references=references)
If you find either PoSh or DOCENT useful in your work, please cite:
@misc{ananthram2025poshusingscenegraphs,
title={PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions},
author={Amith Ananthram and Elias Stengel-Eskin and Lorena A. Bradford and Julia Demarest and Adam Purvis and Keith Krut and Robert Stein and Rina Elster Pantalony and Mohit Bansal and Kathleen McKeown},
year={2025},
eprint={2510.19060},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.19060},
}