Are your detailed image descriptions what you (really really) want? Let PoSh be the judge.

PoSh is an interpretable, replicable metric for detailed description evaluation that produces both granular and coarse scores for the mistakes, omissions and overall quality of a generation in comparison to a reference. It does so in three steps:

Given a generated description and its reference, PoSh extracts scene graphs that reduce each text's surface diversity to its objects, attributes and relations.
Using each scene graph as a structured rubric, PoSh produces granular scores for the presence of its components in the other text through QA.
PoSh aggregates these granular scores for each scene graph to produce interpretable, coarse scores for mistakes and omissions.

To validate PoSh, we collect a new benchmark named DOCENT of artwork from the U.S. National Gallery of Art with expert written references paired with both granular and coarse judgments of model generations from art history students. DOCENT allows evaluating detailed image description metrics and detailed image descriptions themselves. To learn more, please see DOCENT.

In our evaluations, PoSh is a better proxy for the human judgments in DOCENT than existing open-weight metrics (and GPT4o-as-a-Judge). Moreover, PoSh is robust to image type and source model, performing well on CapArena. Finally, we find that PoSh is an effective reward function, outperforming SFT on the 1,000 training images on DOCENT.

To learn more about PoSh and DOCENT, please read our paper, "PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions": https://arxiv.org/abs/2510.19060

To replicate our evaluation of PoSh on DOCENT and CapArena, please run the following on a single H100 GPU running CUDA 12.7:

conda env create -f environment.yml
conda activate posh

python evaluate_posh_coarse.py --benchmark docent
python evaluate_posh_coarse.py --benchmark caparena

Usage

To use PoSh, simply instantiate a PoSh instance and call #evaluate.

from posh.posh import PoSh

posh = PoSh(
    qa_gpu_memory_utilization=args.gpu_memory_utilization,
    qa_tensor_parallel_size=args.tensor_parallel_size,
    qa_enable_prefix_caching=args.enable_prefix_caching,
    verbosity=args.verbosity,
)

generations, references = [# your generations here], [# your references here]

coarse_scores = posh.evaluate(generations=generations, references=references)

If you find either PoSh or DOCENT useful in your work, please cite:

@misc{ananthram2025poshusingscenegraphs,
      title={PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions}, 
      author={Amith Ananthram and Elias Stengel-Eskin and Lorena A. Bradford and Julia Demarest and Adam Purvis and Keith Krut and Robert Stein and Rina Elster Pantalony and Mohit Bansal and Kathleen McKeown},
      year={2025},
      eprint={2510.19060},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.19060}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
docent		docent
figures		figures
posh		posh
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
evaluate_posh_coarse.py		evaluate_posh_coarse.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Are your detailed image descriptions what you (really really) want? Let PoSh be the judge.

Usage

About

Uh oh!

Releases

Packages

Languages

Uh oh!

Uh oh!

amith-ananthram/posh

Folders and files

Latest commit

History

Repository files navigation

Are your detailed image descriptions what you (really really) want? Let PoSh be the judge.

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages