VERSE 📄🔍👀

Visual Embedding Reduction and Space Exploration — Clustering-guided Insights for Training Data Enhancement in V-rDu

Introduction ℹ️

Visually-rich Document Understanding (VrDU) consists of a model synthesizing or selecting information from documents (images with text) to answer questions, classify data or extract information. VrDU tasks are multimodal, i.e., models use text, images, or even the document layout to solve the tasks.

We usually train VLMs on visual synthetic data that we (as humans) label as photorrealistic. We argue that this is an anthropocentric perspective imposed to a model that might not synthetize visual information as we do. VERSE helps to visualize latent space and overlay visual features to detect poor-performance regions and take action to include better suited training sets to boost model performance.

Figure 1. Traditionally, the quality of synthetic images in a dataset is assessed from an anthropocentric perspective, answering the question of whether such images appear photorealistic. In contrast, this work proposes evaluating the images from the model’s perspective through visual embedding analysis.

Models 👾

We use different Vision Language Models (VLMs) with varying visual signal strenght and diverse visual world models. We prove VERSE over those models showing and using stronger visual representations (Donut and Idefics2).

Model	Pre-Trained 🤗 version	Input data	Task
LayoutLMv2	microsoft/layoutlmv2-base-uncased	I + T + L	Token classification
LayoutXLM	microsoft/layoutxlm-base	I + T + L	Token classification
LayoutLMv3	microsoft/layoutlmv3-base	I + T + L	Token classification
Donut	naver-clova-ix/donut-base	I + T	Sequence Generation
Idefics2	HuggingFaceM4/idefics2-8b	I + T	Sequence Generation
PaliGemma	google/paligemma-3b-pt-224	I + T	Sequence Generation
LLaVA	LLaVA-hf/LLaVA-1.5-7b-hf	I + T	Sequence Generation

Table 1. Models pre-considered in this research.

Datasets 📑

Training Dataset 📄

We use the Spanish partition of the MERIT Dataset. The MERIT Dataset is a synthetic multimodal dataset (Image + Text + Layout) crafted for Visually-rich Document Understanding tasks. You can find more about MERIT Dataset here:

Dataset: MERIT Dataset @ Hugging Face 🤗
MERIT Dataset Paper: @ Pattern Recognition and @ ArXiv
Pipeline code: MERIT Dataset generation pipeline

Figure 2. Training samples used. We employ the Spanish-language subsets of the MERIT Dataset, across its different versions (A). Each version comprises data from seven different schools (B). New versions complement the vanilla MERIT Dataset, composed of digital document samples (C) and their renderized versions (D). More information available in the MERIT Dataset paper.

Test-Dev Dataset 📜

We use MERIT Secret (a real dataset under Non-Disclosure Agreement) as test-dev dataset.

Methodology 🔄

Figure 3. VERSE methodology. More information available in the VERSE paper.

Results 📈

Figure 4. We detect conflictive clusters and the main features driving them so we can adjust our training data. In this example, we explore Idefics2 Reduced Embedding Space and boost its performance by combining resonating data that targets conflictive clusters.

Figure 5. Synthetic training-samples moving across the Reduced Embedding Space (RES) of Donut. Every step shows the same sample under increasing level of visual information (purple). In background, PC maps showing F1 scores of the target (test-dev) samples.

Model	Deployment	Fine-tuned	F1
Idefics2	On-premise	VERSE	0.8101
GPT4-O	API fine-tune	API fine-tune	0.7821
Donut	On-premise	VERSE	0.7607
Pixtral	API-Based	N/A	0.7267

Table 2. Comparison of the best-performing models. After applying the VERSE methodology, on-premise models achieve performance comparable to API-based solutions.

Resources 🧭

Explore Reduced Embedding Spaces: VERSE Space @ Hugging Face 🤗

Team 🤜🤛

We are researchers from Comillas Pontifical University

Ignacio de Rodrigo @nachoDRT: PhD Student.
Álvaro López @allopez: Supervisor.
Jaime Boal @jboal: Supervisor.

Citation 📃✒️

If you find our research interesting, please cite our works. 📃✒️

VERSE

@article{WIP,
  title={WIP},
  author={WIP},
  journal={arXiv preprint arXiv:WIP},
  year={2025}
}

MERIT Dataset

@article{deRodrigo2025merit,
title = {The MERIT dataset: Modelling and efficiently rendering interpretable transcripts},
journal = {Pattern Recognition},
volume = {172},
pages = {112502},
year = {2026},
issn = {0031-3203},
doi = {https://doi.org/10.1016/j.patcog.2025.112502},
url = {https://www.sciencedirect.com/science/article/pii/S0031320325011653},
author = {Ignacio {de Rodrigo} and Alberto Sanchez-Cuadrado and Jaime Boal and Alvaro J. Lopez-Lopez},

Name		Name	Last commit message	Last commit date
Latest commit History 275 Commits
.vscode		.vscode
ablation		ablation
bruteforce		bruteforce
embeddings		embeddings
figs		figs
paper		paper
postprocess		postprocess
saas		saas
single		single
transformers @ 525d7d8		transformers @ 525d7d8
vanillas		vanillas
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VERSE 📄🔍👀

Visual Embedding Reduction and Space Exploration — Clustering-guided Insights for Training Data Enhancement in V-rDu

Introduction ℹ️

Models 👾