DocGenome: An Open Large-scale Scientific Document Benchmark for Training Next-generation Large Models

Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Thus, leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extraction and understanding tasks, and their capacity to process within-document data formats such as charts and equations remains under-explored. To address these issues, we present DocGenome, a structured document dataset constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline. DocGenome features four characteristics: \textit{1) Completeness}: It is the first dataset to structure data from all modalities including 15 layout categories along with their LaTex source codes. \textit{2) Logicality}: It provides the logical relationships between different regions within each scientific document. \textit{3) Diversity}: It covers various document-oriented tasks, including document classification, visual grounding, document transformation, table QA, open-ended singe-page QA and multi-page QA. \textit{4) Correctness}: It undergoes rigorous quality control checks conducted by a specialized team. We conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of current large models on our benchmark.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocGenome: An Open Large-scale Scientific Document Benchmark for Training Next-generation Large Models

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DocGenome: An Open Large-scale Scientific Document Benchmark for Training Next-generation Large Models

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages