Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Alpha-Innovator/DocGenome

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

arXiv GitHub issues PRs Welcome

DocGenome: An Open Large-scale Scientific Document Benchmark for Training Next-generation Large Models

Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Thus, leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extraction and understanding tasks, and their capacity to process within-document data formats such as charts and equations remains under-explored. To address these issues, we present DocGenome, a structured document dataset constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline. DocGenome features four characteristics: \textit{1) Completeness}: It is the first dataset to structure data from all modalities including 15 layout categories along with their LaTex source codes. \textit{2) Logicality}: It provides the logical relationships between different regions within each scientific document. \textit{3) Diversity}: It covers various document-oriented tasks, including document classification, visual grounding, document transformation, table QA, open-ended singe-page QA and multi-page QA. \textit{4) Correctness}: It undergoes rigorous quality control checks conducted by a specialized team. We conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of current large models on our benchmark.

About

DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Models

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors