Here is an example of file structure of DocGenome dataset for discipline math.GM.
math.GM
├── 0906.1099
│ ├── layout_annotation.json
│ ├── order_annotation.json
│ ├── page_xxxx.jpg
│ ├── quality_report.json
│ └── reading_annotation.json
└── 2103.02443
├── layout_annotation.json
├── order_annotation.json
├── page_xxxx.jpg
├── quality_report.json
└── reading_annotation.jsonEach paper folder, for example, math.GM/2103.02443 contains five parts:
-
page_xxxx.jpg, each image represents each page of the corresponding paper, the page index is contained in the filename. Note that this might be different from the original paper.
-
layout_annotation.json, this json file contains the layout annotation bounding box of each category region using the COCO format.
-
reading_annotation.json, this json file contains LaTex source code for each block (except for the Figure category). Note that the latex source code may contain macros.
-
order_annotation.json, this json file contains the relationship between different blocks, where the key ofordersconsists of triplets. Each triplet represents the relation type and specifies the source block and the destination block.
-
quality_report.json, this json file contains the quality computing result for each page and the whole paper for further use.
| Index | Category | Notes |
|---|---|---|
| 0 | Algorithm | |
| 1 | Caption | Titles of Images, Tables, and Algorithms |
| 2 | Equation | |
| 3 | Figure | |
| 4 | Footnote | |
| 5 | List | |
| 7 | Table | |
| 8 | Text | |
| 9 | Text-EQ | Text block with inline equations |
| 10 | Title | Section titles |
| 12 | PaperTitle | |
| 13 | Code | |
| 14 | Abstract |
- The IoU of Bounding boxes are too large, this happens when the paper template is too complex.
- The category of the bounding boxes are not correct. This happens when user-defined macros are used. For example, some authors may use
\newcommand{\beq}{\begin{equation}},\newcommand{\eeq}{\end{equation}}, in this case, the equation may be detected asTextclass. - Bounding box is missing, this happens due to rare packages are used. Some rare packages may not identified by our rule-based methods.
- Bounding boxes are correct, but overlaps with other adjacent bounding boxe slightly, this happens due to layout adjustments, for example
vspace,inputcommands.
| Category | Description | Example |
|---|---|---|
| identical | two blocks corresponding to the same latex code chunk | paragraphs that cross columns or pages |
| peer | two blocks are both belongs to Title | \section{introduction}, \section{method} |
| sub | one block is a child of another block logically | \section{introduction} and the first paragraph in Introduction section |
| adj | two adjacent Text blocks | Paragraph1 and Paragraph2 |
| explicit-cite | one block cites another block with ref |
As shown in \ref{Fig: 5}. |
| implicit-cite | The caption block and the corresponding float environment | \begin{table}\caption{A}\begin{tabular}B\end{tabular}\end{table}, then A implicit-cite B |
Each order_annotation.json contains two keys:
-
annotations: contains the block information for each block, theblock_idof each block is used to represent the relationship.
-
orders: contains a list of triples, the meaning of each triple is:
-
type, represents the category of the current relationship, see table above for details.
-
from, represents theblock_idof the starting block of the relationship
-
to, represents theblock_idof the ending block of the relationship
reading_annotation.jsonfile of some papers may not contain the fieldannotationsfor unknown reason.reading_annotation.jsondoesn't contain theimplicit-citerelationship, theimplicit-citerelationship is used in test-dataset for efficiency consideration.explicit-citeonly supportsEquation, the support forTable,Figrueis developed after the training dataset is complete.
This file contains the rule-based quality check for further use. Explanation is as follows:
-
-
num_pages: the number of pages of the corresponding paper.
-
-
-
num_columns: 1 (single column) or 2 (two column), depends on the last page of the paper
-
-
-
category_quality: we record the number rendered latex code chunks for each categoryreading_count, and the number of detected bounding boxesgeometry_count, thenmissing_rateis computed as(reading_count - geometry_count)/reading_count. Finally, theTotalcategory is the summary of all other categories.
-
-
-
page_qualitycontaining IoU information of each page and the whole paper:
-
-
page: page index
-
-
-
num_blocks: how many bounding boxes in this page
-
-
-
area: sum of area of all blocks,$\sum_i \text{area}(\text{bbox}_i)$
-
-
-
overlap: sum of intersection area of all blocks,$\sum_i\sum_{j>i} \text{area}(\text{bbox}_i\cap bbox_j)$
-
-
-
ratiothe ratio betweenoverlapandarea. Note that this ratio may be very large if there is template issue.
-
-