allenai · lolipopshock · Nov 8, 2021 · Aug 16, 2021 · Oct 7, 2021 · Oct 8, 2021
diff --git a/datasets/README.md b/datasets/README.md
@@ -0,0 +1,176 @@
+# S2-VLUE 
+
+- [S2-VLUE](#s2-vlue)
+  - [Overview](#overview)
+  - [Download & Usage](#download--usage)
+    - [Download the exported JSON (for training language models)](#download-the-exported-json-for-training-language-models)
+    - [Download the source PDFs or screenshots](#download-the-source-pdfs-or-screenshots)
+  - [Datasets Details](#datasets-details)
+    - [The S2-VL dataset](#the-s2-vl-dataset)
+      - [Recreating the dataset from PDFs and annotations](#recreating-the-dataset-from-pdfs-and-annotations)
+      - [Dataset Curation Details](#dataset-curation-details)
+    - [The VILA-enhanced DocBank Dataset](#the-vila-enhanced-docbank-dataset)
+  - [Dataset Details](#dataset-details)
+    - [Statistics of the Datasets](#statistics-of-the-datasets)
+    - [File Structures](#file-structures)
+  - [Reference](#reference)
+  - [Citation](#citation)
+
+## Overview 
+
+The S2-VLUE, Semantic Scholar **V**isual **L**ayout-enhanced Scientific Text **U**nderstanding **E**valuation (S2-VLUE) Benchmark Suite, is created to evaluate the scientific document understanding and parsing with visual layout information. 
+
+It consists of three datasets, i.e., GROTOAP2, DocBank, and, S2-VL. We modify the existing dataset GROTOAP2[1] and DocBank[2], adding visual layout information and converting them to a format that is compatible with [HuggingFace Datasets](https://huggingface.co/docs/datasets/). 
+The S2-VL dataset is a newly curated dataset that addresses three major drawbacks in existing work: 1) annotation quality, 2) VILA creation, and 3) domain coverage. 
+It contains human annotations for papers from 19 scientific disciplines. 
+We provide scripts for downloading the source PDF files as well as converting them to a similar HuggingFace Datasets format. 
+
+## Download & Usage 
+
+### Download the exported JSON (for training language models)
+
+```bash
+cd <vila-root>/datasets
+bash ./download.sh <dataset-name> #grotoap2, docbank, s2-vl or all
+```
+
+### Download the source PDFs or screenshots 
+
+- GROTOAP2 (downloading paper PDFs)
+  - Please follow the instructions from the [GROTOAP2 Project README](http://cermine.ceon.pl/grotoap2/README).
+- DocBank (downloading paper page screenshots)
+  - Please follow the instructions from the [home page of the DocBank Project](https://doc-analysis.github.io/docbank-page/index.html). 
+- S2-VL (downloading paper PDFs)
+  - Please check the instructions in [s2-vl-utils/README.md](s2-vl-utils/README.md).
+
+## Datasets Details 
+
+### The S2-VL dataset
+
+During the data release process, we unfortunately found that a small portion of PDFs in our dataset (22 out of 87) had additional copyright constraints of which we had been unaware. This meant that we could not directly release the data corresponding to these papers. As such, in the downloaded version, it contains only paper data created from the 65 papers. 
+
+If you are interested in the version of the dataset used for training and evaluation in our paper, please fill out this [Google Form](https://forms.gle/M1g9tQLrUtKSsDYA7) to request access. 
+
+#### Recreating the dataset from PDFs and annotations
+
+We also provide the full code to help you recreate the dataset from PDFs and annotation files to the JSON files for training models. Please check the instructions in [s2-vl-utils/README.md](s2-vl-utils/README.md).
+
+#### Dataset Curation Details 
+
+Please find a detailed description of the labeling schemas and categories in the following documents:
+- [Labeling Instruction](https://docs.google.com/document/d/1DsIDKNEi8GBxrqQuYRy86lCKhksgvyRaGhXPCheGgG0/edit?usp=sharing)
+- [S2-VL Category Definition](https://docs.google.com/document/d/1frGmzYOHnVRWAwTOuuPfc3KVAwu-XKdkFSbpLfy78RI/edit?usp=sharing)
+  - We labeled both layout and semantic categories in S2-VL (see the document above), but only the 16 Layout categories will be used in this evaluation benchmark. 
+- [The 19 Scientific Disciplines](https://docs.google.com/document/d/1ytJkYhswp4Wlx8tT1iRe-jdjx5A-nqisvUikgmqSQKc/edit?usp=sharing)
+
+### The VILA-enhanced DocBank Dataset
+
+## Dataset Details 
+
+### Statistics of the Datasets
+
+|                   | GROTOAP2     | DocBank         | S2-VL-ver1                       |
+| ----------------- | ------------ | --------------- | ------------------------------ |
+| Train Test Split  | 83k/18k/18k  | 398k/50k/50k    | *                              |
+| Annotation Method | Automatic    | Automatic       | Human Annotation               |
+| Paper Domain      | Life Science | Math/Physics/CS | 19 Disciplines                 |
+| VILA Structure    | PDF parsing  | Vision model    | Gold Label / Detection methods |
+| # of Categories   | 22           | 12              | 15                             |
+
+|                           | GROTOAP2 | DocBank | S2-VL-ver1* |
+| ------------------------- | -------- | ------- | --------- |
+| **Tokens per Page**       |
+| Average                   | 1203     | 838     | 790       |
+| Std                       | 591      | 503     | 453       |
+| 95th Percentile           | 2307     | 1553    | 1591      |
+| **Text Lines per Page**   |
+| Average                   | 90       | 60      | 64        |
+| Std                       | 51       | 34      | 54        |
+| 95th Percentile           | 171      | 125     | 154       |
+| **Text Blocks per Page**  |
+| Average                   | 12       | 15      | 22        |
+| Std                       | 16       | 8       | 36        |
+| 95th Percentile           | 37       | 30      | 68        |
+| **Tokens per Text Line**  |
+| Average                   | 17       | 16      | 14        |
+| Std                       | 12       | 43      | 10        |
+| 95th Percentile           | 38       | 38      | 30        |
+| **Tokens per Text Block** |
+| Average                   | 90       | 57      | 48        |
+| Std                       | 184      | 138     | 121       |
+| 95th Percentile           | 431      | 210     | 249       |
+
+* This is calculated based on the S2-VL-ver1 with all 87 papers.
+
+### File Structures 
+
+1. The organization of the dataset files :
+    ```bash
+    grotoap2 # Docbank is similar 
+    ├─ labels.json       
+    ├─ train-token.json
+    ├─ dev-token.json           
+    ├─ test-token.json           
+    └─ train-test-split.json
+    ```
+2. What's in each file?
+    1. `labels.json`
+        ```json
+        {"0": "Title",
+         "1": "Author",
+         ...
+        }
+        ```
+    2. `train-test-split.json`
+        ```json
+        {
+            "train": [
+                "pdf-file-name", ...
+            ],
+            "test": ["pdf-file-name", ...]
+        }
+        ```
+    3. `train-token.json`, `dev-token.json` or `test-token.json`
+        Please see detailed schema explanation in the [schema-token.json](schema-token.json) file.
+3. Special notes on the folder structure for S2-VL: since the dataset size is small, we use 5-fold cross validation in the paper. The released version has a similar structure: 
+    ```bash
+    s2-vl-ver1
+    ├─ 0  # 5-fold Cross validation                           
+    │  ├─ labels.json               
+    │  ├─ test-token.json           
+    │  ├─ train-test-split.json     
+    │  └─ train-token.json          
+    ├─ 1  # fold-1, have the same files as other folds                         
+    │  ├─ labels.json               
+    │  ├─ test-token.json           
+    │  ├─ train-test-split.json     
+    │  └─ train-token.json          
+    ├─ 2                            
+    ├─ 3                            
+    └─ 4
+    ```
+
+## Reference 
+
+1. The GROTOAP2 Dataset: 
+    - Paper: https://www.dlib.org/dlib/november14/tkaczyk/11tkaczyk.html
+    - Original download link: http://cermine.ceon.pl/grotoap2/
+    - Licence: Open Access license
+
+2. The Original DocBank Dataset: 
+    - Paper: https://arxiv.org/pdf/2006.01038.pdf
+    - Original download link: https://github.com/doc-analysis/DocBank
+    - Licence: Apache-2.0
+
+## Citation 
+
+```
+@article{Shen2021IncorporatingVL,
+  title={Incorporating Visual Layout Structures for Scientific Text Classification},
+  author={Zejiang Shen and Kyle Lo and Lucy Lu Wang and Bailey Kuehl and Daniel S. Weld and Doug Downey},
+  journal={ArXiv},
+  year={2021},
+  volume={abs/2106.00676},
+  url={https://arxiv.org/abs/2106.00676}
+}
+```
diff --git a/datasets/download.sh b/datasets/download.sh
@@ -0,0 +1,44 @@
+#!/bin/bash
+
+dataset_name="$1"
+base_save_path="../data"
+mkdir -p $base_save_path
+
+S3_BASE_LINK="https://ai2-s2-research.s3.us-west-2.amazonaws.com/s2-vlue"
+GROTOAP2_S3_NAME="grotoap2.zip"
+DOCBANK_S3_NAME="docbank.zip"
+S2_VL_VER1_S3_NAME="s2-vl-ver1-public.zip"
+
+download_complied_dataset () {
+    target_path="$1"
+    s3_name="$2"
+    wget $S3_BASE_LINK/$s3_name -O $base_save_path/$s3_name
+    unzip $base_save_path/$s3_name -d $base_save_path/$target_path  
+    rm $base_save_path/$s3_name
+}
+
+case $dataset_name in
+
+  grotoap2)
+    download_complied_dataset "grotoap2" $GROTOAP2_S3_NAME
+    ;;
+
+  docbank)
+    download_complied_dataset "docbank" $DOCBANK_S3_NAME
+    ;;
+
+  s2-vl)
+    download_complied_dataset "s2-vl-ver1" $S2_VL_VER1_S3_NAME
+    ;;
+
+  all)
+    download_complied_dataset "grotoap2" $GROTOAP2_S3_NAME
+    download_complied_dataset "docbank" $DOCBANK_S3_NAME
+    download_complied_dataset "s2-vl-ver1" $S2_VL_VER1_S3_NAME
+    ;;
+
+  *)
+    echo -n "Unkown Dataset"
+    exit
+    ;;
+esac
diff --git a/datasets/s2-vl-utils/README.md b/datasets/s2-vl-utils/README.md
@@ -0,0 +1,69 @@
+# Recreating the S2-VL Dataset 
+
+- [Recreating the S2-VL Dataset](#recreating-the-s2-vl-dataset)
+  - [STEP0: Install extra dependencies for creating the dataset](#step0-install-extra-dependencies-for-creating-the-dataset)
+  - [STEP1: Download the papers & annotations](#step1-download-the-papers--annotations)
+  - [STEP2: Parse token data using CERMINE](#step2-parse-token-data-using-cermine)
+  - [STEP3: Run visual layout detectors for getting the text block and line blocks](#step3-run-visual-layout-detectors-for-getting-the-text-block-and-line-blocks)
+  - [STEP4: Assemble the annotations and export the dataset](#step4-assemble-the-annotations-and-export-the-dataset)
+
+## STEP0: Install extra dependencies for creating the dataset 
+
+```bash
+cd /datasets/s2-vl-utils
+# activate the corresponding environment 
+pip install -r requirements
+```
+
+## STEP1: Download the papers & annotations 
+
+```bash
+python download.py --base-path sources/s2-vl-ver1
+```
+This will download the pdf files to `sources/s2-vl-ver1/pdfs` and annotation files to `sources/s2-vl-ver1/annotations`. 
+We'll check and report PDFs that don't have the compatible SHA1 code or cannot be downloaded. 
+Note: when you find incompatible SHAs for one PDF, it doesn't necessarily mean the PDFs are different. 
+
+## STEP2: Parse token data using CERMINE 
+
+1. Download JAVA and CERMINE following instructions in [this repo](https://github.com/CeON/CERMINE#using-cermine) (PS: The easiest approach would be just downloading CERMINE v1.13 from [JFrog](http://maven.ceon.pl/artifactory/webapp/#/artifacts/browse/simple/General/kdd-releases/pl/edu/icm/cermine/cermine-impl). 
+
+
+2. Run CERMINE on the set of papers and parse the token data, and convert the source CERMINE data to the csv format: 
+    ```bash
+    python cermine_loader.py \
+        --base-path sources/s2-vl-ver1 \
+        --cermine-path /path/to/cermine-impl-1.13-jar-with-dependencies.jar
+    ```
+    It will create the token table for each `sha-pid.csv` in the `sources/tokens` folder. 
+
+## STEP3: Run visual layout detectors for getting the text block and line blocks 
+
+```bash
+python vision_model_loader.py --base-path sources
+```
+It will:
+1. run visual layout detection for both text blocks and lines, and save them in the `<pdf-sha>-<page-id>.csv` files in the `sources/blocks` and `sources/lines` folder. 
+2. combine the text block, line, and token information, create a refined version of visual layout detection, and save them in the `<pdf-sha>-<page-id>.csv` files in the `sources/condensed` folder. 
+
+## STEP4: Assemble the annotations and export the dataset 
+
+```bash
+python condense_dataset.py \
+    --annotation-folder 'sources/s2-vl-ver1/annotations' \
+    --annotation-table 'sources/s2-vl-ver1/annotation_table.csv' \
+    --cermine-pdf-dir 'sources/s2-vl-ver1/pdfs' \
+    --cermine-csv-dir 'sources/s2-vl-ver1/tokens' \
+    --vision-csv-dir 'sources/s2-vl-ver1/condensed' \
+    --export-folder 'export/s2-vl-ver1' \
+    --config './config.json' 
+```
+
+It will convert all the source data in the source folder to a format that can be directly used for training the language models. By default, it will split the dataset into 5-folds for cross validation. The save folder will be specified in `--export-folder` configuration. There are several configurable options during the creation of the training dataset, perhaps the most important one is to specify what notion of blocks and lines to be used when constructing the dataset. Here are some available options: 
+
+| Source of blocks | Sources of lines | Option               |
+| ---------------- | ---------------- | -------------------- |
+| CERMINE          | CERMINE          | - (default behavior) |
+| Vision Model     | CERMINE          | `--use-vision-box`   |
+| Vision Model     | Vision Model     | `--use-vision-line`  |
+| Ground-Truth     | Vision Model     | `--use-gt-box`       |