📂 Dataset Setup Guide
This project is compatible with three official datasets required for T1:
CVL
Historical-WI
HisIR19
⚠️Note: Datasets are not included in this repository due to size. Please
download them manually from their official sources and place them as
described below.
1. CVL Dataset
Download: CVL Database
Expected folder structure after extraction:
data/raw/cvl/
pages_all/ # contains all page images (.tif/.jpg)
c vl-database-1-1/ (meta info, optional)
Preprocessing command:
python scripts/preprocess_binarize.py cvl \
--in_dir data/raw/cvl/pages_all \
--out_root data/images/cvl \
--method sauvola --win 51
2. Historical-WI Dataset
Download: ICDAR2017 Historical-WI Competition
o icdar17-historicalwi-training-binarized.zip → train split
o ScriptNet-HistoricalWI-2017-binarized.zip → test split
Expected folder structure after extraction:
data/raw/historical-wi/
train/ # extracted training pages
test/ # extracted test pages
Preprocessing command:
python scripts/preprocess_binarize.py split \
--train_dir data/raw/historical-wi/train \
--test_dir data/raw/historical-wi/test \
--out_root data/images/historical-wi \
--method sauvola --win 51
3. HisIR19 Dataset
Download: HisIR19 Competition
o train_gt.csv and test_gt.csv (ground-truth CSVs)
o images/ (all page images in one folder)
Expected folder structure after extraction:
data/raw/hisir19/
images/ # all page images
train_gt.csv # official training split
test_gt.csv # official test split
Preprocessing commands:
python scripts/preprocess_binarize.py hisir19 \
--csv data/raw/hisir19/train_gt.csv \
--in_dir data/raw/hisir19/images \
--out_root data/images/hisir19 \
--method sauvola --win 51
python scripts/preprocess_binarize.py hisir19 \
--csv data/raw/hisir19/test_gt.csv \
--in_dir data/raw/hisir19/images \
--out_root data/images/hisir19 \
--method sauvola --win 51
4. After Preprocessing
data/train.csv and data/val.csv will be created automatically.
Normalized + binarized page images will be stored in
data/images/<dataset>/.
You can now train:
python -m src.train --config configs/train-official.yaml
For multi-GPU:
torchrun --nproc_per_node=4 -m src.train --config configs/train-official.yaml