Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
14 views3 pages

Dataset Setup Guide

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views3 pages

Dataset Setup Guide

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

📂 Dataset Setup Guide

This project is compatible with three official datasets required for T1:

 CVL

 Historical-WI

 HisIR19

⚠️Note: Datasets are not included in this repository due to size. Please
download them manually from their official sources and place them as
described below.

1. CVL Dataset

 Download: CVL Database

 Expected folder structure after extraction:

data/raw/cvl/

pages_all/ # contains all page images (.tif/.jpg)

c vl-database-1-1/ (meta info, optional)

 Preprocessing command:

python scripts/preprocess_binarize.py cvl \

--in_dir data/raw/cvl/pages_all \

--out_root data/images/cvl \

--method sauvola --win 51

2. Historical-WI Dataset

 Download: ICDAR2017 Historical-WI Competition

o icdar17-historicalwi-training-binarized.zip → train split

o ScriptNet-HistoricalWI-2017-binarized.zip → test split

 Expected folder structure after extraction:

data/raw/historical-wi/
train/ # extracted training pages

test/ # extracted test pages

 Preprocessing command:

python scripts/preprocess_binarize.py split \

--train_dir data/raw/historical-wi/train \

--test_dir data/raw/historical-wi/test \

--out_root data/images/historical-wi \

--method sauvola --win 51

3. HisIR19 Dataset

 Download: HisIR19 Competition

o train_gt.csv and test_gt.csv (ground-truth CSVs)

o images/ (all page images in one folder)

 Expected folder structure after extraction:

data/raw/hisir19/

images/ # all page images

train_gt.csv # official training split

test_gt.csv # official test split

 Preprocessing commands:

python scripts/preprocess_binarize.py hisir19 \

--csv data/raw/hisir19/train_gt.csv \

--in_dir data/raw/hisir19/images \

--out_root data/images/hisir19 \

--method sauvola --win 51

python scripts/preprocess_binarize.py hisir19 \

--csv data/raw/hisir19/test_gt.csv \
--in_dir data/raw/hisir19/images \

--out_root data/images/hisir19 \

--method sauvola --win 51

4. After Preprocessing

 data/train.csv and data/val.csv will be created automatically.

 Normalized + binarized page images will be stored in


data/images/<dataset>/.

 You can now train:

python -m src.train --config configs/train-official.yaml

For multi-GPU:

torchrun --nproc_per_node=4 -m src.train --config configs/train-official.yaml

You might also like