0% found this document useful (0 votes)

14 views3 pages

Dataset Setup Guide

Uploaded by

Hafeez ullah Jamro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views3 pages

Dataset Setup Guide

Uploaded by

Hafeez ullah Jamro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

📂 Dataset Setup Guide

This project is compatible with three official datasets required for T1:

 CVL

 Historical-WI

 HisIR19

⚠️Note: Datasets are not included in this repository due to size. Please
download them manually from their official sources and place them as
described below.

1. CVL Dataset

 Download: CVL Database

 Expected folder structure after extraction:

data/raw/cvl/

pages_all/ # contains all page images (.tif/.jpg)

c vl-database-1-1/ (meta info, optional)

 Preprocessing command:

python scripts/preprocess_binarize.py cvl \

--in_dir data/raw/cvl/pages_all \

--out_root data/images/cvl \

--method sauvola --win 51

2. Historical-WI Dataset

 Download: ICDAR2017 Historical-WI Competition

o icdar17-historicalwi-training-binarized.zip → train split

o ScriptNet-HistoricalWI-2017-binarized.zip → test split

 Expected folder structure after extraction:

data/raw/historical-wi/
train/ # extracted training pages

test/ # extracted test pages

 Preprocessing command:

python scripts/preprocess_binarize.py split \

--train_dir data/raw/historical-wi/train \

--test_dir data/raw/historical-wi/test \

--out_root data/images/historical-wi \

--method sauvola --win 51

3. HisIR19 Dataset

 Download: HisIR19 Competition

o train_gt.csv and test_gt.csv (ground-truth CSVs)

o images/ (all page images in one folder)

 Expected folder structure after extraction:

data/raw/hisir19/

images/ # all page images

train_gt.csv # official training split

test_gt.csv # official test split

 Preprocessing commands:

python scripts/preprocess_binarize.py hisir19 \

--csv data/raw/hisir19/train_gt.csv \

--in_dir data/raw/hisir19/images \

--out_root data/images/hisir19 \

--method sauvola --win 51

python scripts/preprocess_binarize.py hisir19 \

--csv data/raw/hisir19/test_gt.csv \
--in_dir data/raw/hisir19/images \

--out_root data/images/hisir19 \

--method sauvola --win 51

4. After Preprocessing

 data/train.csv and data/val.csv will be created automatically.

 Normalized + binarized page images will be stored in

data/images/<dataset>/.

 You can now train:

python -m src.train --config configs/train-official.yaml

For multi-GPU:

torchrun --nproc_per_node=4 -m src.train --config configs/train-official.yaml

Your First Deep Learning Project in Python With Keras Step-By-Step
No ratings yet
Your First Deep Learning Project in Python With Keras Step-By-Step
229 pages
Deep Learning Manual
No ratings yet
Deep Learning Manual
24 pages
Revision Python For Computer Vision
No ratings yet
Revision Python For Computer Vision
50 pages
Al3502 Deep Learning For Vision Lab Manuval
No ratings yet
Al3502 Deep Learning For Vision Lab Manuval
19 pages
3 Machine Learning Tools
No ratings yet
3 Machine Learning Tools
69 pages
CS221 Artificial Intelligence: Principles & Techniques: Challenge Problem
No ratings yet
CS221 Artificial Intelligence: Principles & Techniques: Challenge Problem
33 pages
Sentiment Analysis On Tweets
No ratings yet
Sentiment Analysis On Tweets
2 pages
PR Final File
No ratings yet
PR Final File
49 pages
Age and Gender Prediction Using Machine Learning: Internship Project ON
No ratings yet
Age and Gender Prediction Using Machine Learning: Internship Project ON
30 pages
Iris Unlock Implementation Guide - Code & Steps
No ratings yet
Iris Unlock Implementation Guide - Code & Steps
8 pages
Strategy Factory Setup Guide
No ratings yet
Strategy Factory Setup Guide
7 pages
Practical Assignment ML
No ratings yet
Practical Assignment ML
50 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
123 pages
Python Handwriting Recognition Guide
No ratings yet
Python Handwriting Recognition Guide
31 pages
Cancer Detection Project Libraries
No ratings yet
Cancer Detection Project Libraries
4 pages
Deep Learning with Keras & TensorFlow
No ratings yet
Deep Learning with Keras & TensorFlow
35 pages
Converted Journal
No ratings yet
Converted Journal
6 pages
ML Lab Syllabus for Students
No ratings yet
ML Lab Syllabus for Students
90 pages
SQX Strategy Factory - Starter Kit (Phase 1-3)
No ratings yet
SQX Strategy Factory - Starter Kit (Phase 1-3)
9 pages
AI-Week3-Ch2-Intelligent Agents Part B
No ratings yet
AI-Week3-Ch2-Intelligent Agents Part B
25 pages
Car Detection with Bounding Box & Classification
No ratings yet
Car Detection with Bounding Box & Classification
47 pages
CCT Dku
No ratings yet
CCT Dku
6 pages
Search Results
No ratings yet
Search Results
5 pages
EDIT ML Intern Technical Questions Skills Development
No ratings yet
EDIT ML Intern Technical Questions Skills Development
2 pages
Problem Statement
No ratings yet
Problem Statement
6 pages
Interim Report Capstone
No ratings yet
Interim Report Capstone
61 pages
FA I - Unit5
No ratings yet
FA I - Unit5
11 pages
ML LabManual
No ratings yet
ML LabManual
16 pages
Build ML Model with ChatGPT & Keras
No ratings yet
Build ML Model with ChatGPT & Keras
33 pages
Data Science Lab Manual..
No ratings yet
Data Science Lab Manual..
54 pages
Machine Learning (BCSL606) Lab Manual
No ratings yet
Machine Learning (BCSL606) Lab Manual
30 pages
Traffic Signs Recognition Using CNN and Keras in Python
No ratings yet
Traffic Signs Recognition Using CNN and Keras in Python
9 pages
Estimator
No ratings yet
Estimator
29 pages
Face Mask Detection
No ratings yet
Face Mask Detection
32 pages
Synopsis Report
No ratings yet
Synopsis Report
7 pages
Week 3 A
No ratings yet
Week 3 A
18 pages
Imp.-Image Category Classification Using Deep Learning-MATLAB
No ratings yet
Imp.-Image Category Classification Using Deep Learning-MATLAB
9 pages
Progress Report
No ratings yet
Progress Report
2 pages
AI Last Quiz
No ratings yet
AI Last Quiz
2 pages
Kerascv and Kerasnlp: Multi-Framework Models: Lead Authors
No ratings yet
Kerascv and Kerasnlp: Multi-Framework Models: Lead Authors
10 pages
Data Science
No ratings yet
Data Science
8 pages
hw1 Problem Set
No ratings yet
hw1 Problem Set
8 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
Tensor Flow 2
No ratings yet
Tensor Flow 2
3 pages
Numpy: Explanation
No ratings yet
Numpy: Explanation
21 pages
Capstone Project - Jaro-Prof. Babji
No ratings yet
Capstone Project - Jaro-Prof. Babji
5 pages
Getting Started With Opencv Library
No ratings yet
Getting Started With Opencv Library
2 pages
Data Pre Process I
No ratings yet
Data Pre Process I
6 pages
Assignment 3 DL
No ratings yet
Assignment 3 DL
6 pages
DXV Guidelines
No ratings yet
DXV Guidelines
3 pages
Aiml Project List
No ratings yet
Aiml Project List
10 pages
ML Lab Manual Completed
No ratings yet
ML Lab Manual Completed
56 pages
"I C U N N ": Mage Lassification Sing Eural Etworks
No ratings yet
"I C U N N ": Mage Lassification Sing Eural Etworks
15 pages
Report Digit Recognition
No ratings yet
Report Digit Recognition
11 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
Machine Learning Data Prep Guide
No ratings yet
Machine Learning Data Prep Guide
17 pages
CIS 6213 Applied Machine Learning Coursework
No ratings yet
CIS 6213 Applied Machine Learning Coursework
5 pages
Tushar ML
No ratings yet
Tushar ML
52 pages
Logistic Regression For Binary Classification With Core APIs - TensorFlow Core
No ratings yet
Logistic Regression For Binary Classification With Core APIs - TensorFlow Core
22 pages
Lesson1 Notes Fastai
No ratings yet
Lesson1 Notes Fastai
18 pages
Python Machine Learning Practical Guide
No ratings yet
Python Machine Learning Practical Guide
13 pages
Ass-1 Prac
No ratings yet
Ass-1 Prac
23 pages
ML - LAB - FILE Amrit
No ratings yet
ML - LAB - FILE Amrit
13 pages
Complete DL Record
No ratings yet
Complete DL Record
28 pages
Ass1 DSBDA Writeup
No ratings yet
Ass1 DSBDA Writeup
8 pages
Mrdn-Mi 5
No ratings yet
Mrdn-Mi 5
23 pages

Dataset Setup Guide

Uploaded by

Dataset Setup Guide

Uploaded by

📂 Dataset Setup Guide

 Download: CVL Database

 Expected folder structure after extraction:

pages_all/ # contains all page images (.tif/.jpg)

c vl-database-1-1/ (meta info, optional)

python scripts/preprocess_binarize.py cvl \

--method sauvola --win 51

 Download: ICDAR2017 Historical-WI Competition

o icdar17-historicalwi-training-binarized.zip → train split

o ScriptNet-HistoricalWI-2017-binarized.zip → test split

 Expected folder structure after extraction:

test/ # extracted test pages

python scripts/preprocess_binarize.py split \

--method sauvola --win 51

 Download: HisIR19 Competition

o train_gt.csv and test_gt.csv (ground-truth CSVs)

o images/ (all page images in one folder)

 Expected folder structure after extraction:

images/ # all page images

train_gt.csv # official training split

test_gt.csv # official test split

python scripts/preprocess_binarize.py hisir19 \

--method sauvola --win 51

python scripts/preprocess_binarize.py hisir19 \

--method sauvola --win 51

 data/train.csv and data/val.csv will be created automatically.

 Normalized + binarized page images will be stored in

 You can now train:

python -m src.train --config configs/train-official.yaml

torchrun --nproc_per_node=4 -m src.train --config configs/train-official.yaml

You might also like