SynthDog-RTL

SynthDog-RTL is a specialized tool for generating synthetic datasets tailored for RTL (Right-to-Left) languages such as Arabic, Urdu, Persian, Hebrew, Hindi, and other similar languages. This repository is based on the original SynthDog designed for LTR (Left-to-Right) languages, and includes modifications to support RTL text rendering for OCR applications, particularly using the Donut OCR model.

Overview

SynthDog-RTL enables Machine Learning Engineers to create synthetic datasets that are crucial for training OCR models for RTL languages. This is especially useful for document analysis tasks, data extraction, and multilingual document processing using Donut OCR or similar models.

Key Features:

Full support for RTL languages.
Configurable synthetic data generation.
Adjustable parameters for dataset quality, layout, and image effects.
Capability to incorporate custom fonts and language-specific text corpora.

Installation

To get started, first, clone this repository and install the required dependencies:

git clone https://github.com/aiviewz/Synthdog-RTL.git
cd Synthdog-RTL
pip install synthtiger Pillow==9.5.0

Project Structure

/Synthdog-RTL
  ├── resources/
  │   ├── background/
  │   ├── paper/
  │   ├── font/
  │   │   └── ur/         # Folder for Urdu fonts
  │   └── corpus/
  │       └── urdu_sample.txt  # Sample text for Urdu
  └── config_ur.yaml       # Configuration file for Urdu

Explanation:

background/: Background images for the synthetic documents.
paper/: Paper texture images.
font/: Font files for target RTL languages.
corpus/: Sample text for target languages.

Usage

To generate synthetic datasets, follow these steps:

Step 1: Setup Configuration

Create a text file (urdu_sample.txt) with sample paragraphs in your target language (e.g., Urdu).
Download fonts from Google Fonts that support your language and place them in the appropriate directory under resources/font/.

For instructions on downloading fonts from Google Fonts, check the Google Fonts Documentation.
Modify the configuration file (e.g., config_ur.yaml) to specify language-specific settings, font paths, and corpus paths.

Step 2: Generate Dataset

Run the following command to generate the dataset:

synthtiger -o ./outputs/SynthDoG_ur -c 10 -w 2 -v template.py SynthDoG config_ur.yaml

Parameters:

-o: Output directory for generated images.
-c: Number of samples to generate.
-w: Number of workers to use.
config_ur.yaml: Configuration file with dataset settings.

For more detailed instructions on generating datasets and customizing configurations, check the Complete Tutorial.

Customization Guide

Below are some common attributes you might want to adjust in the configuration file (config_ur.yaml) for customization:

Document Layout:
- landscape: Adjust to 1 for landscape orientation or 0 for portrait.
- fullscreen: Set to 1 to fill the entire document background with text.
Text Adjustments:
- font.bold: Set to 1 to enable bold text.
- align: Modify alignment (e.g., right for RTL languages).
Effects:
- Elastic distortion, noise, and blur parameters can be adjusted for creating more diverse synthetic datasets.

Adding New Languages

To extend support for other RTL languages, follow these steps:

Create a new corpus file with sample text in the target language.
Add relevant fonts to the resources/font/<language-code>/ directory.
Update the configuration file paths to reflect the new language resources.

References & Acknowledgments

This repository is based on the original SynthDoG for LTR Languages developed by Clova AI. Any copyright-related queries should refer to the original repository.

For additional information on using SynthDog-RTL and customizing configurations, see our Complete Tutorial.

License

Refer to the LICENSE file for detailed license information.

Contributions

Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions for improvement or new features.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
elements		elements
layouts		layouts
outputs/SynthDoG_ur		outputs/SynthDoG_ur
resources		resources
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config_en.yaml		config_en.yaml
config_ja.yaml		config_ja.yaml
config_ko.yaml		config_ko.yaml
config_ur.yaml		config_ur.yaml
config_zh.yaml		config_zh.yaml
template.py		template.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SynthDog-RTL

Overview

Key Features:

Installation

Project Structure

Explanation:

Usage

Step 1: Setup Configuration

Step 2: Generate Dataset

Parameters:

Customization Guide

Adding New Languages

References & Acknowledgments

License

Contributions

About

Uh oh!

Releases

Packages

Languages

License

aiviewz/Synthdog-RTL

Folders and files

Latest commit

History

Repository files navigation

SynthDog-RTL

Overview

Key Features:

Installation

Project Structure

Explanation:

Usage

Step 1: Setup Configuration

Step 2: Generate Dataset

Parameters:

Customization Guide

Adding New Languages

References & Acknowledgments

License

Contributions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages