Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Efficient Uzbek Text Processing & Byte-Level BPE Tokenizer – A streamlined pipeline for cleaning 15GB of Uzbek text, removing Cyrillic scripts, and training a custom Byte-Level BPE tokenizer for advanced NLP applications. Keywords: Uzbek NLP, text processing, tokenizer training, Cyrillic removal, large-scale dataset

Notifications You must be signed in to change notification settings

bxod/uzbek-tokenizer

Repository files navigation

uzbek-tokenizer

Efficient Uzbek Text Processing & Byte-Level BPE Tokenizer – A streamlined pipeline for cleaning 15GB of Uzbek text, removing Cyrillic scripts, and training a custom Byte-Level BPE tokenizer for advanced NLP applications. Keywords: Uzbek NLP, text processing, tokenizer training, Cyrillic removal, large-scale dataset

Overview

This project demonstrates:

  • Data Acquisition: Loading datasets from the Hugging Face Hub and downloading additional corpora (e.g., Leipzig Wortschatz).
  • Data Preprocessing: Cleaning raw text data, removing excessive newlines, and filtering out Cyrillic characters.
  • Tokenizer Training: Building a Byte-Level BPE tokenizer customized for Uzbek language processing.
  • Scalability: Efficient handling of large-scale (15GB) text data to support real-world NLP applications.

Features

  • Large-scale Data Processing: Handles 13GB of Uzbek text data.
  • Custom Preprocessing: Removes Cyrillic scripts and normalizes text for cleaner tokenization.
  • State-of-the-art Tokenizer: Trains a Byte-Level BPE tokenizer using the Hugging Face tokenizers library.
  • Keywords: "Uzbek NLP", "Byte-Level BPE", "text processing", "tokenizer training", and "natural language processing".

Requirements

Installation

Clone the repository:

git clone https://github.com/bekki3/uzbek-tokenizer.git
cd uzbek-tokenizer
pip install -r requirements.txt

Setup steps

Download Uzbek corpora from Leipzig Wortschatz as required in the list "FILES_UZ_CORPORA" and place them inside the "./raw_dataset" folder. Create "./processed_dataset" folder. We will use this folder's files for training a tokenizer.

Acknowledgments

Thanks to Tahrirchi for the 40K Uzbek books, 2.8GB of Uzbek crawl-news, and the Leipzig Corpora Collection for their invaluable datasets.

Important

Keep pushing the boundaries of Uzbek NLP! Every contribution, no matter how small, advances language technology and preserves Uzbek cultural heritage. Also, don't forget to leave a star.

About

Efficient Uzbek Text Processing & Byte-Level BPE Tokenizer – A streamlined pipeline for cleaning 15GB of Uzbek text, removing Cyrillic scripts, and training a custom Byte-Level BPE tokenizer for advanced NLP applications. Keywords: Uzbek NLP, text processing, tokenizer training, Cyrillic removal, large-scale dataset

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published