uzbek-tokenizer

Efficient Uzbek Text Processing & Byte-Level BPE Tokenizer – A streamlined pipeline for cleaning 15GB of Uzbek text, removing Cyrillic scripts, and training a custom Byte-Level BPE tokenizer for advanced NLP applications. Keywords: Uzbek NLP, text processing, tokenizer training, Cyrillic removal, large-scale dataset

Overview

This project demonstrates:

Data Acquisition: Loading datasets from the Hugging Face Hub and downloading additional corpora (e.g., Leipzig Wortschatz).
Data Preprocessing: Cleaning raw text data, removing excessive newlines, and filtering out Cyrillic characters.
Tokenizer Training: Building a Byte-Level BPE tokenizer customized for Uzbek language processing.
Scalability: Efficient handling of large-scale (15GB) text data to support real-world NLP applications.

Features

Large-scale Data Processing: Handles 13GB of Uzbek text data.
Custom Preprocessing: Removes Cyrillic scripts and normalizes text for cleaner tokenization.
State-of-the-art Tokenizer: Trains a Byte-Level BPE tokenizer using the Hugging Face tokenizers library.
Keywords: "Uzbek NLP", "Byte-Level BPE", "text processing", "tokenizer training", and "natural language processing".

Requirements

Python 3.7+
datasets
tokenizers
Other standard libraries: os, re

Installation

Clone the repository:

git clone https://github.com/bekki3/uzbek-tokenizer.git
cd uzbek-tokenizer
pip install -r requirements.txt

Setup steps

Download Uzbek corpora from Leipzig Wortschatz as required in the list "FILES_UZ_CORPORA" and place them inside the "./raw_dataset" folder. Create "./processed_dataset" folder. We will use this folder's files for training a tokenizer.

Acknowledgments

Thanks to Tahrirchi for the 40K Uzbek books, 2.8GB of Uzbek crawl-news, and the Leipzig Corpora Collection for their invaluable datasets.

Important

Keep pushing the boundaries of Uzbek NLP! Every contribution, no matter how small, advances language technology and preserves Uzbek cultural heritage. Also, don't forget to leave a star.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
requirements.txt		requirements.txt
uzbek_tokenizer.ipynb		uzbek_tokenizer.ipynb
uzbek_tokenizer.json		uzbek_tokenizer.json
uzbek_tokens_10K.txt		uzbek_tokens_10K.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

uzbek-tokenizer

Overview

Features

Requirements

Installation

Setup steps

Acknowledgments

Important

About

Uh oh!

Releases

Packages

Uh oh!

Languages

bxod/uzbek-tokenizer

Folders and files

Latest commit

History

Repository files navigation

uzbek-tokenizer

Overview

Features

Requirements

Installation

Setup steps

Acknowledgments

Important

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages