Efficient Uzbek Text Processing & Byte-Level BPE Tokenizer – A streamlined pipeline for cleaning 15GB of Uzbek text, removing Cyrillic scripts, and training a custom Byte-Level BPE tokenizer for advanced NLP applications. Keywords: Uzbek NLP, text processing, tokenizer training, Cyrillic removal, large-scale dataset
This project demonstrates:
- Data Acquisition: Loading datasets from the Hugging Face Hub and downloading additional corpora (e.g., Leipzig Wortschatz).
- Data Preprocessing: Cleaning raw text data, removing excessive newlines, and filtering out Cyrillic characters.
- Tokenizer Training: Building a Byte-Level BPE tokenizer customized for Uzbek language processing.
- Scalability: Efficient handling of large-scale (15GB) text data to support real-world NLP applications.
- Large-scale Data Processing: Handles 13GB of Uzbek text data.
- Custom Preprocessing: Removes Cyrillic scripts and normalizes text for cleaner tokenization.
- State-of-the-art Tokenizer: Trains a Byte-Level BPE tokenizer using the Hugging Face
tokenizerslibrary. - Keywords: "Uzbek NLP", "Byte-Level BPE", "text processing", "tokenizer training", and "natural language processing".
- Python 3.7+
- datasets
- tokenizers
- Other standard libraries:
os,re
Clone the repository:
git clone https://github.com/bekki3/uzbek-tokenizer.git
cd uzbek-tokenizer
pip install -r requirements.txtDownload Uzbek corpora from Leipzig Wortschatz as required in the list "FILES_UZ_CORPORA" and place them inside the "./raw_dataset" folder. Create "./processed_dataset" folder. We will use this folder's files for training a tokenizer.
Thanks to Tahrirchi for the 40K Uzbek books, 2.8GB of Uzbek crawl-news, and the Leipzig Corpora Collection for their invaluable datasets.
Keep pushing the boundaries of Uzbek NLP! Every contribution, no matter how small, advances language technology and preserves Uzbek cultural heritage. Also, don't forget to leave a star.