TextPrepX is a Streamlit-based web application for preprocessing text data in both English and Persian. It supports common preprocessing steps like lowercasing, removing punctuation and emojis, handling contractions, stemming, spell correction, and more.
This project is an interactive text preprocessing tool built with Streamlit, designed to clean and prepare both English and Persian texts for natural language processing (NLP) tasks.
It supports a wide range of preprocessing options, including:
Lowercasing
Removing punctuation, numbers, and emojis
Expanding contractions
Spell correction using TextBlob (for English) and Parsivar (for Persian)
Stopword removal (customizable for Persian)
Lemmatization and stemming
Tokenization (word and sentence level)
Repetition reduction and slang replacement
Unicode normalization and formatting cleanup
The Persian module leverages the Parsivar library, while the English module utilizes NLTK, TextBlob, and contractions for more nuanced cleaning. Users can either upload .txt files or enter raw text directly. Results are displayed in a styled, readable format.
This toolkit is ideal for data preprocessing in NLP pipelines, educational purposes, and rapid text cleaning for bilingual corpora.
- Lowercasing
- Removing numbers and punctuation
- Handling contractions (e.g., can't → cannot)
- Removing emojis
- Spell correction using TextBlob
- Stopword removal
- Lemmatization + Stemming
- Reducing repeated characters and slang normalization (e.g., gonna → going to)
- Normalization using Parsivar
- Custom stopword removal
- Tokenization (words & sentences)
- Stemming
- Spell correction using Parsivar
- Removing punctuation, numbers, and extra whitespaces
- Clone this repository or download the code.
- Install dependencies:
pip install -r requirements.txt
first create a spell folder in this path:
venv\Lib\site-packages\parsivar\resource
then replace these two file in the spell folder:
- onegram.pckl
- mybigram_lm.pckl
streamlit run TEP.py
TextPrepX/ ├── TEP.py # Main Streamlit app ├── persianstopwords.txt # Custom Persian stopword list ├── models/ │ └── cnn-lstm-probwordnoise/ # (Optional) NeuSpell model folder for advanced spellcheck ├── requirements.txt
Persian spell correction is handled by Parsivar.
For advanced English spell correction (NeuSpell), set up the model separately.
You can enhance further by adding Named Entity Recognition (NER) or keyword extraction.
Created by [Farhad Ghaherdoost] – Feel free to fork and customize.😄