TextPrepX (Multilingual Text Preprocessing)

TextPrepX is a Streamlit-based web application for preprocessing text data in both English and Persian. It supports common preprocessing steps like lowercasing, removing punctuation and emojis, handling contractions, stemming, spell correction, and more.

📄 Description:

This project is an interactive text preprocessing tool built with Streamlit, designed to clean and prepare both English and Persian texts for natural language processing (NLP) tasks.

It supports a wide range of preprocessing options, including:

Lowercasing

Removing punctuation, numbers, and emojis

Expanding contractions

Spell correction using TextBlob (for English) and Parsivar (for Persian)

Stopword removal (customizable for Persian)

Lemmatization and stemming

Tokenization (word and sentence level)

Repetition reduction and slang replacement

Unicode normalization and formatting cleanup

The Persian module leverages the Parsivar library, while the English module utilizes NLTK, TextBlob, and contractions for more nuanced cleaning. Users can either upload .txt files or enter raw text directly. Results are displayed in a styled, readable format.

This toolkit is ideal for data preprocessing in NLP pipelines, educational purposes, and rapid text cleaning for bilingual corpora.

✨ Features

✅ English Text

Lowercasing
Removing numbers and punctuation
Handling contractions (e.g., can't → cannot)
Removing emojis
Spell correction using TextBlob
Stopword removal
Lemmatization + Stemming
Reducing repeated characters and slang normalization (e.g., gonna → going to)

✅ Persian Text

Normalization using Parsivar
Custom stopword removal
Tokenization (words & sentences)
Stemming
Spell correction using Parsivar
Removing punctuation, numbers, and extra whitespaces

🚀 How to Run

Clone this repository or download the code.
Install dependencies:

pip install -r requirements.txt

first create a spell folder in this path:

venv\Lib\site-packages\parsivar\resource

then replace these two file in the spell folder:

- onegram.pckl
- mybigram_lm.pckl

🔽Download two files from here

streamlit run TEP.py

TextPrepX/ ├── TEP.py # Main Streamlit app ├── persianstopwords.txt # Custom Persian stopword list ├── models/ │ └── cnn-lstm-probwordnoise/ # (Optional) NeuSpell model folder for advanced spellcheck ├── requirements.txt

📌 Notes

Persian spell correction is handled by Parsivar.

For advanced English spell correction (NeuSpell), set up the model separately.

You can enhance further by adding Named Entity Recognition (NER) or keyword extraction.

📷 Screenshots

🧑‍💻 Author

Created by [Farhad Ghaherdoost] – Feel free to fork and customize.😄

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
pages		pages
resources		resources
English Text.txt		English Text.txt
LICENSE		LICENSE
PersianText.txt		PersianText.txt
README.md		README.md
TextPrepX.py		TextPrepX.py
persianstopwords.txt		persianstopwords.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TextPrepX (Multilingual Text Preprocessing)

📄 Description:

This toolkit is ideal for data preprocessing in NLP pipelines, educational purposes, and rapid text cleaning for bilingual corpora.

✨ Features

✅ English Text

✅ Persian Text

🚀 How to Run

🔽Download two files from here

📌 Notes

📷 Screenshots

🧑‍💻 Author

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

farhad-here/TextPrepX

Folders and files

Latest commit

History

Repository files navigation

TextPrepX (Multilingual Text Preprocessing)

📄 Description:

This toolkit is ideal for data preprocessing in NLP pipelines, educational purposes, and rapid text cleaning for bilingual corpora.

✨ Features

✅ English Text

✅ Persian Text

🚀 How to Run

🔽Download two files from here

📌 Notes

📷 Screenshots

🧑‍💻 Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages