Thanks to visit codestin.com
Credit goes to github.com

Skip to content

farhad-here/TextPrepX

Repository files navigation

TextPrepX (Multilingual Text Preprocessing)

TextPrepX is a Streamlit-based web application for preprocessing text data in both English and Persian. It supports common preprocessing steps like lowercasing, removing punctuation and emojis, handling contractions, stemming, spell correction, and more.

📄 Description:

This project is an interactive text preprocessing tool built with Streamlit, designed to clean and prepare both English and Persian texts for natural language processing (NLP) tasks.

It supports a wide range of preprocessing options, including:

Lowercasing

Removing punctuation, numbers, and emojis

Expanding contractions

Spell correction using TextBlob (for English) and Parsivar (for Persian)

Stopword removal (customizable for Persian)

Lemmatization and stemming

Tokenization (word and sentence level)

Repetition reduction and slang replacement

Unicode normalization and formatting cleanup

The Persian module leverages the Parsivar library, while the English module utilizes NLTK, TextBlob, and contractions for more nuanced cleaning. Users can either upload .txt files or enter raw text directly. Results are displayed in a styled, readable format.

This toolkit is ideal for data preprocessing in NLP pipelines, educational purposes, and rapid text cleaning for bilingual corpora.

✨ Features

✅ English Text

  • Lowercasing
  • Removing numbers and punctuation
  • Handling contractions (e.g., can't → cannot)
  • Removing emojis
  • Spell correction using TextBlob
  • Stopword removal
  • Lemmatization + Stemming
  • Reducing repeated characters and slang normalization (e.g., gonna → going to)

✅ Persian Text

  • Normalization using Parsivar
  • Custom stopword removal
  • Tokenization (words & sentences)
  • Stemming
  • Spell correction using Parsivar
  • Removing punctuation, numbers, and extra whitespaces

🚀 How to Run

  1. Clone this repository or download the code.
  2. Install dependencies:
pip install -r requirements.txt

first create a spell folder in this path:

venv\Lib\site-packages\parsivar\resource

then replace these two file in the spell folder:

- onegram.pckl
- mybigram_lm.pckl
streamlit run TEP.py

TextPrepX/ ├── TEP.py # Main Streamlit app ├── persianstopwords.txt # Custom Persian stopword list ├── models/ │ └── cnn-lstm-probwordnoise/ # (Optional) NeuSpell model folder for advanced spellcheck ├── requirements.txt

📌 Notes

Persian spell correction is handled by Parsivar.

For advanced English spell correction (NeuSpell), set up the model separately.

You can enhance further by adding Named Entity Recognition (NER) or keyword extraction.

📷 Screenshots

tt1 tt2 tt3 tt4 tt5

🧑‍💻 Author

Created by [Farhad Ghaherdoost] – Feel free to fork and customize.😄

Releases

No releases published

Packages

No packages published

Languages