Welcome to the NLP Preprocessing Cookbook! This repository is a collection of essential techniques for cleaning and preparing text data for Machine Learning models. Think of it as a practical guide with code examples for the most common preprocessing steps in any NLP project.
This entire project is demonstrated within a Jupyter Notebook, making it easy to follow along and experiment with each technique.
This notebook is a step-by-step guide through the foundational techniques of NLP data preparation:
-
Tokenization (Sentence and Word):
- The first and most crucial step. I've shown how to break down a large block of text into its basic units—sentences and individual words—using the NLTK library.
-
Stemming and Lemmatization:
- To reduce words to their root forms, I've explored two popular techniques:
- Stemming: A faster, rule-based approach to chop off word endings.
- Lemmatization: A more advanced, dictionary-based approach that considers the context to find the true root word (lemma).
- To reduce words to their root forms, I've explored two popular techniques:
-
Stopwords Removal:
- I've demonstrated how to remove common words (like "is", "the", "a") that add little semantic value, which helps the model focus on the more important keywords.
-
Text Cleaning with RegEx:
- Real-world text is messy! I've used Regular Expressions (RegEx) to clean the text by removing punctuation, numbers, and other unwanted characters.
-
Text Vectorization (Bag-of-Words):
- Finally, to make the text understandable for a machine learning model, I've converted the cleaned words into numerical vectors using Scikit-learn's
CountVectorizer. This technique, also known as Bag-of-Words, is a fundamental method for feature extraction in NLP.
- Finally, to make the text understandable for a machine learning model, I've converted the cleaned words into numerical vectors using Scikit-learn's
- NLTK (Natural Language Toolkit): For tokenization, stemming, lemmatization, and stopwords.
- Scikit-learn: For text vectorization with
CountVectorizer. - Jupyter Notebook: For interactive development and demonstration.
-
Clone the repository:
git clone [https://github.com/jsonusuman351/nlp-preprocessing-cookbook.git](https://github.com/jsonusuman351/nlp-preprocessing-cookbook.git) cd nlp-preprocessing-cookbook -
Create and activate a virtual environment:
# It is recommended to use Python 3.10 or higher python -m venv venv .\venv\Scripts\activate
-
Install the required packages:
pip install notebook nltk scikit-learn
-
One-Time NLTK Downloads: The first time you run this, you'll need to download some necessary models from NLTK. Open a Python interpreter from your activated environment and run:
import nltk nltk.download('punkt') nltk.download('stopwords') nltk.download('wordnet') exit()
The entire workflow is contained within a single Jupyter Notebook.
- Launch Jupyter:
Make sure your virtual environment is active, then run:
jupyter notebook
- Open the Notebook:
In the Jupyter interface, click on
NLTK-CountVectorizer.ipynb. - Run the Cells: You can run each cell sequentially to see how each preprocessing technique transforms the text step-by-step.
I've organized this entire project into a single notebook to make it easy to follow my experiments with different NLP preprocessing techniques.
Click to view the code layout
nlp-preprocessing-cookbook/
│
└── NLTK-CountVectorizer.ipynb # The complete end-to-end workflow is in this single notebook