0% found this document useful (0 votes)

21 views20 pages

NLP Preprocessing Steps 1740444240

Natural Language Processing (NLP) involves pre-processing steps to convert raw text into a format suitable for analysis or machine learning. Key steps include lowercasing, tokenization, removing stopwords, stemming, lemmatization, and vectorization, among others. These steps enhance data quality and facilitate better understanding and processing by algorithms.

Uploaded by

simok111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views20 pages

NLP Preprocessing Steps 1740444240

Uploaded by

simok111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Natural Language Pre-processing (NLP)

Natural Language Processing (NLP) involves a series of pre-

processing steps to transform raw text data into a format
suitable for analysis or machine learning models. These steps help
improve the quality of the data and make it easier for algorithms
to understand and process the text. Below are the key pre-
processing steps used in NLP, along with explanations and example
code.

33 Common Pre-processing step commonly used

before feeding data into an NLP model

1. Lowercasing
2. Tokenization
3. Removing Punctuation
4. Removing Stopwords
5. Stemming
6. Lemmatization
7. Removing Numbers
8. Removing Extra Spaces
9. Handling Contractions
10. Removing Special Characters
11. Part-of-Speech (POS) Tagging
12. Named Entity Recognition (NER)
13. Vectorization
14. Handling Missing Data
15. Normalization
16. Spelling Correction
17. Handling Emojis and Emoticons
18. Removing HTML Tags
19. Handling URLs
20. Handling Mentions and Hashtags
21. Sentence Segmentation
22. Handling Abbreviations
23. Language Detection
24. Text Encoding
25. Handling Whitespace Tokens
26. Handling Dates and Times
27. Text Augmentation
28. Handling Negations
29. Dependency Parsing
30. Handling Rare Words
31. Text Chunking
32. Handling Synonyms
33. Text Normalization for Social Media
Detailed explanation of each pre-processing step commonly
used before feeding data into an NLP model and during its
use:

1. Lowercasing
 Purpose: Converts all text to lowercase to ensure uniformity.
 Why: Reduces the vocabulary size and avoids treating the same
word in different cases as different tokens (e.g., "Apple" vs. "apple").

text = "Hello World! This is NLP."

text = text.lower()
print(text)

2. Tokenization
 Purpose: Splits text into individual words, phrases, or sentences
(tokens).
 Why: Breaks down text into manageable units for further processing.

import nltk

nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

text = "Hello World! This is NLP."

tokens = word_tokenize(text)
print(tokens)
3. Removing Punctuation
 Purpose: Removes punctuation marks like commas, periods,
exclamation marks, etc.
 Why: Punctuation often doesn’t contribute to the meaning in many
NLP tasks and can add noise.

import string

text = "Hello, World! This is NLP."

text = text.translate(str.maketrans('', '',
string.punctuation))
print(text)

4. Removing Stopwords
 Purpose: Removes common words like "the," "is," "and," which
don’t carry significant meaning.
 Why: Reduces noise and focuses on meaningful words.

import nltk

# Download the 'stopwords' dataset

nltk.download('stopwords')

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
tokens = ["this", "is", "a", "sample", "sentence"]
filtered_tokens = [word for word in tokens if word.lower() not in
stop_words]
print(filtered_tokens)
5. Stemming
 Purpose: Reduces words to their root form by chopping off suffixes
(e.g., "running" → "run").
 Why: Simplifies words to their base form, reducing vocabulary size.

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "runner", "ran"]
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)

6. Lemmatization
 Purpose: Converts words to their base or dictionary form (e.g.,
"better" → "good").
 Why: More accurate than stemming as it uses vocabulary and
morphological analysis.

import nltk
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
words = ["running", "runner", "ran"]
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word
in words]
print(lemmatized_words)
7. Removing Numbers
 Purpose: Removes numeric values from the text.
 Why: Numbers may not be relevant in certain NLP tasks like
sentiment analysis.

import re
text = "There are 3 apples and 5 oranges."
text = re.sub(r'\d+', '', text)
print(text)

8. Removing Extra Spaces

 Purpose: Eliminates multiple spaces, tabs, or newlines.
 Why: Ensures clean and consistent text formatting.

text = " This is a sentence. "

text = ' '.join(text.split())
print(text)

9. Handling Contractions
 Purpose: Expands contractions (e.g., "can't" → "cannot").
 Why: Standardizes text for better processing.

!pip install contractions

from contractions import fix

text = "I can't do this."

text = fix(text)
print(text)
10. Removing Special Characters
 Purpose: Removes non-alphanumeric characters like @, #, $, etc.
 Why: Reduces noise and irrelevant symbols.

import re

text = "This is a #sample text with @special characters!"

text = re.sub(r'[^\w\s]', '', text)
print(text)

11. Part-of-Speech (POS) Tagging

 Purpose: Assigns grammatical tags to words (e.g., noun, verb,
adjective).
 Why: Helps in understanding the syntactic structure of sentences.

import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize

# Download the required resource

nltk.download('averaged_perceptron_tagger_eng')

tokens = word_tokenize("This is a sample sentence.")

pos_tags = pos_tag(tokens)
print(pos_tags)
12. Named Entity Recognition (NER)
 Purpose: Identifies and classifies entities like names, dates, locations,
etc.
 Why: Useful for tasks like information extraction.

import nltk
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize

# Download the required resources

nltk.download('words')
nltk.download('maxent_ne_chunker')
nltk.download('averaged_perceptron_tagger')
# Download the 'maxent_ne_chunker_tab' resource
nltk.download('maxent_ne_chunker_tab') # This line is crucial to fix
the error.

tokens = word_tokenize("John works at Google in New York.")

pos_tags = pos_tag(tokens)
ner_tags = ne_chunk(pos_tags)
print(ner_tags)

13. Vectorization
 Purpose: Converts text into numerical vectors (e.g., Bag of Words,
TF-IDF, Word Embeddings).
 Why: Machine learning models require numerical input.
from sklearn.feature_extraction.text import
CountVectorizer

corpus = ["This is a sample sentence.", "Another example

sentence."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray()) # Output: [[1 1 1 1 0], [0 1 1 0 1]]
print(vectorizer.get_feature_names_out()

14. Handling Missing Data

 Purpose: Fills or removes missing or incomplete text data.
 Why: Ensures the dataset is complete and consistent.

import pandas as pd

data = {"text": ["Hello", None, "World"]}

df = pd.DataFrame(data)
df["text"].fillna("My Dear", inplace=True) # Fill missing values
print(df)

15. Normalization
 Purpose: Standardizes text (e.g., converting all dates to a single
format).
 Why: Ensures consistency in the dataset.
import unicodedata

text = "Café"
text = unicodedata.normalize('NFKD', text).encode('ascii',
'ignore').decode('utf-8')
print(text)

16. Spelling Correction

 Purpose: Corrects spelling errors in the text.
 Why: Improves the quality of the text for analysis.

from textblob import TextBlob

text = "I made a many mistakes in Artificial intellengence"

blob = TextBlob(text)
corrected_text = blob.correct()
print(corrected_text)

17. Handling Emojis and Emoticons

 Purpose: Converts emojis and emoticons into text or removes them.
 Why: Emojis can carry sentiment or meaning that needs to be
captured.

!pip install emoji

import emoji

text = "I love Python! �"

# Convert emojis to text
text = emoji.demojize(text)
print(text) # Output: "I love Python!
:smiling_face_with_smiling_eyes:"

# Remove emojis
text = emoji.replace_emoji(text, replace="")
print(text)

18. Removing HTML Tags

 Purpose: Removes HTML tags from web scraped text.
 Why: HTML tags are irrelevant for most NLP tasks.

from bs4 import BeautifulSoup

text = "<p>This is a <b>sample</b> text.</p>"

soup = BeautifulSoup(text, "html.parser")
clean_text = soup.get_text()
print(clean_text)

19. Handling URLs

 Purpose: Removes or replaces URLs in the text.
 Why: URLs are often irrelevant for text analysis.

import re

text = "Visit my website at https://example.com."

text = re.sub(r'http\S+|www\S+|https\S+', '', text,
flags=re.MULTILINE)
print(text)
20. Handling Mentions and Hashtags
 Purpose: Processes or removes social media mentions (@user) and
hashtags (#topic).
 Why: Useful for social media text analysis.

text = "Hey @user, check out #NLP!"

text = re.sub(r'@\w+|#\w+', '', text)
print(text)

21. Sentence Segmentation

 Purpose: Splits text into individual sentences.
 Why: Important for tasks like machine translation or summarization.

from nltk.tokenize import sent_tokenize

text = "This is the first sentence. This is the second

sentence."
sentences = sent_tokenize(text)
print(sentences)

22. Handling Abbreviations

 Purpose: Expands abbreviations (e.g., "ASAP" → "as soon as
possible").
 Why: Ensures clarity and consistency.

!pip install contractions

import contractions
text = "I'll be there ASAP."
expanded_text = contractions.fix(text)
print(expanded_text)

23. Language Detection

 Purpose: Identifies the language of the text.
 Why: Ensures the correct NLP model is applied.

!pip install langdetect

from langdetect import detect
text = "Ceci est un texte en français."
language = detect(text)
print(language)

24. Text Encoding

 Purpose: Converts text into a specific encoding format (e.g., UTF-8).
 Why: Ensures compatibility with NLP tools and models.

text = "Café"
text = text.encode('utf-8').decode('utf-8')
print(text)

25. Handling Whitespace Tokens

 Purpose: Removes or processes tokens that are just spaces or empty
strings.
 Why: Ensures clean and meaningful tokens.

tokens = ["This", " ", "is", " ", "a", " ", "sample", " "]
tokens = [token for token in tokens if token.strip()]
print(tokens)
26. Handling Dates and Times
 Purpose: Standardizes or extracts date and time formats.
 Why: Useful for time-sensitive analysis.

import dateutil.parser as dparser

text = "The event is on 2023-10-15."

date = dparser.parse(text, fuzzy=True)
print(date)

27. Text Augmentation

 Purpose: Generates additional training data by modifying existing text
(e.g., synonym replacement).
 Why: Improves model robustness and performance.

#!pip install nlpaug # Install the nlpaug library

from nlpaug.augmenter.word import SynonymAug

aug = SynonymAug(aug_src='wordnet')
text = "This is a sample text."
augmented_text = aug.augment(text)
print(augmented_text)

28. Handling Negations

 Purpose: Identifies and processes negations (e.g., "not good").
 Why: Important for sentiment analysis and understanding context.
from nltk import word_tokenize

text = "This is not good."

tokens = word_tokenize(text)
for i, token in enumerate(tokens):
if token == "not" and i + 1 < len(tokens):
tokens[i + 1] = "not_" + tokens[i + 1]
print(tokens)

29. Dependency Parsing

 Purpose: Analyzes the grammatical structure of a sentence.
 Why: Helps in understanding relationships between words.

import spacy

!python -m spacy download en_core_web_sm # Download

the model if not already downloaded
nlp = spacy.load("en_core_web_sm") # Load the model
directly using spacy.load

# The rest of your code remains the same

text = "This is a sample sentence."
doc = nlp(text)
for token in doc:
print(token.text, token.dep_, token.head.text)
30. Handling Rare Words
 Purpose: Replaces or removes rare words that occur infrequently.
 Why: Reduces noise and improves model efficiency.

from collections import Counter

tokens = ["this", "is", "a", "rare", "word", "word"]

word_counts = Counter(tokens)
rare_words = {word for word, count in word_counts.items()
if count < 2}
tokens = [token if token not in rare_words else "<UNK>" for
token in tokens]
print(tokens)

31. Text Chunking

 Purpose: Groups words into "chunks" based on POS tags (e.g., noun
phrases).
 Why: Useful for information extraction.

from nltk import pos_tag, word_tokenize

from nltk.chunk import RegexpParser

text = "This is a sample sentence."

tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunk_parser = RegexpParser(grammar)
tree = chunk_parser.parse(pos_tags)
print(tree)
32. Handling Synonyms
 Purpose: Replaces words with their synonyms.
 Why: Helps in text augmentation and reducing redundancy.

from nltk.corpus import wordnet

word = "happy"
synonyms = wordnet.synsets(word)
print([syn.lemmas()[0].name() for syn in synonyms])

33. Text Normalization for Social Media

 Purpose: Processes informal text (e.g., "u" → "you", "gr8" →
"great").
 Why: Social media text often contains informal language and slang.

import re

text = "I loooove this!"

text = re.sub(r'(.)\1+', r'\1', text)
print(text)

These pre-processing steps are crucial for cleaning, standardizing,

and transforming raw text into a format suitable for NLP models.
The specific steps used depend on the task (e.g., sentiment
analysis, machine translation) and the nature of the text (e.g.,
formal documents, social media posts).
The importance of pre-processing steps in NLP depends on the specific
task, type of text data, and the NLP model being used. However, some steps
are generally considered more critical across most NLP tasks. Here's a
breakdown:

Most Important Pre-processing Steps for NLP

1. Tokenization
o Why: Tokenization is the foundation of NLP. It breaks text into
meaningful units (words, sentences, etc.), which are necessary for
any further processing.
o When: Always required, regardless of the task.
2. Lowercasing
oWhy: Ensures consistency by treating words like "Apple" and "apple"
as the same. Reduces vocabulary size and computational complexity.
o When: Important for tasks like text classification, sentiment
analysis, and information retrieval.
3. Removing Stopwords
oWhy: Stopwords (e.g., "the," "is," "and") add noise and don’t
contribute much to the meaning in many tasks.
o When: Useful for tasks like text classification, topic modeling, and
search engines.
4. Handling Missing Data
o Why: Incomplete or missing data can lead to poor model
performance.
o When: Critical for all tasks, especially when working with real-world
datasets.
5. Vectorization
oWhy: Converts text into numerical representations (e.g., Bag of
Words, TF-IDF, Word Embeddings) that machine learning models
can process.
o When: Essential for all tasks involving machine learning or deep
learning models.
6. Removing Punctuation and Special Characters
oWhy: Punctuation and special characters often don’t contribute to
the meaning and can add noise.
o When: Important for tasks like sentiment analysis, text
classification, and machine translation.
7. Lemmatization or Stemming
Why: Reduces words to their base forms, simplifying the vocabulary
o
and improving consistency.
o When: Useful for tasks like information retrieval, text
classification, and topic modeling.
8. Handling Contractions and Abbreviations
Why: Expands contractions (e.g., "can't" → "cannot") and
o
abbreviations (e.g., "ASAP" → "as soon as possible") for better
understanding.
o When: Important for tasks involving informal text (e.g., social media
analysis).
9. Handling URLs, Mentions, and Hashtags
o Why: Social media text often contains URLs, mentions (@user), and
hashtags (#topic), which need to be processed or removed.
o When: Critical for social media text analysis.
10. Text Normalization
o Why: Standardizes text (e.g., converting dates, times, and numbers
to a consistent format).
o When: Important for tasks involving structured data or time-
sensitive analysis.

Task-Specific Importance
 Sentiment Analysis: Handling negations, emojis, and emoticons is crucial.
 Machine Translation: Sentence segmentation and POS tagging are
important.
 Named Entity Recognition (NER): Handling dates, times, and special
characters is critical.
 Social Media Analysis: Handling emojis, hashtags, and informal language is
essential.
 Text Classification: Removing stopwords, lowercasing, and vectorization
are key.

Summary

While tokenization, lowercasing, stopword removal, and vectorization are

universally important, the relevance of other steps depends on the task and
dataset. Always analyze your data and task requirements to determine the most
critical preprocessing steps.

Prepared by: Syed Afroz Ali

FAQs For PAG-IBIG File Upload in EGov
No ratings yet
FAQs For PAG-IBIG File Upload in EGov
13 pages
Steps To Output Microsoft Word Doc From SAP
0% (1)
Steps To Output Microsoft Word Doc From SAP
12 pages
HSBC Bank Message Implementation Guide: Ifile
No ratings yet
HSBC Bank Message Implementation Guide: Ifile
60 pages
MAIT DRAFT 1d Exchange Candidate 5
No ratings yet
MAIT DRAFT 1d Exchange Candidate 5
30 pages
NLP Techniques for Students
No ratings yet
NLP Techniques for Students
55 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
Jinja Palletsprojects Com en 2.11.x PDF
No ratings yet
Jinja Palletsprojects Com en 2.11.x PDF
131 pages
PHP Strings Manual
No ratings yet
PHP Strings Manual
33 pages
NLP Guide for AI Students
No ratings yet
NLP Guide for AI Students
29 pages
Localization 0402
No ratings yet
Localization 0402
92 pages
p62 0x09 UTF8 Shellcode by Greuff
No ratings yet
p62 0x09 UTF8 Shellcode by Greuff
16 pages
Howto Unicode PDF
No ratings yet
Howto Unicode PDF
11 pages
C# Coding
No ratings yet
C# Coding
41 pages
Jb463 6.3 Student Guide
No ratings yet
Jb463 6.3 Student Guide
282 pages
Import "Multiple Choice" Questions With CSV: Let's Get Started
No ratings yet
Import "Multiple Choice" Questions With CSV: Let's Get Started
3 pages
En FR 8.5.1 Dep Book
No ratings yet
En FR 8.5.1 Dep Book
286 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
UPCC V300R006C10 CLI Interface Instruction
No ratings yet
UPCC V300R006C10 CLI Interface Instruction
478 pages
SW TM4C Tools Ug 2.1.0.12573
No ratings yet
SW TM4C Tools Ug 2.1.0.12573
42 pages
Introduction To ETL in Python: Stefano Francavilla
No ratings yet
Introduction To ETL in Python: Stefano Francavilla
62 pages
Oracle Database 12.1.0.2.0 Info & Config
No ratings yet
Oracle Database 12.1.0.2.0 Info & Config
3 pages
Java Exception Handling Guide
No ratings yet
Java Exception Handling Guide
8 pages
Accessing Values in Strings: 'Hello World!' "Python Programming"
No ratings yet
Accessing Values in Strings: 'Hello World!' "Python Programming"
29 pages
z/OS Machine Instruction Guide
No ratings yet
z/OS Machine Instruction Guide
6 pages
NLP Applications and Preprocessing
No ratings yet
NLP Applications and Preprocessing
56 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
NLP Slides
No ratings yet
NLP Slides
19 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
NLP - Cheatsheet
No ratings yet
NLP - Cheatsheet
10 pages
Unit2 Full
No ratings yet
Unit2 Full
28 pages
Disruptive Technologies AI Lecture 3
No ratings yet
Disruptive Technologies AI Lecture 3
19 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
EFMS 5v00 Service Manual (EN)
No ratings yet
EFMS 5v00 Service Manual (EN)
10 pages
Python 3 Cheat Sheet & Updates
No ratings yet
Python 3 Cheat Sheet & Updates
132 pages
X 64 DBG
No ratings yet
X 64 DBG
283 pages
Python and NLP Notes
No ratings yet
Python and NLP Notes
32 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
Module2.4 Text Processing
No ratings yet
Module2.4 Text Processing
17 pages
NLP Prep
No ratings yet
NLP Prep
14 pages
Text Preprocessing & NLTK Guide
No ratings yet
Text Preprocessing & NLTK Guide
8 pages
Natural Language Processing Manual
No ratings yet
Natural Language Processing Manual
39 pages
Module-I NLP
No ratings yet
Module-I NLP
35 pages
NLP Essentials for AI Enthusiasts
No ratings yet
NLP Essentials for AI Enthusiasts
4 pages
NLP 2
No ratings yet
NLP 2
45 pages
Advanced XML Features Guide
No ratings yet
Advanced XML Features Guide
51 pages
NLP Pipeline: Chapter-2
No ratings yet
NLP Pipeline: Chapter-2
171 pages
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
No ratings yet
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
81 pages
02 - NLP Pipeline - Binh
No ratings yet
02 - NLP Pipeline - Binh
37 pages
NLP 1
No ratings yet
NLP 1
11 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
NLP Concepts Resources
No ratings yet
NLP Concepts Resources
48 pages
RPM 82 Installation 19c
No ratings yet
RPM 82 Installation 19c
57 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
28 pages
Python Cookbook (2024)
No ratings yet
Python Cookbook (2024)
471 pages
Ai 2
No ratings yet
Ai 2
7 pages
NLP - Course EDC 1 29
No ratings yet
NLP - Course EDC 1 29
29 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
NLP Syllabus
No ratings yet
NLP Syllabus
2 pages
NLP Record300
No ratings yet
NLP Record300
24 pages
NLP Coding Guide for Beginners
No ratings yet
NLP Coding Guide for Beginners
10 pages
Adnan Amin
No ratings yet
Adnan Amin
19 pages
Tinywow Pythass3 77951173
No ratings yet
Tinywow Pythass3 77951173
17 pages
Module I NLP
No ratings yet
Module I NLP
65 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
HTTPWrapper 24
No ratings yet
HTTPWrapper 24
13 pages
Text Processing
No ratings yet
Text Processing
5 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
NLP Pipeline
No ratings yet
NLP Pipeline
58 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
2 Marks
No ratings yet
2 Marks
22 pages
Mobox Log
No ratings yet
Mobox Log
2 pages
Languages: What Is Natural Language Processing ?
No ratings yet
Languages: What Is Natural Language Processing ?
25 pages
SAP LA thr81 EN 2411 Ex
No ratings yet
SAP LA thr81 EN 2411 Ex
126 pages
NLP Ans
No ratings yet
NLP Ans
91 pages
Natural Language Processing - Personal Notes
No ratings yet
Natural Language Processing - Personal Notes
8 pages
NLP Short Notes
No ratings yet
NLP Short Notes
21 pages
Genai Unit !
No ratings yet
Genai Unit !
71 pages
Sap La Thr81 en 2311 Ex
No ratings yet
Sap La Thr81 en 2311 Ex
115 pages
NLP Full Overview
No ratings yet
NLP Full Overview
37 pages
NLP Notes Unit 1
No ratings yet
NLP Notes Unit 1
179 pages
AMLTA
No ratings yet
AMLTA
17 pages
SNLP - 1
No ratings yet
SNLP - 1
11 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
7 pages
Module 3
No ratings yet
Module 3
9 pages

NLP Preprocessing Steps 1740444240

Uploaded by

NLP Preprocessing Steps 1740444240

Uploaded by

Natural Language Pre-processing (NLP)

Natural Language Processing (NLP) involves a series of pre-

33 Common Pre-processing step commonly used

text = "Hello World! This is NLP."

text = "Hello World! This is NLP."

text = "Hello, World! This is NLP."

# Download the 'stopwords' dataset

from nltk.corpus import stopwords

from nltk.stem import PorterStemmer

from nltk.stem import WordNetLemmatizer

8. Removing Extra Spaces

text = " This is a sentence. "

!pip install contractions

text = "I can't do this."

text = "This is a #sample text with @special characters!"

11. Part-of-Speech (POS) Tagging

# Download the required resource

tokens = word_tokenize("This is a sample sentence.")

# Download the required resources

tokens = word_tokenize("John works at Google in New York.")

corpus = ["This is a sample sentence.", "Another example

14. Handling Missing Data

data = {"text": ["Hello", None, "World"]}

16. Spelling Correction

from textblob import TextBlob

text = "I made a many mistakes in Artificial intellengence"

17. Handling Emojis and Emoticons

!pip install emoji

text = "I love Python! �"

18. Removing HTML Tags

from bs4 import BeautifulSoup

text = "<p>This is a <b>sample</b> text.</p>"

19. Handling URLs

text = "Visit my website at https://example.com."

text = "Hey @user, check out #NLP!"

21. Sentence Segmentation

from nltk.tokenize import sent_tokenize

text = "This is the first sentence. This is the second

22. Handling Abbreviations

!pip install contractions

23. Language Detection

!pip install langdetect

24. Text Encoding

25. Handling Whitespace Tokens

import dateutil.parser as dparser

text = "The event is on 2023-10-15."

27. Text Augmentation

#!pip install nlpaug # Install the nlpaug library

28. Handling Negations

text = "This is not good."

29. Dependency Parsing

!python -m spacy download en_core_web_sm # Download

# The rest of your code remains the same

from collections import Counter

tokens = ["this", "is", "a", "rare", "word", "word"]

31. Text Chunking

from nltk import pos_tag, word_tokenize

text = "This is a sample sentence."

from nltk.corpus import wordnet

33. Text Normalization for Social Media

text = "I loooove this!"

These pre-processing steps are crucial for cleaning, standardizing,

Most Important Pre-processing Steps for NLP

While tokenization, lowercasing, stopword removal, and vectorization are

Prepared by: Syed Afroz Ali

You might also like