0% found this document useful (0 votes)

11 views8 pages

NLP Practical Three

The document outlines a practical implementation of Natural Language Processing (NLP) techniques, including data loading, exploratory data analysis, text preprocessing, label encoding, and TF-IDF representation creation. It utilizes Python libraries such as pandas, NLTK, and scikit-learn to process a news dataset containing 2225 articles across various categories. The final outputs include a processed DataFrame, TF-IDF matrix, and saved models for further analysis.

Uploaded by

Kpranit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views8 pages

NLP Practical Three

Uploaded by

Kpranit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

nlp-practical-three

April 9, 2025

[1]: import pandas as pd

import numpy as np
import pickle
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import save_npz
import matplotlib.pyplot as plt
import seaborn as sns

[14]: import warnings

warnings.filterwarnings('ignore')

[4]: print("--- Downloading NLTK resources (if needed) ---")

try:
nltk.data.find('corpora/wordnet')
print("WordNet resource found.")
except LookupError:
print("WordNet resource not found. Downloading...")
nltk.download('wordnet', quiet=True)

try:
nltk.data.find('corpora/omw-1.4')
print("OMW-1.4 resource found.")
except LookupError:
print("OMW-1.4 resource not found. Downloading...")
nltk.download('omw-1.4', quiet=True)

try:
nltk.data.find('tokenizers/punkt')
print("Punkt tokenizer resource found.")
except LookupError:
print("Punkt tokenizer resource not found. Downloading...")

1
nltk.download('punkt', quiet=True)

try:
nltk.data.find('corpora/stopwords')
print("Stopwords resource found.")
except LookupError:
print("Stopwords resource not found. Downloading...")
nltk.download('stopwords', quiet=True)

print("NLTK resources checked/downloaded.")

--- Downloading NLTK resources (if needed) ---

WordNet resource not found. Downloading…
OMW-1.4 resource not found. Downloading…
Punkt tokenizer resource found.
Stopwords resource found.
NLTK resources checked/downloaded.

[5]: print("\n--- Loading Data ---")

path_df = "/content/News_dataset.pickle"

with open(path_df, 'rb') as data:

df = pickle.load(data)
print("DataFrame loaded successfully.")
print(f"Shape of DataFrame: {df.shape}")

--- Loading Data ---

DataFrame loaded successfully.
Shape of DataFrame: (2225, 6)

[6]: print("\n--- Exploratory Data Analysis ---")

# 4.1 Basic Info

print("DataFrame Info:")
df.info()

print("\nDataFrame Head:")
print(df.head())

print("\nCheck for Missing Values:")

print(df.isnull().sum())

--- Exploratory Data Analysis ---

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>

2
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 File_Name 2225 non-null object
1 Content 2225 non-null object
2 Category 2225 non-null object
3 Complete_Filename 2225 non-null object
4 id 2225 non-null int64
5 News_length 2225 non-null int64
dtypes: int64(2), object(4)
memory usage: 104.4+ KB

DataFrame Head:
File_Name Content Category \
0 001.txt Ad sales boost Time Warner profit\r\n\r\nQuart… business
1 002.txt Dollar gains on Greenspan speech\r\n\r\nThe do… business
2 003.txt Yukos unit buyer faces loan claim\r\n\r\nThe o… business
3 004.txt High fuel prices hit BA's profits\r\n\r\nBriti… business
4 005.txt Pernod takeover talk lifts Domecq\r\n\r\nShare… business

Complete_Filename id News_length
0 001.txt-business 1 2569
1 002.txt-business 1 2257
2 003.txt-business 1 1557
3 004.txt-business 1 2421
4 005.txt-business 1 1575

Check for Missing Values:

File_Name 0
Content 0
Category 0
Complete_Filename 0
id 0
News_length 0
dtype: int64

[15]: print("\nCategory Distribution:")

category_counts = df['Category'].value_counts()
print(category_counts)
print("\n")

plt.figure(figsize=(10, 6))
sns.countplot(data=df, y='Category', order=category_counts.index,␣
↪palette='viridis')

plt.title('Distribution of News Categories')

plt.xlabel('Number of Articles')

3
plt.ylabel('Category')
plt.tight_layout()
plt.show()

# 4.3 News Length Analysis (using existing 'News_length' column)

print("\nNews Length Analysis:\n\n")
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='News_length', kde=True, bins=50)
plt.title('Distribution of News Article Lengths (Original)')
plt.xlabel('Number of Characters')
plt.ylabel('Frequency')
plt.show()

print(f"Average news length: {df['News_length'].mean():.2f} characters")

Category Distribution:
Category
sport 511
business 510
politics 417
tech 401
entertainment 386
Name: count, dtype: int64

4
News Length Analysis:

Average news length: 2274.36 characters

[9]: nltk.download("punkt_tab")
print("\n--- Performing Text Preprocessing ---")

# Initialize Lemmatizer and Stopwords list

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
# 1. Lowercase
text = text.lower()
# 2. Remove punctuation and numbers (keep only letters and spaces)
text = re.sub(r'[^a-z\s]', '', text)
# 3. Tokenize
words = word_tokenize(text)
# 4. Remove Stop Words and Lemmatize

5
lemmatized_words = [lemmatizer.lemmatize(word) for word in words if word␣
↪not in stop_words and len(word) > 2] # Keep words > 2 chars
# 5. Join back into string
return ' '.join(lemmatized_words)

# Apply the preprocessing function to the 'Content' column

# Using .copy() to avoid SettingWithCopyWarning
df_processed = df.copy()
print("Applying preprocessing to 'Content' column...")
df_processed['Cleaned_Content'] = df_processed['Content'].apply(preprocess_text)
print("Preprocessing complete.")

# Display comparison for one example

print("\nOriginal Content (first row):")
print(df.iloc[0]['Content'][:500] + "...") # Show first 500 chars
print("\nCleaned Content (first row):")
print(df_processed.iloc[0]['Cleaned_Content'][:500] + "...")

[nltk_data] Downloading package punkt_tab to /root/nltk_data…

[nltk_data] Unzipping tokenizers/punkt_tab.zip.

--- Performing Text Preprocessing ---

Applying preprocessing to 'Content' column…
Preprocessing complete.

Original Content (first row):

Ad sales boost Time Warner profit

Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (Â£600m)

for the three months to December, from $639m year-earlier.

The firm, which is now one of the biggest investors in Google, benefited from
sales of high-speed internet connections and higher advert sales. TimeWarner
said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were
buoyed by one-off gains which offset a profit dip at Warner Bros, and less users
for AOL.
…

Cleaned Content (first row):

sale boost time warner profit quarterly profit medium giant timewarner jumped
three month december yearearlier firm one biggest investor google benefited sale
highspeed internet connection higher advert sale timewarner said fourth quarter
sale rose profit buoyed oneoff gain offset profit dip warner bros less user aol
time warner said friday owns searchengine google internet business aol mixed
fortune lost subscriber fourth quarter profit lower preceding three quarter
however company said aols un…

6
[10]: print("\n--- Performing Label Encoding on 'Category' ---")
label_encoder = LabelEncoder()

# Apply Label Encoding

df_processed['Category_Encoded'] = label_encoder.
↪fit_transform(df_processed['Category'])

# Display mapping
category_mapping = dict(zip(label_encoder.classes_, label_encoder.
↪transform(label_encoder.classes_)))

print("Category to Encoded Label Mapping:")

print(category_mapping)

print("\nDataFrame Head with Processed Columns:")

print(df_processed[['Category', 'Category_Encoded', 'Cleaned_Content']].head())

--- Performing Label Encoding on 'Category' ---

Category to Encoded Label Mapping:
{'business': np.int64(0), 'entertainment': np.int64(1), 'politics': np.int64(2),
'sport': np.int64(3), 'tech': np.int64(4)}

DataFrame Head with Processed Columns:

Category Category_Encoded \
0 business 0
1 business 0
2 business 0
3 business 0
4 business 0

Cleaned_Content
0 sale boost time warner profit quarterly profit…
1 dollar gain greenspan speech dollar hit highes…
2 yukos unit buyer face loan claim owner embattl…
3 high fuel price hit ba profit british airway b…
4 pernod takeover talk lift domecq share drink f…

[11]: print("\n--- Creating TF-IDF Representations ---")

tfidf_vectorizer = TfidfVectorizer(max_features=5000) # Limit features to top␣

↪5000 for efficiency

# Fit and transform the cleaned text data

tfidf_matrix = tfidf_vectorizer.fit_transform(df_processed['Cleaned_Content'])

print(f"Shape of TF-IDF matrix: {tfidf_matrix.shape}")

7
print(f"(Number of documents: {tfidf_matrix.shape[0]}, Number of unique terms/
↪features: {tfidf_matrix.shape[1]})")

--- Creating TF-IDF Representations ---

Shape of TF-IDF matrix: (2225, 5000)
(Number of documents: 2225, Number of unique terms/features: 5000)

[13]: print("\n--- Saving Processed Data and TF-IDF Objects ---")

# Define output paths (adjust as needed, Kaggle uses /kaggle/working/)

output_dir = "/content"
processed_df_path = output_dir + "processed_news_data.pkl"
tfidf_matrix_path = output_dir + "tfidf_matrix.npz"
tfidf_vectorizer_path = output_dir + "tfidf_vectorizer.pkl"
label_encoder_path = output_dir + "label_encoder.pkl"

# Save the processed DataFrame

df_processed.to_pickle(processed_df_path)
print(f"Processed DataFrame saved to: {processed_df_path}")

# Save the TF-IDF matrix (sparse format)

save_npz(tfidf_matrix_path, tfidf_matrix)
print(f"TF-IDF matrix saved to: {tfidf_matrix_path}")

# Save the TF-IDF vectorizer

with open(tfidf_vectorizer_path, 'wb') as f:
pickle.dump(tfidf_vectorizer, f)
print(f"TF-IDF vectorizer saved to: {tfidf_vectorizer_path}")

# Save the Label Encoder

with open(label_encoder_path, 'wb') as f:
pickle.dump(label_encoder, f)
print(f"Label encoder saved to: {label_encoder_path}")

print("\n--- All Steps Completed Successfully ---")

--- Saving Processed Data and TF-IDF Objects ---

Processed DataFrame saved to: /contentprocessed_news_data.pkl
TF-IDF matrix saved to: /contenttfidf_matrix.npz
TF-IDF vectorizer saved to: /contenttfidf_vectorizer.pkl
Label encoder saved to: /contentlabel_encoder.pkl

--- All Steps Completed Successfully ---

Grade 6, Physics SQP-1 - Revision
No ratings yet
Grade 6, Physics SQP-1 - Revision
4 pages
Teachers Details Check List: 04001 Govt HSS, Ala, Alappuzha
100% (2)
Teachers Details Check List: 04001 Govt HSS, Ala, Alappuzha
110 pages
IPT - AI - 30 Days
No ratings yet
IPT - AI - 30 Days
39 pages
AQA As 2.0 Optical Fibres 1 Questions
No ratings yet
AQA As 2.0 Optical Fibres 1 Questions
13 pages
Logistic Regression and Beginner ML Notes
No ratings yet
Logistic Regression and Beginner ML Notes
9 pages
Feature Engineering - Python Data Science Handbook
No ratings yet
Feature Engineering - Python Data Science Handbook
9 pages
Group Discussion
No ratings yet
Group Discussion
6 pages
First
No ratings yet
First
27 pages
Feature Engineering - Introduction
No ratings yet
Feature Engineering - Introduction
74 pages
Text Classification of News Articles
No ratings yet
Text Classification of News Articles
14 pages
Features of A Datase1
No ratings yet
Features of A Datase1
11 pages
Pandas Library Documentation
No ratings yet
Pandas Library Documentation
16 pages
Unit 3
No ratings yet
Unit 3
110 pages
Questions Answers Chapter Wise
No ratings yet
Questions Answers Chapter Wise
4 pages
Solution Youtube Adview Prediction Step4 Lyst3102 Lyst9364
No ratings yet
Solution Youtube Adview Prediction Step4 Lyst3102 Lyst9364
3 pages
Srujitha 1
No ratings yet
Srujitha 1
91 pages
News Classification
No ratings yet
News Classification
4 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
# Load The Dataset: 'News - Dataset - Pickle' 'RB'
No ratings yet
# Load The Dataset: 'News - Dataset - Pickle' 'RB'
2 pages
PP Anakonda
No ratings yet
PP Anakonda
8 pages
Feature Engineering Guide
No ratings yet
Feature Engineering Guide
51 pages
Sma 3
No ratings yet
Sma 3
3 pages
The AIM Test
No ratings yet
The AIM Test
4 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Project Report
No ratings yet
Project Report
12 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
123 pages
Topic Classifierby David Caleb
No ratings yet
Topic Classifierby David Caleb
7 pages
Merge
No ratings yet
Merge
33 pages
Methodology
No ratings yet
Methodology
9 pages
Class Xii PDF For Practical
No ratings yet
Class Xii PDF For Practical
24 pages
Accident Investigation Guide
100% (1)
Accident Investigation Guide
113 pages
Personalized Cancer Diagnosis
No ratings yet
Personalized Cancer Diagnosis
100 pages
ROBV101 - PNote Activities
No ratings yet
ROBV101 - PNote Activities
10 pages
ML Week 8
No ratings yet
ML Week 8
12 pages
Emotion Classification With DistilBERT
No ratings yet
Emotion Classification With DistilBERT
25 pages
ML Lab Exercise - 9
No ratings yet
ML Lab Exercise - 9
4 pages
Stage 1 - Data Ingestion and Organization
No ratings yet
Stage 1 - Data Ingestion and Organization
9 pages
Data Science Project
No ratings yet
Data Science Project
34 pages
EX1
No ratings yet
EX1
6 pages
Python in Research
No ratings yet
Python in Research
18 pages
Text Data
No ratings yet
Text Data
25 pages
Additional English - 4th Semester Full
No ratings yet
Additional English - 4th Semester Full
48 pages
Pricing Mercari
No ratings yet
Pricing Mercari
41 pages
Sentiments Analysis Code Analysis
No ratings yet
Sentiments Analysis Code Analysis
42 pages
Tutorial 3 - 206009L
No ratings yet
Tutorial 3 - 206009L
34 pages
R002 KrishAhuja BDA Lab9.Ipynb - Colab
No ratings yet
R002 KrishAhuja BDA Lab9.Ipynb - Colab
3 pages
Role Play Rubric
100% (2)
Role Play Rubric
2 pages
2403RES29 - Hemant Choudhary - CS582 - Assignment - 1
No ratings yet
2403RES29 - Hemant Choudhary - CS582 - Assignment - 1
5 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
Exp 1
No ratings yet
Exp 1
5 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
l9 Scientific Python Proc
No ratings yet
l9 Scientific Python Proc
30 pages
9 Feature Engineering Text Data
No ratings yet
9 Feature Engineering Text Data
7 pages
7-8 Feature Engineering 101-Normalization
No ratings yet
7-8 Feature Engineering 101-Normalization
8 pages
Decision Theory
No ratings yet
Decision Theory
33 pages
Thesis Jur Erbrink
No ratings yet
Thesis Jur Erbrink
245 pages
AIL303 M
No ratings yet
AIL303 M
22 pages
Unstructured Data Classification Handson
No ratings yet
Unstructured Data Classification Handson
4 pages
Lab5 Example Fall 23
No ratings yet
Lab5 Example Fall 23
4 pages
Lecture 01 Properties of Sea Water PDF
No ratings yet
Lecture 01 Properties of Sea Water PDF
6 pages
Jigsaw Jumbled Test Sheet
No ratings yet
Jigsaw Jumbled Test Sheet
75 pages
Data Presentation Methods Explained
No ratings yet
Data Presentation Methods Explained
13 pages
3 Creating Features - Kaggle
No ratings yet
3 Creating Features - Kaggle
14 pages
Architecture Theory Essentials
100% (1)
Architecture Theory Essentials
9 pages
Machine Learning With Python - Part-2
No ratings yet
Machine Learning With Python - Part-2
27 pages
How To Use NLP in Python A Practical Step-by-Step ExampleTo Find Out The In-Demand Skills For Data SC
No ratings yet
How To Use NLP in Python A Practical Step-by-Step ExampleTo Find Out The In-Demand Skills For Data SC
12 pages
STS-30 Press Kit
No ratings yet
STS-30 Press Kit
41 pages
Logitech MX ERGO Wireless Trackball
No ratings yet
Logitech MX ERGO Wireless Trackball
8 pages
Filipino Research Skills Course
No ratings yet
Filipino Research Skills Course
9 pages
Concepts of Database Management Seventh Edition: DBMS Functions
No ratings yet
Concepts of Database Management Seventh Edition: DBMS Functions
62 pages
Listening-đã chuyển đổi
No ratings yet
Listening-đã chuyển đổi
8 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
Assignment 1.1: First 10 Rows Looks Like Below in Notepad++
100% (1)
Assignment 1.1: First 10 Rows Looks Like Below in Notepad++
6 pages
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
No ratings yet
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
20 pages
eNodeB Site Commissioning Guide
No ratings yet
eNodeB Site Commissioning Guide
22 pages
Dimensionless Numbers
No ratings yet
Dimensionless Numbers
13 pages
Amazon-Fine-Food-Review - K-Means, Agglomerative & DBSCAN Clustering
No ratings yet
Amazon-Fine-Food-Review - K-Means, Agglomerative & DBSCAN Clustering
79 pages
Happy Street II - 1st Tests 2017 - Key
No ratings yet
Happy Street II - 1st Tests 2017 - Key
1 page
Statement of Purpose Gmu Carlos Lavin PH
No ratings yet
Statement of Purpose Gmu Carlos Lavin PH
2 pages
Cyber Security: PROJECT: Fake News Detection
No ratings yet
Cyber Security: PROJECT: Fake News Detection
8 pages
Python Text Classification Guide
No ratings yet
Python Text Classification Guide
34 pages
Node B Installation Guide
67% (3)
Node B Installation Guide
29 pages
Personal Data Sheet: Single Married Annulled Widowed Separated Others, Specify
No ratings yet
Personal Data Sheet: Single Married Annulled Widowed Separated Others, Specify
4 pages
IT Chem F4 Topical Test 1 (E)
No ratings yet
IT Chem F4 Topical Test 1 (E)
2 pages
What Is PDP?: Personal Development Planning The Engineering Subject Centre 1
0% (1)
What Is PDP?: Personal Development Planning The Engineering Subject Centre 1
5 pages
Drdo Research Project
No ratings yet
Drdo Research Project
5 pages
Employer Branding Essentials Guide
No ratings yet
Employer Branding Essentials Guide
4 pages
Hospital Housekeeping Guide
100% (1)
Hospital Housekeeping Guide
16 pages

NLP Practical Three

Uploaded by

NLP Practical Three

Uploaded by

nlp-practical-three

[1]: import pandas as pd

[14]: import warnings

[4]: print("--- Downloading NLTK resources (if needed) ---")

print("NLTK resources checked/downloaded.")

--- Downloading NLTK resources (if needed) ---

[5]: print("\n--- Loading Data ---")

with open(path_df, 'rb') as data:

--- Loading Data ---

[6]: print("\n--- Exploratory Data Analysis ---")

# 4.1 Basic Info

print("\nCheck for Missing Values:")

--- Exploratory Data Analysis ---

Check for Missing Values:

[15]: print("\nCategory Distribution:")

plt.title('Distribution of News Categories')

# 4.3 News Length Analysis (using existing 'News_length' column)

print(f"Average news length: {df['News_length'].mean():.2f} characters")

Average news length: 2274.36 characters

# Initialize Lemmatizer and Stopwords list

# Apply the preprocessing function to the 'Content' column

# Display comparison for one example

[nltk_data] Downloading package punkt_tab to /root/nltk_data…

--- Performing Text Preprocessing ---

Original Content (first row):

Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (Â£600m)

Cleaned Content (first row):

# Apply Label Encoding

print("Category to Encoded Label Mapping:")

print("\nDataFrame Head with Processed Columns:")

--- Performing Label Encoding on 'Category' ---

DataFrame Head with Processed Columns:

[11]: print("\n--- Creating TF-IDF Representations ---")

tfidf_vectorizer = TfidfVectorizer(max_features=5000) # Limit features to top␣

# Fit and transform the cleaned text data

print(f"Shape of TF-IDF matrix: {tfidf_matrix.shape}")

--- Creating TF-IDF Representations ---

[13]: print("\n--- Saving Processed Data and TF-IDF Objects ---")

# Define output paths (adjust as needed, Kaggle uses /kaggle/working/)

# Save the processed DataFrame

# Save the TF-IDF matrix (sparse format)

# Save the TF-IDF vectorizer

# Save the Label Encoder

print("\n--- All Steps Completed Successfully ---")

--- Saving Processed Data and TF-IDF Objects ---

--- All Steps Completed Successfully ---

You might also like