0% found this document useful (0 votes)

82 views8 pages

Importing Packages: Id Label Tweet 0 1 2 3 4

This document discusses analyzing hate speech on Twitter using natural language processing techniques. It loads a Twitter dataset labeled for hate speech, cleans the text by removing URLs, handles, hashtags etc. It creates word clouds and calculates term frequencies. It uses TF-IDF to vectorize the text and train a logistic regression classifier to identify hate speech, evaluating performance with cross validation. Hyperparameter tuning is done using grid search with stratified k-fold cross validation to identify the best model.

Uploaded by

rajat raina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views8 pages

Importing Packages: Id Label Tweet 0 1 2 3 4

Uploaded by

rajat raina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

12/5/2020 TwitterHate_NLP.

ipynb - Colaboratory

Importing Packages

import pandas as pd
import regex as re
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS
from nltk.tokenize.treebank import TreebankWordDetokenizer
from sklearn.model_selection import train_test_split
from nltk.tokenize import TweetTokenizer
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...

[nltk_data] Package stopwords is already up-to-date!
True

Loading Twitter Dataset

sentiment_data = pd.read_csv('/content/TwitterHate.csv')
print(len(sentiment_data))
sentiment_data.head()

31962
id label tweet

0 1 0 @user when a father is dysfunctional and is s...

1 2 0 @user @user thanks for #lyft credit i can't us...

2 3 0 bihday your majesty

3 4 0 #model i love u take with u all the time in ...

4 5 0 factsguide: society now #motivation

https://colab.research.google.com/drive/1kZR_kENRbH_zbQ4BKUJ3zuhFOaoL_2ce#scrollTo=LsXwJKKw910w&printMode=true 1/8
12/5/2020 TwitterHate_NLP.ipynb - Colaboratory

sentiment_data['label'].value_counts()
#Imbalanced Dataset

0 29720
1 2242
Name: label, dtype: int64

# from imblearn.over_sampling import RandomOverSampler

/usr/local/lib/python3.6/dist-packages/sklearn/externals/six.py:31: FutureWarning: The m

"(https://pypi.org/project/six/).", FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:144: FutureWarning:
warnings.warn(message, FutureWarning)

Cleaning Text using Regex

def textcleanup(data):

tk = TweetTokenizer()
stop_words = set(stopwords.words('english'))
tweet_list = []
word_list = []
for tweet in list(data['tweet']):
tweet = tweet.encode('ascii', 'ignore').decode('ascii')
tweet = re.sub('[^ ]+\.[^ ]+','',tweet) # Remove URL
tweet = re.sub("[#'']",'',tweet) # Remove #
tweet = re.sub('\@\w+','',tweet) # Remove User handle
tweet = re.sub(r'^[RT]','',tweet)#remove RT-tags
tweet = re.sub("\W+\\+[A-Za-z0-9]+\d+\D|\\+[A-Za-z0-9]+\d+\D+\w",'',tweet) #Remove redu
tweet = re.sub("\b[a]+[m]+[p]\b",'',tweet)
tweet = tweet.lower().lstrip().rstrip()
tweet = tk.tokenize(tweet)
tweet = [word for word in tweet if word not in stop_words]
tweet = list(filter(lambda sentiment: len(sentiment) > 1, tweet))
tweet_list.append(tweet)
word_list.extend(tweet)

return tweet_list,word_list,stop_words

cleantext,wordlist,stop_words = textcleanup(sentiment_data)

Getting 10 most common terms after cleaning the text

https://colab.research.google.com/drive/1kZR_kENRbH_zbQ4BKUJ3zuhFOaoL_2ce#scrollTo=LsXwJKKw910w&printMode=true 2/8
12/5/2020 TwitterHate_NLP.ipynb - Colaboratory

word_count = Counter(wordlist)
word_count.most_common(10)

[('love', 2725),
('day', 2247),
('happy', 1673),
('im', 1155),
('time', 1115),
('life', 1114),
('like', 1089),
('today', 993),
('new', 989),
('positive', 934)]

wordcloud = WordCloud(width = 800, height = 800,

background_color ='white',
stopwords = stop_words,
min_font_size = 10).generate(str(wordlist))

# plot the WordCloud image

plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

plt.show()

https://colab.research.google.com/drive/1kZR_kENRbH_zbQ4BKUJ3zuhFOaoL_2ce#scrollTo=LsXwJKKw910w&printMode=true 3/8
12/5/2020 TwitterHate_NLP.ipynb - Colaboratory

Joining the token back to form strings.

clean_sentiments = []

for sent in cleantext:

detokanized_sent = TreebankWordDetokenizer().detokenize(sent)
clean_sentiments.append(detokanized_sent)

clean_sentiments[0]

'father dysfunctional selfish drags kids dysfunction run'

newframe = {'labels' : sentiment_data['label'], 'clean_sentiments' : clean_sentiments }

sentiments_frame = pd.DataFrame(newframe)
sentiments_frame.head()

labels clean_sentiments

0 0 father dysfunctional selfish drags kids dysfun...

1 0 thanks lyft credit cant use cause dont offer w...

2 0 bihday majesty

3 0 model love take time ur

4 0 factsguide society motivation

Using TF-IDF values for the terms as a feature to get into a

vector space model

tfidf_vectorizer = TfidfVectorizer(
max_df=0.5,
min_df=10,
strip_accents='unicode',
max_features=5000
)
https://colab.research.google.com/drive/1kZR_kENRbH_zbQ4BKUJ3zuhFOaoL_2ce#scrollTo=LsXwJKKw910w&printMode=true 4/8
12/5/2020 TwitterHate_NLP.ipynb - Colaboratory
)

tfidf_data = tfidf_vectorizer.fit_transform(sentiments_frame['clean_sentiments'])

Splitting Data into train, test and Creating Model

#Splitting Data
i te
X_train, X_test, y_train, y_test = train_test_split(tfidf_data,sentiments_frame['labels'],

#Creating Model
model = LogisticRegression()
model.fit(X_train,y_train)

train_score = model.score(X_train,y_train)
test_score = model.score(X_test,y_test)

print(train_score)
print(test_score)

0.9557276389377762
0.9510402002189895

#Generating and Plotting Confusion Matrix

cf_matrix =confusion_matrix(y_test,model.predict(X_test))
plt.figure(figsize = (7,5))
sns.heatmap(cf_matrix, annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f2f86c36320>

#Classification Report for Test Data

print(classification_report(y_test, model.predict(X_test)))
https://colab.research.google.com/drive/1kZR_kENRbH_zbQ4BKUJ3zuhFOaoL_2ce#scrollTo=LsXwJKKw910w&printMode=true 5/8
12/5/2020 TwitterHate_NLP.ipynb - Colaboratory

precision recall f1-score support

0 0.95 1.00 0.97 5937

1 0.90 0.35 0.51 456

accuracy 0.95 6393

macro avg 0.93 0.68 0.74 6393
weighted avg 0.95 0.95 0.94 6393

#Classification Report for Train Data

print(classification_report(y_train, model.predict(X_train)))

precision recall f1-score support

0 0.96 1.00 0.98 23783

1 0.94 0.39 0.55 1786

accuracy 0.96 25569

macro avg 0.95 0.69 0.76 25569
weighted avg 0.96 0.96 0.95 25569

Using Grid Search and Strati ed Kfold for Hyperparameter

Tuning

parameters = [{'penalty': ['l1', 'l2'],

'C': [1, 10, 100, 1000],
'class_weight': ['auto','balanced']}]

grid_sr = GridSearchCV(
LogisticRegression(class_weight="balanced"), parameters, scoring='recal
)
grid_sr.fit(X_train, y_train)

grid_sr.best_params_

{'C': 1, 'class_weight': 'balanced', 'penalty': 'l2'}

kfold = StratifiedKFold(n_splits=4, shuffle=True, random_state=1)

# enumerate the splits and summarize the distributions
for train_ix, test_ix in kfold.split(tfidf_data,sentiments_frame['labels']):
# select rows
train_X, test_X = tfidf_data[train_ix], tfidf_data[test_ix]
train_y, test_y = sentiments_frame['labels'][train_ix], sentiments_frame['labels'][test_ix]
model_test = LogisticRegression(C= 1, class_weight= 'balanced', penalty= 'l2')
model test fit(X train y train)
https://colab.research.google.com/drive/1kZR_kENRbH_zbQ4BKUJ3zuhFOaoL_2ce#scrollTo=LsXwJKKw910w&printMode=true 6/8
12/5/2020 TwitterHate_NLP.ipynb - Colaboratory
model_test.fit(X_train,y_train)

train_score1 = model_test.score(train_X,train_y)
test_score1 = model_test.score(test_X,test_y)

print('Train Score', train_score1)

print('Test Score', test_score1)

if test_score1 > train_score1:

break

Train Score 0.9289141045429894

Test Score 0.9300463020898511

print(classification_report(test_y,model_test.predict(test_X)))

precision recall f1-score support

0 0.99 0.93 0.96 7430

1 0.50 0.91 0.65 561

accuracy 0.93 7991

macro avg 0.75 0.92 0.80 7991
weighted avg 0.96 0.93 0.94 7991

print(classification_report(train_y,model_test.predict(train_X)))

precision recall f1-score support

0 0.99 0.93 0.96 22290

1 0.50 0.93 0.65 1681

accuracy 0.93 23971

macro avg 0.75 0.93 0.80 23971
weighted avg 0.96 0.93 0.94 23971

Best Parameters : (C= 1, class_weight= 'balanced', penalty= '12')

Recall on the test set for the toxic comments : 93
f_1 Score on the test set for the toxic comments : 65

https://colab.research.google.com/drive/1kZR_kENRbH_zbQ4BKUJ3zuhFOaoL_2ce#scrollTo=LsXwJKKw910w&printMode=true 7/8
12/5/2020 TwitterHate_NLP.ipynb - Colaboratory

https://colab.research.google.com/drive/1kZR_kENRbH_zbQ4BKUJ3zuhFOaoL_2ce#scrollTo=LsXwJKKw910w&printMode=true 8/8

C1 W1 Assignment
No ratings yet
C1 W1 Assignment
16 pages
Professional Machine Learning
No ratings yet
Professional Machine Learning
67 pages
Hatespeech Code Ipynb
No ratings yet
Hatespeech Code Ipynb
31 pages
Classification CNN
No ratings yet
Classification CNN
7 pages
Miniproject 14
No ratings yet
Miniproject 14
4 pages
NLP Transformer-Based Models Used For Sentiment Analysis: 1. BERT
No ratings yet
NLP Transformer-Based Models Used For Sentiment Analysis: 1. BERT
98 pages
Sentiment Analysis On Tweets
No ratings yet
Sentiment Analysis On Tweets
2 pages
Twitter Sentiment Analysis Using Machine Learning Project Report
No ratings yet
Twitter Sentiment Analysis Using Machine Learning Project Report
3 pages
DS - Lab Report.
No ratings yet
DS - Lab Report.
25 pages
Hate Speech Detection
No ratings yet
Hate Speech Detection
6 pages
Sentiment Analysis Using LSTM
No ratings yet
Sentiment Analysis Using LSTM
5 pages
Kindle Review Sentiment Analysis - Ipynb - Colab
No ratings yet
Kindle Review Sentiment Analysis - Ipynb - Colab
5 pages
Immediate Access Marketing Management 4th Edition Marshall Verified PDF Download
0% (1)
Immediate Access Marketing Management 4th Edition Marshall Verified PDF Download
408 pages
Ai Lab Final
No ratings yet
Ai Lab Final
21 pages
Practical File OF Machine Learning
No ratings yet
Practical File OF Machine Learning
31 pages
HateSpeech - Ipynb - Colab
No ratings yet
HateSpeech - Ipynb - Colab
8 pages
Sentimental Analysis
No ratings yet
Sentimental Analysis
3 pages
Transformer Models for Sentiment Analysis
No ratings yet
Transformer Models for Sentiment Analysis
45 pages
Machine Learning Code Explanation
No ratings yet
Machine Learning Code Explanation
33 pages
Q 3
No ratings yet
Q 3
2 pages
Sentiment Analysis for Tweets
No ratings yet
Sentiment Analysis for Tweets
11 pages
DL 3
No ratings yet
DL 3
5 pages
Mids Practical 3
No ratings yet
Mids Practical 3
2 pages
Artificial Neural Network Code
No ratings yet
Artificial Neural Network Code
3 pages
8-Text Classification - Jupyter Notebook
No ratings yet
8-Text Classification - Jupyter Notebook
2 pages
Rajeek 7
No ratings yet
Rajeek 7
3 pages
Tweet Emotion Recognition: NLP With Tensorflow
No ratings yet
Tweet Emotion Recognition: NLP With Tensorflow
10 pages
Sentiment Analysis of Tweets
No ratings yet
Sentiment Analysis of Tweets
9 pages
Sma Exp 10 Code Print
No ratings yet
Sma Exp 10 Code Print
7 pages
Hate Speech Detection Documentation With Code
No ratings yet
Hate Speech Detection Documentation With Code
4 pages
Toxic Comment Classification
No ratings yet
Toxic Comment Classification
11 pages
Sentiment Analysis Using LSTM
No ratings yet
Sentiment Analysis Using LSTM
5 pages
C1 W1 Assignment
No ratings yet
C1 W1 Assignment
14 pages
ML Week10.1
No ratings yet
ML Week10.1
5 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
5 pages
Super Visionado VSRegras
No ratings yet
Super Visionado VSRegras
6 pages
Grade 5 Term 3 Lessons Plans
No ratings yet
Grade 5 Term 3 Lessons Plans
132 pages
AI Lab Report BIM
No ratings yet
AI Lab Report BIM
34 pages
Logistic Regression Sentiment Analysis
No ratings yet
Logistic Regression Sentiment Analysis
3 pages
Document Dsbda Codes For Mini Project
No ratings yet
Document Dsbda Codes For Mini Project
9 pages
Apply Logistic Regression To Amazon Reviews Data Set (M)
No ratings yet
Apply Logistic Regression To Amazon Reviews Data Set (M)
11 pages
Schedule D SAFETY, HEALTH AND ENVIRONMENTAL REQUIREMENTS
100% (1)
Schedule D SAFETY, HEALTH AND ENVIRONMENTAL REQUIREMENTS
26 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
13 pages
C1 W1 Assignment
No ratings yet
C1 W1 Assignment
16 pages
Twitter Sentiment Analysis Dss
No ratings yet
Twitter Sentiment Analysis Dss
14 pages
Mental Maths Grade 1 Workbook Solutions
No ratings yet
Mental Maths Grade 1 Workbook Solutions
130 pages
Few-Shot Learning Tutorial - Medium
No ratings yet
Few-Shot Learning Tutorial - Medium
16 pages
Python CA 4
No ratings yet
Python CA 4
9 pages
Lab Report - CSE 816
No ratings yet
Lab Report - CSE 816
17 pages
Text Classification Using Decision Forests and Pretrained Embeddings - 1716327972920
No ratings yet
Text Classification Using Decision Forests and Pretrained Embeddings - 1716327972920
12 pages
Machine Learning Assignment Guide
No ratings yet
Machine Learning Assignment Guide
6 pages
Kinematics Conceptual Questions
No ratings yet
Kinematics Conceptual Questions
3 pages
Sma 5
No ratings yet
Sma 5
3 pages
Twitter Sentiment Analysis Project
No ratings yet
Twitter Sentiment Analysis Project
18 pages
IQBAL Fresher 19
No ratings yet
IQBAL Fresher 19
3 pages
CPE531 S18 MT Sol PDF
No ratings yet
CPE531 S18 MT Sol PDF
3 pages
NLP Twitter Sentiment Analysis
No ratings yet
NLP Twitter Sentiment Analysis
3 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
7 pages
ML Projrct Article 2
No ratings yet
ML Projrct Article 2
6 pages
European Steel and Alloy Grades: 10crmo9-10 (1.7380)
No ratings yet
European Steel and Alloy Grades: 10crmo9-10 (1.7380)
3 pages
Methodology
No ratings yet
Methodology
9 pages
Psychology Guide for CBSE Students
No ratings yet
Psychology Guide for CBSE Students
8 pages
Malignant Comment Classifier Guide
No ratings yet
Malignant Comment Classifier Guide
30 pages
LSM6DS3 Datasheet
No ratings yet
LSM6DS3 Datasheet
100 pages
NLP - Twitter Sentiment Analysis With Tensorflow - Sebastian Correa - Medium
No ratings yet
NLP - Twitter Sentiment Analysis With Tensorflow - Sebastian Correa - Medium
13 pages
Mohammad Rehan Commerce 4.0
No ratings yet
Mohammad Rehan Commerce 4.0
26 pages
The Digital City Media and The Social Production O... - (Introduction)
No ratings yet
The Digital City Media and The Social Production O... - (Introduction)
24 pages
Theme: Living With COVID Sub Theme: Health and Well Being: CLASS - VII (2020-21) Project Based Assessment
No ratings yet
Theme: Living With COVID Sub Theme: Health and Well Being: CLASS - VII (2020-21) Project Based Assessment
4 pages
Domatia
No ratings yet
Domatia
6 pages
Subliminal Mastery for Manifestors
No ratings yet
Subliminal Mastery for Manifestors
3 pages
Solutions To Applied Data Science AI
No ratings yet
Solutions To Applied Data Science AI
9 pages
Resume Workshop: A Presentation For The BCA Department
No ratings yet
Resume Workshop: A Presentation For The BCA Department
39 pages
Beam String 5
No ratings yet
Beam String 5
19 pages
Affords Investors The Right To Exclude How It Works, Physics Mechanism
No ratings yet
Affords Investors The Right To Exclude How It Works, Physics Mechanism
17 pages
TNCT Q1 COT On Roles of Parts of A Whole
No ratings yet
TNCT Q1 COT On Roles of Parts of A Whole
43 pages
Thermo - 6
0% (1)
Thermo - 6
14 pages
Lecture 3-SOCIAL RELATIONS
No ratings yet
Lecture 3-SOCIAL RELATIONS
21 pages
Syllabus Arch 353 Sec Sem.2024-2025
No ratings yet
Syllabus Arch 353 Sec Sem.2024-2025
4 pages
Ansi Niso Z39.104 2022
No ratings yet
Ansi Niso Z39.104 2022
12 pages
Ethnomath in Javanese Drums
No ratings yet
Ethnomath in Javanese Drums
12 pages
Ielts Reading
No ratings yet
Ielts Reading
3 pages
Journal of Business Research: Jos e Ant Onio Porfírio, Tiago Carrilho, Jos e Augusto Felício, Jacinto Jardim
No ratings yet
Journal of Business Research: Jos e Ant Onio Porfírio, Tiago Carrilho, Jos e Augusto Felício, Jacinto Jardim
10 pages
Linear Algebra Cheat Sheet
No ratings yet
Linear Algebra Cheat Sheet
5 pages
Designs
No ratings yet
Designs
8 pages
Development of A Pico-Hydro Electric Generator Wit
No ratings yet
Development of A Pico-Hydro Electric Generator Wit
10 pages
UNIT 10 - Bahasa Inggris Pangan - JMP
No ratings yet
UNIT 10 - Bahasa Inggris Pangan - JMP
10 pages
Sentiment Analysis On User-Generated Tweets
No ratings yet
Sentiment Analysis On User-Generated Tweets
15 pages
Procedural Writing
100% (1)
Procedural Writing
3 pages