Miniproject NLP
Miniproject NLP
A REPORT ON
SENTIMENT ANALYSIS USING BERT
TRANSFORMER
SUBMITTED BY
Problem Statement:
In today's digital world, a massive amount of textual data is generated daily through social media,
product reviews, news articles, and more. Understanding the sentiment behind this data is crucial for
businesses and researchers to make informed decisions. Traditional machine learning approaches
often fall short in capturing context and semantics effectively. Hence, there is a need for a more
robust, context-aware model like BERT (Bidirectional Encoder Representations from Transformers) to
accurately perform sentiment analysis on textual data.
Objectives:
Outcomes:
1. Successfully built and fine-tuned a BERT-based sentiment analysis model using a publicly
available dataset (e.g., IMDb reviews or Twitter sentiment).
4. Visualized results using confusion matrix, precision-recall curve, and F1-score, showing
improved performance in handling ambiguous or sarcastic inputs.
Code:
import transformers
from transformers import BertModel, BertTokenizer, AdamW,
get_linear_schedule_with_warmup
import torch
import numpy as np
import pandas as pd
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from collections import defaultdict
from textwrap import wrap
%matplotlib inline
Data Exploration
In [0]:
df = pd.read_csv("reviews.csv")
df.head()
Out[0]:
user sc thumbs reviewCr reply repl
cont sortO appI
Nam userImage or UpCou eatedVers at Conte ied
ent rder d
e e nt ion nt At
Upd
ate: Accor
Afte ding
202
r to our 202
0-
Andr https://lh3.googl getti TOS, 0-
04- most_ com.
ew eusercontent.co ng a and 04-
0 1 21 4.17.0.3 05 releva anyd
Tho m/a- resp the 05
22: nt o
mas /AOh14GiHd... onse term 15:1
25:
from you 0:24
57
the have
deve ag...
...
user sc thumbs reviewCr reply repl
cont sortO appI
Nam userImage or UpCou eatedVers at Conte ied
ent rder d
e e nt ion nt At
Use
d it
It
for a
202 sound
fair 202
0- s like
https://lh3.googl amo 0-
Craig 04- you most_ com.
eusercontent.co unt 04-
1 Hain 1 11 4.17.0.3 04 logge releva anyd
m/- of 05
es 13: d in nt o
hoe0kwSJgPQ... time 15:1
40: with a
with 1:35
01 differ
out
ent ...
any
...
You
r
This
app
sound
suck 202
s odd! 202
s 0-
steve https://lh3.googl We 0-
now 04- most_ com.
n eusercontent.co are 04-
2 !!!!! 1 17 4.17.0.3 01 releva anyd
adkin m/a- not 02
Use 16: nt o
s /AOh14GiXw... aware 16:0
d to 18:
of any 5:56
be 13
issue..
good
.
but
no...
It
see
We
ms
do
OK, 202
offer 202
but 0-
Lars https://lh3.googl this 0-
very 03- most_ com.
Panz eusercontent.co option 03-
3 basi 1 192 4.17.0.2 12 releva anyd
erbjø m/a-/AOh14Gg- as 15
c. 08: nt o
rn h... part 06:2
Rec 17:
of the 0:13
urrin 34
Adva
g
nce...
tasks
n...
Abs We're
olute 202 sorry
202
ly 0- you
https://lh3.googl 0-
Scott wort 03- feel most_ com.
eusercontent.co 03-
4 Prew hless 1 42 4.17.0.2 14 this releva anyd
m/-K-X1- 15
itt . 17: way! nt o
YsVd6U... 23:4
This 41: 90%
5:51
app 01 of the
runs app ...
a
user sc thumbs reviewCr reply repl
cont sortO appI
Nam userImage or UpCou eatedVers at Conte ied
ent rder d
e e nt ion nt At
proh
ibit..
.
df.shape
(15746, 11)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15746 entries, 0 to 15745
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 userName 15746 non-null object
1 userImage 15746 non-null object
2 content 15746 non-null object
3 score 15746 non-null int64
4 thumbsUpCount 15746 non-null int64
5 reviewCreatedVersion 13533 non-null object
6 at 15746 non-null object
7 replyContent 7367 non-null object
8 repliedAt 7367 non-null object
9 sortOrder 15746 non-null object
10 appId 15746 non-null object
dtypes: int64(2), object(9)
memory usage: 1.3+ MB
In [0]:
sns.countplot(df.score)
plt.xlabel('review score');
In [0]:
def to_sentiment(rating):
rating = int(rating)
if rating <= 2:
return 0
elif rating == 3:
return 1
else:
return 2
df['sentiment'] = df.score.apply(to_sentiment)
In [0]:
class_names = ['negative', 'neutral', 'positive']
In [0]:
ax = sns.countplot(df.sentiment)
plt.xlabel('review sentiment')
ax.set_xticklabels(class_names);
Data Preprocessing
• Add special tokens to separate sentences and do classification
• Pass sequences of constant length (introduce padding)
• Create array of 0s (pad token) and 1s (real token) called attention mask
BERT
We are using BERT BASE CASED i.e. Case Sensitive BERT BASE Model with 12 Transformer
Encoders stacked .
In [0]:
PRE_TRAINED_MODEL_NAME = 'bert-base-cased'
In [0]:
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
Special Tokens
[SEP] - marker for ending of a sentence
[CLS] - we must add this token to the start of each sentence, so BERT knows we're doing
classification
There is also a special token for padding:
BERT understands tokens that were in the training set. Everything else can be encoded using
the [UNK] (unknown) token:
All of that work can be done using the encode_plus() method:
In [0]:
encoding = tokenizer.encode_plus(
sample_txt,
max_length=32,
add_special_tokens=True, # Add '[CLS]' and '[SEP]'
return_token_type_ids=False,
pad_to_max_length=True,
return_attention_mask=True, #RETURNS 0 FOR PADDINGS
return_tensors='pt', # Return PyTorch tensors
)
encoding.keys()
Out[0]:
dict_keys(['input_ids', 'attention_mask'])
The token ids are now stored in a Tensor and padded to a length of 32:
In [0]:
print(len(encoding['input_ids'][0]))
encoding['input_ids'][0]
32
Out[0]:
tensor([ 101, 1332, 1108, 146, 1314, 1796, 136, 146, 1821, 5342, 1120, 1
313,
1111, 123, 2277, 119, 102, 0, 0, 0, 0, 0, 0,
0,
0, 0, 0, 0, 0, 0, 0, 0])
The attention mask has the same length:
In [0]:
print(len(encoding['attention_mask'][0]))
encoding['attention_mask']
32
Out[0]:
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0, 0, 0]])
We can inverse the tokenization to have a look at the special tokens:
In [0]:
tokenizer.convert_ids_to_tokens(encoding['input_ids'][0])
Out[0]:
['[CLS]',
'When',
'was',
'I',
'last',
'outside',
'?',
'I',
'am',
'stuck',
'at',
'home',
'for',
'2',
'weeks',
'.',
'[SEP]',
'[PAD]',
'[PAD]',
'[PAD]',
'[PAD]',
'[PAD]',
'[PAD]',
'[PAD]',
'[PAD]',
'[PAD]',
'[PAD]',
'[PAD]',
'[PAD]',
'[PAD]',
'[PAD]',
'[PAD]']
token_lens = []
sns.distplot(token_lens)
plt.xlim([0, 256]);
plt.xlabel('Token count');
Most of the reviews seem to contain less than 128 tokens, but we'll be on the safe side and
choose a maximum length of 160.
In [0]:
MAX_LEN = 160
In [0]:
class GPReviewDataset(Dataset):
def __len__(self):
return len(self.reviews)
encoding = self.tokenizer.encode_plus(
review,
add_special_tokens=True,
max_length=self.max_len,
return_token_type_ids=False,
pad_to_max_length=True,
return_attention_mask=True,
return_tensors='pt',
)
return {
'review_text': review,
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'targets': torch.tensor(target, dtype=torch.long)
}
The tokenizer is doing most of the heavy lifting for us. We also return the review texts, so it'll be
easier to evaluate the predictions from our model. Let's split the data:
In [0]:
return DataLoader(
ds,
batch_size=batch_size,
num_workers=4
)
In [0]:
BATCH_SIZE = 16
data = next(iter(train_data_loader))
data.keys()
Out[0]:
print(data['input_ids'].shape)
print(data['attention_mask'].shape)
print(data['targets'].shape)
torch.Size([16, 160])
torch.Size([16, 160])
torch.Size([16])
bert_model = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
In [0]:
last_hidden_state.shape
Out[0]:
bert_model.config.hidden_size
Out[0]:
768
This is the number of hidden units in the feedforward-networks
pooled_output.shape
Out[0]:
torch.Size([1, 768])
In [0]:
class SentimentClassifier(nn.Module):
model = SentimentClassifier(len(class_names))
model = model.to(device)
We'll move the example batch of our training data to the GPU:
In [0]:
input_ids = data['input_ids'].to(device)
attention_mask = data['attention_mask'].to(device)
Training
We'll use the AdamW optimizer provided by Hugging Face. It corrects weight decay, so it's similar
to the original paper. We'll also use a linear scheduler with no warmup steps:
In [0]:
EPOCHS = 10
loss_fn = nn.CrossEntropyLoss().to(device)
In [0]:
losses = []
correct_predictions = 0
for d in data_loader:
input_ids = d["input_ids"].to(device)
attention_mask = d["attention_mask"].to(device)
targets = d["targets"].to(device)
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask
)
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
losses = []
correct_predictions = 0
with torch.no_grad():
for d in data_loader:
input_ids = d["input_ids"].to(device)
attention_mask = d["attention_mask"].to(device)
targets = d["targets"].to(device)
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask
)
_, preds = torch.max(outputs, dim=1)
%%time
history = defaultdict(list)
best_accuracy = 0
history['train_acc'].append(train_acc)
history['train_loss'].append(train_loss)
history['val_acc'].append(val_acc)
history['val_loss'].append(val_loss)
Epoch 2/10
----------
Train loss 0.4158683338330777 accuracy 0.8420012701997036
Val loss 0.5365073362737894 accuracy 0.832274459974587
Epoch 3/10
----------
Train loss 0.24015077009679367 accuracy 0.922023851527768
Val loss 0.5074492372572422 accuracy 0.8716645489199493
Epoch 4/10
----------
Train loss 0.16012676668187295 accuracy 0.9546962105708843
Val loss 0.6009970247745514 accuracy 0.8703939008894537
Epoch 5/10
----------
Train loss 0.11209654617575301 accuracy 0.9675393409074872
Val loss 0.7367783848941326 accuracy 0.8742058449809403
Epoch 6/10
----------
Train loss 0.08572274737026433 accuracy 0.9764307388328276
Val loss 0.7251267762482166 accuracy 0.8843710292249047
Epoch 7/10
----------
Train loss 0.06132202987342602 accuracy 0.9833462705525369
Val loss 0.7083295831084251 accuracy 0.889453621346887
Epoch 8/10
----------
Train loss 0.050604159273123096 accuracy 0.9849693035071626
Val loss 0.753860274553299 accuracy 0.8907242693773825
Epoch 9/10
----------
Train loss 0.04373276197092931 accuracy 0.9862395032107826
Val loss 0.7506809896230697 accuracy 0.8919949174078781
Epoch 10/10
----------
Train loss 0.03768671146314381 accuracy 0.9880036694658105
Val loss 0.7431786182522774 accuracy 0.8932655654383737
CPU times: user 29min 54s, sys: 13min 28s, total: 43min 23s
Wall time: 43min 43s
In [0]:
test_acc.item()
Out[0]:
0.883248730964467
In [0]:
def get_predictions(model, data_loader):
model = model.eval()
review_texts = []
predictions = []
prediction_probs = []
real_values = []
with torch.no_grad():
for d in data_loader:
texts = d["review_text"]
input_ids = d["input_ids"].to(device)
attention_mask = d["attention_mask"].to(device)
targets = d["targets"].to(device)
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask
)
_, preds = torch.max(outputs, dim=1)
review_texts.extend(texts)
predictions.extend(preds)
prediction_probs.extend(probs)
real_values.extend(targets)
predictions = torch.stack(predictions).cpu()
prediction_probs = torch.stack(prediction_probs).cpu()
real_values = torch.stack(real_values).cpu()
return review_texts, predictions, prediction_probs, real_values
In [0]:
y_review_texts, y_pred, y_pred_probs, y_test = get_predictions(
model,
test_data_loader
)
Let's have a look at the classification report
In [0]:
print(classification_report(y_test, y_pred, target_names=class_names))
precision recall f1-score support
In [0]:
def show_confusion_matrix(confusion_matrix):
hmap = sns.heatmap(confusion_matrix, annot=True, fmt="d", cmap="Blues")
hmap.yaxis.set_ticklabels(hmap.yaxis.get_ticklabels(), rotation=0,
ha='right')
hmap.xaxis.set_ticklabels(hmap.xaxis.get_ticklabels(), rotation=30,
ha='right')
plt.ylabel('True sentiment')
plt.xlabel('Predicted sentiment');
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, index=class_names, columns=class_names)
show_confusion_matrix(df_cm)
This confirms that our model is having difficulty classifying neutral reviews. It mistakes those for
negative and positive at a roughly equal frequency.
review_text = y_review_texts[idx]
true_sentiment = y_test[idx]
pred_df = pd.DataFrame({
'class_names': class_names,
'values': y_pred_probs[idx]
})
In [0]:
print("\n".join(wrap(review_text)))
print()
print(f'True sentiment: {class_names[true_sentiment]}')
I used to use Habitica, and I must say this is a great step up. I'd
like to see more social features, such as sharing tasks - only one
person has to perform said task for it to be checked off, but only
giving that person the experience and gold. Otherwise, the price for
subscription is too steep, thus resulting in a sub-perfect score. I
could easily justify $0.99/month or eternal subscription for $15. If
that price could be met, as well as fine tuning, this would be easily
worth 5 stars.
Conclusion:
This assignment demonstrates the effectiveness of BERT Transformer in sentiment analysis tasks.
Unlike traditional models, BERT captures the context from both directions in a sentence, resulting in
more accurate sentiment predictions. The model shows strong performance on benchmark datasets
and proves to be a reliable solution for real-world sentiment classification problems.