Assignment 3
Title: Perform text cleaning, perform lemmatization (any method), remove stop words (any method),
label encoding. Create representations using TF-IDF. Save outputs. Dataset: bbc sports
Learning Objectives:
To Learn data preprocessing and embedding
To work with real time data in natural Language processing.
Learning Outcome:
Execute preprocessing techniques by analyzing real time data
Prepare data for Machine Learning algorithms
Theory:
Bag of Words (BoW)
Bag of words is a Natural Language Processing technique of text modeling.
A problem with modeling text is that it is messy, and techniques like machine learning algorithms
prefer well defined fixed-length inputs and outputs. Machine learning algorithms cannot work with
raw text directly; the text must be converted into numbers. Specifically, vectors of numbers. This is
called feature extraction or feature encoding.
The bag-of-words model is a way of representing text data when modeling text with machine
learning algorithms. It is a popular and simple method of feature extraction from text data.
A bag-of-words is a representation of text that describes the occurrence of words within a document.
It involves two things:
1. A vocabulary of known words.
2. A measure of the presence of known words.
It is called a “bag” of words, because any information about the order or structure of words in the
document is discarded. The model is only concerned with whether known words occur in the
document, not where in the document.
The most common kind of characteristic, or feature calculated from the Bag-of-words model is term
frequency, which is essentially the number of times a term appears in the text. Term frequency is not
necessarily the best representation for the text, but it still does find successful applications in areas
like email filtering. Term frequency isn’t the best representation of the text because common words
such as "the", "a", "to" are almost always the terms with highest frequency in the text. This shows
that having a high raw count does not necessarily indicate that the corresponding word is more
important.
Advantges of BoW Approach
The most significant advantage of the bag-of-words model is its simplicity and ease of use. It can be
used to create an initial draft model before proceeding to more sophisticated word embeddings.
Disadvantges of BoW Approach
Vocabulary: The vocabulary requires careful design, most specifically in order to manage
the size, which impacts the sparsity of the document representations.
Sparsity: Sparse representations are harder to model both for computational reasons (space
and time complexity) and also for information reasons, where the challenge is for the models
to harness so little information in such a large representational space.
Meaning: Discarding word order ignores the context, and in turn meaning of words in the
document (semantics). Context and meaning can offer a lot to the model, that if modeled
could tell the difference between the same words differently arranged (“this is interesting” vs
“is this interesting”), synonyms (“old bike” vs “used bike”), and much more.
Bag-of-words example
Let's assume we have three sentences in our vocabulary.
Sentence 1: Data science is fun and interesting
Sentence 2: Data science is fun
Sentence 3: science is interesting
The unique words in the sentences are : [data, science, is, fun, and, interesting]. Hence, the bag of
words vectors for the above sentences will be
Sentence 1: [1, 1, 1, 1, 1, 1]
Sentence 2: [1, 1, 1, 1, 0, 0]
Sentence 3: [1, 1, 1, 1, 0, 0]
TF-IDF
Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical method in
natural language processing and information retrieval. It measures how important a term is within a
document relative to a collection of documents (i.e., relative to a corpus). Words within a text
document are transformed into importance numbers by a text vectorization process. There are many
different text vectorization scoring schemes, with TF-IDF being one of the most common.
Term Frequency: TF of a term or word is the number of times the term appears in a document
compared to the total number of words in the document.
Inverse Document Frequency: IDF of a term reflects the proportion of documents in the corpus
that contain the term. Words unique to a small percentage of documents (e.g., technical jargon
terms) receive higher importance values than words common across all documents (e.g., a, the,
and).
The TF-IDF of a term is calculated by multiplying TF and IDF scores.
TF-IDF = TF * IDF
Importance of a term is high when it occurs a lot in a given document and rarely in others. In short,
commonality within a document measured by TF is balanced by rarity between documents
measured by IDF. The resulting TF-IDF score reflects the importance of a term for a document in the
corpus.
TF-IDF is useful in many natural language processing applications. For example, Search Engines
use TF-IDF to rank the relevance of a document for a query. TF-IDF is also employed in text
classification, text summarization, and topic modeling.
Bag of Words Algorithm Implementation
''' vectorize() function takes list of words in a sentence as input
and returns a vector of size of filtered_vocab.It puts 0 if the
word is not present in tokens and count of token if present.'''
def vectorize(tokens):
vector=[]
for w in filtered_vocab:
vector.append(tokens.count(w))
return vector
'''unique() functions returns a list in which the order remains
same and no item repeats.Using the set() function does not
preserve the original ordering,so i didnt use that instead'''
def unique(sequence):
seen = set()
return [x for x in sequence if not (x in seen or seen.add(x))]
#create a list of stopwords.You can import stopwords from nltk too
stopwords=["to","was","a"]
#list of special characters.You can use regular expressions too
special_char=[",",":"," ",";",".","?"]
string1 = "Data science is fun and interesting"
string2 = "Data science is fun"
string3 = "science is interesting"
#converting strings to lower case
string1=string1.lower()
string2=string2.lower()
string3=string3.lower()
#split the sentences into tokens
tokens1=string1.split()
tokens2=string2.split()
tokens3=string3.split()
print(tokens1)
print(tokens2)
print(tokens3)
#create a vocabulary list
vocab=unique(tokens1+tokens2+tokens3)
print(vocab)
#filter the vocabulary list
filtered_vocab=[]
for w in vocab:
if w not in stopwords and w not in special_char:
filtered_vocab.append(w)
print("Final filtered vocabulary: ", filtered_vocab)
#convert sentences into vectords
vector1=vectorize(tokens1)
print("Sentence 1 vector :",vector1)
vector2=vectorize(tokens2)
print("Sentence 2 vector :",vector2)
vector3=vectorize(tokens3)
print("Sentence 3 vector :",vector2)
Creating Bag of Words using sklearn library
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
string1 = "Data science is fun and interesting"
string2 = "Data science is fun"
string3 = "science is interesting"
doc = string1+string2+string3
CountVec = CountVectorizer(ngram_range=(1,1))
#transform
Count_data = CountVec.fit_transform([string1,string2,string3])
#create dataframe
cv_dataframe=pd.DataFrame(Count_data.toarray(),columns=CountVec.get_feature_n
ames())
print(cv_dataframe)
Note that the CountVectorize sorts the vocabulary alphabetically before generating vectors.
Count Occurrence
count_vec = CountVectorizer()
count_occurs = count_vec.fit_transform([doc])
count_occur_df = pd.DataFrame((count, word) for word, count in
zip(count_occurs.toarray().tolist()[0], count_vec.get_feature_names_out()))
count_occur_df.columns = ['Word', 'Count']
count_occur_df.sort_values('Count', ascending=False, inplace=True)
count_occur_df.head()
Normalized Count Occurrence
from sklearn.feature_extraction.text import TfidfVectorizer
norm_count_vec = TfidfVectorizer(use_idf=False, norm='l2')
norm_count_occurs = norm_count_vec.fit_transform([doc])
norm_count_occur_df = pd.DataFrame((count, word) for word, count in zip(
norm_count_occurs.toarray().tolist()[0],
norm_count_vec.get_feature_names_out()))
norm_count_occur_df.columns = ['Word', 'Count']
norm_count_occur_df.sort_values('Count', ascending=False, inplace=True)
norm_count_occur_df.head()
import pandas as pd
import numpy as np
corpus = ['Natural language processing is fun and interesting',
'Natural language processing is fun',
'Hindi language is interesting' ]
#creating a word set for the corpus
words_set = set()
for doc in corpus:
words = doc.split(' ')
words_set = words_set.union(set(words))
print('Number of words in the corpus:',len(words_set))
print('The words in the corpus: \n', words_set)
Computing Term Frequency
#creating a dataframe by the number of documents in the corpus and the word
set, and use that information to compute the term frequency (TF)
n_docs = len(corpus) #·Number of documents in the corpus
n_words_set = len(words_set) #·Number of unique words in the
df_tf = pd.DataFrame(np.zeros((n_docs, n_words_set)), columns=words_set)
# Compute Term Frequency (TF)
for i in range(n_docs):
words = corpus[i].split(' ') # Words in the document
for w in words:
df_tf[w][i] = df_tf[w][i] + (1 / len(words))
df_tf
Computing Inverse Document Frequency
print("IDF of: ")
idf = {}
for w in words_set:
k = 0 # number of documents in the corpus that contain this word
for i in range(n_docs):
if w in corpus[i].split():
k += 1
idf[w] = np.log10(n_docs / k)
print(f'{w:>15}: {idf[w]:>10}' )
df_tf_idf = df_tf.copy()
for w in words_set:
for i in range(n_docs):
df_tf_idf[w][i] = df_tf[w][i] * idf[w]
df_tf_idf
TF-IDF using sklearn library
from sklearn.feature_extraction.text import TfidfVectorizer
tr_idf_model = TfidfVectorizer()
tf_idf_vector = tr_idf_model.fit_transform(corpus)
tf_idf_array = tf_idf_vector.toarray()
print(tf_idf_array)
words_set = tr_idf_model.get_feature_names_out()
print(words_set)
df_tf_idf = pd.DataFrame(tf_idf_array, columns = words_set)
df_tf_idf
Conclusion:
Bag of Words, TF- IDF, Word2Vec are embedding techniques which transform Textual
data in numerical form.