0% found this document useful (0 votes)

48 views6 pages

BBC Sports Text Preprocessing Guide

The document discusses text preprocessing techniques like bag-of-words modeling and TF-IDF for feature extraction from text data. It includes the theory behind bag-of-words and TF-IDF, provides examples of implementing them in Python, and discusses advantages and disadvantages of the bag-of-words approach.

Uploaded by

[TE A-1] Chandan Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views6 pages

BBC Sports Text Preprocessing Guide

Uploaded by

[TE A-1] Chandan Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Assignment 3

Title: Perform text cleaning, perform lemmatization (any method), remove stop words (any method),
label encoding. Create representations using TF-IDF. Save outputs. Dataset: bbc sports

Learning Objectives:
 To Learn data preprocessing and embedding
 To work with real time data in natural Language processing.

Learning Outcome:

 Execute preprocessing techniques by analyzing real time data

 Prepare data for Machine Learning algorithms

Theory:

Bag of Words (BoW)

Bag of words is a Natural Language Processing technique of text modeling.

A problem with modeling text is that it is messy, and techniques like machine learning algorithms
prefer well defined fixed-length inputs and outputs. Machine learning algorithms cannot work with
raw text directly; the text must be converted into numbers. Specifically, vectors of numbers. This is
called feature extraction or feature encoding.

The bag-of-words model is a way of representing text data when modeling text with machine
learning algorithms. It is a popular and simple method of feature extraction from text data.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It involves two things:

1. A vocabulary of known words.

2. A measure of the presence of known words.

It is called a “bag” of words, because any information about the order or structure of words in the
document is discarded. The model is only concerned with whether known words occur in the
document, not where in the document.
The most common kind of characteristic, or feature calculated from the Bag-of-words model is term
frequency, which is essentially the number of times a term appears in the text. Term frequency is not
necessarily the best representation for the text, but it still does find successful applications in areas
like email filtering. Term frequency isn’t the best representation of the text because common words
such as "the", "a", "to" are almost always the terms with highest frequency in the text. This shows
that having a high raw count does not necessarily indicate that the corresponding word is more
important.
Advantges of BoW Approach
The most significant advantage of the bag-of-words model is its simplicity and ease of use. It can be
used to create an initial draft model before proceeding to more sophisticated word embeddings.

Disadvantges of BoW Approach

 Vocabulary: The vocabulary requires careful design, most specifically in order to manage
the size, which impacts the sparsity of the document representations.
 Sparsity: Sparse representations are harder to model both for computational reasons (space
and time complexity) and also for information reasons, where the challenge is for the models
to harness so little information in such a large representational space.
 Meaning: Discarding word order ignores the context, and in turn meaning of words in the
document (semantics). Context and meaning can offer a lot to the model, that if modeled
could tell the difference between the same words differently arranged (“this is interesting” vs
“is this interesting”), synonyms (“old bike” vs “used bike”), and much more.

Bag-of-words example

Let's assume we have three sentences in our vocabulary.

Sentence 1: Data science is fun and interesting

Sentence 2: Data science is fun

Sentence 3: science is interesting

The unique words in the sentences are : [data, science, is, fun, and, interesting]. Hence, the bag of
words vectors for the above sentences will be

Sentence 1: [1, 1, 1, 1, 1, 1]

Sentence 2: [1, 1, 1, 1, 0, 0]

Sentence 3: [1, 1, 1, 1, 0, 0]

TF-IDF
Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical method in
natural language processing and information retrieval. It measures how important a term is within a
document relative to a collection of documents (i.e., relative to a corpus). Words within a text
document are transformed into importance numbers by a text vectorization process. There are many
different text vectorization scoring schemes, with TF-IDF being one of the most common.

Term Frequency: TF of a term or word is the number of times the term appears in a document
compared to the total number of words in the document.

Inverse Document Frequency: IDF of a term reflects the proportion of documents in the corpus
that contain the term. Words unique to a small percentage of documents (e.g., technical jargon
terms) receive higher importance values than words common across all documents (e.g., a, the,
and).
The TF-IDF of a term is calculated by multiplying TF and IDF scores.

TF-IDF = TF * IDF

Importance of a term is high when it occurs a lot in a given document and rarely in others. In short,
commonality within a document measured by TF is balanced by rarity between documents
measured by IDF. The resulting TF-IDF score reflects the importance of a term for a document in the
corpus.

TF-IDF is useful in many natural language processing applications. For example, Search Engines
use TF-IDF to rank the relevance of a document for a query. TF-IDF is also employed in text
classification, text summarization, and topic modeling.

Bag of Words Algorithm Implementation

''' vectorize() function takes list of words in a sentence as input
and returns a vector of size of filtered_vocab.It puts 0 if the
word is not present in tokens and count of token if present.'''
def vectorize(tokens):
vector=[]
for w in filtered_vocab:
vector.append(tokens.count(w))
return vector
'''unique() functions returns a list in which the order remains
same and no item repeats.Using the set() function does not
preserve the original ordering,so i didnt use that instead'''
def unique(sequence):
seen = set()
return [x for x in sequence if not (x in seen or seen.add(x))]
#create a list of stopwords.You can import stopwords from nltk too
stopwords=["to","was","a"]
#list of special characters.You can use regular expressions too
special_char=[",",":"," ",";",".","?"]

string1 = "Data science is fun and interesting"

string2 = "Data science is fun"
string3 = "science is interesting"

#converting strings to lower case

string1=string1.lower()
string2=string2.lower()
string3=string3.lower()

#split the sentences into tokens

tokens1=string1.split()
tokens2=string2.split()
tokens3=string3.split()

print(tokens1)
print(tokens2)
print(tokens3)
#create a vocabulary list
vocab=unique(tokens1+tokens2+tokens3)
print(vocab)
#filter the vocabulary list
filtered_vocab=[]
for w in vocab:
if w not in stopwords and w not in special_char:
filtered_vocab.append(w)
print("Final filtered vocabulary: ", filtered_vocab)
#convert sentences into vectords
vector1=vectorize(tokens1)
print("Sentence 1 vector :",vector1)
vector2=vectorize(tokens2)
print("Sentence 2 vector :",vector2)
vector3=vectorize(tokens3)
print("Sentence 3 vector :",vector2)

Creating Bag of Words using sklearn library

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

string1 = "Data science is fun and interesting"

string2 = "Data science is fun"
string3 = "science is interesting"

doc = string1+string2+string3

CountVec = CountVectorizer(ngram_range=(1,1))
#transform
Count_data = CountVec.fit_transform([string1,string2,string3])

#create dataframe
cv_dataframe=pd.DataFrame(Count_data.toarray(),columns=CountVec.get_feature_n
ames())
print(cv_dataframe)
Note that the CountVectorize sorts the vocabulary alphabetically before generating vectors.
Count Occurrence

count_vec = CountVectorizer()
count_occurs = count_vec.fit_transform([doc])
count_occur_df = pd.DataFrame((count, word) for word, count in
zip(count_occurs.toarray().tolist()[0], count_vec.get_feature_names_out()))
count_occur_df.columns = ['Word', 'Count']
count_occur_df.sort_values('Count', ascending=False, inplace=True)
count_occur_df.head()
Normalized Count Occurrence

from sklearn.feature_extraction.text import TfidfVectorizer

norm_count_vec = TfidfVectorizer(use_idf=False, norm='l2')
norm_count_occurs = norm_count_vec.fit_transform([doc])
norm_count_occur_df = pd.DataFrame((count, word) for word, count in zip(
norm_count_occurs.toarray().tolist()[0],
norm_count_vec.get_feature_names_out()))
norm_count_occur_df.columns = ['Word', 'Count']
norm_count_occur_df.sort_values('Count', ascending=False, inplace=True)
norm_count_occur_df.head()

import pandas as pd
import numpy as np

corpus = ['Natural language processing is fun and interesting',

'Natural language processing is fun',
'Hindi language is interesting' ]

#creating a word set for the corpus

words_set = set()

for doc in corpus:

words = doc.split(' ')
words_set = words_set.union(set(words))

print('Number of words in the corpus:',len(words_set))

print('The words in the corpus: \n', words_set)

Computing Term Frequency

#creating a dataframe by the number of documents in the corpus and the word
set, and use that information to compute the term frequency (TF)
n_docs = len(corpus) #·Number of documents in the corpus
n_words_set = len(words_set) #·Number of unique words in the

df_tf = pd.DataFrame(np.zeros((n_docs, n_words_set)), columns=words_set)

# Compute Term Frequency (TF)

for i in range(n_docs):
words = corpus[i].split(' ') # Words in the document
for w in words:
df_tf[w][i] = df_tf[w][i] + (1 / len(words))

df_tf

Computing Inverse Document Frequency

print("IDF of: ")

idf = {}

for w in words_set:
k = 0 # number of documents in the corpus that contain this word
for i in range(n_docs):
if w in corpus[i].split():
k += 1

idf[w] = np.log10(n_docs / k)

print(f'{w:>15}: {idf[w]:>10}' )

df_tf_idf = df_tf.copy()

for w in words_set:
for i in range(n_docs):
df_tf_idf[w][i] = df_tf[w][i] * idf[w]

df_tf_idf

TF-IDF using sklearn library

from sklearn.feature_extraction.text import TfidfVectorizer

tr_idf_model = TfidfVectorizer()
tf_idf_vector = tr_idf_model.fit_transform(corpus)

tf_idf_array = tf_idf_vector.toarray()

print(tf_idf_array)

words_set = tr_idf_model.get_feature_names_out()

print(words_set)

df_tf_idf = pd.DataFrame(tf_idf_array, columns = words_set)

df_tf_idf

Conclusion:
Bag of Words, TF- IDF, Word2Vec are embedding techniques which transform Textual
data in numerical form.

Unit Ii
No ratings yet
Unit Ii
20 pages
Unit 2
No ratings yet
Unit 2
48 pages
Lesson 2 Feature Engineering On Text Data
No ratings yet
Lesson 2 Feature Engineering On Text Data
89 pages
Bag of Words
No ratings yet
Bag of Words
32 pages
Ch4 Word Embeddings
No ratings yet
Ch4 Word Embeddings
21 pages
NLP Unit II Study Material
No ratings yet
NLP Unit II Study Material
47 pages
Text Vectorization
No ratings yet
Text Vectorization
18 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
Lesson 2 Feature Engineering On Text Data
No ratings yet
Lesson 2 Feature Engineering On Text Data
131 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
Lecture 5 - Language Representation Tf-Idf
No ratings yet
Lecture 5 - Language Representation Tf-Idf
51 pages
Bag of Words
No ratings yet
Bag of Words
3 pages
NLP Notes CL 10
No ratings yet
NLP Notes CL 10
13 pages
NLP2
No ratings yet
NLP2
8 pages
Lecture 2 Bag of Words
No ratings yet
Lecture 2 Bag of Words
25 pages
DSC 202
No ratings yet
DSC 202
8 pages
At First On 2013 I Do Some Research Work With My Departmental HOD Sir and With Prof Partha Basu Chakraborty
No ratings yet
At First On 2013 I Do Some Research Work With My Departmental HOD Sir and With Prof Partha Basu Chakraborty
4 pages
Lab 5
No ratings yet
Lab 5
27 pages
Natural Language Processing: Lecture # 7
No ratings yet
Natural Language Processing: Lecture # 7
36 pages
Module III
No ratings yet
Module III
42 pages
AIML Unit5
No ratings yet
AIML Unit5
36 pages
Unit IV
No ratings yet
Unit IV
57 pages
M6L3 Lyst8212
No ratings yet
M6L3 Lyst8212
17 pages
Text Representation: Lecture # 6
No ratings yet
Text Representation: Lecture # 6
21 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
27 pages
NLP Crecord Mid2
No ratings yet
NLP Crecord Mid2
36 pages
Lect 04
No ratings yet
Lect 04
44 pages
Unit 2 Newml
No ratings yet
Unit 2 Newml
25 pages
Unit IV
No ratings yet
Unit IV
58 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
DeekshikaJadyada26 AP24LDS11
No ratings yet
DeekshikaJadyada26 AP24LDS11
7 pages
Natural Language Processing
No ratings yet
Natural Language Processing
8 pages
Feature Extraction Techniques in NLP
No ratings yet
Feature Extraction Techniques in NLP
10 pages
DM Chapter 9 - Word Embedding
No ratings yet
DM Chapter 9 - Word Embedding
7 pages
NLP Asgn2
No ratings yet
NLP Asgn2
7 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Document
No ratings yet
Document
6 pages
Lecture 6 - From Unstructured Texts To Structure Data I
No ratings yet
Lecture 6 - From Unstructured Texts To Structure Data I
17 pages
TF Idf
No ratings yet
TF Idf
27 pages
Word Embedding
No ratings yet
Word Embedding
60 pages
NLP Text Representation Guide
No ratings yet
NLP Text Representation Guide
131 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
NLP-Neuro Linguistic Programming: What Is A Corpus?
No ratings yet
NLP-Neuro Linguistic Programming: What Is A Corpus?
3 pages
A Survey On Word Representation in Natural Language
No ratings yet
A Survey On Word Representation in Natural Language
7 pages
Text Mining - Vectorization
No ratings yet
Text Mining - Vectorization
24 pages
Bag of Words Vs Tfidf
No ratings yet
Bag of Words Vs Tfidf
4 pages
Bag of Words
No ratings yet
Bag of Words
19 pages
Aiml P5
No ratings yet
Aiml P5
10 pages
NLP Notes
No ratings yet
NLP Notes
12 pages
Allnlp
No ratings yet
Allnlp
15 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
Text Analysis for Students
No ratings yet
Text Analysis for Students
11 pages
10 Write A Program To Find The Sum of All The Digits of A Number
No ratings yet
10 Write A Program To Find The Sum of All The Digits of A Number
2 pages
Pipeline
No ratings yet
Pipeline
9 pages
Embeddings
No ratings yet
Embeddings
3 pages
Module5-Representing and Mining Text
No ratings yet
Module5-Representing and Mining Text
24 pages
Basics of Bag of Words Model
No ratings yet
Basics of Bag of Words Model
32 pages
Word Embedding & Language Modelling
No ratings yet
Word Embedding & Language Modelling
111 pages

BBC Sports Text Preprocessing Guide

Uploaded by

BBC Sports Text Preprocessing Guide

Uploaded by

Assignment 3

 Execute preprocessing techniques by analyzing real time data

Bag of Words (BoW)

1. A vocabulary of known words.

Disadvantges of BoW Approach

Let's assume we have three sentences in our vocabulary.

Sentence 1: Data science is fun and interesting

Sentence 2: Data science is fun

Sentence 3: science is interesting

Bag of Words Algorithm Implementation

string1 = "Data science is fun and interesting"

#converting strings to lower case

#split the sentences into tokens

Creating Bag of Words using sklearn library

string1 = "Data science is fun and interesting"

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ['Natural language processing is fun and interesting',

#creating a word set for the corpus

for doc in corpus:

print('Number of words in the corpus:',len(words_set))

Computing Term Frequency

df_tf = pd.DataFrame(np.zeros((n_docs, n_words_set)), columns=words_set)

# Compute Term Frequency (TF)

Computing Inverse Document Frequency

print("IDF of: ")

TF-IDF using sklearn library

df_tf_idf = pd.DataFrame(tf_idf_array, columns = words_set)

You might also like