0% found this document useful (0 votes)

29 views54 pages

Module 1 Part1

The document provides an overview of text data mining, highlighting its importance in extracting valuable insights from unstructured text data generated through various sources. It discusses key concepts such as information extraction, natural language processing, and data mining techniques, along with the challenges posed by unstructured data. Additionally, it covers applications of text mining, including named entity recognition and relation extraction, emphasizing the growing significance of these technologies in business and research.

Uploaded by

vkarmayogesh2514

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views54 pages

Module 1 Part1

Uploaded by

vkarmayogesh2514

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 54

Module1:TextDataMining

Data Science Honours : SEM VIII

Dr. Jalpa Mehta
Contents
1.1 Introduction to Text Mining
1.2 Text Representation
Text Data Mining

• Text data mining can be described as the process of extracting

essential data from standard language text. All the data that we
generate via text messages, documents, emails, files are written
in common language text. Text mining is primarily used to
draw useful insights or patterns from such data.

• The text mining market has experienced exponential growth

and adoption over the last few years and also expected to gain
significant growth and adoption in the coming future.
Text Data Mining

• One of the primary reasons behind the adoption of text

mining is higher competition in the business market, many
organizations seeking value-added solutions to compete with
other organizations.

• With increasing completion in business and changing customer

perspectives, organizations are making huge investments to
find a solution that is capable of analyzing customer and
competitor data to improve competitiveness.
Text Data Mining

• The primary source of data is e-commerce websites, social

media platforms, published articles, survey, and many more.

• The larger part of the generated data is unstructured, which

makes it challenging and expensive for the organizations to
analyze with the help of the people.

• This challenge integrates with the exponential growth in data

generation has led to the growth of analytical tools. It is not
only able to handle large volumes of text data but also helps in
decision-making purposes.

• Text mining software empowers a user to draw useful

information from a huge set of data available sources.
Information Extraction:
The automatic extraction of structured data such as entities, entities
relationships, and attributes describing entities from an unstructured source
is called information extraction.
Natural Language Processing:
NLP stands for Natural language processing. Computer software can
understand human language as same as it is spoken. NLP is primarily a
component of artificial intelligence(AI).
•The development of the NLP application is difficult because computers
generally expect humans to "Speak" to them in a programming language that is
accurate, clear, and exceptionally structured.
• Human speech is usually not authentic so that it can depend on many complex
variables, including slang, social context, and regional dialects.

•Data Mining:
Data mining refers to the extraction of useful data, hidden patterns from large
data sets. Data mining tools can predict behaviors and future trends that allow
businesses to make a better data-driven decision.
•Data mining tools can be used to resolve many business problems that have
traditionally been too time-consuming.
Information Retrieval:

• Information retrieval deals with retrieving useful data from data that is stored
in our systems. Alternately, as an analogy, we can view search engines that
happen on websites such as e-commerce sites or any other sites as part of
information retrieval.
Text Mining Process
Text Pre-processing
• It involves a series of steps as shown in below:
Text Cleanup
• Text Cleanup means removing any unnecessary or unwanted
information. Such as remove ads from web pages, normalize text
converted from binary formats.

Tokenization
• Tokenizing is simply achieved by splitting the text into white
spaces.

Part of Speech Tagging

• Part-of-Speech (POS) tagging means word class assignment to
each token. Its input is given by the tokenized text. Taggers have to
cope with unknown words (OOV problem) and ambiguous word-
tag mappings.
Text Transformation (Attribute Generation)

• A text document is represented by the words it contains and their

occurrences. Two main approaches to document representation are:
i. Bag of words
ii. Vector Space

Feature Selection (Attribute Selection)

• Feature selection also is known as variable selection. It is the

process of selecting a subset of important features for use in model
creation.
• Redundant features are the one which provides no extra
information.
• Irrelevant features provide no useful or relevant information in
any context.
Data Mining
• At this point, the Text mining process merges with the
traditional process. Classic Data Mining techniques are used
in the structured database. Also, it resulted from the previous
stages.

Evaluate
• Evaluate the result, after evaluation, the result discard.
Applications
Algorithms for Text Mining
Information Extraction from Text Data

Information extraction is the process of extracting information from unstructured textual

sources to enable finding entities as well as classifying and storing them in a database

• Information extraction is the process of extracting specific (pre-specified)

information from textual sources.

• One of the most trivial examples is when your email extracts only the data from the
message for you to add in your Calendar.
Algorithms for Text Mining
How Does Information Extraction Work?
There are many subtleties and complex techniques involved in the process of information
extraction,
Algorithms for Text Mining
Typically, for structured information to be extracted from unstructured texts, the
following main subtasks are involved:

Pre-processing of the text – this is where the text is prepared for processing with the
help of computational linguistics tools such as tokenization, sentence splitting,
morphological analysis, etc.
Finding and classifying concepts – this is where mentions of people, things,
locations, events and other pre-specified types of concepts are detected and classified.

Connecting the concepts – this is the task of identifying relationships between the
extracted concepts.

Unifying – this subtask is about presenting the extracted data into a standard form.
Getting rid of the noise – this subtask involves eliminating duplicate data.

Enriching your knowledge base – this is where the extracted knowledge is ingested
in your database for further use.

Information extraction can be entirely automated or performed with the help of

human input.
Algorithms for Text Mining: Information Extraction from Text Data:
Unsupervised Learning Methods from Text Data:

Unsupervised learning, also known as unsupervised machine learning,

uses machine learning algorithms to analyse and cluster unlabeled
datasets. These algorithms discover hidden patterns or data groupings
without the need for human intervention.
Machine learning techniques have become ubiquitous for enhancing
user experiences and ensuring the quality of systems.

Unsupervised learning, in particular, offers an exploratory approach

to glean insights from data, facilitating the swift identification of
patterns in vast datasets compared to manual analysis.
Text Summarization
• In this approach we build algorithms or programs which will
reduce the text size and create a summary of our text data. This is
called automatic text summarization in machine learning.
Text summarization is the process of creating shorter text without
removing the semantic structure of text.
Text Summarization

Extractive approaches

•Using an extractive approach we summarize our text on the basis of simple

and traditional algorithms.

•For example, when we want to summarize our text on the basis of the
frequency method, we store all the important words and frequency of all those
words in the dictionary.

•On the basis of high frequency words, we store the sentences containing that
word in our final summary.

•This means the words which are in our summary confirm that they are part
of the given text.
Text Summarization

An Abstractive Approach is more advanced.

On the basis of time requirements we exchange some sentences

for smaller sentences with the same semantic approaches of our
text data.
Text Summarization
Text Mining in Social Media
Text Mining in Social Media
Text Mining in Social Media
Text Mining in Social Media
Text Mining in Social Media
Text Mining in Social Media
Text Mining in Social Media
Other Text Mining Algorithms
• Unsupervised Learning Methods from Text Data
• LSI and Dimensionality Reduction for Text Mining
• Supervised Learning Methods for Text Data
• Transfer Learning with Text Data
• Probabilistic Techniques for Text Mining
• Mining Text Streams
• Cross-Lingual Mining of Text Data
• Text Mining in Multimedia Networks
• Opinion Mining from Text Data
• Text Mining from Biomedical Data
1.2 INFORMATION EXTRACTION FROM TEXT
• Information extraction is the task of finding structured
information from unstructured or semi-structured text.

• It is an important task in text mining and has been

extensively studied in various research communities
including natural language processing, information
retrieval and Web mining.

• It has a wide range of applications in domains such as

biomedical literature mining and business intelligence.
1.2 INFORMATION EXTRACTION FROM TEXT

Gathering detailed structured data from texts, information extraction

enables:

•The automation of tasks such as smart content classification,

integrated search, management and delivery

•Data-driven activities such as mining for patterns and trends,

uncovering hidden relationships, etc.
Application of information extraction?
• Business intelligence: For enabling analysts to gather
structured information from multiple sources
• Financial investigation: For analysis and discovery of
hidden relationships
• Scientific research: For automated references
discovery or relevant papers suggestion
• Media monitoring: For mentions of companies,
brands, people
• Healthcare records management: For structuring and
summarizing patients records
• Pharma research: For drug discovery, adverse effects
discovery, and clinical trials automated analysis
Named Entity Recognition

Named entity recognition (NER) ‒ also called entity identification or

entity extraction ‒ is a natural language processing (NLP) technique
that automatically identifies named entities in a text and classifies
them into predefined categories.

Entities can be names of people, organizations, locations, times,

quantities, monetary values, percentages, and more.
Named Entity Recognition

With named entity recognition, you can extract key information to understand what a
text is about, or merely use it to collect important information to store in a database.
Named Entity Recognition Example
‍
•Suppose a working model is required to categorize name, organization, and job
profile for finding the relevant candidate from a job portal.

•The task at hand is to categorize data as mentioned above from the bio of the
candidates

•Taking an example for the bio i.e. “John has been working at Amazon as a Full Stack
Developer”.

•Now from a human perspective, it is easier for any HR to understand the details
about John i.e. name, organization, and job profile.

•But what if there are thousands of bios to checkout for a relevant candidate? This is
where NER comes in.
Named Entity Recognition
• The single sentence from the bio of John will be processed by a
NER model such as this:

• John[name] has been working in Amazon[organization] as a Full

Stack Developer[job profile].

• Named entity recognition automatically categorizes data from text

based on its previous training.

• Its applications are huge because it is machine based and doesn’t

get affected by tons of work leading to cognitive overload.
Named Entity Recognition
Relation Extraction
• Relation Extraction is the subtask of the Information Extraction
task, which aims to identify relations between entities and assign
them some kind of label or class.

• For example, given the sentence “Barack Obama was the president
of the USA.”, Barack Obama is related to the USA as 'president'. A
relation extractor aims at predicting the relationship of the
'president' between these two entities.

• Relations are one of the most important aspects for building

knowledge graphs, which is significant to various NLP applications
like summarization, Q&A, search, sentiment analysis, etc.
Relation Extraction
Unsupervised Information Extraction
The goal of unsupervised relation extraction is to extract relations from the text when
we have no labeled training data and not even any list of relations.

This task is often called Open Information Extraction or Open IE.

Let's take an example:
United has a hub in Chicago, which is the headquarters of United Continental
Holdings.

ReVerb finds United to the left and Chicago to the right of has a hub in, and skips
over which to find Chicago to the left of `is the headquarters of.

The final output is:

Text Representation
•Conversion of raw text to a suitable numerical form is called text
representation.
OR

•The raw text corpus is preprocessed and transformed into a number of

text representations that are input to the machine learning model.

•The data is pushed through a series of preprocessing tasks such as

tokenization, stopword removal, punctuation removal, stemming, and
many more. We clean the data of any noise present.

•This cleaned data is represented in various forms according to the

purpose of the application and the input requirements of the machine
learning model.
Text Representation

Feature representation is a common step in any ML project, whether the data is

text, images, videos, or speech.
Text Representation
Common Terms Used While Representing Text

Corpus( C ): All the text data or records of the dataset together are known as a
corpus.

Vocabulary(V): This consists of all the unique words present in the corpus.

Document(D): One single text record of the dataset is a Document.

Word(W): The words present in the vocabulary.

Tokenization:One of the first steps in text preprocessing is Tokenization. It is the

process of creating tokens. Tokens can be thought of as a building unit of the text
sequence (the data).
Text Representation
As explained above, these tokens can be:
•Characters
•Words (individual words or sets of multiple words together)
•Part of words
•Punctuations
•Sentences
•Regular expressions
Stemming
The process of removing affixes from a word so that we are left with the stem of that
word is called stemming.

For example, consider the words ‘run’, ‘running’, and ‘runs’, all convert into the root
word ‘run’ after stemming is implemented on them.

One crucial point about stem words is that they need not be meaningful. For
example, the word ‘traditional’ stem is ‘tradi’ and has no meaning.
Stemming
Why use Stemming
The benefits of using the stemming algorithm in an NLP project can be summarized as
follows:

1.It reduces the number of words that serve as an input to the Machine
Learning/Deep Learning model.

2.It minimizes the confusion around words that have similar meanings.

3.It lowers the complexity of the input space.

4.When creating applications that search a specific text in a document, using

stemming for indexing assists in retrieving relevant documents.

5.It assists in eliminating the out-of-vocabulary (OOV) problem. For example, if

the vocabulary does not contain the word ‘oranges’, one can use the stem word
‘orange’ as a proxy.

6.It enhances the accuracy of the ML/DL model as the model does not have to
deal with inflected word forms
Types of Stemming

Three popular types of stemming: Porter, Snowball, and Lancaster.

N-gram Modelling
N-gram is a sequence of the N-words in the modeling of NLP.

•Consider an example of the statement for modeling. “I love reading

history books and watching documentaries”.

• In one-gram or unigram, there is a one-word sequence. As for the above

statement, in one gram it can be “I”, “love”, “history”, “books”, “and”,
“watching”, “documentaries”.

•In two-gram or the bi-gram, there is the two-word sequence i.e. “I love”,
“love reading”, or “history books”.

•In the three-gram or the tri-gram, there are the three words sequences i.e.
“I love reading”, “history books,” or “and watching documentaries”

•The illustration of the N-gram modeling i.e. for N=1,2,3 is given below
in Figure
N-gram Modelling

•For N-1 words, the N-gram modeling predicts most occurred words that
can follow the sequences.

•The model is the probabilistic language model which is trained on the

collection of the text.

• This model is useful in applications i.e. speech recognition, and machine

translations for producing more natural statements in target and specified
languages
Overview of Text and Web Mining process

Data Mining in Business Intelligence
No ratings yet
Data Mining in Business Intelligence
64 pages
Case Study On Text Mining
No ratings yet
Case Study On Text Mining
8 pages
Unit 1
No ratings yet
Unit 1
8 pages
What Is Text Mining
No ratings yet
What Is Text Mining
9 pages
Module 4
No ratings yet
Module 4
63 pages
1-What Is Text Mining - IBM
No ratings yet
1-What Is Text Mining - IBM
5 pages
Text Mining: A Burgeoning Technology For Knowledge Extraction
100% (1)
Text Mining: A Burgeoning Technology For Knowledge Extraction
5 pages
Text Mining
No ratings yet
Text Mining
18 pages
Text Mining
No ratings yet
Text Mining
16 pages
Text Mining in Data Mining Guide
No ratings yet
Text Mining in Data Mining Guide
18 pages
Unit I - Text Mining
No ratings yet
Unit I - Text Mining
48 pages
Seven Text Mining Techniques
No ratings yet
Seven Text Mining Techniques
21 pages
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
No ratings yet
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
11 pages
Text Mining: Concepts, Process and Applications: January 2013
No ratings yet
Text Mining: Concepts, Process and Applications: January 2013
5 pages
Text and Web Mining
No ratings yet
Text and Web Mining
44 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
42 pages
Data Mining in Business Intelligence
No ratings yet
Data Mining in Business Intelligence
63 pages
Survey Data Analysis
No ratings yet
Survey Data Analysis
17 pages
Screenshot 2024-06-04 at 12.02.17 AM
No ratings yet
Screenshot 2024-06-04 at 12.02.17 AM
23 pages
DMTerm Paper
No ratings yet
DMTerm Paper
4 pages
IT445 Week8 Ch7
No ratings yet
IT445 Week8 Ch7
59 pages
Text Mining
No ratings yet
Text Mining
25 pages
Text Mining
No ratings yet
Text Mining
12 pages
IMTC634 - Data Science - Chapter 7
No ratings yet
IMTC634 - Data Science - Chapter 7
24 pages
FDS-Content Beyond Syllabus
No ratings yet
FDS-Content Beyond Syllabus
15 pages
Submitted To: Submitted By:: Text Mining
No ratings yet
Submitted To: Submitted By:: Text Mining
15 pages
A Detailed Study On Text Mining Techniques
No ratings yet
A Detailed Study On Text Mining Techniques
4 pages
Chapter 5 Predictive Analytics II Text J Web J and Social Media Analytics
No ratings yet
Chapter 5 Predictive Analytics II Text J Web J and Social Media Analytics
5 pages
Information Retrieval
No ratings yet
Information Retrieval
3 pages
Unit:: A. Text Mining Algorithms
No ratings yet
Unit:: A. Text Mining Algorithms
21 pages
Text Mining Techniques Overview
100% (1)
Text Mining Techniques Overview
4 pages
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
No ratings yet
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
122 pages
Data Mining for Business Experts
No ratings yet
Data Mining for Business Experts
41 pages
Text Analytics and Text Mining Overview
No ratings yet
Text Analytics and Text Mining Overview
16 pages
AFM - Module 4
No ratings yet
AFM - Module 4
48 pages
Text Mining: Techniques and Its Application: December 2014
100% (1)
Text Mining: Techniques and Its Application: December 2014
5 pages
(IJCST-V6I4P5) :S.Sheela, T.Bharathi
No ratings yet
(IJCST-V6I4P5) :S.Sheela, T.Bharathi
7 pages
Text Mining & Applications in Social Media: by Anthony Yang
No ratings yet
Text Mining & Applications in Social Media: by Anthony Yang
30 pages
Lecture 5 - Text Mining Sentiment and Social Media Analytics
No ratings yet
Lecture 5 - Text Mining Sentiment and Social Media Analytics
52 pages
The Seven Practice Areas of Text Analytics Chapter 2 Excerpt
No ratings yet
The Seven Practice Areas of Text Analytics Chapter 2 Excerpt
4 pages
Text Mining: 2 History
No ratings yet
Text Mining: 2 History
8 pages
ch05 - DS Unit 4
No ratings yet
ch05 - DS Unit 4
148 pages
Text Mining
No ratings yet
Text Mining
3 pages
Unit Ii DM
No ratings yet
Unit Ii DM
18 pages
Section 2 Text Analytics and Text Mining Overview
No ratings yet
Section 2 Text Analytics and Text Mining Overview
47 pages
10 - Session 10 - Text Analytics, Text Mining and Sentiment Analysis
No ratings yet
10 - Session 10 - Text Analytics, Text Mining and Sentiment Analysis
36 pages
1 2 3 4 5 Merged
No ratings yet
1 2 3 4 5 Merged
23 pages
10 1109@icaccs 2019 8728547
No ratings yet
10 1109@icaccs 2019 8728547
5 pages
Text Mining: Techniques and Challenges
No ratings yet
Text Mining: Techniques and Challenges
5 pages
Week10 Social Network Analytics
No ratings yet
Week10 Social Network Analytics
19 pages
Text Mining
No ratings yet
Text Mining
13 pages
Text Mining: Tools, Techniques, and Applications
No ratings yet
Text Mining: Tools, Techniques, and Applications
19 pages
DS Finalexam (Thxtoshravani)
No ratings yet
DS Finalexam (Thxtoshravani)
31 pages
TextAnalyticsApplicationofTextMining2021 31122023 071845am 1 10122024 061001pm
No ratings yet
TextAnalyticsApplicationofTextMining2021 31122023 071845am 1 10122024 061001pm
7 pages
Search Engines - Text Mining in Action
No ratings yet
Search Engines - Text Mining in Action
18 pages
New01 Intro
No ratings yet
New01 Intro
11 pages
MPMC Syllabus
No ratings yet
MPMC Syllabus
217 pages
Phoneme-Based English-Amharic Statistical Machine Translation
No ratings yet
Phoneme-Based English-Amharic Statistical Machine Translation
5 pages
Harmonize 2 TRM Review 58 Vocab Grammar Worksheets
No ratings yet
Harmonize 2 TRM Review 58 Vocab Grammar Worksheets
6 pages
Introduction to DBMS Concepts
No ratings yet
Introduction to DBMS Concepts
37 pages
Product Designer Role at Lenskart
No ratings yet
Product Designer Role at Lenskart
1 page
What The Internet Really Is
No ratings yet
What The Internet Really Is
3 pages
Castinglianos Theorem Proof
No ratings yet
Castinglianos Theorem Proof
5 pages
Y5 Lesson 2 Equivalent Fractions 2019
No ratings yet
Y5 Lesson 2 Equivalent Fractions 2019
2 pages
Center 2025
No ratings yet
Center 2025
34 pages
SEO Basics: Search Engines & Optimization
No ratings yet
SEO Basics: Search Engines & Optimization
52 pages
ANSWER KEY Yearly Exame Paper Maths Class 9 Session (2024-25)
No ratings yet
ANSWER KEY Yearly Exame Paper Maths Class 9 Session (2024-25)
12 pages
Lesson Plan - Where Were You at
No ratings yet
Lesson Plan - Where Were You at
6 pages
Grammar Practice Activities
No ratings yet
Grammar Practice Activities
6 pages
Electronics Engineer Portfolio
No ratings yet
Electronics Engineer Portfolio
1 page
CAD, Mechatronics
No ratings yet
CAD, Mechatronics
168 pages
Tutorial - The Sum-Product Algorithm
No ratings yet
Tutorial - The Sum-Product Algorithm
5 pages
Alice in Wonderland - A Critique Paper
No ratings yet
Alice in Wonderland - A Critique Paper
2 pages
(OOP) - 01-45 (22-08-2009) Updated
No ratings yet
(OOP) - 01-45 (22-08-2009) Updated
342 pages
Phonetics Booklet - Key
No ratings yet
Phonetics Booklet - Key
11 pages
Борискин О.И Appendix - 1 2016
No ratings yet
Борискин О.И Appendix - 1 2016
94 pages
SOA 27001 Controles Aplicados
No ratings yet
SOA 27001 Controles Aplicados
16 pages
Tutorial Letter 302/4/2024: Presenting Assignment Answers and Referencing
No ratings yet
Tutorial Letter 302/4/2024: Presenting Assignment Answers and Referencing
46 pages
Unit 1 - Lesson 4
No ratings yet
Unit 1 - Lesson 4
17 pages
The Truth About The Drug Companies How They Deceive Us and What To Do About It 1st Edition Marcia Angell Instant Download
100% (2)
The Truth About The Drug Companies How They Deceive Us and What To Do About It 1st Edition Marcia Angell Instant Download
37 pages
Analytical Exposition Text Guide
100% (1)
Analytical Exposition Text Guide
7 pages
Math 2 - Week 3.2nd
No ratings yet
Math 2 - Week 3.2nd
2 pages
Corpus BasedSociolinguistics Partington
No ratings yet
Corpus BasedSociolinguistics Partington
7 pages
Hemanth (4,0)
No ratings yet
Hemanth (4,0)
4 pages
12 Ip
No ratings yet
12 Ip
4 pages