Module1:TextDataMining
Data Science Honours : SEM VIII
Dr. Jalpa Mehta
Contents
1.1 Introduction to Text Mining
1.2 Text Representation
Text Data Mining
• Text data mining can be described as the process of extracting
essential data from standard language text. All the data that we
generate via text messages, documents, emails, files are written
in common language text. Text mining is primarily used to
draw useful insights or patterns from such data.
• The text mining market has experienced exponential growth
and adoption over the last few years and also expected to gain
significant growth and adoption in the coming future.
Text Data Mining
• One of the primary reasons behind the adoption of text
mining is higher competition in the business market, many
organizations seeking value-added solutions to compete with
other organizations.
• With increasing completion in business and changing customer
perspectives, organizations are making huge investments to
find a solution that is capable of analyzing customer and
competitor data to improve competitiveness.
Text Data Mining
• The primary source of data is e-commerce websites, social
media platforms, published articles, survey, and many more.
• The larger part of the generated data is unstructured, which
makes it challenging and expensive for the organizations to
analyze with the help of the people.
• This challenge integrates with the exponential growth in data
generation has led to the growth of analytical tools. It is not
only able to handle large volumes of text data but also helps in
decision-making purposes.
• Text mining software empowers a user to draw useful
information from a huge set of data available sources.
Information Extraction:
The automatic extraction of structured data such as entities, entities
relationships, and attributes describing entities from an unstructured source
is called information extraction.
Natural Language Processing:
NLP stands for Natural language processing. Computer software can
understand human language as same as it is spoken. NLP is primarily a
component of artificial intelligence(AI).
•The development of the NLP application is difficult because computers
generally expect humans to "Speak" to them in a programming language that is
accurate, clear, and exceptionally structured.
• Human speech is usually not authentic so that it can depend on many complex
variables, including slang, social context, and regional dialects.
•Data Mining:
Data mining refers to the extraction of useful data, hidden patterns from large
data sets. Data mining tools can predict behaviors and future trends that allow
businesses to make a better data-driven decision.
•Data mining tools can be used to resolve many business problems that have
traditionally been too time-consuming.
Information Retrieval:
• Information retrieval deals with retrieving useful data from data that is stored
in our systems. Alternately, as an analogy, we can view search engines that
happen on websites such as e-commerce sites or any other sites as part of
information retrieval.
Text Mining Process
Text Pre-processing
• It involves a series of steps as shown in below:
Text Cleanup
• Text Cleanup means removing any unnecessary or unwanted
information. Such as remove ads from web pages, normalize text
converted from binary formats.
Tokenization
• Tokenizing is simply achieved by splitting the text into white
spaces.
Part of Speech Tagging
• Part-of-Speech (POS) tagging means word class assignment to
each token. Its input is given by the tokenized text. Taggers have to
cope with unknown words (OOV problem) and ambiguous word-
tag mappings.
Text Transformation (Attribute Generation)
• A text document is represented by the words it contains and their
occurrences. Two main approaches to document representation are:
i. Bag of words
ii. Vector Space
Feature Selection (Attribute Selection)
• Feature selection also is known as variable selection. It is the
process of selecting a subset of important features for use in model
creation.
• Redundant features are the one which provides no extra
information.
• Irrelevant features provide no useful or relevant information in
any context.
Data Mining
• At this point, the Text mining process merges with the
traditional process. Classic Data Mining techniques are used
in the structured database. Also, it resulted from the previous
stages.
Evaluate
• Evaluate the result, after evaluation, the result discard.
Applications
Algorithms for Text Mining
Information Extraction from Text Data
Information extraction is the process of extracting information from unstructured textual
sources to enable finding entities as well as classifying and storing them in a database
• Information extraction is the process of extracting specific (pre-specified)
information from textual sources.
• One of the most trivial examples is when your email extracts only the data from the
message for you to add in your Calendar.
Algorithms for Text Mining
How Does Information Extraction Work?
There are many subtleties and complex techniques involved in the process of information
extraction,
Algorithms for Text Mining
Typically, for structured information to be extracted from unstructured texts, the
following main subtasks are involved:
Pre-processing of the text – this is where the text is prepared for processing with the
help of computational linguistics tools such as tokenization, sentence splitting,
morphological analysis, etc.
Finding and classifying concepts – this is where mentions of people, things,
locations, events and other pre-specified types of concepts are detected and classified.
Connecting the concepts – this is the task of identifying relationships between the
extracted concepts.
Unifying – this subtask is about presenting the extracted data into a standard form.
Getting rid of the noise – this subtask involves eliminating duplicate data.
Enriching your knowledge base – this is where the extracted knowledge is ingested
in your database for further use.
Information extraction can be entirely automated or performed with the help of
human input.
Algorithms for Text Mining: Information Extraction from Text Data:
Unsupervised Learning Methods from Text Data:
Unsupervised learning, also known as unsupervised machine learning,
uses machine learning algorithms to analyse and cluster unlabeled
datasets. These algorithms discover hidden patterns or data groupings
without the need for human intervention.
Machine learning techniques have become ubiquitous for enhancing
user experiences and ensuring the quality of systems.
Unsupervised learning, in particular, offers an exploratory approach
to glean insights from data, facilitating the swift identification of
patterns in vast datasets compared to manual analysis.
Text Summarization
• In this approach we build algorithms or programs which will
reduce the text size and create a summary of our text data. This is
called automatic text summarization in machine learning.
Text summarization is the process of creating shorter text without
removing the semantic structure of text.
Text Summarization
Extractive approaches
•Using an extractive approach we summarize our text on the basis of simple
and traditional algorithms.
•For example, when we want to summarize our text on the basis of the
frequency method, we store all the important words and frequency of all those
words in the dictionary.
•On the basis of high frequency words, we store the sentences containing that
word in our final summary.
•This means the words which are in our summary confirm that they are part
of the given text.
Text Summarization
An Abstractive Approach is more advanced.
On the basis of time requirements we exchange some sentences
for smaller sentences with the same semantic approaches of our
text data.
Text Summarization
Text Mining in Social Media
Text Mining in Social Media
Text Mining in Social Media
Text Mining in Social Media
Text Mining in Social Media
Text Mining in Social Media
Text Mining in Social Media
Other Text Mining Algorithms
• Unsupervised Learning Methods from Text Data
• LSI and Dimensionality Reduction for Text Mining
• Supervised Learning Methods for Text Data
• Transfer Learning with Text Data
• Probabilistic Techniques for Text Mining
• Mining Text Streams
• Cross-Lingual Mining of Text Data
• Text Mining in Multimedia Networks
• Opinion Mining from Text Data
• Text Mining from Biomedical Data
1.2 INFORMATION EXTRACTION FROM TEXT
• Information extraction is the task of finding structured
information from unstructured or semi-structured text.
• It is an important task in text mining and has been
extensively studied in various research communities
including natural language processing, information
retrieval and Web mining.
• It has a wide range of applications in domains such as
biomedical literature mining and business intelligence.
1.2 INFORMATION EXTRACTION FROM TEXT
Gathering detailed structured data from texts, information extraction
enables:
•The automation of tasks such as smart content classification,
integrated search, management and delivery
•Data-driven activities such as mining for patterns and trends,
uncovering hidden relationships, etc.
Application of information extraction?
• Business intelligence: For enabling analysts to gather
structured information from multiple sources
• Financial investigation: For analysis and discovery of
hidden relationships
• Scientific research: For automated references
discovery or relevant papers suggestion
• Media monitoring: For mentions of companies,
brands, people
• Healthcare records management: For structuring and
summarizing patients records
• Pharma research: For drug discovery, adverse effects
discovery, and clinical trials automated analysis
Named Entity Recognition
Named entity recognition (NER) ‒ also called entity identification or
entity extraction ‒ is a natural language processing (NLP) technique
that automatically identifies named entities in a text and classifies
them into predefined categories.
Entities can be names of people, organizations, locations, times,
quantities, monetary values, percentages, and more.
Named Entity Recognition
With named entity recognition, you can extract key information to understand what a
text is about, or merely use it to collect important information to store in a database.
Named Entity Recognition Example
•Suppose a working model is required to categorize name, organization, and job
profile for finding the relevant candidate from a job portal.
•The task at hand is to categorize data as mentioned above from the bio of the
candidates
•Taking an example for the bio i.e. “John has been working at Amazon as a Full Stack
Developer”.
•Now from a human perspective, it is easier for any HR to understand the details
about John i.e. name, organization, and job profile.
•But what if there are thousands of bios to checkout for a relevant candidate? This is
where NER comes in.
Named Entity Recognition
• The single sentence from the bio of John will be processed by a
NER model such as this:
• John[name] has been working in Amazon[organization] as a Full
Stack Developer[job profile].
• Named entity recognition automatically categorizes data from text
based on its previous training.
• Its applications are huge because it is machine based and doesn’t
get affected by tons of work leading to cognitive overload.
Named Entity Recognition
Relation Extraction
• Relation Extraction is the subtask of the Information Extraction
task, which aims to identify relations between entities and assign
them some kind of label or class.
• For example, given the sentence “Barack Obama was the president
of the USA.”, Barack Obama is related to the USA as 'president'. A
relation extractor aims at predicting the relationship of the
'president' between these two entities.
• Relations are one of the most important aspects for building
knowledge graphs, which is significant to various NLP applications
like summarization, Q&A, search, sentiment analysis, etc.
Relation Extraction
Unsupervised Information Extraction
The goal of unsupervised relation extraction is to extract relations from the text when
we have no labeled training data and not even any list of relations.
This task is often called Open Information Extraction or Open IE.
Let's take an example:
United has a hub in Chicago, which is the headquarters of United Continental
Holdings.
ReVerb finds United to the left and Chicago to the right of has a hub in, and skips
over which to find Chicago to the left of `is the headquarters of.
The final output is:
Text Representation
•Conversion of raw text to a suitable numerical form is called text
representation.
OR
•The raw text corpus is preprocessed and transformed into a number of
text representations that are input to the machine learning model.
•The data is pushed through a series of preprocessing tasks such as
tokenization, stopword removal, punctuation removal, stemming, and
many more. We clean the data of any noise present.
•This cleaned data is represented in various forms according to the
purpose of the application and the input requirements of the machine
learning model.
Text Representation
Feature representation is a common step in any ML project, whether the data is
text, images, videos, or speech.
Text Representation
Common Terms Used While Representing Text
Corpus( C ): All the text data or records of the dataset together are known as a
corpus.
Vocabulary(V): This consists of all the unique words present in the corpus.
Document(D): One single text record of the dataset is a Document.
Word(W): The words present in the vocabulary.
Tokenization:One of the first steps in text preprocessing is Tokenization. It is the
process of creating tokens. Tokens can be thought of as a building unit of the text
sequence (the data).
Text Representation
As explained above, these tokens can be:
•Characters
•Words (individual words or sets of multiple words together)
•Part of words
•Punctuations
•Sentences
•Regular expressions
Stemming
The process of removing affixes from a word so that we are left with the stem of that
word is called stemming.
For example, consider the words ‘run’, ‘running’, and ‘runs’, all convert into the root
word ‘run’ after stemming is implemented on them.
One crucial point about stem words is that they need not be meaningful. For
example, the word ‘traditional’ stem is ‘tradi’ and has no meaning.
Stemming
Why use Stemming
The benefits of using the stemming algorithm in an NLP project can be summarized as
follows:
1.It reduces the number of words that serve as an input to the Machine
Learning/Deep Learning model.
2.It minimizes the confusion around words that have similar meanings.
3.It lowers the complexity of the input space.
4.When creating applications that search a specific text in a document, using
stemming for indexing assists in retrieving relevant documents.
5.It assists in eliminating the out-of-vocabulary (OOV) problem. For example, if
the vocabulary does not contain the word ‘oranges’, one can use the stem word
‘orange’ as a proxy.
6.It enhances the accuracy of the ML/DL model as the model does not have to
deal with inflected word forms
Types of Stemming
Three popular types of stemming: Porter, Snowball, and Lancaster.
N-gram Modelling
N-gram is a sequence of the N-words in the modeling of NLP.
•Consider an example of the statement for modeling. “I love reading
history books and watching documentaries”.
• In one-gram or unigram, there is a one-word sequence. As for the above
statement, in one gram it can be “I”, “love”, “history”, “books”, “and”,
“watching”, “documentaries”.
•In two-gram or the bi-gram, there is the two-word sequence i.e. “I love”,
“love reading”, or “history books”.
•In the three-gram or the tri-gram, there are the three words sequences i.e.
“I love reading”, “history books,” or “and watching documentaries”
•The illustration of the N-gram modeling i.e. for N=1,2,3 is given below
in Figure
N-gram Modelling
•For N-1 words, the N-gram modeling predicts most occurred words that
can follow the sequences.
•The model is the probabilistic language model which is trained on the
collection of the text.
• This model is useful in applications i.e. speech recognition, and machine
translations for producing more natural statements in target and specified
languages
Overview of Text and Web Mining process