Chapter 6
INTRODUCTION TO DATA MINING
Learning objectives:
After this lesson, you are able to learn as the
following:
What is Data Mining?
Describe the various techniques in Data mining process
Understand the KDD Process model
Describe the various phases of CRISP-DM
Applications of Data Mining
Definition of Data mining
Data mining is the process of discovering interesting knowledge such as
unknown patterns, association or significant structures from large amount of
data stored in databases, data warehouses or other information repositories in
order to discover useful patterns.
Another definition of data mining : Data mining is an iterative process of
creating predictive and descriptive models, by uncovering previously unknown
trends and patterns in vast amount of data in order to support decision making.
Data mining is a subset of Business Analytics
There is a need to turn data into useful information and knowledge for broad
applications including
Market analysis
Business management
Decision support
Customer segmentation and behavior
Etc.
How data mining works?
Data mining builds models to discover patterns
among attributes presented in the data set.
Models are:
Mathematical representations (simple linear
relationships and highly non-linear
relationship) that identify patterns among
attributes of the things such as customers with
products
Some of these patterns are explanatory and
others are predictive (foretelling future values
of certain attributes)
Why Mine Data? Commercial Viewpoint
Lots of data is being collected
and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong
Providebetter, customized services for an edge (e.g. in
Customer Relationship Management)
What is (not) Data Mining?
lWhat is not Data l What is Data Mining?
Mining?
– Look up phone – Certain names are more
number in phone prevalent in certain US
directory locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by
information about search engine according to
“Amazon” their context (e.g. Amazon
rainforest, Amazon.com,)
Examples of data mining
applications
Regarding temporal data, for instance, banking data can be mined
for changing trends, which may aid in the scheduling of bank tellers
according to the volume of customer traffic.
Stock exchange data can be mined so that trends that could help to
plan investment strategies can be uncovered
Computer network data streams can be mined to detect intrusions
based on the anomaly of message flows, which may be discovered
by clustering, dynamic construction of stream models or by
comparing the current frequent patterns with those at a previous
time.
With spatial data, look for patterns that describe changes in
metropolitan poverty rates based on city distances from major
highways. By examining the relationships among a set of spatial
objects, which subsets of objects are spatially auto correlated or
associated can be discovered.
Industry examples of DM
applications
Sales/ Marketing
Identify buying patterns from customers
Find the association among customer demographic characteristics
Banking
Credit card fraudulent detection
Identify ‘loyal’ customers
Insurance and Health Care
Claims analysis i.e., which medical procedures are claimed together
Predict the customers who will buy new policies
Transportation
Determine the distribution schedules for the outlets
Analyze loading patterns
Medicine
Characterize patient behavior in order to predict office visits
Identify successful medical therapies for different diseases / illnesses
Take a break….
Watch a video
Source of data mining
https://www.youtube.com/watch?v=Y_JlkzzhAgw
Data Mining
Tasks,
methods
and
algorithms
Prediction
Prediction is refer to the act of telling about the
future by taking into account the experiences,
opinions and other relevant information in
conducting the task of foretelling.
Depending on the nature of what is being
predicted, prediction can be specifically as :
Classification (predicted thing is such as
tomorrow’s forecast, is a class label such as
“rainy” or “sunny”)
Regression (predicted thing is tomorrow’s
temperature, is a real number such as 65 F)
Time-series, the data consists of values of the
same variable that is captured and stored over
tine in regular intervals, such as stock price
Prediction techniques
Classification : assign a new data record to one of several
predefined categories or classes. Also called supervised
learning.
Classification approaches normally use a training set where
all objects are already associated with known class labels.
The classification algorithm learns from the training set
and builds a model. The model is used to classify new
objects.
This method has been used in customer segmentation,
business modeling, and credit analysis.
For example, after starting a credit policy, the
OurVideoStore managers could analyze the customers’
behaviours via their credit, and label accordingly the
customers who received credits with three possible labels
“safe”, “risky” and “very risky”. The classification analysis
would generate a model that could be used to either
accept or reject credit requests in the future
Associations
Or association rule learning in data mining is a
popular and well-researched technique for
discovering interesting relationships among
variables in large databases.
With the help of bar-code scanners, the use of
associations rules for discovering regularities
among products is able to capture by the
system.
Types of associations:
Link analysis : the linkage among many objects
of interest is discovered automatically, such as
the link between web pages and referential
relationships among groups of academic
publication authors
Associations techniques
Market-basket: detect sets of attributes/items that
frequently has association relationship or correlations
among them, e.g. 90% of the people who buy cookies,
also buy milk (60% of all grocery shoppers buy both)
In data mining, association rules are useful for
analyzing and predicting customer behavior. They
play an important part in shopping basket data
analysis, product clustering, catalog design and store
layout.
Sequence mining (categorical): discover sequences of
events that commonly occur together, .e.g. In a set of
DNA sequences ACGTC is followed by GTCA after a gap
of 9, with 30% probability
Something come after the other, for example: when
happen outbreak flu, the glove will be in shortage
Association rules
Clustering
Clustering: method of assigning a set of objects into groups
or segments based on similarities automatically.
Unlike classification, in clustering the class labels are
unknown.
As the selected algorithm goes through the data set,
identifying the common of things based on their
characteristics, the clusters are established.
Clustering techniques include optimization.
Goal of clustering is to create groups so that the members
within each group have maximum similarity and the
members across groups have minimum similarity.
Clustering techniques
Cluster analysis is a means of identifying
classes of items so that items in a cluster have
more in common with each other than with
items in other clusters.
Example: create customer segmentation based
on income, age, race, location, etc.
Data Mining Techniques
Outlier Analysis: find the record(s) that is (are)
the most different from the other records, i.e.,
find all outliers. Outliers are data elements that
cannot be grouped in a given class or cluster.
Example of using Data Mining
Data Mining versus Statistics
Data Mining Statistics
Starts with loosely defined Starts with a well-defined
discovery statement by using proposition and by collecting
all existing data (i.e. sample data (i.e. primary data)
observational and secondary to test the hypothesis
data) to discover novel
patterns and relationships
Data sets in data mining are as Statistics looks for the right
“big” as possible size of data (if the size of data
required for statistical
analysis, usually sample of data
is used)
Data
Visualization
Take a break…
watch a video
How Facebook Data Mining, And Your Info, Is Influencing
The 2016 Election | TODAY
https://www.youtube.com/watch?v=i-rIYadXoms
Knowledge Discovery in Database
(KDD)
Knowledge Discovery from Data (KDD), refers to the broad
process of finding knowledge in data that emphasizes the
"high-level" application of particular data mining methods.
The unifying goal of KDD process - extract knowledge from
data in the context of large databases - done by using data
mining methods
KDD refers to the entire process of discovering useful
knowledge from data.
This process involves making decision of what qualifies as
knowledge by evaluating and possibly interpreting the
patterns. It also includes the choice of encoding schemes,
preprocessing, sampling, and projections of the data prior
to the data mining step.
KDD: A Definition
KDD is the automatic extraction of non-obvious,
hidden knowledge from large volumes of data.
Then run Data
Mining algorithms
106-1012 bytes:
we never see the What is the knowledge?
whole data set, so will How to represent
put it in the memory of and use it?
computers
Knowledge Discovery Process
Steps in KDD process
Knowledge Discovery Process
The Knowledge Discovery in Databases process comprises of a few steps
leading from raw data collections to some form of new knowledge.
The iterative process consists of the following steps:
Data cleaning: also known as data cleansing, it is a phase in which noise data and
irrelevant data are removed from the collection or maybe missing data.
Data integration: at this stage, multiple data sources, often heterogeneous, may be
combined in a common source.
Data selection: at this step, the data relevant to the analysis is decided on and
retrieved from the data collection.
Data transformation: also known as data consolidation, it is a phase in which the
selected data is transformed into forms appropriate for the mining procedure.
Data mining: it is the crucial step in which clever techniques are applied to extract
patterns potentially useful. Searching for patterns of interest in a particular
representational form or a set of such representations, including classification rules
or trees, regression, and clustering
Pattern evaluation: in this step, strictly interesting patterns representing
knowledge are identified based on given measures.
Knowledge representation: is the final phase in which the discovered knowledge is
visually represented to the user. This essential step uses visualization techniques to
help users understand and interpret the data mining results.
3 methodologies of KDD
model
Fayyad et al. (Computer science)
E.g., WEKA
SEMMA (SAS) (Statistics)
SAS Enterprise Miner
CRISP-DM (SPSS, OHRA) (Business)
SPSS
Methodology of KDD –
CRISP-DM
CRISP-DM
Stands for Cross Industry Standard Process for
Data Mining
A non-proprietary, documented, and freely
available data mining model.
It was developed by industry leaders with input
from more than 200 data mining users and data
mining tool and service providers.
It is an industry-, tool- and application-neutral
model.
This model encourages best practices and offers
organizations the structure needed to realize
better, faster results from data mining.
Six phases in CRISP-DM
CRISP –DM (Elaborate view)
Six phases of CRISP-DM
1. Business Understanding
This initial phase focuses on understanding the project objectives and
requirements from a business perspective, and then converting this
knowledge into a data mining problem definition, and a preliminary
plan designed to achieve the objectives.
Such as “What are the common characteristics of the customers we
have lost to our competitors recently?”
2. Data Understanding
The data understanding phase starts with an initial data collection. It
proceeds with activities
▪ To get familiar with the data,
▪ To identify data quality problems,
▪ To discover first insights into the data, or to
▪ Detect interesting subsets to form hypotheses for hidden information.
Six phases of CRISP-DM
3. Data Preparation
The data preparation phase covers all activities to
construct the final dataset (data that will be fed into the
modeling tool(s)) from the initial raw data.
Data preparation tasks are likely to be performed multiple
times, and not in any prescribed order. Tasks include table,
record, and attribute selection as well as transformation
and cleaning of data for modeling tools.
4. Modeling
In this phase, many modeling techniques are chosen and
applied, and calibrate their parameters to optimal values.
Typically, to the same data mining problem type, several
techniques can be applied.
Six phases of CRISP-DM
5. Evaluate Results
The accuracy and generality of the model were dealt with
the previous evaluation steps. The degree to which the
model meets the business objectives is assessed in this step.
Also this step seeks to determine if there is some valid
business reason why the model is deficient. If time and
budget permits, the model(s) can be tested on test
applications in the real application which is another option
of evaluation.
6. Deployment
The end of the project is not just the creation of the model.
Though the purpose of the model is to increase knowledge
of the data, the knowledge gained needs to be organized
and presented in such a way that the client can use.
KDD vs. DM
DM is a component of the KDD process that is
mainly concerned with means by which patterns
and models are extracted and enumerated from
the data
DM is quite technical
Knowledge discovery involves evaluation and
interpretation of the patterns and models to
make the decision of what constitutes
knowledge and what does not
KDD requires a lot of domain understanding
The DM and KDD are often used interchangeably
Perhaps DM is a more common term in business
world, and KDD in academic world
The end.
Video: Data Mining and Business Intelligent
https://www.youtube.com/watch?v=peSNJ5bfjX0
How data mining works?
https://www.youtube.com/watch?v=W44q6qszdqY