0% found this document useful (0 votes)

297 views39 pages

Chapter 6 Data Mining

This document provides an introduction to data mining. It defines data mining as the process of discovering patterns and relationships in large datasets. The document outlines several data mining techniques including prediction, associations, and clustering. Prediction techniques include classification and regression. Association rule learning is used to discover relationships between variables. Clustering assigns objects to groups based on similarities. Examples of data mining applications in various industries are also provided.

Uploaded by

Jiawei Tan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

297 views39 pages

Chapter 6 Data Mining

Uploaded by

Jiawei Tan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Chapter 6

INTRODUCTION TO DATA MINING

Learning objectives:

 After this lesson, you are able to learn as the

following:
 What is Data Mining?
 Describe the various techniques in Data mining process
 Understand the KDD Process model
 Describe the various phases of CRISP-DM
 Applications of Data Mining
Definition of Data mining
 Data mining is the process of discovering interesting knowledge such as
unknown patterns, association or significant structures from large amount of
data stored in databases, data warehouses or other information repositories in
order to discover useful patterns.
 Another definition of data mining : Data mining is an iterative process of
creating predictive and descriptive models, by uncovering previously unknown
trends and patterns in vast amount of data in order to support decision making.
 Data mining is a subset of Business Analytics
 There is a need to turn data into useful information and knowledge for broad
applications including
 Market analysis
 Business management
 Decision support
 Customer segmentation and behavior
 Etc.
How data mining works?

 Data mining builds models to discover patterns

among attributes presented in the data set.
 Models are:
 Mathematical representations (simple linear
relationships and highly non-linear
relationship) that identify patterns among
attributes of the things such as customers with
products
 Some of these patterns are explanatory and
others are predictive (foretelling future values
of certain attributes)
Why Mine Data? Commercial Viewpoint
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions

 Computers have become cheaper and more powerful

 Competitive Pressure is Strong
 Providebetter, customized services for an edge (e.g. in
Customer Relationship Management)
What is (not) Data Mining?

lWhat is not Data l What is Data Mining?

Mining?
– Look up phone – Certain names are more
number in phone prevalent in certain US
directory locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by
information about search engine according to
“Amazon” their context (e.g. Amazon
rainforest, Amazon.com,)
Examples of data mining
applications

 Regarding temporal data, for instance, banking data can be mined

for changing trends, which may aid in the scheduling of bank tellers
according to the volume of customer traffic.
 Stock exchange data can be mined so that trends that could help to
plan investment strategies can be uncovered
 Computer network data streams can be mined to detect intrusions
based on the anomaly of message flows, which may be discovered
by clustering, dynamic construction of stream models or by
comparing the current frequent patterns with those at a previous
time.
 With spatial data, look for patterns that describe changes in
metropolitan poverty rates based on city distances from major
highways. By examining the relationships among a set of spatial
objects, which subsets of objects are spatially auto correlated or
associated can be discovered.
Industry examples of DM
applications
 Sales/ Marketing
 Identify buying patterns from customers
 Find the association among customer demographic characteristics
 Banking
 Credit card fraudulent detection
 Identify ‘loyal’ customers
 Insurance and Health Care
 Claims analysis i.e., which medical procedures are claimed together
 Predict the customers who will buy new policies
 Transportation
 Determine the distribution schedules for the outlets
 Analyze loading patterns
 Medicine
 Characterize patient behavior in order to predict office visits
 Identify successful medical therapies for different diseases / illnesses
Take a break….
Watch a video

 Source of data mining

 https://www.youtube.com/watch?v=Y_JlkzzhAgw
Data Mining
Tasks,
methods
and
algorithms
Prediction
 Prediction is refer to the act of telling about the
future by taking into account the experiences,
opinions and other relevant information in
conducting the task of foretelling.
 Depending on the nature of what is being
predicted, prediction can be specifically as :
 Classification (predicted thing is such as
tomorrow’s forecast, is a class label such as
“rainy” or “sunny”)
 Regression (predicted thing is tomorrow’s
temperature, is a real number such as 65 F)
 Time-series, the data consists of values of the
same variable that is captured and stored over
tine in regular intervals, such as stock price
Prediction techniques
 Classification : assign a new data record to one of several
predefined categories or classes. Also called supervised
learning.
 Classification approaches normally use a training set where
all objects are already associated with known class labels.
 The classification algorithm learns from the training set
and builds a model. The model is used to classify new
objects.
 This method has been used in customer segmentation,
business modeling, and credit analysis.
 For example, after starting a credit policy, the
OurVideoStore managers could analyze the customers’
behaviours via their credit, and label accordingly the
customers who received credits with three possible labels
“safe”, “risky” and “very risky”. The classification analysis
would generate a model that could be used to either
accept or reject credit requests in the future
Associations
 Or association rule learning in data mining is a
popular and well-researched technique for
discovering interesting relationships among
variables in large databases.
 With the help of bar-code scanners, the use of
associations rules for discovering regularities
among products is able to capture by the
system.
 Types of associations:
 Link analysis : the linkage among many objects
of interest is discovered automatically, such as
the link between web pages and referential
relationships among groups of academic
publication authors
Associations techniques
 Market-basket: detect sets of attributes/items that
frequently has association relationship or correlations
among them, e.g. 90% of the people who buy cookies,
also buy milk (60% of all grocery shoppers buy both)
 In data mining, association rules are useful for
analyzing and predicting customer behavior. They
play an important part in shopping basket data
analysis, product clustering, catalog design and store
layout.
 Sequence mining (categorical): discover sequences of
events that commonly occur together, .e.g. In a set of
DNA sequences ACGTC is followed by GTCA after a gap
of 9, with 30% probability
 Something come after the other, for example: when
happen outbreak flu, the glove will be in shortage
Association rules
Clustering
 Clustering: method of assigning a set of objects into groups
or segments based on similarities automatically.
 Unlike classification, in clustering the class labels are
unknown.
 As the selected algorithm goes through the data set,
identifying the common of things based on their
characteristics, the clusters are established.
 Clustering techniques include optimization.
 Goal of clustering is to create groups so that the members
within each group have maximum similarity and the
members across groups have minimum similarity.
Clustering techniques
 Cluster analysis is a means of identifying
classes of items so that items in a cluster have
more in common with each other than with
items in other clusters.
 Example: create customer segmentation based
on income, age, race, location, etc.
Data Mining Techniques
 Outlier Analysis: find the record(s) that is (are)
the most different from the other records, i.e.,
find all outliers. Outliers are data elements that
cannot be grouped in a given class or cluster.
Example of using Data Mining
Data Mining versus Statistics
Data Mining Statistics

Starts with loosely defined Starts with a well-defined

discovery statement by using proposition and by collecting
all existing data (i.e. sample data (i.e. primary data)
observational and secondary to test the hypothesis
data) to discover novel
patterns and relationships

Data sets in data mining are as Statistics looks for the right
“big” as possible size of data (if the size of data
required for statistical
analysis, usually sample of data
is used)
Data
Visualization
Take a break…
watch a video
 How Facebook Data Mining, And Your Info, Is Influencing
The 2016 Election | TODAY
https://www.youtube.com/watch?v=i-rIYadXoms
Knowledge Discovery in Database
(KDD)
 Knowledge Discovery from Data (KDD), refers to the broad
process of finding knowledge in data that emphasizes the
"high-level" application of particular data mining methods.
 The unifying goal of KDD process - extract knowledge from
data in the context of large databases - done by using data
mining methods
 KDD refers to the entire process of discovering useful
knowledge from data.
 This process involves making decision of what qualifies as
knowledge by evaluating and possibly interpreting the
patterns. It also includes the choice of encoding schemes,
preprocessing, sampling, and projections of the data prior
to the data mining step.
KDD: A Definition

 KDD is the automatic extraction of non-obvious,

hidden knowledge from large volumes of data.

Then run Data

Mining algorithms

106-1012 bytes:
we never see the What is the knowledge?
whole data set, so will How to represent
put it in the memory of and use it?
computers
Knowledge Discovery Process
Steps in KDD process
Knowledge Discovery Process
 The Knowledge Discovery in Databases process comprises of a few steps
leading from raw data collections to some form of new knowledge.
 The iterative process consists of the following steps:
 Data cleaning: also known as data cleansing, it is a phase in which noise data and
irrelevant data are removed from the collection or maybe missing data.
 Data integration: at this stage, multiple data sources, often heterogeneous, may be
combined in a common source.
 Data selection: at this step, the data relevant to the analysis is decided on and
retrieved from the data collection.
 Data transformation: also known as data consolidation, it is a phase in which the
selected data is transformed into forms appropriate for the mining procedure.
 Data mining: it is the crucial step in which clever techniques are applied to extract
patterns potentially useful. Searching for patterns of interest in a particular
representational form or a set of such representations, including classification rules
or trees, regression, and clustering
 Pattern evaluation: in this step, strictly interesting patterns representing
knowledge are identified based on given measures.
 Knowledge representation: is the final phase in which the discovered knowledge is
visually represented to the user. This essential step uses visualization techniques to
help users understand and interpret the data mining results.
3 methodologies of KDD
model
 Fayyad et al. (Computer science)
 E.g., WEKA
 SEMMA (SAS) (Statistics)
 SAS Enterprise Miner
 CRISP-DM (SPSS, OHRA) (Business)
 SPSS
Methodology of KDD –
CRISP-DM
 CRISP-DM
 Stands for Cross Industry Standard Process for
Data Mining
 A non-proprietary, documented, and freely
available data mining model.
 It was developed by industry leaders with input
from more than 200 data mining users and data
mining tool and service providers.
 It is an industry-, tool- and application-neutral
model.
 This model encourages best practices and offers
organizations the structure needed to realize
better, faster results from data mining.
Six phases in CRISP-DM
CRISP –DM (Elaborate view)
Six phases of CRISP-DM
1. Business Understanding
 This initial phase focuses on understanding the project objectives and
requirements from a business perspective, and then converting this
knowledge into a data mining problem definition, and a preliminary
plan designed to achieve the objectives.
 Such as “What are the common characteristics of the customers we
have lost to our competitors recently?”
2. Data Understanding
 The data understanding phase starts with an initial data collection. It
proceeds with activities
 ▪ To get familiar with the data,
 ▪ To identify data quality problems,
 ▪ To discover first insights into the data, or to
 ▪ Detect interesting subsets to form hypotheses for hidden information.
Six phases of CRISP-DM
3. Data Preparation
 The data preparation phase covers all activities to
construct the final dataset (data that will be fed into the
modeling tool(s)) from the initial raw data.
 Data preparation tasks are likely to be performed multiple
times, and not in any prescribed order. Tasks include table,
record, and attribute selection as well as transformation
and cleaning of data for modeling tools.
4. Modeling
 In this phase, many modeling techniques are chosen and
applied, and calibrate their parameters to optimal values.
Typically, to the same data mining problem type, several
techniques can be applied.
Six phases of CRISP-DM
5. Evaluate Results
 The accuracy and generality of the model were dealt with
the previous evaluation steps. The degree to which the
model meets the business objectives is assessed in this step.
 Also this step seeks to determine if there is some valid
business reason why the model is deficient. If time and
budget permits, the model(s) can be tested on test
applications in the real application which is another option
of evaluation.
6. Deployment
 The end of the project is not just the creation of the model.
Though the purpose of the model is to increase knowledge
of the data, the knowledge gained needs to be organized
and presented in such a way that the client can use.
KDD vs. DM
 DM is a component of the KDD process that is
mainly concerned with means by which patterns
and models are extracted and enumerated from
the data
 DM is quite technical
 Knowledge discovery involves evaluation and
interpretation of the patterns and models to
make the decision of what constitutes
knowledge and what does not
 KDD requires a lot of domain understanding
 The DM and KDD are often used interchangeably
 Perhaps DM is a more common term in business
world, and KDD in academic world
The end.

Video: Data Mining and Business Intelligent

https://www.youtube.com/watch?v=peSNJ5bfjX0

How data mining works?

https://www.youtube.com/watch?v=W44q6qszdqY

Week-1-Introduction To Data Mining
No ratings yet
Week-1-Introduction To Data Mining
43 pages
Salesforce AI Associate Dumps
100% (4)
Salesforce AI Associate Dumps
60 pages
5 Data Mining Proccess and Techniques - Week 7
No ratings yet
5 Data Mining Proccess and Techniques - Week 7
61 pages
Chapter 5 Data Resource Management
No ratings yet
Chapter 5 Data Resource Management
24 pages
Soda Straw Rocket Lesson 5
No ratings yet
Soda Straw Rocket Lesson 5
2 pages
Chapter 2-DATABASE SYSTEM Architecture
No ratings yet
Chapter 2-DATABASE SYSTEM Architecture
52 pages
Reservoir Quality Prediction Guide
75% (4)
Reservoir Quality Prediction Guide
316 pages
Idea Group Neural Networks in Business Forecasting
No ratings yet
Idea Group Neural Networks in Business Forecasting
311 pages
MIS Project
67% (3)
MIS Project
12 pages
Data Mining for Analysts
No ratings yet
Data Mining for Analysts
86 pages
Predicting The Stock Market Using Machine Learning and Deep Learning PDF
No ratings yet
Predicting The Stock Market Using Machine Learning and Deep Learning PDF
58 pages
TB Ch03 Cost Analysis
No ratings yet
TB Ch03 Cost Analysis
11 pages
General JPSE PDF
No ratings yet
General JPSE PDF
8 pages
Data Warehousing & Mining Course
No ratings yet
Data Warehousing & Mining Course
2 pages
A Kick-Ass Application: Phillip Potamites May 31, 2007
No ratings yet
A Kick-Ass Application: Phillip Potamites May 31, 2007
4 pages
Informed Search v1
No ratings yet
Informed Search v1
31 pages
1 Explain Apriori Algorithm With Example or Finding Frequent Item Sets Using With Candidate Generation
No ratings yet
1 Explain Apriori Algorithm With Example or Finding Frequent Item Sets Using With Candidate Generation
21 pages
Programming Basics for Beginners
No ratings yet
Programming Basics for Beginners
16 pages
Chapter 9 Transactions Management and Concurrency Control
No ratings yet
Chapter 9 Transactions Management and Concurrency Control
36 pages
Unit 1 - Introduction
No ratings yet
Unit 1 - Introduction
8 pages
Group Assignment
No ratings yet
Group Assignment
4 pages
Data Mining Course Syllabus
No ratings yet
Data Mining Course Syllabus
8 pages
XML Practice Exercises & Solutions
No ratings yet
XML Practice Exercises & Solutions
8 pages
Chapter 9 Management Information Systems
No ratings yet
Chapter 9 Management Information Systems
11 pages
Notification System
No ratings yet
Notification System
5 pages
Adaptations FPD - Year 5 Biological Sciences
No ratings yet
Adaptations FPD - Year 5 Biological Sciences
12 pages
Theory of Price - George J. Stigler
100% (2)
Theory of Price - George J. Stigler
379 pages
Principles of Information: Systems, Ninth Edition
No ratings yet
Principles of Information: Systems, Ninth Edition
59 pages
Final Year Project Networking
No ratings yet
Final Year Project Networking
21 pages
SF AI Specialist Cert Guide
No ratings yet
SF AI Specialist Cert Guide
22 pages
Big Data Summery
No ratings yet
Big Data Summery
9 pages
Enterprise Resource and Planning - ERP
No ratings yet
Enterprise Resource and Planning - ERP
20 pages
Big Data CH 1
No ratings yet
Big Data CH 1
62 pages
Executive Information System
100% (2)
Executive Information System
28 pages
Data Warehousing
No ratings yet
Data Warehousing
24 pages
Weather Report Generation and Prediction
No ratings yet
Weather Report Generation and Prediction
4 pages
Management Information System: Chapter Two
No ratings yet
Management Information System: Chapter Two
55 pages
IS - Chapter 1, Overview of Information System
No ratings yet
IS - Chapter 1, Overview of Information System
34 pages
3 - Product and Service Design
No ratings yet
3 - Product and Service Design
38 pages
Big Data Analytics Course Overview
No ratings yet
Big Data Analytics Course Overview
2 pages
IoT Weather Monitoring & Prediction
No ratings yet
IoT Weather Monitoring & Prediction
10 pages
Chapter 6
No ratings yet
Chapter 6
59 pages
Introduction To Data Management - Week 1 - 2024
No ratings yet
Introduction To Data Management - Week 1 - 2024
17 pages
Security and Ethical Challenges
No ratings yet
Security and Ethical Challenges
76 pages
Chapter 7
No ratings yet
Chapter 7
26 pages
Chapter 2 - Processes in Conducting Research
No ratings yet
Chapter 2 - Processes in Conducting Research
38 pages
CRM Ch-07 Data Mining
No ratings yet
CRM Ch-07 Data Mining
32 pages
Data Mining
No ratings yet
Data Mining
15 pages
Knowledge Representation and Expert System
No ratings yet
Knowledge Representation and Expert System
36 pages
Risk
No ratings yet
Risk
61 pages
Exam Prep: Data Mining Insights
No ratings yet
Exam Prep: Data Mining Insights
48 pages
ST - Module 3: Integration, System and Acceptance Testing Integration Testing
No ratings yet
ST - Module 3: Integration, System and Acceptance Testing Integration Testing
27 pages
System & Network Security Course
No ratings yet
System & Network Security Course
3 pages
Parallel & Distributed Systems Course
No ratings yet
Parallel & Distributed Systems Course
4 pages
Sharda Dss10 PPT 08 ST
No ratings yet
Sharda Dss10 PPT 08 ST
14 pages
23-24 Database Assignment Brief
No ratings yet
23-24 Database Assignment Brief
5 pages
Business Intelligence and Analytics Tools
No ratings yet
Business Intelligence and Analytics Tools
4 pages
Fundamentals of Information Systems, Seventh Edition
No ratings yet
Fundamentals of Information Systems, Seventh Edition
56 pages
ITIL Core Concepts Explained
No ratings yet
ITIL Core Concepts Explained
5 pages
Exercise Chapter 2
No ratings yet
Exercise Chapter 2
9 pages
Fundamentals of DBS - CH - 2
No ratings yet
Fundamentals of DBS - CH - 2
28 pages
Readings and Case Studies
No ratings yet
Readings and Case Studies
128 pages
CRM Strategies for Marketers
No ratings yet
CRM Strategies for Marketers
25 pages
Systems Planning and Selection
100% (1)
Systems Planning and Selection
11 pages
CRM Course for IT Students
No ratings yet
CRM Course for IT Students
7 pages
A Two-Step Strategy For Fuel Consumption Prediction and Optimization of
No ratings yet
A Two-Step Strategy For Fuel Consumption Prediction and Optimization of
15 pages
Operation Research Chapter Five 5. Networks and Project Management
No ratings yet
Operation Research Chapter Five 5. Networks and Project Management
11 pages
Business Intelligence Overview
No ratings yet
Business Intelligence Overview
8 pages
Functional Business Systems: James A. O'Brien, and George Marakas Management Information Systems
No ratings yet
Functional Business Systems: James A. O'Brien, and George Marakas Management Information Systems
22 pages
Young SocialPsychologySocial 1951
No ratings yet
Young SocialPsychologySocial 1951
9 pages
2.data Analysis With Python by Rituraj Dixit - Z-Library
No ratings yet
2.data Analysis With Python by Rituraj Dixit - Z-Library
4 pages
Student Information System System Design
No ratings yet
Student Information System System Design
15 pages
Decision-Making & Knowledge Systems
No ratings yet
Decision-Making & Knowledge Systems
22 pages
Causation vs. Effectuation in Entrepreneurship
100% (1)
Causation vs. Effectuation in Entrepreneurship
22 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
10 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
19 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
4 pages
Analysis of Wheat
No ratings yet
Analysis of Wheat
21 pages
Chapter 7 Data Warehouse & OLAP
No ratings yet
Chapter 7 Data Warehouse & OLAP
42 pages
INFS1602 Testbank Chapter 5 (Part 1)
No ratings yet
INFS1602 Testbank Chapter 5 (Part 1)
14 pages
Scientific and Strategic Reasoning Guide
No ratings yet
Scientific and Strategic Reasoning Guide
25 pages
Android SQLite Database Guide
No ratings yet
Android SQLite Database Guide
61 pages
Temperature Prediction
No ratings yet
Temperature Prediction
16 pages
Stock Market Prediction Using Machine Learning: December 2018
No ratings yet
Stock Market Prediction Using Machine Learning: December 2018
4 pages
Chapter 1 Summary Notes
No ratings yet
Chapter 1 Summary Notes
5 pages
Econometrics & Panel Data Basics
No ratings yet
Econometrics & Panel Data Basics
37 pages
Data Science Internship Guide
No ratings yet
Data Science Internship Guide
12 pages
Integration of Artificial Intelligence Performance Prediction and Learning Analytics To Improve Student Learning in Online Engineering Course
No ratings yet
Integration of Artificial Intelligence Performance Prediction and Learning Analytics To Improve Student Learning in Online Engineering Course
23 pages
D7.2 Data Managment Plan v1.04
No ratings yet
D7.2 Data Managment Plan v1.04
14 pages
DGPS-Based Vehicle-To-Vehicle Cooperative Collision Warning - Engineering Feasibility Viewpoints
No ratings yet
DGPS-Based Vehicle-To-Vehicle Cooperative Collision Warning - Engineering Feasibility Viewpoints
14 pages
Grade 8 Science Lab Guide
No ratings yet
Grade 8 Science Lab Guide
9 pages
INT4204 Searching
No ratings yet
INT4204 Searching
21 pages
TCP/IP and Network Design Guide
No ratings yet
TCP/IP and Network Design Guide
22 pages
Chapter 7 Normalization
No ratings yet
Chapter 7 Normalization
24 pages
Chapter3 Project Scope Management (Saras Ref)
No ratings yet
Chapter3 Project Scope Management (Saras Ref)
41 pages
ITM Assignment Full
No ratings yet
ITM Assignment Full
29 pages
Cell Tracking Using A Multiple Model Kalman Filter Approach
No ratings yet
Cell Tracking Using A Multiple Model Kalman Filter Approach
17 pages
Chapter 8 SQL Complex Queries
No ratings yet
Chapter 8 SQL Complex Queries
51 pages
Delft Course Alberto 2014
No ratings yet
Delft Course Alberto 2014
55 pages
Business Analytics 9605 Session 2
No ratings yet
Business Analytics 9605 Session 2
60 pages
FRP Strength Probability
No ratings yet
FRP Strength Probability
48 pages

Chapter 6 Data Mining

Uploaded by

Chapter 6 Data Mining

Uploaded by

Chapter 6

INTRODUCTION TO DATA MINING

 After this lesson, you are able to learn as the

 Data mining builds models to discover patterns

 Computers have become cheaper and more powerful

lWhat is not Data l What is Data Mining?

 Regarding temporal data, for instance, banking data can be mined

 Source of data mining

Starts with loosely defined Starts with a well-defined

 KDD is the automatic extraction of non-obvious,

Then run Data

Video: Data Mining and Business Intelligent

How data mining works?

You might also like