Data Mining:
Concepts and Techniques
(3rd ed.)
Unit-II (a): Data Mining
1
Unit-II (a): Data Mining
1. Why Data Mining?
2. What Is Data Mining?
3. What Kinds of Data Can Be Mined?
4. What Kinds of Patterns Can Be Mined?
5. Major Issues in Data Mining
2
1. Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
3
1. Why Data Mining?
i. Moving toward the Information Age
We are living in the information age.
The world is data rich but information poor.
The number of people who search for flu-related information
and the number of people who actually have flu symptoms
ii. Data Mining as the Evolution of Information Technology
Data Collection and Database Creation
DBMS
Advanced Database Systems
Advanced Data Analysis
Data tombs into “golden nuggets” of knowledge.
4
5
Unit-II (a): Data Mining
1. Why Data Mining?
2. What Is Data Mining?
3. What Kinds of Data Can Be Mined?
4. What Kinds of Patterns Can Be Mined?
5. Major Issues in Data Mining
6
2. What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or
knowledge from huge amount of data
Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
7
Knowledge Discovery (KDD) Process
This is a view from typical database
systems and data warehousing
communities Pattern Evaluation
Data mining plays an essential role in
the knowledge discovery process
Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
Databases
8
Example: A Web Mining Framework
Web mining usually involves
Data cleaning
Data integration from multiple sources
Warehousing the data
Data cube construction
Data selection for data mining
Data mining
Presentation of the mining results
Patterns and knowledge to be used or stored into
knowledge-base
9
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
10
Unit-II (a): Data Mining
1. Why Data Mining?
2. What Is Data Mining?
3. What Kinds of Data Can Be Mined?
4. What Kinds of Patterns Can Be Mined?
5. Major Issues in Data Mining
11
3. Data Mining: On What Kinds of Data?
i. Database-oriented data sets and applications
Relational database
Data Warehouse
Transactional database
12
June 21, 2019 Data Mining: Concepts and Techniques 13
14
15
ii. Other Kinds of Data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
16
Unit-II (a): Data Mining
1. Why Data Mining?
2. What Is Data Mining?
3. What Kinds of Data Can Be Mined?
4. What Kinds of Patterns Can Be Mined?
5. Major Issues in Data Mining
17
4. Data Mining Techniques
i. Class/Concept Description: Characterization and
Discrimination
ii. Mining Frequent Patterns, Associations, and
Correlations
iii. Classification and Regression for Predictive Analysis
iv. Cluster Analysis
v. Outlier Analysis
vi. Are All Patterns Interesting?
18
Data Mining Tasks
Data mining functionalities are used to specify the
kinds of patterns to be found in data mining tasks.
Descriptive mining tasks characterize properties of
the data in a target data set.
Predictive mining tasks perform induction on the
current data in order to make predictions.
19
i. Data characterization & discrimination
It is a summarization of the general characteristics or features
of a target class of data.
Output:
pie charts, bar charts, curves, multidimensional data cubes,
and multidimensional tables, including crosstabs.
Data discrimination is a comparison of the general features of
the target class data objects against the general features of
objects from one or multiple contrasting classes.
The target and contrasting classes can be specified by a user,
and the corresponding data objects can be retrieved through
database queries. (Output: discriminant rules)
20
(ii) Association and Correlation Analysis
Frequent patterns (or frequent itemsets)
What items are frequently purchased together in your
Walmart?
Association, correlation vs. causality
A typical association rule
Diaper Beer [0.5%, 75%] (support, confidence)
Are strongly associated items also strongly correlated?
How to mine such patterns and rules efficiently in large
datasets?
How to use such patterns for classification, clustering,
and other applications?
21
Classification and Regression for
Predictive Analysis
Classification and label prediction
Construct models (functions) based on some training examples
Describe and distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars based
on (gas mileage)
Predict some unknown class labels
Typical methods
Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-based
classification, logistic regression, …
Typical applications:
Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …
22
June 21, 2019 Data Mining: Concepts and Techniques 23
(iv) Cluster Analysis
Unsupervised learning (i.e., Class label is unknown)
Group data to form new categories (i.e., clusters),
e.g., cluster houses to find distribution patterns
Principle: Maximizing intra-class similarity &
minimizing interclass similarity
Many methods and applications
24
June 21, 2019 Data Mining: Concepts and Techniques 25
(v) Outlier Analysis
Outlier analysis
Outlier: A data object that does not comply with
the general behavior of the data
Noise or exception? ― One person’s garbage
could be another person’s treasure
Methods: by product of clustering or regression
analysis, …
Useful in fraud detection, rare events analysis
26
(vi) Are all mined knowledge interesting
A pattern is interesting if it is
(1) easily understood by humans,
(2) valid on new or test data with some degree of certainty,
(3) potentially useful, and
(4) novel.
A pattern is also interesting if it validates a hypothesis that
the user sought to confirm.
Evaluation of mined knowledge → directly mine only
interesting knowledge?
Support and Confidence
Coverage
Accuracy
Timeliness
27
Unit-II (a): Data Mining
1. Why Data Mining?
2. What Is Data Mining?
3. What Kinds of Data Can Be Mined?
4. What Kinds of Patterns Can Be Mined?
5. Major Issues in Data Mining
28
5. Major Issues in Data Mining
i. Mining Methodology
Mining various and new kinds of knowledge (Integrated Clustering)
Mining knowledge in multi-dimensional space (CHG & Data Cube)
Data mining: An interdisciplinary effort (Text & Bug Mining)
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, & incompleteness of data (Cleaning)
Pattern evaluation & pattern-or constraint-guided mining (Beliefs)
ii. User Interaction
Interactive mining (UI, diff. mining requests, search, OLAP Oper’s)
Incorporation of background knowledge (Domain knowledge)
Ad hoc data mining and data mining query languages (SQL/DMQL)
Presentation and visualization of data mining results (Understandable)
29
Contd..
iii. Efficiency and Scalability
Efficiency & scalability of DM algorithms (Time & Performance)
Parallel, distributed, stream, and incremental mining methods
iv. Diversity of data types
Handling complex types of data (Stream, Text, Web data)
Mining dynamic, networked, & global data repositories (Unstructured)
v. Data Mining and Society
Social impacts of data mining (Benefits, misuse, protection rights)
Privacy-preserving data mining (Sensitivity and Privacy)
Invisible data mining (Future Recommendations- purchase, interests)
30
Summary
Data mining: Discovering interesting patterns and knowledge
from massive amount of data.
A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation.
Mining can be performed in a variety of data.
Data mining functionalities: characterization, discrimination,
association, classification, clustering, trend and outlier analysis.
Major issues in data mining
31