DATA MINING
1
Introduction
Motivation: Why data mining?
What is data mining?
Data Mining: On what kind of data?
Data mining functionality
Are all the patterns interesting?
Classification of data mining systems
Major issues in data mining
2
Motivation
Data explosion problem
Automated data collection tools, database systems, Web, computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing
Extraction of interesting knowledge (rules, regularities, patterns, constraints)
from data in large databases
3
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
4
What is data mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data.
Process of semi-automatically analyzing large databases to find patterns
that are:
valid: hold on new data with some certainty
novel: non-obvious to the system
useful: should be possible to act on the item
understandable: humans should be able to interpret the pattern
5
Goals of Data Mining
The typical goals of data mining projects are:
Identification of groups, clusters, strata, or dimensions in data that
display no obvious structure,
The identification of factors that are related to a particular outcome of
interest (root-cause analysis)
Accurate prediction of outcome variable(s) of interest (in the future, or
in new customers, clients, applicants, etc.; this application is usually
referred to as predictive data mining)
Knowledge Discovery (KDD) Process
Data mining—core of Evaluation and
knowledge discovery Presentation
process
Data Mining
Patterns
Task-relevant Data
Selection and Transformation
Data Warehouse
Cleaning & Integration
Databases
7
Steps of a KDD Process
1. Data cleaning: removal of noise and inconsistent data.
2. Data integration: combination of multiple data sources.
3. Data selection: retrieval of data relevant to the analysis task from the
database.
4. Data transformation: data consolidation into forms appropriate for mining
through summarization, aggregation.
5. Data mining: application of intelligent methods to extract patterns of
interest.
6. Pattern evaluation: identification of truly interesting patterns representing
knowledge based on some interestingness measures.
7. Knowledge presentation: use of visualization and presentation techniques
to present mined knowledge to the user.
8
Data Mining and Business Intelligence
Increasing potential
to support Decision End User
business decisions Making
Data Presentation Business
Visualization Techniques Analyst
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database DBA
Systems
8
Confluence of Multiple Disciplines
10
Data Mining System Architecture
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data warehouse server
Data cleaning, integration and selection
Data World Wide Other Info
Database
Warehouse Web Repositories
10
Necessity for Data Mining
Tremendous amount of data
Algorithms must be highly scalable to handle such data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
Desired analyses
Support for planning, Yield management, System performance, Mature
database analysis
12
Multi-Dimensional View of Data Mining
Data to be mined
Relational, data warehouse, transactional, stream,
object-oriented/relational, active, spatial, time-series, text, multi-
media, heterogeneous, legacy, WWW
Knowledge to be mined
Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, Web mining, etc.
13
Classification Schemes
General functionality
Descriptive data mining
Predictive data mining
Different views lead to different classifications
• Data view: Kinds of data to be mined
• Knowledge view: Kinds of knowledge to be discovered
• Method view: Kinds of techniques utilized
• Application view: Kinds of applications adapted
14
Data Mining Models and Tasks
Life Cycle of Data Mining Projects
Business understanding - project
objectives from business
perspective, data mining problem
definition
Data understanding - initial data
collection, get familiar with data
Data preparation - construct final
dataset from raw data
Modeling - Select and apply
modeling techniques
Evaluation - Evaluate model,
decide on further deployment
Deployment - Create report, carry
out actions based on new insights Standardized Data Mining Process [CRISP]
Data Mining Algorithms
Online Analytical Discovery Driven Methods
Processing
Description Prediction
SQL Query Tools
Visualization Classification Regressions
Clustering
Decision Trees
Association
Sequential Analysis Neural Networks
13
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web 18
Data Mining Functionalities
Multidimensional concept description: Characterization and discrimination
Generalize, summarize, and contrast data characteristics, e.g., dry vs.
wet regions
Frequent patterns, association, correlation vs. causality
Diaper Beer [0.5%, 75%] (Correlation or causality?)
Classification and prediction
Construct models (functions) that describe and distinguish classes or
concepts for future prediction
• E.g., classify countries based on (climate), or classify cars based on
(fuel consumption)
Predict some unknown or missing numerical values 19
Data Mining Functionalities(2)
Cluster analysis
Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass similarity
Outlier analysis
Outlier: Data object that does not comply with the general behavior of
the data
Noise or exception? Useful in fraud detection, rare events analysis
Trend and evolution analysis
Trend and deviation: e.g., regression analysis
Sequential pattern mining: e.g., digital camera large SD memory
Periodicity analysis
Similarity-based analysis
Other pattern-directed or statistical analyses
20
Are All the “Discovered” Patterns Interesting?
Data mining may generate thousands of patterns: Not all of them are
interesting
Suggested approach: Human-centered, query-based, focused mining
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid on new or
test data with some degree of certainty, potentially useful, novel, or
validates some hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
Subjective: based on user’s belief in the data, e.g., unexpectedness,
novelty, actionability, etc.
21
Major Issues in Data Mining
Mining methodology and user interaction
Mining different kinds of knowledge in databases
Interactive mining of knowledge at multiple levels of abstraction
Incorporation of background knowledge
Data mining query languages and ad-hoc data mining
Expression and visualization of data mining results
Handling noise and incomplete data
Pattern evaluation: the interestingness problem
Performance and scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed and incremental mining methods
22
Major Issues in Data Mining
Issues relating to the diversity of data types
Handling relational and complex types of data
Mining information from heterogeneous databases and global
information systems (www)
Issues related to applications and social impacts
Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
Integration of the discovered knowledge with existing knowledge: A
knowledge fusion problem
Protection of data security, integrity, and privacy
23
Primitives that Define a Data Mining Task
Task-relevant data
Database or data warehouse name
Database tables or data warehouse cubes
Condition for data selection
Relevant attributes or dimensions
Data grouping criteria
Type of knowledge to be mined
Characterization, discrimination, association, classification,
prediction, clustering, outlier analysis, other data mining tasks
Background knowledge
Pattern interestingness measurements
Visualization/presentation of discovered patterns 24
Primitive 3: Background Knowledge
A typical kind of background knowledge: Concept hierarchies
Schema hierarchy
E.g., street < city < county < country
Set-grouping hierarchy
E.g., {20-39} = young, {40-59} = middle_aged
Operation-derived hierarchy
email address:
[email protected] login-name < department < university < country
Rule-based hierarchy
low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) < $50
25
Primitive 4: Pattern Interestingness Measure
Simplicity
e.g., (association) rule length, (decision) tree size
Certainty
e.g., confidence, P(A|B) = #(A and B)/ #(B), classification reliability or
accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc.
Utility
potential usefulness, e.g. support (association), noise threshold
(description)
Novelty
not previously known, surprising (used to remove redundant rules)
26
Primitive 5: Presentation of Discovered Patterns
Different backgrounds/usages may require different forms of representation
E.g., rules, tables, crosstabs, pie/bar chart, etc.
Concept hierarchy is also important
Discovered knowledge might be more understandable when represented
at high level of abstraction
Interactive drill up/down, pivoting, slicing and dicing provide different
perspectives to data
Different kinds of knowledge require different representation: association,
classification, clustering, etc.
27
Data mining Potential Applications
Banking: loan/credit card approval
predict good customers based on old customers
Customer relationship management:
identify those who are likely to leave for a competitor.
Targeted marketing:
identify likely responders to promotions
Fraud detection: telecommunications, financial transactions
from an online stream of events identify fraudulent events
28
Data mining Potential Applications
Manufacturing and production:
automatically adjust knobs when process parameter changes
Medicine: disease outcome, effectiveness of treatments
analyze patient disease history: find relationship between diseases
Molecular/Pharmaceutical: identify new drugs
Scientific data analysis:
identify new galaxies by searching for sub clusters
Web site/store design and promotion:
find affinity of visitor to pages and modify layout
29
Ex. 1: Market Analysis and Management
Source of data —Credit card transactions, customer complaint calls, (public) lifestyle studies
Target marketing
Find clusters of “model” customers who share the same characteristics: interest, income level, spending
habits, etc.
Determine customer purchasing patterns over time
Cross-market analysis—Find associations/co-relations between product sales, & predict based on
such association
Customer profiling—What types of customers buy what products (clustering or classification)
Customer requirement analysis
Identify the best products for different groups of customers
Predict what factors will attract new customers
Provision of summary information
Multidimensional summary reports
Statistical summary information (data central tendency and variation)
30
Ex. 2: Corporate Analysis & Risk Management
Finance planning and asset evaluation
cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)
Resource planning
summarize and compare the resources and spending
Competition
monitor competitors and market directions
group customers into classes and a class-based pricing procedure
set pricing strategy in a highly competitive market
31
Ex.3: Fraud Detection & Mining Unusual Patterns
Approaches: Clustering & model construction for frauds, outlier analysis
Applications: Health care, retail, credit card service, telecomm.
Auto insurance: ring of collisions
Money laundering: suspicious monetary transactions
Medical insurance
Professional patients, ring of doctors, and ring of references
Unnecessary or correlated screening tests
Telecommunications: phone-call fraud
Phone call model: destination of the call, duration, time of day or week.
Analyze patterns that deviate from an expected norm
Retail industry
Analysts estimate that 38% of retail shrink is due to dishonest employees
32