Data Mining & Warehousing
Chapter 1. Introduction
Data Warehousing/Mining
Chapter 1. Introduction
Motivation: Why data mining?
What is data mining?
Data Mining: On what kind of data?
Data mining functionality
Are all the patterns interesting?
Classification of data mining systems
Major issues in data mining
Data Warehousing/Mining
• 1 Zeta byte = 1
trillion Gigabytes.
• 5,200 GB of data
for every
person on
Earth.
Data Warehousing/Mining 3
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes
• Data collection and data availability
• Automated data collection tools, database systems, Web, computerized
society
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube, social media,
mobile devices, …
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
• Mine the knowledge from data
Data Warehousing/Mining 4
Example of Data Volumes
Data Sets are growing.
How Much Data is that?
1 MB 220 or 106 bytes Small novel – 31/2 Disk
Paper rims that could fill the back of
1 GB 230 or 109 bytes
a pickup van
50,000 trees chopped and converted
1 TB 240 or 1012 bytes
into paper and printed
Academic research libraries across
2 PB 1 PB = 250 or 1015 bytes
the U.S.
All words ever spoken by human
5 EB 1 EB = 260 or 1018 bytes
beings
5
Evolution of Database Technology
1960s:
– Data collection, database creation, IMS and network DBMS
1970s:
– Relational data model, relational DBMS implementation
1980s:
– RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial,
scientific, engineering, etc.)
1990s—2000s:
– Data mining and data warehousing, multimedia databases, and
Web databases
Data Warehousing/Mining
Data Mining: On What Kind of
Data?
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
– Object-oriented and object-relational databases
– Spatial databases
– Time-series data and temporal data
– Text databases and multimedia databases
– Heterogeneous and legacy databases
– WWW
Data Warehousing/Mining
What Is Data Mining?
Data mining (knowledge discovery in databases):
– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns
from data in large databases
Alternative names and their “inside stories”:
– Data mining: a misnomer?
– Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
What is not data mining?
– (Deductive) query processing.
– Expert systems or small machine learning/
statistical programs
Data Warehousing/Mining
Data Mining: A Knowledge Discovery in
Databases (KDD) Process
Pattern Evaluation
– Data mining: the core of
knowledge discovery process.
Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
Databases
Data Warehousing/Mining
7 Data Mining Steps
1. Data cleaning – remove noise and
inconsistent data
2. Data integration – combine multiple
sources
3. Data selection – retrieve from the
database data relevant to the analysis task
4. Data transformation – data are
transformed or consolidated into forms
appropriate for mining (e.g. performing
summary or aggregation operations)
Data Warehousing/Mining 1
7 Data Mining Steps (continued)
5. Data mining – intelligent methods are
applied to extract data patterns
6. Pattern evaluation – identify truly
interesting patterns representing knowledge
based on some interestingness measures
7. Knowledge presentation – present mined
knowledge to the user
Data Warehousing/Mining 1
Data Mining: Classification Schemes
Broader category of data:
General functionality
– Descriptive data mining – general property of data
– Predictive data mining –in order to make predictions
Different views, different classifications
– Kinds of databases to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adapted
Data Warehousing/Mining 1
Pattern to be Mined
Predictive data mining
– Classification
– Regression
– Time series analysis
– prediction
Descriptive data mining
– Clustering
– Association
– Summarization
Data Warehousing/Mining 1
Why Data Mining? — Potential
Applications
Database analysis and decision support
– Market analysis and management
target marketing, customer relation management, market basket
analysis, cross selling, market segmentation
– Risk analysis and management
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
– Fraud detection and management
Other Applications
– Text mining (news group, email, documents) and Web analysis.
– Intelligent query answering
Data Warehousing/Mining 1
Market Analysis and Management (1)
Where are the data sources for analysis?
– Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
Target marketing
– Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over time
– Conversion of single to a joint bank account: marriage, etc.
Cross-market analysis
– Associations/co-relations between product sales
– Prediction based on the association information
Data Warehousing/Mining 1
Market Analysis and Management (2)
Customer profiling
– data mining can tell you what types of customers buy what
products (clustering or classification)
Identifying customer requirements
– identifying the best products for different customers
– use prediction to find what factors will attract new customers
Provides summary information
– various multidimensional summary reports
– statistical summary information (data central tendency and
variation)
Data Warehousing/Mining 1
Corporate Analysis and Risk
Management
Finance planning and asset evaluation
– cash flow analysis and prediction
– contingent claim analysis to evaluate assets
– cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.)
Resource planning:
– summarize and compare the resources and spending
Competition:
– monitor competitors and market directions
– group customers into classes and a class-based pricing
procedure
– set pricing strategy in a highly competitive market
Data Warehousing/Mining 1
Fraud Detection and Management (1)
Applications
– widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach
– use historical data to build models of fraudulent behavior and
use data mining to help identify similar instances
Examples
– auto insurance: detect a group of people who stage accidents to
collect on insurance
– money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
– medical insurance: detect professional patients and ring of
doctors and ring of references
Data Warehousing/Mining 1
Fraud Detection and Management (2)
Detecting inappropriate medical treatment
– Australian Health Insurance Commission identifies that in many
cases blanket screening tests were requested (save Australian
$1m/yr).
Detecting telephone fraud
– Telephone call model: destination of the call, duration, time of
day or week. Analyze patterns that deviate from an expected
norm.
– British Telecom identified discrete groups of callers with
frequent intra-group calls, especially mobile phones, and broke
a multimillion dollar fraud.
Retail
– Analysts estimate that 38% of retail shrink is due to dishonest
employees.
Data Warehousing/Mining 1
Other Applications
Sports
– IBM Advanced Scout analyzed NBA game statistics (shots
blocked, assists, and fouls) to gain competitive advantage for
New York Knicks and Miami Heat
Internet Web Surf-Aid
– IBM Surf-Aid applies data mining algorithms to Web access
logs for market-related pages to discover customer preference
and behavior pages, analyzing effectiveness of Web marketing,
improving Web site organization, etc.
Data Warehousing/Mining 2
Major Issues in Data Mining (1)
Mining methodology and user interaction
– Mining different kinds of knowledge in databases
– Interactive mining of knowledge at multiple levels of abstraction
– Incorporation of background knowledge
– Data mining query languages and ad-hoc data mining
– Expression and visualization of data mining results
– Handling noise and incomplete data
– Pattern evaluation: the interestingness problem
Performance and scalability
– Efficiency and scalability of data mining algorithms
– Parallel, distributed and incremental mining methods
Data Warehousing/Mining 2
Major Issues in Data Mining (2)
Issues relating to the diversity of data types
– Handling relational and complex types of data
– Mining information from heterogeneous databases and global
information systems (WWW)
Issues related to applications and social impacts
– Application of discovered knowledge
Domain-specific data mining tools
Intelligent query answering
Process control and decision making
– Integration of the discovered knowledge with existing
knowledge: A knowledge fusion problem
– Protection of data security, integrity, and privacy
Data Warehousing/Mining 2
Summary
Data mining: discovering interesting patterns from large amounts of
data
A natural evolution of database technology, in great demand, with
wide applications
A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
Mining can be performed in a variety of information repositories
Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
Classification of data mining systems
Major issues in data mining
Data Warehousing/Mining 2