Chapter 1.
Introduction
Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? Major Issues in Data Mining
Google Flu Trends: US Flu Activity
Blue: Google Flu Trends estimate; Orange: US data
http://www.google.org/flutrends/
March 30, 2014
Data Mining: Concepts and Techniques
Influenza-like Illness (ILI) percentages estimated by model (black) and provided by the CDC (red) in the mid-Atlantic region, showing data available at four points in the 2007-2008 influenza season.
J Ginsberg et al. Nature 000, 1-3 (2008) doi:10.1038/nature07634
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web, computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks,
Science: Remote sensing, bioinformatics, scientific simulation, Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
Necessity is the mother of inventionData miningAutomated analysis of massive data sets
4
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS Relational data model, relational DBMS implementation RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
1970s:
1980s:
Application-oriented DBMS (spatial, scientific, engineering, etc.)
Data mining, data warehousing, multimedia databases, and Web databases Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems
5
1990s:
2000s
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data
Data mining: a misnomer? Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Simple search and query processing (Deductive) expert systems
6
Alternative names
Watch out: Is everything data mining?
Knowledge Discovery (KDD) Process
This is a view from typical database systems and data Pattern Evaluation warehousing communities Data mining plays an essential role in the knowledge discovery Data Mining process Task-relevant Data Data Warehouse Selection
Data Cleaning
Data Integration Databases
7
Data Mining in Business Intelligence
Increasing potential to support business decisions
Decision Making
Data Presentation Visualization Techniques Data Mining Information Discovery
End User
Business Analyst Data Analyst
Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
DBA
KDD Process: A Typical View from ML and Statistics
Data Mining
PostProcessing
Input Data
Data PreProcessing
Data integration Normalization Feature selection Dimension reduction
Pattern discovery Association & correlation Classification Clustering Outlier analysis
Pattern Pattern Pattern Pattern
evaluation selection interpretation visualization
This is a view from typical machine learning and statistics communities
9
Example: Medical Data Mining
Health care & medical data mining often adopted such a view in statistics and machine learning Preprocessing of the data (including feature extraction and dimension reduction) Classification or/and clustering processes Post-processing for presentation
10
Multi-Dimensional View of Data Mining
Data to be mined Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks Knowledge to be mined (or: Data mining functions) Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Descriptive vs. predictive data mining Multiple/integrated functions and mining at multiple levels Techniques utilized Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
11
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database Data streams and sensor data Time-series data, temporal data, sequence data (incl. bio-sequences) Structure data, graphs, social networks and multi-linked data Object-relational databases Heterogeneous databases and legacy databases Spatial data and spatiotemporal data Multimedia database Text databases The World-Wide Web
12
Advanced data sets and advanced applications
Data Mining Function: (1) Generalization
Information integration and data warehouse construction
Data cleaning, transformation, integration, and multidimensional data model Scalable methods for computing (i.e., materializing) multidimensional aggregates OLAP (online analytical processing)
Data cube technology
Multidimensional concept description: Characterization and discrimination
Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region
13
Data Mining Function: (2) Association and Correlation Analysis
Frequent patterns (or frequent itemsets)
What items are frequently purchased together in your Walmart? A typical association rule
Association, correlation vs. causality
Diaper Beer [0.5%, 75%] (support, confidence)
Are strongly associated items also strongly correlated?
How to mine such patterns and rules efficiently in large datasets? How to use such patterns for classification, clustering, and other applications?
14
Data Mining Function: (3) Classification
Classification and label prediction
Construct models (functions) based on some training examples Describe and distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars based on (gas mileage)
Predict some unknown class labels Decision trees, nave Bayesian classification, support vector machines, neural networks, rule-based classification, patternbased classification, logistic regression, Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages,
15
Typical methods
Typical applications:
Data Mining Function: (4) Cluster Analysis
Unsupervised learning (i.e., Class label is unknown) Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns Principle: Maximizing intra-class similarity & minimizing interclass similarity
Many methods and applications
16
Data Mining Function: (5) Outlier Analysis
Outlier analysis
Outlier: A data object that does not comply with the general behavior of the data Noise or exception? One persons garbage could be another persons treasure Methods: by product of clustering or regression analysis, Useful in fraud detection, rare events analysis
17
Time and Ordering: Sequential Pattern, Trend and Evolution Analysis
Sequence, trend and evolution analysis Trend, time-series, and deviation analysis: e.g., regression and value prediction Sequential pattern mining e.g., first buy digital camera, then buy large SD memory cards Periodicity analysis Motifs and biological sequence analysis Approximate and consecutive motifs Similarity-based analysis Mining data streams Ordered, time-varying, potentially infinite, data streams
18
Evaluation of Knowledge
Are all mined knowledge interesting?
One can mine tremendous amount of patterns and knowledge Some may fit only certain dimension space (time, location, ) Some may not be representative, may be transient,
Evaluation of mined knowledge directly mine only interesting knowledge?
Descriptive vs. predictive Coverage
Typicality vs. novelty
Accuracy Timeliness
19
Data Mining: Confluence of Multiple Disciplines
Machine Learning Pattern Recognition Statistics
Applications
Data Mining
Visualization
Algorithm
Database Technology
High-Performance Computing
20
Why Confluence of Multiple Disciplines?
Tremendous amount of data
Algorithms must be highly scalable to handle potentially tera-bytes of data Micro-array may have tens of thousands of dimensions Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations
High-dimensionality of data
High complexity of data
New and sophisticated applications
21
Major Issues in Data Mining (1)
Mining Methodology
Mining various and new kinds of knowledge Mining knowledge in multi-dimensional space Data mining: An interdisciplinary effort Boosting the power of discovery in a networked environment Handling noise, uncertainty, and incompleteness of data Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge Presentation and visualization of data mining results
22
Major Issues in Data Mining (2)
Efficiency and Scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining methods Handling complex types of data Mining dynamic, networked, and global data repositories Social impacts of data mining
Diversity of data types
Data mining and society
Privacy-preserving data mining
Invisible data mining
23