Data Warehousing and Data Mining
INTRODUCTION
Lecture 1
Dr. Sultan Yahya Al-Sultan
Assistant Professor
Books
▶ Required Text
▶ Data mining: Concepts and Techniques, by Jiawei Han and
Micheline Kamber, Morgan Kaufmann, ISBN 1-55860-489-8
▶ Pang-Ning Tan, Michael Steinbach, Vipin Kumar:
Introduction to Data Mining. 2nd Edition.Pearson / Addison
Wesley.
▶ Attached books
▶ Data mining: Concepts and Techniques
"Introduction to Data Mining:
1. The book provides a comprehensive introduction to the concept of data
mining and its importance in extracting knowledge from large datasets.
1.Basic Concepts:
1. It explains fundamental concepts in data mining such as exploratory
analysis, classification, clustering, and association analysis.
2.Key Techniques:
1. The book covers important techniques such as neural networks, decision
trees, factor analysis, and classification using various algorithms.
3.Big Data Analysis:
1. The book also addresses how to deal with big data and use data mining
techniques to extract knowledge from it.
4.Predictive Analysis:
1. It presents methods for predictive analysis and how to use data mining
techniques to predict trends and future events.
5.Practical Applications:
1. The book discusses case studies and practical applications of data mining in
various fields such as marketing, healthcare, and finance.
Course structure
▶ The course has two parts:
▶ Lectures - Introduction to the main topics
▶ Projects (done in groups)
▶ 1 project OR,
▶ 1 research project…but we will follow the project.
3
Grading
Attendance … %
Interaction …%
Assignments … % …
Midterm + others : …%
Projects: …%
Final Exam: ….%
4
Course Topics
▶ Introduction
▶ Data pre-processing
▶ Data warehousing
▶ Association rules and sequential patterns
▶ Classification (supervised learning)
▶ Clustering (unsupervised learning)
▶ Post-processing of data mining results
▶ Measures of Interestingness
▶ Objective Measures
▶ Subjective Measures
5
Rules
• Keep your phone silent
you should be in class on time
We are a…….
Lecture Outline
Motivation: Why Data Mining?
What is Data Mining?
History of Data Mining
Data Mining Tasks
Data Mining Techniques
Data Mining Applications
Are all the Patterns Interesting?
Issues in Data Mining
Data Mining Motivation
“The key in business is to know something that
nobody else knows.”
— Aristotle Onassis
“To understand is to perceive-recognize- patterns.”
— Sir Isaiah Berlin
Necessity is the Mother of Invention
Data explosion
Huge amount of data stored in databases, data warehouses and other
information repositories
This data need to data collection tools and advanced database technology via
mining and discovering new patterns and trends
We are drowning in sea of data, but starving-looking for - for
knowledge!
Necessity is the Mother of Invention
We are drowning in data,
but starving for knowledge!
Solution
Data Mining
Extraction of interesting
knowledge (rules, regularities,
patterns, constraints) from data
in large databases
Data Warehousing and Storage
Analysis and extract
11
Why Data mining?
▶ Huge amounts of data
▶ Electronic records of our decisions
▶ Choices in the supermarket
▶ Financial records
▶ Our comings and goings
▶ Data rich – but information poor
▶ We want to dig
Why Data Mining?
From a managerial perspective:
Analyzing trends
Wealth generation
Security
Strategic decision making
Why Data Mining?
▶ Huge amount of data ex:
▶ Google has Peta Bytes of web data
▶ Facebook has billions of active users
▶ Amazon handles millions of visits/day
Data vs. Information
Society produces massive amounts of data
▶ business, science, medicine, economics, sports, …
▶ Raw data is useless
▶ need techniques to automatically extract information
▶ Data: raw facts
▶ Information: patterns come form processed data
What is DATA MINING?
▶ Extracting or “mining” knowledge from large amounts of
data
▶ Discovery and modeling of hidden patterns (never existed
before) in large volumes of data
▶ Extraction of implicit, previously unknown and
unexpected, novel, potentially extremely useful
information from data
………..
Needs of different levels of Management
▶ Operational Level – Control Oriented Data
▶ Tactical (Middle) Level – Planning and Control Information
Oriented
▶ Strategic (Top) Level – Fundamentally Planning
Knowledge
oriented
Knowledge pyramid
Wisdom
Intelligence Knowledge + experience
Knowledge Information + rules
Information Data + context
Data
Data Mining
▶ Look for hidden patterns and trends in data that is not
immediately apparent from summarizing the data
▶ No Query…
▶ …But an “Interestingness criteria”
Data Mining
+ =
Interestingness Hidden
Data criteria patterns
Data Mining
Type of data Type of
Interestingness criteria
+ =
Interestingness Hidden
Data criteria patterns
Data Mining
Type of Patterns
+ =
Interestingness Hidden
Data criteria patterns
Data Mining is NOT
▶ Data Warehousing
▶ (Deductive) query processing
▶ SQL/ Reporting
▶ Software Agents
▶ Expert Systems
▶ Online Analytical Processing (OLAP)
▶ Statistical Analysis Tool
▶ Data visualization
Great Opportunities to Solve Society’s Major Problems
Improving health care and reducing costs Predicting the impact of climate change
Finding alternative/ green energy sources Reducing hunger and poverty by
increasing agriculture production
24
Data Mining Challenges
▶ Problem 1: most patterns are not interesting
▶ Problem 2: patterns may be inexact or completely fake
when noisy data present
Multidisciplinary Field
Database
Statistics
Technology
Machine
Learning
Data Mining Visualization
Artificial Other
Intelligence Disciplines
Data mining History
▶ Emerged late 1980s
▶ Grown –1990s
▶ Roots traced back along three family lines
▶ Classical Statistics
▶ Artificial Intelligence
▶ Machine Learning….Deep learning
Statistics
▶ Foundation of most DM technologies
▶ Regression analysis
▶ Standard distribution
▶ Deviation/Variance
▶ Cluster analysis
▶ Confidence intervals
Data Mining vs Statistical Inference
Statistics:
Statistical
Conceptual Reasoning
Model
(Hypothesis)
“Proof”
(Validation of Hypothesis)
Data Mining vs Statistical Inference
Data mining:
Mining
Algorithm
Data Based on
Interestingness
Pattern
(model, rule,
hypothesis)
discovery
Artificial Intelligence
▶ Artificial intelligence is the branch of computer science to
create computers that think like humans.
▶ Heuristics vs. Statistics
▶ Human-thought-like processing
▶ Requires vast computer processing power
▶ Supercomputers
Machine Learning
▶ Union of statistics and AI
▶Combinations of AI heuristics with advanced statistical
analysis
▶ Machine Learning – let computer programs
▶ learn about data they study –
▶ make different decisions based on the quality of studied data
▶ using statistics for fundamental concepts and adding more advanced AI
heuristics and algorithms
Data Mining: A KDD Process
▶ Data mining: the core of Pattern Evaluation
knowledge discovery process.
Data Mining
Task-relevant Data
Data Selection
Data Warehouse
Data preprocessing
Data Cleaning
Data Integration
Databases
KDD Process
Steps of a KDD Process
1. First :Ask about:
1. what is the application domain
2. What are relevant prior knowledge and goals of application
2. Creating a target data set: data selection
3. Data cleaning and preprocessing: (may take 60% of effort!)
4. Data reduction and transformation:
▶ Find useful features
▶ variable reduction
▶ Invariants-constants- representation.
▶ Choosing functions of data mining
▶ summarization, classification, regression, association, clustering.
▶ Choosing the mining algorithm(s)
Steps of a KDD Process
5. Data mining: search for patterns of interest
6. Pattern evaluation and knowledge presentation
▶ Visualization
▶ transformation
▶ removing redundant patterns.
▶ Then we will use of discovered knowledge
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions
Data Presentation Business Analyst
Visualization Techniques
Data Mining Data Analyst
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP…. DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
37
Learning Algorithms
▶ Fundamental idea:
learn rules/patterns/relationships
automatically from the data
Data Mining Tasks
1. Prediction Tasks
▶ Use some variables to predict unknown or future values of other variables
2. Description Tasks
▶ Find human patterns that describe the data.
3. Common data mining tasks
▶ Classification [Predictive]
▶ Clustering [Descriptive]
▶ Association Rule Discovery [Descriptive]
▶ Sequential Pattern Discovery [Descriptive]
▶ Regression [Predictive]
▶ Deviation Detection [Predictive]
The End
My best wishes for success to all of you….
.
Dr. Sultan Yahya Al-Sultan
Assistant Professor