Data Mining
1. Database Data
Stored in: Relational databases (tables with rows and
columns).
Examples: Customer records, employee data, item listings.
Structure: Uses a schema (e.g., customer(custID, name, age,
income...)).
Tools: SQL for querying and aggregating data.
Data Mining Use: Identify trends, patterns, and deviations
(e.g., predicting credit risk, analyzing sales).
2. Data Warehouses
Stored in: Centralized repositories integrating data from
multiple sources.
Structure: Multidimensional (data cubes).
Purpose: Historical analysis and business decision support.
Features:
o Organized by subject (customer, item, time, etc.)
o Uses OLAP operations like roll-up and drill-down.
Data Mining Use: Discover patterns at various levels of
granularity, enable exploratory analysis.
3. Transactional Data
Stored in: Flat files or tables representing individual
transactions.
Structure: Each record contains a transaction ID and items
involved (e.g., T100: I1, I3, I8).
Examples: Retail sales, flight bookings, clickstream data.
Data Mining Use:
o Market basket analysis
o Frequent itemset mining to discover what products are
often bought together.
Other Forms of Data (Mentioned briefly)
Data Streams
Sequence Data
Graph or Network Data
Spatial, Text, Multimedia Data
Web Data (WWW)
Feature Database Data Data Warehouse Transactional
Data Data
Definition Structured data Integrated, Data
stored in historical data representing
relational from multiple real-world
databases sources stored transactions or
(RDBMS) using for analytical events, often
tables. purposes. sequential and
time-stamped.
Purpose Real-time Strategic Capturing and
operations and decision support analyzing
day-to-day and historical individual user
transactions analysis (OLAP). or business
(OLTP). actions (e.g.,
purchases).
Structure Tables Multidimensional Flat files or
(relations) with data cubes with nested tables
rows (tuples) summarized with transaction
and columns information. IDs and lists of
(attributes). items or events.
Storage Relational Centralized Data Flat files,
System Database Warehouse (e.g., NoSQL DBs, or
Management Amazon special
System Redshift, transactional
(RDBMS) like Snowflake, DB systems.
MySQL, Oracle, Google
SQL Server. BigQuery).
Schema Type Normalized Star or Often
schemas (3NF or Snowflake denormalized or
ER models). schemas for fast semi-structured
aggregation and (list of items
querying. per
transaction).
Data Fine-grained Aggregated Fine-grained
Granularity (detailed (summary over (detailed per
individual time or groups). transaction).
records).
Data Sources Single Multiple Point-of-sale
operational heterogeneous systems,
system (e.g., sources (e.g., sensors, web
POS system). regional DBs, logs, etc.
logs).
Update Frequently Periodically Continuously
Frequency updated (daily, updated (daily, updated or
hourly). weekly, appended.
monthly).
Examples - - Sales by region - T100: [I1, I3,
customer(custID and time - I8] -
, name, age, Quarterly Clickstream log:
income) - product UserID: [Page1,
item(itemID, performance Page3, Page7]
price, category)
Query Type SQL-based OLAP queries: Pattern mining:
queries: SELECT, Drill-down, Roll- association rule
JOIN, GROUP BY, up, Slice, Dice mining,
etc. sequence
pattern mining
Mining Classification, Multidimensional Association rule
Techniques clustering, pattern mining, mining,
outlier trend analysis, sequential
detection, anomaly pattern mining,
regression detection market basket
analysis
Use Cases Predicting Strategic Recommending
customer churn, decisions like products,
identifying which regions detecting
fraud, customer are buying
segmentation. underperforming behavior,
or product trend promotion
analysis. bundling.
Tools/ SQL, DBMS OLAP tools Apache Hadoop,
Technologies (MySQL, (Tableau, Power Spark, NoSQL,
PostgreSQL), BI), ETL Association
Python (pandas) pipelines, Cube Rule Mining
computation (Apriori, FP-
Growth)
Functionalities
1. Characterization and Discrimination (Class/Concept
Description)
➤ Characterization:
Describes the general features of data belonging to a target class.
Provides a concise summary, usually through descriptive
statistics, OLAP operations, or attribute-oriented induction.
Example: A retail manager wants to know the profile of customers
who spend over $5000/year. The result might show that they are
typically middle-aged, employed, and have good credit ratings.
Output can be presented in the form of:
o Charts (bar, pie)
o Generalized relations
o Characteristic rules (e.g., "If income > 50K → likely to spend >
$5000")
It's used for summarizing and understanding data patterns in a
group.
➤ Discrimination:
Compares the features of a target class against one or more
contrasting classes.
Example: Comparing customers who shop frequently for computer
products vs. those who shop rarely. Differences may include age,
education, etc.
Helps identify features that distinguish between groups (e.g., age,
occupation).
Often results in discriminant rules, e.g., “If age between 20-40
and education = university → frequent buyer.”
Key Difference: Characterization is about describing one group;
discrimination is about comparing multiple groups.
2. Mining Frequent Patterns, Associations, and Correlations
Aims to find repetitive patterns, associations, or correlations
in large datasets.
This includes:
o Frequent Itemsets: Sets of items that often appear together
in transactions (e.g., bread and butter).
o Sequential Patterns: Items purchased in a sequence (e.g.,
laptop → camera → memory card).
o Substructures: Patterns in structural forms like graphs or
trees.
➤ Association Rule Mining:
Example: “buys(X, 'computer') → buys(X, 'software') [support: 1%,
confidence: 50%]”
o Means 1% of transactions include both, and 50% of computer
buyers also buy software.
Can be single-dimensional (same predicate) or multi-
dimensional (age, income, buys).
Used in market basket analysis, cross-selling, product
recommendations.
➤ Correlation Analysis:
Goes beyond co-occurrence to measure statistical significance
between items (e.g., chi-square test).
Importance: Helps in identifying what tends to happen together in
data, enabling targeted marketing, inventory planning, and more.
3. Classification and Regression (Predictive Analysis)
➤ Classification:
Builds a model (classifier) that assigns data to predefined
categories or classes.
Requires labeled training data.
Output can be:
o IF-THEN rules
o Decision trees
o Neural networks
o SVMs, k-NN, Bayesian classifiers
Example: Classifying items based on sales response (good, mild,
none).
Used in spam detection, credit scoring, disease diagnosis, etc.
➤ Regression:
Predicts continuous numeric values, not categories.
Example: Predicting the expected revenue from a product.
Methods include linear regression, polynomial regression, and
advanced ML techniques.
Used in forecasting, pricing models, stock prediction.
Key Difference: Classification predicts discrete labels, regression
predicts continuous values.
4. Cluster Analysis
Groups a set of objects into clusters so that:
o Intra-cluster similarity is high
o Inter-cluster similarity is low
No prior labels are required (unsupervised learning).
Each cluster can later be treated as a class for further analysis.
Example: Segmenting customers based on purchasing behavior or
geographic location.
Visualized often using 2D/3D plots (e.g., k-means, DBSCAN).
Applications include customer segmentation, image recognition,
bioinformatics.
Purpose: To discover natural groupings within data without predefined
categories.
5. Outlier Analysis
Identifies data objects that deviate significantly from the
general pattern.
Such data points are called outliers or anomalies.
Useful in applications where rare events are more important than
common ones:
o Fraud detection (e.g., unusual credit card activity)
o Intrusion detection
o Medical anomalies
Techniques:
o Statistical methods (assuming distribution models)
o Distance-based methods (objects far from others)
o Density-based methods (like LOF – Local Outlier Factor)
Not all outliers are noise—many are insightful and can drive
important decisions.
Technologies Used in Data Mining
Data mining is an application-driven field that integrates various
techniques from multiple disciplines to extract valuable insights from data.
These include:
1. Statistics
o Role in Data Mining: Statistics is used to model data and
target classes. A statistical model describes the behavior of
data using mathematical functions and probability
distributions.
o Applications in Data Mining:
Data Characterization and Classification: Statistical
models can be used to classify and characterize data.
Handling Noise and Missing Data: Statistics helps in
modeling and handling noisy or missing data during the
data mining process.
Prediction and Forecasting: Statistical models are
key for prediction tasks, providing a framework for
making inferences about the data.
Verifying Data Mining Results: After building
classification or prediction models, statistical hypothesis
testing helps verify their accuracy and significance.
o Challenges: Scaling statistical methods for large datasets is
complex due to computational costs. This issue is exacerbated
for online applications requiring real-time processing.
2. Machine Learning
o Role in Data Mining: Machine learning focuses on enabling
computers to learn patterns and make decisions based on
data. Machine learning is used in data mining for tasks like
classification and clustering.
o Types of Learning Methods:
Supervised Learning (Classification): Involves
training a model with labeled data to recognize patterns,
such as recognizing postal codes from handwritten
images.
Unsupervised Learning (Clustering): The model
learns from data without labels, finding hidden patterns
or groups (e.g., recognizing different digits in
handwritten data without predefined labels).
Semi-supervised Learning: Combines both labeled
and unlabeled data. Labeled data helps build models,
while unlabeled data helps refine the model's
boundaries, improving accuracy.
Active Learning: The model actively queries humans
(domain experts) to label uncertain data points, thus
improving the model with minimal human input.
o Challenges: While machine learning focuses on accuracy,
data mining also emphasizes efficiency, scalability, and
handling diverse types of data.
3. Database Systems and Data Warehouses
o Role in Data Mining: Database systems handle the storage,
management, and retrieval of data, and they play a crucial
role in ensuring that data mining can scale to large datasets.
o Data Warehousing: A data warehouse integrates data from
various sources and timeframes into a unified structure. It
enables advanced data analysis by consolidating data into
multidimensional space, known as data cubes.
o Data Mining Integration: Modern database systems often
incorporate data mining capabilities to extend their analytic
power. Data mining tools can operate directly on data stored
in databases to identify patterns.
o Challenges: Data mining often involves working with real-
time streaming data, which requires efficient database
technologies to process large volumes of data quickly.
4. Information Retrieval (IR)
o Role in Data Mining: Information retrieval involves
searching and retrieving relevant documents or information
from a large database or the web. Unlike database systems, IR
deals with unstructured data (e.g., text or multimedia).
o Probabilistic Models: IR uses probabilistic models to
measure the similarity between documents. Text documents
are often represented as a bag of words, where the presence
and frequency of words are important, but word order is not.
o Topic Modeling: IR systems use models to identify
underlying topics in collections of documents. These topics are
represented as probability distributions over a vocabulary, and
documents may belong to multiple topics.
o Integration with Data Mining: Combining IR with data
mining techniques enables deeper analysis of text and
multimedia data, facilitating better search and analysis of
large, unstructured datasets (e.g., web data, digital libraries,
healthcare records).
Applications of Data Mining
Data mining plays a vital role in various fields where large amounts of
data need to be analyzed. Here are two major applications:
1. Business Intelligence (BI)
o Purpose: To understand business contexts such as
customers, market trends, and competitors.
o Key Techniques:
Classification and Prediction for sales, market
analysis, and customer feedback.
Clustering for Customer Relationship Management
(CRM), grouping customers by similarities.
Characterization mining for understanding customer
groups and developing tailored programs.
o Importance: BI allows businesses to make smart decisions,
retain valuable customers, and gain insights into competitors.
Without data mining, effective market analysis would be
difficult.
2. Web Search Engines
o Purpose: To retrieve information from the web in response to
user queries.
o Techniques Used:
Crawling: Deciding which web pages to crawl and how
frequently.
Indexing: Choosing which pages to index and how to
structure the index.
Ranking: Determining how to rank pages based on
relevance and quality.
o Challenges:
Data Volume: Search engines deal with massive
amounts of data, requiring cloud computing for
processing.
Real-Time Processing: Search engines need to
respond to user queries instantly, often requiring
continuous updates and real-time data mining.
Small Data Issues: Many queries are asked rarely,
posing a challenge for mining methods designed for
large datasets.
Major Issues in Data Mining
Data mining, being a rapidly evolving field, faces several challenges and
open research areas. These challenges can be categorized into five main
groups:
1. Mining Methodology
o New Knowledge Types: Data mining covers a broad range
of tasks (e.g., classification, regression, clustering), and as
new applications emerge, new mining techniques are
developed.
o Multidimensional Data Mining: Mining knowledge across
different dimensions, such as combining various attributes in
data cubes.
o Interdisciplinary Approaches: Integrating methods from
natural language processing, software engineering, and other
fields enhances data mining.
o Handling Uncertainty: Dealing with noisy or incomplete
data is a significant challenge in data mining, requiring
techniques like data cleaning and outlier detection.
2. User Interaction
o Interactive Mining: The mining process should be flexible
and dynamic, allowing users to refine searches and explore
data interactively.
o Incorporation of Background Knowledge: Including
domain-specific knowledge, constraints, or rules can guide the
mining process towards more useful results.
o Data Mining Query Languages: High-level languages or
interfaces allow users to define and optimize ad hoc queries,
making the process more user-friendly.
o Visualization of Results: Presenting mining results in an
understandable, visually intuitive way is crucial for the
usability of data mining systems.
3. Efficiency and Scalability
o Algorithm Efficiency: Data mining algorithms need to
handle large datasets quickly and efficiently, especially as
data volumes grow.
o Parallel and Distributed Mining: Large datasets often
require parallel processing across distributed systems. Cloud
and cluster computing are common methods to scale data
mining processes.
o Incremental Mining: Incremental algorithms that can update
models as new data arrives without reprocessing all existing
data are a key area of research.
4. Diversity of Data Types
o Complex Data Types: Data mining must handle a variety of
data types, from structured databases to unstructured data
like text and images.
o Dynamic Data: Some data, like online data streams or real-
time sensor data, change constantly, presenting challenges
for traditional mining methods.
o Interconnected Data: Many datasets are linked (e.g., social
networks, web data), requiring mining techniques that can
handle and exploit these connections.
5. Data Mining and Society
o Social Impact: Data mining affects privacy, security, and
social dynamics. How can we use data mining for societal
benefit while preventing misuse?
o Privacy-Preserving Mining: Safeguarding individuals'
privacy while conducting data mining is crucial. Ongoing
research focuses on privacy-preserving data mining methods.
o Invisible Data Mining: Many systems perform data mining
behind the scenes without users' awareness. For instance, e-
commerce sites track user behavior to recommend products.
What is a Data Warehouse?
A data warehouse refers to a central repository where data from
different sources is stored and organized for analysis and decision-making.
It allows businesses to store historical data that supports strategic
decisions. Data warehouses are essential in today’s competitive world as
organizations use them to gain insights into various aspects of their
operations and make informed decisions.
Key Features of a Data Warehouse
1. Subject-Oriented:
o Data warehouses are designed around major subjects of
interest such as customers, products, suppliers, and sales.
o Unlike operational databases, which focus on day-to-day
transactions, data warehouses are structured to provide a
more analytical view that helps decision-makers.
o Data is organized to reflect decision support processes, not
operational activities.
2. Integrated:
o A data warehouse integrates data from multiple
heterogeneous sources like relational databases, flat files, and
transaction logs.
o It ensures consistency in naming conventions, data formats,
and attribute measures.
o The data is cleaned and standardized before being loaded into
the warehouse.
3. Time-Variant:
o Data in a warehouse is typically historical, covering several
years (e.g., 5-10 years) to help analyze trends over time.
o Each data set within a warehouse includes a time element,
either implicitly or explicitly, to track changes and trends over
time.
4. Nonvolatile:
o Once data is stored in a warehouse, it is not changed. New
data is only appended to the system.
o A data warehouse does not require the mechanisms for
transaction processing, recovery, or concurrency control that
are needed in operational databases.
o The primary operations in a data warehouse are data loading
and querying.
Functions of a Data Warehouse
A data warehouse consolidates large amounts of data for analysis
and decision-making purposes. It is not primarily designed for
transactional operations but rather for answering complex queries
and providing insights into various business activities.
A data warehouse is often constructed by integrating data from
multiple sources, using processes like data cleaning, data
integration, and data consolidation.
Decision support technologies are used to query the data,
generate reports, and make strategic decisions based on the
insights derived from the data.
Difference b/w Database and Datawarehouse
Feature OLTP (Operational OLAP (Data Warehouse
Database System) System)
Purpose Handles day-to-day Supports data analysis and
transactions and query decision making
processing
Users Clerks, clients, IT Managers, executives,
professionals (customer- analysts (market-oriented)
oriented)
Data Content Current, detailed, real- Historical, aggregated,
time transactional data summarized data
Data Volume Typically smaller; focused Very large; includes years
on current data of historical data
Database Entity-Relationship (ER) Star or Snowflake schema;
Design model; application- subject-oriented
oriented
View of Data Narrow view, specific to Broad view, integrates
department/enterprise data from multiple sources
Query Simple, short, atomic Complex queries, often
Characteristi queries and transactions involving aggregation
cs
Operations Frequent inserts, updates, Primarily read-only
deletes (write-heavy) operations (read-heavy)
Concurrency Requires concurrency Less need for concurrency,
Control control and recovery due to read-only nature
mechanisms
Access High transaction Low latency not critical,
Patterns throughput, quick focus on complex query
response time performance
Performance Measured by number of Measured by query
Metrics transactions per second response time and
analytical capability
Data Very detailed, fine-grained Data stored at multiple
Granularity data levels of granularity (from
detailed to summarized)
Frequency of Constant, very frequent Periodic, depending on
Access analysis/reporting needs
Data Minimal; single source High; integrates data from
Integration systems multiple heterogeneous
sources
Storage Typically stored on a Distributed storage
Medium single system or server systems due to large
volume
Data Warehousing: A Multitiered Architecture
Data warehouses are designed using a three-tier architecture, which
helps in separating data storage, data processing, and data presentation.
This architecture ensures scalability, flexibility, and efficient data
management. Here's a breakdown of each tier:
1. Bottom Tier: Data Warehouse Server
Role: This tier is responsible for storing the actual data.
Technology Used: Usually a relational database management
system (RDBMS).
Functions:
o Data Extraction: Pulls data from various operational and
external sources.
o Data Cleaning: Removes errors, inconsistencies, and
duplicates.
o Data Transformation: Converts data into a common, unified
format.
o Data Loading: Transfers the processed data into the data
warehouse.
o Data Refreshing: Periodically updates the data warehouse to
reflect recent changes.
Data Sources:
o Operational Databases: e.g., banking systems, sales
systems.
o External Sources: e.g., market research reports, customer
profiles from third parties.
Tools Used:
o Gateways (APIs) to connect and query the source systems:
ODBC (Open Database Connectivity)
OLEDB (Object Linking and Embedding Database)
JDBC (Java Database Connectivity)
Metadata Repository:
o Stores information about data (like source, format,
transformations applied).
o Acts as a directory for warehouse management and query
optimization.
2. Middle Tier: OLAP Server
Role: Acts as the processing layer, converting data into a form
suitable for analysis.
Two Main OLAP Models:
1. ROLAP (Relational OLAP):
Works on top of relational databases.
Converts multidimensional operations into relational
queries.
Suitable for handling large volumes of data.
2. MOLAP (Multidimensional OLAP):
Uses specialized multidimensional data structures
(cubes).
Faster for complex analytical queries, but may have
storage limitations.
Functionality:
o Supports advanced analytical processing, including
summarization, aggregation, and complex computations.
o Optimized for read-heavy operations.
3. Top Tier: Front-End Tools
Role: This is the user interface layer, where users interact with
the system.
Components:
o Query and Reporting Tools: For generating standard or
custom reports.
o Data Analysis Tools: For ad-hoc querying, slicing, dicing, and
drill-down analysis.
o Data Mining Tools: For predictive modeling, clustering, trend
analysis, etc.
Users:
o Business Analysts
o Executives and Managers
o Decision Makers
Functionality:
o Provides a visual and interactive environment for
exploring and analyzing data.
o Supports dashboards, charts, graphs, and other visualizations.
Data Warehouse Models: Enterprise
Warehouse,
Data Mart, and Virtual Warehouse
1. Enterprise Data Warehouse (EDW)
Definition: A centralized data warehouse that stores information
from across the entire organization.
Scope: Corporate-wide, cross-functional.
Data:
o Includes both detailed and summarized data.
o Integrated from multiple operational systems or external
sources.
Implementation:
o Requires extensive business modeling.
o Typically built on mainframes, superservers, or parallel
systems.
o May take months or years to design and deploy.
Advantages:
o Single source of truth.
o High consistency and integration.
Disadvantages:
o Time-consuming and expensive to build.
o Inflexible in dynamic environments.
2. Data Mart
Definition: A smaller, focused version of a data warehouse that
stores data for a specific business line or department (e.g.,
marketing, sales).
Scope: Departmental or subject-specific.
Data:
o Typically summarized and related to specific business needs.
Types:
o Independent Data Mart: Sourced directly from operational
systems or external providers.
o Dependent Data Mart: Sourced from an existing enterprise
data warehouse.
Implementation:
o Uses low-cost servers (e.g., Linux, Windows).
o Takes weeks to build (faster ROI).
Advantages:
o Quick to implement.
o Cost-effective.
o Flexible and adaptable to specific needs.
Disadvantages:
o Risk of data silos.
o Complex integration later if not aligned with enterprise
strategy.
3. Virtual Warehouse
Definition: A set of virtual views over operational databases.
Implementation:
o Does not store data physically.
o Queries are processed in real time using views.
Advantages:
o Easy and fast to build.
o Cost-efficient (no extra storage).
Disadvantages:
o Performance depends on operational systems.
o Requires high processing capacity for complex queries.
o Limited historical data analysis.
Top-Down vs. Bottom-Up Approaches to Data Warehouse
Development
Aspect Top-Down Approach Bottom-Up Approach
Start Begin with enterprise data Start with departmental
Point warehouse data marts
Time & High cost and long duration Low cost, faster
Cost implementation
Flexibilit Less flexible More adaptable
y
Integrati Minimizes integration issues May lead to integration
on later challenges
Suitabili Best for long-term strategic Best for tactical and quick
ty planning solutions
Risk High initial investment with Quick wins but may cause
late returns silo issues
Recommended Approach: Incremental & Evolutionary
A hybrid approach is often best — combining top-down planning with
bottom-up implementation. The steps are:
1. Define High-Level Corporate Data Model
o Done within 1–2 months.
o Ensures consistent view of data across the organization.
2. Implement Independent Data Marts
o Developed in parallel using the high-level model.
o Quick deployment, department-level use.
3. Construct Distributed Data Marts
o Integrate various marts via hub servers.
o Enables data sharing across business units.
4. Build Multitier Data Warehouse
o Centralized Enterprise Data Warehouse becomes the
primary data store.
o Distributes data to dependent data marts as needed.
📊 Data Warehouse Modeling: Data Cube and OLAP
✅ Overview
Data Warehouses and OLAP (Online Analytical Processing)
tools are built on the multidimensional data model.
This model visualizes data as a data cube, which allows for
interactive analysis of multidimensional data.
The modeling supports advanced operations like roll-up, drill-
down, and slicing/dicing to enable deep business insights.
Data Cube: A Multidimensional Data Model
🔹 What is a Data Cube?
A data cube allows data to be modeled in n-dimensions (not just
3D).
It is defined by:
o Dimensions: The perspectives for analysis (e.g., time, item,
location).
o Facts/Measures: Quantitative data (e.g., dollars sold, units
sold).
🔹 Key Concepts:
Dimensions:
o Examples: time, item, branch, location.
o Each has a dimension table (e.g., for item: item name,
brand, type).
Fact Table:
o Contains numeric measures like dollars sold, units sold, etc.
o Links to each dimension via foreign keys.
🧊 Representation:
2-D Cube: Like a spreadsheet/table (e.g., time × item, for location
= Vancouver).
3-D Cube: time × item × location.
4-D Cube: time × item × location × supplier — hard to visualize but
conceptually a series of 3D cubes.
📐 Cuboids and Lattice
🔸 What is a Cuboid?
A cuboid is a cube at a certain level of summarization (group-
by).
Base Cuboid: The lowest level (e.g., time, item, location, supplier).
Apex Cuboid (0-D): The highest level — summarized over all
dimensions.
🔸 Data Cube Lattice
Given n dimensions, 2^n possible cuboids exist.
Forms a lattice structure, representing all possible levels of
summarization.
🔹 Example from the notes:
For dimensions: time, item, location, supplier, the lattice includes:
0-D Cuboid: total sales (summarized across all dimensions)
1-D Cuboids: {time}, {item}, {location}, {supplier}
2-D Cuboids: {time, item}, {item, supplier}, etc.
3-D Cuboids: {time, item, location}, etc.
4-D Cuboid: {time, item, location, supplier} (base cuboid)
🔷 Multidimensional Schema Models
⭐ 1. Star Schema
Structure:
o Central fact table (large, non-redundant).
o Connected dimension tables (flat, possibly redundant).
Pros:
o Simple, fast query performance.
o Easy to understand.
Cons:
o Some redundancy in dimension tables.
Use case: Most common in data marts.
❄️2. Snowflake Schema
Structure:
o Like a star schema but dimension tables are normalized into
sub-tables.
Pros:
o Reduces redundancy.
o Easier maintenance.
Cons:
o More complex queries due to joins.
o Slight performance trade-off.
Use case: Less common; used when storage efficiency is more
critical.
🌌 3. Fact Constellation (Galaxy Schema)
Structure:
o Multiple fact tables sharing dimension tables.
Pros:
o Models multiple interrelated subjects.
o Captures enterprise-wide data.
Cons:
o Complex structure.
Use case: Suitable for enterprise data warehouses.
🌐 Concept Hierarchies for Dimensions
What is a Concept Hierarchy?
Maps low-level values (e.g., city) to higher-level concepts (e.g.,
country).
Helps summarize or roll-up data in OLAP operations.
Types:
Schema Hierarchy: Total/partial order (e.g., street < city <
province < country).
Lattice: Partial order where attributes don’t follow a single path
(e.g., week < year; day < month).
Set-grouping Hierarchy: Value ranges grouped (e.g., price
ranges: ($0–$200], ($200–$400], ...).
Can be manually defined or automatically generated.
🧮 Measures in a Data Cube
Measures = Numeric values aggregated over dimension values
(e.g., total sales).
Type Definition Examples Efficien
cy
Distribu Can be computed from sum(), count(), Very
tive subaggregates and min(), max() efficient
combined.
Algebrai Computed using a fixed avg() = Efficient
c number of distributive sum()/count(),
aggregates. stddev()
Holistic Requires full data scan, median(), mode(), Inefficient
cannot be broken into rank()
subaggregates.
⚠️Most OLAP tools focus on distributive and algebraic measures for
performance.
OLAP Operations
🔁 1. Roll-Up
Definition: Aggregates data by climbing up a concept hierarchy or
by reducing dimensions.
Example: Aggregating sales data from city to country (Toronto →
Canada).
Also called: Drill-Up (by some vendors).
🔽 2. Drill-Down
Definition: The reverse of roll-up; navigates from summary data to
more detailed data.
Example: Moving from quarterly sales data to monthly sales
data.
Also includes: Adding a new dimension (e.g., customer group) for
more detail.
🧊 3. Slice
Definition: Selects a single dimension value, resulting in a subcube.
Example: Selecting data where time = Q1 only.
🧊🧊 4. Dice
Definition: Selects a range of values on two or more dimensions,
resulting in a subcube.
Example: Data for location = Toronto or Vancouver, time = Q1
or Q2, and item = home entertainment or computer.
🔄 5. Pivot (Rotate)
Definition: Rotates the cube to view data from different
perspectives.
Example: Swapping the axes item and location for alternate visual
layout.
🔍 6. Drill-Across
Definition: Executes queries across multiple fact tables.
🧱 7. Drill-Through
Definition: Accesses the bottom-level data in the data cube using
SQL, typically reaching into backend relational tables.
📊 8. Other Advanced Operations
Examples:
o Top-N/Bottom-N ranking.
o Moving averages, growth rates, depreciation.
o Currency conversion, internal return rates.
o Forecasting, trend/statistical analysis, variance
calculations.
💡 Role of Concept Hierarchies in OLAP
Enable aggregation and drilling at various levels of detail.
Facilitate multilevel data exploration across dimensions.
Concept Hierarchies
1. Definition
Concept hierarchy is a sequence of mappings from low-level
(specific) concepts to high-level (general) concepts.
Purpose: Allows multilevel data abstraction.
2. Types
Schema Hierarchies (based on database attributes):
o Example for location: Street < City < Province/State <
Country
o Example for time: Day < Month < Quarter < Year
Lattice Structure:
o Supports partial orders, e.g., Day < Month < Quarter, and
Week < Year.
Set-Grouping Hierarchies:
o Created by grouping values into ranges or categories.
o Example for price: $0–$100, $100–$200, etc.
o User-defined groups: cheap, moderate, expensive
Multiple Hierarchies:
o A single attribute can have multiple concept hierarchies
depending on the analysis (e.g., price can be by range or by
category).
3. Sources of Concept Hierarchies
Manual: Provided by users, domain experts.
Automatic: Generated using statistical analysis (e.g., clustering).
📏 Measures: Categorization & Computation
Measure: A numerical value computed for each multidimensional
point (e.g., sales).
Categories of Measures:
Type Definition Examples Notes
Distribut Can be computed sum(), count(), Easy and efficient to
ive in parts and min(), compute.
then max()
aggregated.
Algebrai Computed from a avg() = Depends on multiple
c fixed number sum/count distributive
(M) of , stddev() functions.
distributive
measures.
Holistic Cannot be expressed median(), Complex, may
with a bounded mode(), require
number of rank() approximation
distributive
.
results.
🧠 OLAP Engine Capabilities
Enables complex analytical computations.
Supports:
o Aggregations, hierarchies, ratios
o Forecasting, trend and statistical analysis
Provides a user-friendly, interactive environment for querying
multidimensional dat
Detailed Definition of Mining Frequent Patterns
Frequent Pattern Mining is a fundamental task in data mining that
involves discovering patterns (like itemsets, sequences, or
structures) that occur frequently in a dataset. These patterns
reveal relationships and associations between data items that can
be useful in decision-making, prediction, recommendation, and
classification.
Patterns
Key Concepts in Frequent Pattern Mining
1. Frequent Pattern
A frequent pattern is a set of items, subsequences, or structures that
appear together frequently in a dataset.
Examples:
Frequent Itemset: {milk, bread} appears together in many
transactions.
Frequent Sequential Pattern: <PC → Digital Camera → Memory
Card> appears in many customer purchase histories.
Frequent Structured Pattern: A frequently recurring subgraph in
a chemical compound dataset.
2. Itemset
A collection of one or more items. For instance, in a supermarket:
{milk}, {milk, bread}, {bread, butter, eggs} are itemsets.
A k-itemset contains k items.
3. Support
Support of an itemset is the proportion (or count) of transactions
that contain the itemset.
It measures how frequently an itemset appears in the dataset.
Support(A) = (Number of transactions containing A) / (Total number of
transactions)
4. Confidence
Confidence of a rule A → B is the probability that transactions
containing A also contain B.
Confidence(A → B) = Support(A ∪ B) / Support(A)
It shows how reliable the rule is.
5. Association Rules
An association rule is an implication of the form:
css
CopyEdit
A → B [support = s%, confidence = c%]
It means that if A occurs, B is likely to occur with support s and confidence
c.
6. Closed and Maximal Frequent Itemsets
Closed Frequent Itemset: A frequent itemset that has no
superset with the same support.
Maximal Frequent Itemset: A frequent itemset that has no
frequent supersets.
Frequent Itemsets, Closed Itemsets, and
Association Rules
🔹 1. Basic Definitions
➤ Itemset:
A group of items.
A k-itemset contains k items (e.g., {bread, milk} is a 2-itemset).
➤ Transaction (T):
A set of items bought together.
Identified by a unique TID.
🔹 2. Support and Confidence
➤ Support:
Fraction of transactions that contain an itemset.
Transactions containing both A and B
Support ( A ∪ B )=
Total transactions
➤ Confidence:
Likelihood of item B occurring given item A.
Support ( A ∪ B )
Confidence ( A → B )=
Support ( A )
🔹 3. Association Rules
Form: A → B
Indicates a strong relationship: "If A occurs, B is likely to occur."
✔️Strong Rules:
Satisfy both:
o Minimum Support (minsup)
o Minimum Confidence (minconf)
🔹 4. Frequent Itemsets
An itemset is frequent if its support ≥ minsup.
Support count = number of transactions containing the itemset.
🔹 5. Closed Frequent Itemsets
An itemset is closed if no proper superset has the same
support count.
Captures complete support info.
Used to eliminate redundancy.
🔹 6. Maximal Frequent Itemsets
An itemset is maximal frequent if it is frequent and none of its
supersets are frequent.
Represents the outer boundary of frequent itemsets.
More compact, but may lose support details of subsets.
🔹 7. Why Use Closed or Maximal?
Mining frequent itemsets may result in an exponential number of
patterns.
Closed and maximal reduce computation and storage.
🔹 8. Example
Dataset:
T1: {a1, a2, ..., a100}
T2: {a1, a2, ..., a50}
minsup = 1
➤ Frequent itemsets: All subsets of T1 and T2
Total = 2^100 - 1 → Too large!
➤ Closed frequent itemsets:
{a1, ..., a50} → support: 2
{a1, ..., a100} → support: 1
➤ Maximal frequent itemset:
{a1, ..., a100} only (as its superset doesn't exist)
🔹 9. Association Rule Mining Steps
1. Find all frequent itemsets (support ≥ minsup).
2. Generate strong association rules from those itemsets
(confidence ≥ minconf).
3.
Apriori Algorithm – Overview
Purpose: To mine frequent itemsets for Boolean association rules.
Proposed by: R. Agrawal and R. Srikant (1994).
Name Origin: Uses prior knowledge of itemset properties.
🔁 Working Principle
Level-wise iterative approach:
o Finds frequent 1-itemsets (L1) from the database.
o Uses Lk-1 to generate Lk (frequent k-itemsets).
o Iterates until no more frequent itemsets can be found.
✅ Apriori Property (Antimonotonicity)
Definition: All non-empty subsets of a frequent itemset must also
be frequent.
Implication:
o If itemset I is infrequent, then any superset I ∪ A is also
infrequent.
o Helps in pruning the candidate space (reducing
computations).
🧩 Two-Step Process (Join & Prune)
1. Join Step:
o Generate candidate itemsets Ck by self-joining Lk-1.
o Join l1 and l2 in Lk-1 if their first k-2 items are the same.
o Ensures no duplicates using lexicographic order.
2. Prune Step:
o Remove candidate c ∈ Ck if any of its (k-1)-subsets is not in
Lk-1.
📊 Example (Using AllElectronics DB)
Database D: 9 transactions (T100–T900).
min_sup = 2 (support count).
Iterations:
o C1 → L1: All 1-itemsets satisfying min_sup.
o C2 → L2: 2-itemsets from L1 × L1; all subsets are frequent →
no pruning.
o C3 → L3: Prune itemsets with infrequent subsets using the
Apriori property.
o C4: Generated but pruned entirely due to infrequent subset →
termination.
Generating Association Rules
✅ Definitions
Frequent Itemset: An itemset whose support ≥ minimum support
threshold.
Association Rule: An implication of the form A → B, where A and B
are itemsets.
Support Count: Number of transactions containing a given
itemset.
Confidence: Measures how often items in B appear in transactions
that contain A:
Support ( A ∪ B )
Confidence ( A → B )=
Support ( A )
⚙️Steps to Generate Association Rules
1. Find all frequent itemsets using algorithms like Apriori or FP-
Growth.
2. For each frequent itemset l:
o Generate all non-empty subsets s of l.
o For each s, form the rule:
s→(l−s)s \rightarrow (l - s)s→(l−s)
o Compute the confidence of each rule.
3. Filter strong rules:
o Only keep rules with confidence ≥ min_conf.
o All rules automatically satisfy min_support because they are
derived from frequent itemsets.
📘 Example
Let X={I1,I2,I5}X = \{I1, I2, I5\}X={I1,I2,I5} be a frequent itemset.
Non-empty subsets of X:
{I1}, {I2}, {I5}, {I1, I2}, {I1, I5}, {I2, I5}
Possible rules and confidences (assuming support counts):
{I1, I2} → {I5} → 50%
{I1, I5} → {I2} → 100%
{I2, I5} → {I1} → 100%
{I1} → {I2, I5} → 33%
{I2} → {I1, I5} → 29%
{I5} → {I1, I2} → 100%
With min_conf = 70%, strong rules:
{I1, I5} → {I2}
{I2, I5} → {I1}
{I5} → {I1, I2}
FP-Growth Algorithm
✅ Motivation
Apriori algorithm, though effective, suffers from:
Huge candidate generation (e.g., 10⁴ 1-itemsets → 10⁷ 2-
itemsets).
Multiple full database scans and expensive pattern matching.
💡 FP-Growth Solution
Avoids candidate generation by using a divide-and-conquer
strategy.
Builds a compressed data structure called the FP-tree
(Frequent Pattern Tree).
Recursively mines conditional FP-trees for frequent patterns.
How FP-Growth Works
1. First Database Scan
Count support of all items → generate frequent 1-itemsets.
Sort items in descending order of support → List L.
2. Build FP-Tree
Start with a null root.
For each transaction:
o Sort items according to L.
o Insert path into tree, sharing common prefixes.
o Increment node counts for existing prefixes.
Maintain node-links for quick access via a header table.
3. Mine FP-Tree
For each item (starting from the least frequent in L):
Construct Conditional Pattern Base (CPB):
o Paths in FP-tree ending with the item.
Build Conditional FP-Tree from CPB.
Recursively mine the conditional FP-tree.
📘 Example Summary
Frequent Items (sorted by support):
L = {I2:7, I1:6, I3:6, I4:2, I5:2}
Mining I5:
CPB: {I2, I1:1}, {I2, I1, I3:1}
Conditional FP-tree: I2:2, I1:2
Patterns: {I2, I5}, {I1, I5}, {I2, I1, I5}
Mining I4:
CPB: {I2, I1:1}, {I2:1}
Conditional FP-tree: I2:2
Pattern: {I2, I4}
Mining I3:
CPB: {I2, I1:2}, {I2:2}, {I1:2}
Conditional FP-tree: Two branches
Patterns: {I2, I3}, {I1, I3}, {I2, I1, I3}
Mining I1:
CPB: {I2:4}
Conditional FP-tree: I2:4
Pattern: {I2, I1}
🌲 FP-Tree Benefits
Compact representation of the database.
Reduces:
o Database scans
o Candidate generation
o Search space
🔄 Recursive Mining
Focuses on smaller projected databases.
Combines prefix path with suffix pattern during recursion.
🔍 Why Pattern Evaluation Matters
Even strong rules (high support & confidence) can be misleading
or uninteresting.
This problem is especially common with:
o Low support thresholds
o Long patterns
📌 Strong Rules May Be Misleading
Example: The rule "buys computer games → buys videos" has:
o Support: 40%
o Confidence: 66%
Seems strong, but actually misleading, because:
o The base probability of buying videos is 75%
o So buying games lowers the chance of buying videos (→
negative correlation)
Takeaway: Confidence does not imply true correlation
⚙️Correlation-Based Evaluation
To improve pattern evaluation, use correlation analysis.
✅ Lift:
Formula:
lift(A → B) = P(A ∩ B) / (P(A) * P(B))
Interpretation:
o Lift > 1: Positive correlation
o Lift < 1: Negative correlation
o Lift = 1: No correlation
In the example, lift = 0.89 → negative correlation
✅ Chi-Square (χ²) Measure:
Based on a contingency table of observed vs expected values
Formula:
χ² = Σ (observed - expected)² / expected
High χ² → Strong evidence of dependence
In the example, χ² = 555.6, confirming negative correlation
✅ Conclusion: Which Measures Are Interesting?
Support & Confidence: Good for filtering, but insufficient
Lift & Chi-square (χ²): Better for identifying true
interestingness
o Capture actual dependencies (positive or negative)
Would you like a visual diagram to summarize this comparison of
interestingness measures?
4o
You said:
write as detaield notes
ChatGPT said:
Sure! Here's a detailed set of notes on the topic "Which Patterns Are
Interesting? — Pattern Evaluation Methods", covering all key
points, examples, and measures in a structured and easy-to-study
format:
Pattern Evaluation Methods in Association Rule Mining
🔹 Overview
Most association rule mining algorithms use the support–
confidence framework.
Issue: Even with thresholds, many generated rules may still be
uninteresting or misleading.
Especially problematic with:
o Low support thresholds
o Long patterns
To tackle this, additional measures are used to assess the
interestingness of patterns more effectively.
🧩 Strong Rules Are Not Necessarily Interesting
🔸 Subjective vs Objective Interestingness:
Subjective: Depends on user preferences, domain knowledge, etc.
Objective: Based on statistical measures derived from the data.
❗ Objective measures help filter out misleading rules before presenting them to
users.
🔸 Example 6.7 – Misleading “Strong” Rule
Scenario:
Data from AllElectronics about purchase behavior:
o Total transactions: 10,000
o Customers who bought:
Computer games: 6000
Videos: 7500
Both: 4000
Discovered Rule:
buys(X, "computer games") → buys(X, "videos")
Support = 40%, Confidence = 66%
Analysis:
Rule meets minimum support (30%) and confidence (60%) →
Considered strong.
BUT, the actual probability of buying videos is 75%.
o Hence, confidence (66%) < base probability (75%).
o Indicates negative correlation.
Conclusion: Rule is misleading.
o Buying games actually reduces the likelihood of buying
videos.
Key Insight:
Confidence alone doesn't imply meaningful association.
Need better metrics to reveal true correlations.
🧪 From Association Analysis to Correlation Analysis
🔸 Why Use Correlation Measures?
Support & confidence cannot detect:
o Independence
o Negative correlation
Correlation measures evaluate statistical dependency between
items.
🔹 Lift Measure
Formula:
Lift(A → B) = P(A ∩ B) / (P(A) * P(B))
Interpretation:
Lift > 1: Positive correlation (A implies B more often than by
chance)
Lift = 1: No correlation (independence)
Lift < 1: Negative correlation (A implies B less than by chance)
Example 6.8 – Applying Lift
From previous data:
o P(game) = 0.60
o P(video) = 0.75
o P(game ∩ video) = 0.40
Lift = 0.40 / (0.60 × 0.75) = 0.40 / 0.45 = 0.89
Result: Since lift < 1 → Negative correlation
✅ Lift detects negative correlation that confidence failed to reveal.
🔹 Chi-Square (χ²) Measure
Purpose:
Tests independence between itemsets
Based on observed vs. expected values in a contingency table
Formula:
χ² = Σ [(Observed − Expected)² / Expected]
Example 6.9 – Applying χ²
Contingency Table: Observed Values (Table 6.6)
¬Ga Row
Gam
m Tota
e l
Video 4000 3500 7500
¬Video 2000 500 2500
Col Total 6000 4000 10000
Expected Values (Table 6.7)
Game ¬Game
(Ex (Exp
p) )
Video 4500 3000
¬Vide
1500 1000
o
χ² Calculation:
markdown
CopyEdit
χ² = (4000−4500)² / 4500
+ (3500−3000)² / 3000
+ (2000−1500)² / 1500
+ (500−1000)² / 1000
χ² = 500²/4500 + 500²/3000 + 500²/1500 + 500²/1000
= 55.56 + 83.33 + 166.67 + 250.0
= **555.6**
Interpretation:
χ² > 1 → Statistically significant deviation
In this case:
o Observed joint occurrence (4000) < Expected (4500)
o Confirms negative correlation
✅ Like Lift, Chi-square also detects the negative correlation missed by the
confidence metric.