Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
20 views48 pages

Data Mining

The document provides an overview of data mining, detailing various types of data such as database data, data warehouses, and transactional data, along with their structures and uses. It discusses key mining techniques including classification, clustering, and outlier analysis, as well as the technologies and applications involved in data mining. Additionally, it highlights challenges in the field, such as handling uncertainty and the need for user-friendly interfaces.

Uploaded by

is7636665
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views48 pages

Data Mining

The document provides an overview of data mining, detailing various types of data such as database data, data warehouses, and transactional data, along with their structures and uses. It discusses key mining techniques including classification, clustering, and outlier analysis, as well as the technologies and applications involved in data mining. Additionally, it highlights challenges in the field, such as handling uncertainty and the need for user-friendly interfaces.

Uploaded by

is7636665
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Data Mining

1. Database Data

 Stored in: Relational databases (tables with rows and


columns).

 Examples: Customer records, employee data, item listings.

 Structure: Uses a schema (e.g., customer(custID, name, age,


income...)).

 Tools: SQL for querying and aggregating data.

 Data Mining Use: Identify trends, patterns, and deviations


(e.g., predicting credit risk, analyzing sales).

2. Data Warehouses

 Stored in: Centralized repositories integrating data from


multiple sources.

 Structure: Multidimensional (data cubes).

 Purpose: Historical analysis and business decision support.

 Features:

o Organized by subject (customer, item, time, etc.)

o Uses OLAP operations like roll-up and drill-down.

 Data Mining Use: Discover patterns at various levels of


granularity, enable exploratory analysis.

3. Transactional Data

 Stored in: Flat files or tables representing individual


transactions.

 Structure: Each record contains a transaction ID and items


involved (e.g., T100: I1, I3, I8).

 Examples: Retail sales, flight bookings, clickstream data.

 Data Mining Use:

o Market basket analysis

o Frequent itemset mining to discover what products are


often bought together.

Other Forms of Data (Mentioned briefly)

 Data Streams
 Sequence Data

 Graph or Network Data

 Spatial, Text, Multimedia Data

 Web Data (WWW)

Feature Database Data Data Warehouse Transactional


Data Data

Definition Structured data Integrated, Data


stored in historical data representing
relational from multiple real-world
databases sources stored transactions or
(RDBMS) using for analytical events, often
tables. purposes. sequential and
time-stamped.

Purpose Real-time Strategic Capturing and


operations and decision support analyzing
day-to-day and historical individual user
transactions analysis (OLAP). or business
(OLTP). actions (e.g.,
purchases).

Structure Tables Multidimensional Flat files or


(relations) with data cubes with nested tables
rows (tuples) summarized with transaction
and columns information. IDs and lists of
(attributes). items or events.

Storage Relational Centralized Data Flat files,


System Database Warehouse (e.g., NoSQL DBs, or
Management Amazon special
System Redshift, transactional
(RDBMS) like Snowflake, DB systems.
MySQL, Oracle, Google
SQL Server. BigQuery).

Schema Type Normalized Star or Often


schemas (3NF or Snowflake denormalized or
ER models). schemas for fast semi-structured
aggregation and (list of items
querying. per
transaction).

Data Fine-grained Aggregated Fine-grained


Granularity (detailed (summary over (detailed per
individual time or groups). transaction).
records).

Data Sources Single Multiple Point-of-sale


operational heterogeneous systems,
system (e.g., sources (e.g., sensors, web
POS system). regional DBs, logs, etc.
logs).

Update Frequently Periodically Continuously


Frequency updated (daily, updated (daily, updated or
hourly). weekly, appended.
monthly).

Examples - - Sales by region - T100: [I1, I3,


customer(custID and time - I8] -
, name, age, Quarterly Clickstream log:
income) - product UserID: [Page1,
item(itemID, performance Page3, Page7]
price, category)

Query Type SQL-based OLAP queries: Pattern mining:


queries: SELECT, Drill-down, Roll- association rule
JOIN, GROUP BY, up, Slice, Dice mining,
etc. sequence
pattern mining

Mining Classification, Multidimensional Association rule


Techniques clustering, pattern mining, mining,
outlier trend analysis, sequential
detection, anomaly pattern mining,
regression detection market basket
analysis

Use Cases Predicting Strategic Recommending


customer churn, decisions like products,
identifying which regions detecting
fraud, customer are buying
segmentation. underperforming behavior,
or product trend promotion
analysis. bundling.

Tools/ SQL, DBMS OLAP tools Apache Hadoop,


Technologies (MySQL, (Tableau, Power Spark, NoSQL,
PostgreSQL), BI), ETL Association
Python (pandas) pipelines, Cube Rule Mining
computation (Apriori, FP-
Growth)
Functionalities

1. Characterization and Discrimination (Class/Concept


Description)

➤ Characterization:

Describes the general features of data belonging to a target class.

 Provides a concise summary, usually through descriptive


statistics, OLAP operations, or attribute-oriented induction.

 Example: A retail manager wants to know the profile of customers


who spend over $5000/year. The result might show that they are
typically middle-aged, employed, and have good credit ratings.

 Output can be presented in the form of:

o Charts (bar, pie)

o Generalized relations

o Characteristic rules (e.g., "If income > 50K → likely to spend >
$5000")

 It's used for summarizing and understanding data patterns in a


group.

➤ Discrimination:

 Compares the features of a target class against one or more


contrasting classes.

 Example: Comparing customers who shop frequently for computer


products vs. those who shop rarely. Differences may include age,
education, etc.

 Helps identify features that distinguish between groups (e.g., age,


occupation).

 Often results in discriminant rules, e.g., “If age between 20-40


and education = university → frequent buyer.”

Key Difference: Characterization is about describing one group;


discrimination is about comparing multiple groups.

2. Mining Frequent Patterns, Associations, and Correlations

 Aims to find repetitive patterns, associations, or correlations


in large datasets.
 This includes:

o Frequent Itemsets: Sets of items that often appear together


in transactions (e.g., bread and butter).

o Sequential Patterns: Items purchased in a sequence (e.g.,


laptop → camera → memory card).

o Substructures: Patterns in structural forms like graphs or


trees.

➤ Association Rule Mining:

 Example: “buys(X, 'computer') → buys(X, 'software') [support: 1%,


confidence: 50%]”

o Means 1% of transactions include both, and 50% of computer


buyers also buy software.

 Can be single-dimensional (same predicate) or multi-


dimensional (age, income, buys).

 Used in market basket analysis, cross-selling, product


recommendations.

➤ Correlation Analysis:

 Goes beyond co-occurrence to measure statistical significance


between items (e.g., chi-square test).

Importance: Helps in identifying what tends to happen together in


data, enabling targeted marketing, inventory planning, and more.

3. Classification and Regression (Predictive Analysis)

➤ Classification:

 Builds a model (classifier) that assigns data to predefined


categories or classes.

 Requires labeled training data.

 Output can be:

o IF-THEN rules

o Decision trees

o Neural networks

o SVMs, k-NN, Bayesian classifiers


 Example: Classifying items based on sales response (good, mild,
none).

 Used in spam detection, credit scoring, disease diagnosis, etc.

➤ Regression:

 Predicts continuous numeric values, not categories.

 Example: Predicting the expected revenue from a product.

 Methods include linear regression, polynomial regression, and


advanced ML techniques.

 Used in forecasting, pricing models, stock prediction.

Key Difference: Classification predicts discrete labels, regression


predicts continuous values.

4. Cluster Analysis

 Groups a set of objects into clusters so that:

o Intra-cluster similarity is high

o Inter-cluster similarity is low

 No prior labels are required (unsupervised learning).

 Each cluster can later be treated as a class for further analysis.

 Example: Segmenting customers based on purchasing behavior or


geographic location.

 Visualized often using 2D/3D plots (e.g., k-means, DBSCAN).

 Applications include customer segmentation, image recognition,


bioinformatics.
Purpose: To discover natural groupings within data without predefined
categories.

5. Outlier Analysis

 Identifies data objects that deviate significantly from the


general pattern.

 Such data points are called outliers or anomalies.

 Useful in applications where rare events are more important than


common ones:

o Fraud detection (e.g., unusual credit card activity)

o Intrusion detection

o Medical anomalies

 Techniques:

o Statistical methods (assuming distribution models)

o Distance-based methods (objects far from others)

o Density-based methods (like LOF – Local Outlier Factor)

 Not all outliers are noise—many are insightful and can drive
important decisions.

Technologies Used in Data Mining


Data mining is an application-driven field that integrates various
techniques from multiple disciplines to extract valuable insights from data.
These include:

1. Statistics

o Role in Data Mining: Statistics is used to model data and


target classes. A statistical model describes the behavior of
data using mathematical functions and probability
distributions.

o Applications in Data Mining:

 Data Characterization and Classification: Statistical


models can be used to classify and characterize data.

 Handling Noise and Missing Data: Statistics helps in


modeling and handling noisy or missing data during the
data mining process.

 Prediction and Forecasting: Statistical models are


key for prediction tasks, providing a framework for
making inferences about the data.

 Verifying Data Mining Results: After building


classification or prediction models, statistical hypothesis
testing helps verify their accuracy and significance.

o Challenges: Scaling statistical methods for large datasets is


complex due to computational costs. This issue is exacerbated
for online applications requiring real-time processing.

2. Machine Learning

o Role in Data Mining: Machine learning focuses on enabling


computers to learn patterns and make decisions based on
data. Machine learning is used in data mining for tasks like
classification and clustering.

o Types of Learning Methods:

 Supervised Learning (Classification): Involves


training a model with labeled data to recognize patterns,
such as recognizing postal codes from handwritten
images.

 Unsupervised Learning (Clustering): The model


learns from data without labels, finding hidden patterns
or groups (e.g., recognizing different digits in
handwritten data without predefined labels).

 Semi-supervised Learning: Combines both labeled


and unlabeled data. Labeled data helps build models,
while unlabeled data helps refine the model's
boundaries, improving accuracy.

 Active Learning: The model actively queries humans


(domain experts) to label uncertain data points, thus
improving the model with minimal human input.

o Challenges: While machine learning focuses on accuracy,


data mining also emphasizes efficiency, scalability, and
handling diverse types of data.

3. Database Systems and Data Warehouses

o Role in Data Mining: Database systems handle the storage,


management, and retrieval of data, and they play a crucial
role in ensuring that data mining can scale to large datasets.

o Data Warehousing: A data warehouse integrates data from


various sources and timeframes into a unified structure. It
enables advanced data analysis by consolidating data into
multidimensional space, known as data cubes.

o Data Mining Integration: Modern database systems often


incorporate data mining capabilities to extend their analytic
power. Data mining tools can operate directly on data stored
in databases to identify patterns.

o Challenges: Data mining often involves working with real-


time streaming data, which requires efficient database
technologies to process large volumes of data quickly.

4. Information Retrieval (IR)

o Role in Data Mining: Information retrieval involves


searching and retrieving relevant documents or information
from a large database or the web. Unlike database systems, IR
deals with unstructured data (e.g., text or multimedia).

o Probabilistic Models: IR uses probabilistic models to


measure the similarity between documents. Text documents
are often represented as a bag of words, where the presence
and frequency of words are important, but word order is not.
o Topic Modeling: IR systems use models to identify
underlying topics in collections of documents. These topics are
represented as probability distributions over a vocabulary, and
documents may belong to multiple topics.

o Integration with Data Mining: Combining IR with data


mining techniques enables deeper analysis of text and
multimedia data, facilitating better search and analysis of
large, unstructured datasets (e.g., web data, digital libraries,
healthcare records).

Applications of Data Mining


Data mining plays a vital role in various fields where large amounts of
data need to be analyzed. Here are two major applications:

1. Business Intelligence (BI)

o Purpose: To understand business contexts such as


customers, market trends, and competitors.

o Key Techniques:

 Classification and Prediction for sales, market


analysis, and customer feedback.

 Clustering for Customer Relationship Management


(CRM), grouping customers by similarities.

 Characterization mining for understanding customer


groups and developing tailored programs.

o Importance: BI allows businesses to make smart decisions,


retain valuable customers, and gain insights into competitors.
Without data mining, effective market analysis would be
difficult.

2. Web Search Engines

o Purpose: To retrieve information from the web in response to


user queries.

o Techniques Used:
 Crawling: Deciding which web pages to crawl and how
frequently.

 Indexing: Choosing which pages to index and how to


structure the index.

 Ranking: Determining how to rank pages based on


relevance and quality.

o Challenges:

 Data Volume: Search engines deal with massive


amounts of data, requiring cloud computing for
processing.

 Real-Time Processing: Search engines need to


respond to user queries instantly, often requiring
continuous updates and real-time data mining.

 Small Data Issues: Many queries are asked rarely,


posing a challenge for mining methods designed for
large datasets.

Major Issues in Data Mining


Data mining, being a rapidly evolving field, faces several challenges and
open research areas. These challenges can be categorized into five main
groups:

1. Mining Methodology

o New Knowledge Types: Data mining covers a broad range


of tasks (e.g., classification, regression, clustering), and as
new applications emerge, new mining techniques are
developed.

o Multidimensional Data Mining: Mining knowledge across


different dimensions, such as combining various attributes in
data cubes.

o Interdisciplinary Approaches: Integrating methods from


natural language processing, software engineering, and other
fields enhances data mining.

o Handling Uncertainty: Dealing with noisy or incomplete


data is a significant challenge in data mining, requiring
techniques like data cleaning and outlier detection.

2. User Interaction
o Interactive Mining: The mining process should be flexible
and dynamic, allowing users to refine searches and explore
data interactively.

o Incorporation of Background Knowledge: Including


domain-specific knowledge, constraints, or rules can guide the
mining process towards more useful results.

o Data Mining Query Languages: High-level languages or


interfaces allow users to define and optimize ad hoc queries,
making the process more user-friendly.

o Visualization of Results: Presenting mining results in an


understandable, visually intuitive way is crucial for the
usability of data mining systems.

3. Efficiency and Scalability

o Algorithm Efficiency: Data mining algorithms need to


handle large datasets quickly and efficiently, especially as
data volumes grow.

o Parallel and Distributed Mining: Large datasets often


require parallel processing across distributed systems. Cloud
and cluster computing are common methods to scale data
mining processes.

o Incremental Mining: Incremental algorithms that can update


models as new data arrives without reprocessing all existing
data are a key area of research.

4. Diversity of Data Types

o Complex Data Types: Data mining must handle a variety of


data types, from structured databases to unstructured data
like text and images.

o Dynamic Data: Some data, like online data streams or real-


time sensor data, change constantly, presenting challenges
for traditional mining methods.

o Interconnected Data: Many datasets are linked (e.g., social


networks, web data), requiring mining techniques that can
handle and exploit these connections.

5. Data Mining and Society


o Social Impact: Data mining affects privacy, security, and
social dynamics. How can we use data mining for societal
benefit while preventing misuse?

o Privacy-Preserving Mining: Safeguarding individuals'


privacy while conducting data mining is crucial. Ongoing
research focuses on privacy-preserving data mining methods.

o Invisible Data Mining: Many systems perform data mining


behind the scenes without users' awareness. For instance, e-
commerce sites track user behavior to recommend products.

What is a Data Warehouse?


A data warehouse refers to a central repository where data from
different sources is stored and organized for analysis and decision-making.
It allows businesses to store historical data that supports strategic
decisions. Data warehouses are essential in today’s competitive world as
organizations use them to gain insights into various aspects of their
operations and make informed decisions.

Key Features of a Data Warehouse

1. Subject-Oriented:

o Data warehouses are designed around major subjects of


interest such as customers, products, suppliers, and sales.

o Unlike operational databases, which focus on day-to-day


transactions, data warehouses are structured to provide a
more analytical view that helps decision-makers.

o Data is organized to reflect decision support processes, not


operational activities.

2. Integrated:

o A data warehouse integrates data from multiple


heterogeneous sources like relational databases, flat files, and
transaction logs.

o It ensures consistency in naming conventions, data formats,


and attribute measures.

o The data is cleaned and standardized before being loaded into


the warehouse.
3. Time-Variant:

o Data in a warehouse is typically historical, covering several


years (e.g., 5-10 years) to help analyze trends over time.

o Each data set within a warehouse includes a time element,


either implicitly or explicitly, to track changes and trends over
time.

4. Nonvolatile:

o Once data is stored in a warehouse, it is not changed. New


data is only appended to the system.

o A data warehouse does not require the mechanisms for


transaction processing, recovery, or concurrency control that
are needed in operational databases.

o The primary operations in a data warehouse are data loading


and querying.

Functions of a Data Warehouse

 A data warehouse consolidates large amounts of data for analysis


and decision-making purposes. It is not primarily designed for
transactional operations but rather for answering complex queries
and providing insights into various business activities.

 A data warehouse is often constructed by integrating data from


multiple sources, using processes like data cleaning, data
integration, and data consolidation.

 Decision support technologies are used to query the data,


generate reports, and make strategic decisions based on the
insights derived from the data.

Difference b/w Database and Datawarehouse


Feature OLTP (Operational OLAP (Data Warehouse
Database System) System)

Purpose Handles day-to-day Supports data analysis and


transactions and query decision making
processing
Users Clerks, clients, IT Managers, executives,
professionals (customer- analysts (market-oriented)
oriented)

Data Content Current, detailed, real- Historical, aggregated,


time transactional data summarized data

Data Volume Typically smaller; focused Very large; includes years


on current data of historical data

Database Entity-Relationship (ER) Star or Snowflake schema;


Design model; application- subject-oriented
oriented

View of Data Narrow view, specific to Broad view, integrates


department/enterprise data from multiple sources

Query Simple, short, atomic Complex queries, often


Characteristi queries and transactions involving aggregation
cs

Operations Frequent inserts, updates, Primarily read-only


deletes (write-heavy) operations (read-heavy)

Concurrency Requires concurrency Less need for concurrency,


Control control and recovery due to read-only nature
mechanisms

Access High transaction Low latency not critical,


Patterns throughput, quick focus on complex query
response time performance

Performance Measured by number of Measured by query


Metrics transactions per second response time and
analytical capability

Data Very detailed, fine-grained Data stored at multiple


Granularity data levels of granularity (from
detailed to summarized)

Frequency of Constant, very frequent Periodic, depending on


Access analysis/reporting needs

Data Minimal; single source High; integrates data from


Integration systems multiple heterogeneous
sources

Storage Typically stored on a Distributed storage


Medium single system or server systems due to large
volume

Data Warehousing: A Multitiered Architecture


Data warehouses are designed using a three-tier architecture, which
helps in separating data storage, data processing, and data presentation.
This architecture ensures scalability, flexibility, and efficient data
management. Here's a breakdown of each tier:

1. Bottom Tier: Data Warehouse Server

 Role: This tier is responsible for storing the actual data.

 Technology Used: Usually a relational database management


system (RDBMS).

 Functions:

o Data Extraction: Pulls data from various operational and


external sources.

o Data Cleaning: Removes errors, inconsistencies, and


duplicates.

o Data Transformation: Converts data into a common, unified


format.

o Data Loading: Transfers the processed data into the data


warehouse.

o Data Refreshing: Periodically updates the data warehouse to


reflect recent changes.

 Data Sources:

o Operational Databases: e.g., banking systems, sales


systems.

o External Sources: e.g., market research reports, customer


profiles from third parties.

 Tools Used:

o Gateways (APIs) to connect and query the source systems:

 ODBC (Open Database Connectivity)

 OLEDB (Object Linking and Embedding Database)


 JDBC (Java Database Connectivity)

 Metadata Repository:

o Stores information about data (like source, format,


transformations applied).

o Acts as a directory for warehouse management and query


optimization.

2. Middle Tier: OLAP Server

 Role: Acts as the processing layer, converting data into a form


suitable for analysis.

 Two Main OLAP Models:

1. ROLAP (Relational OLAP):

 Works on top of relational databases.

 Converts multidimensional operations into relational


queries.

 Suitable for handling large volumes of data.

2. MOLAP (Multidimensional OLAP):

 Uses specialized multidimensional data structures


(cubes).

 Faster for complex analytical queries, but may have


storage limitations.

 Functionality:

o Supports advanced analytical processing, including


summarization, aggregation, and complex computations.

o Optimized for read-heavy operations.

3. Top Tier: Front-End Tools

 Role: This is the user interface layer, where users interact with
the system.

 Components:

o Query and Reporting Tools: For generating standard or


custom reports.
o Data Analysis Tools: For ad-hoc querying, slicing, dicing, and
drill-down analysis.

o Data Mining Tools: For predictive modeling, clustering, trend


analysis, etc.

 Users:

o Business Analysts

o Executives and Managers

o Decision Makers

 Functionality:

o Provides a visual and interactive environment for


exploring and analyzing data.

o Supports dashboards, charts, graphs, and other visualizations.


Data Warehouse Models: Enterprise
Warehouse,
Data Mart, and Virtual Warehouse
1. Enterprise Data Warehouse (EDW)

 Definition: A centralized data warehouse that stores information


from across the entire organization.

 Scope: Corporate-wide, cross-functional.

 Data:

o Includes both detailed and summarized data.

o Integrated from multiple operational systems or external


sources.

 Implementation:

o Requires extensive business modeling.

o Typically built on mainframes, superservers, or parallel


systems.

o May take months or years to design and deploy.

 Advantages:

o Single source of truth.

o High consistency and integration.

 Disadvantages:

o Time-consuming and expensive to build.

o Inflexible in dynamic environments.

2. Data Mart

 Definition: A smaller, focused version of a data warehouse that


stores data for a specific business line or department (e.g.,
marketing, sales).

 Scope: Departmental or subject-specific.

 Data:

o Typically summarized and related to specific business needs.


 Types:

o Independent Data Mart: Sourced directly from operational


systems or external providers.

o Dependent Data Mart: Sourced from an existing enterprise


data warehouse.

 Implementation:

o Uses low-cost servers (e.g., Linux, Windows).

o Takes weeks to build (faster ROI).

 Advantages:

o Quick to implement.

o Cost-effective.

o Flexible and adaptable to specific needs.

 Disadvantages:

o Risk of data silos.

o Complex integration later if not aligned with enterprise


strategy.

3. Virtual Warehouse

 Definition: A set of virtual views over operational databases.

 Implementation:

o Does not store data physically.

o Queries are processed in real time using views.

 Advantages:

o Easy and fast to build.

o Cost-efficient (no extra storage).

 Disadvantages:

o Performance depends on operational systems.

o Requires high processing capacity for complex queries.

o Limited historical data analysis.


Top-Down vs. Bottom-Up Approaches to Data Warehouse
Development

Aspect Top-Down Approach Bottom-Up Approach


Start Begin with enterprise data Start with departmental
Point warehouse data marts

Time & High cost and long duration Low cost, faster
Cost implementation

Flexibilit Less flexible More adaptable


y
Integrati Minimizes integration issues May lead to integration
on later challenges

Suitabili Best for long-term strategic Best for tactical and quick
ty planning solutions

Risk High initial investment with Quick wins but may cause
late returns silo issues

Recommended Approach: Incremental & Evolutionary

A hybrid approach is often best — combining top-down planning with


bottom-up implementation. The steps are:

1. Define High-Level Corporate Data Model

o Done within 1–2 months.

o Ensures consistent view of data across the organization.

2. Implement Independent Data Marts

o Developed in parallel using the high-level model.

o Quick deployment, department-level use.

3. Construct Distributed Data Marts

o Integrate various marts via hub servers.

o Enables data sharing across business units.

4. Build Multitier Data Warehouse


o Centralized Enterprise Data Warehouse becomes the
primary data store.

o Distributes data to dependent data marts as needed.

📊 Data Warehouse Modeling: Data Cube and OLAP


✅ Overview

 Data Warehouses and OLAP (Online Analytical Processing)


tools are built on the multidimensional data model.

 This model visualizes data as a data cube, which allows for


interactive analysis of multidimensional data.

 The modeling supports advanced operations like roll-up, drill-


down, and slicing/dicing to enable deep business insights.

Data Cube: A Multidimensional Data Model


🔹 What is a Data Cube?

 A data cube allows data to be modeled in n-dimensions (not just


3D).
 It is defined by:

o Dimensions: The perspectives for analysis (e.g., time, item,


location).

o Facts/Measures: Quantitative data (e.g., dollars sold, units


sold).

🔹 Key Concepts:

 Dimensions:

o Examples: time, item, branch, location.

o Each has a dimension table (e.g., for item: item name,


brand, type).

 Fact Table:

o Contains numeric measures like dollars sold, units sold, etc.

o Links to each dimension via foreign keys.

🧊 Representation:

 2-D Cube: Like a spreadsheet/table (e.g., time × item, for location


= Vancouver).

 3-D Cube: time × item × location.

 4-D Cube: time × item × location × supplier — hard to visualize but


conceptually a series of 3D cubes.

📐 Cuboids and Lattice

🔸 What is a Cuboid?

 A cuboid is a cube at a certain level of summarization (group-


by).

 Base Cuboid: The lowest level (e.g., time, item, location, supplier).

 Apex Cuboid (0-D): The highest level — summarized over all


dimensions.

🔸 Data Cube Lattice

 Given n dimensions, 2^n possible cuboids exist.

 Forms a lattice structure, representing all possible levels of


summarization.
🔹 Example from the notes:

For dimensions: time, item, location, supplier, the lattice includes:

 0-D Cuboid: total sales (summarized across all dimensions)

 1-D Cuboids: {time}, {item}, {location}, {supplier}

 2-D Cuboids: {time, item}, {item, supplier}, etc.

 3-D Cuboids: {time, item, location}, etc.

 4-D Cuboid: {time, item, location, supplier} (base cuboid)

🔷 Multidimensional Schema Models


⭐ 1. Star Schema

 Structure:

o Central fact table (large, non-redundant).

o Connected dimension tables (flat, possibly redundant).

 Pros:

o Simple, fast query performance.

o Easy to understand.

 Cons:

o Some redundancy in dimension tables.

 Use case: Most common in data marts.

❄️2. Snowflake Schema

 Structure:
o Like a star schema but dimension tables are normalized into
sub-tables.

 Pros:

o Reduces redundancy.

o Easier maintenance.

 Cons:

o More complex queries due to joins.

o Slight performance trade-off.

 Use case: Less common; used when storage efficiency is more


critical.

🌌 3. Fact Constellation (Galaxy Schema)

 Structure:

o Multiple fact tables sharing dimension tables.

 Pros:

o Models multiple interrelated subjects.

o Captures enterprise-wide data.

 Cons:

o Complex structure.

 Use case: Suitable for enterprise data warehouses.


🌐 Concept Hierarchies for Dimensions
What is a Concept Hierarchy?

 Maps low-level values (e.g., city) to higher-level concepts (e.g.,


country).

 Helps summarize or roll-up data in OLAP operations.

Types:

 Schema Hierarchy: Total/partial order (e.g., street < city <


province < country).

 Lattice: Partial order where attributes don’t follow a single path


(e.g., week < year; day < month).

 Set-grouping Hierarchy: Value ranges grouped (e.g., price


ranges: ($0–$200], ($200–$400], ...).

 Can be manually defined or automatically generated.

🧮 Measures in a Data Cube

Measures = Numeric values aggregated over dimension values


(e.g., total sales).

Type Definition Examples Efficien


cy
Distribu Can be computed from sum(), count(), Very
tive subaggregates and min(), max() efficient
combined.

Algebrai Computed using a fixed avg() = Efficient


c number of distributive sum()/count(),
aggregates. stddev()

Holistic Requires full data scan, median(), mode(), Inefficient


cannot be broken into rank()
subaggregates.

⚠️Most OLAP tools focus on distributive and algebraic measures for


performance.

OLAP Operations
🔁 1. Roll-Up

 Definition: Aggregates data by climbing up a concept hierarchy or


by reducing dimensions.

 Example: Aggregating sales data from city to country (Toronto →


Canada).

 Also called: Drill-Up (by some vendors).

🔽 2. Drill-Down

 Definition: The reverse of roll-up; navigates from summary data to


more detailed data.

 Example: Moving from quarterly sales data to monthly sales


data.

 Also includes: Adding a new dimension (e.g., customer group) for


more detail.

🧊 3. Slice
 Definition: Selects a single dimension value, resulting in a subcube.

 Example: Selecting data where time = Q1 only.

🧊🧊 4. Dice

 Definition: Selects a range of values on two or more dimensions,


resulting in a subcube.

 Example: Data for location = Toronto or Vancouver, time = Q1


or Q2, and item = home entertainment or computer.

🔄 5. Pivot (Rotate)

 Definition: Rotates the cube to view data from different


perspectives.

 Example: Swapping the axes item and location for alternate visual
layout.

🔍 6. Drill-Across

 Definition: Executes queries across multiple fact tables.

🧱 7. Drill-Through

 Definition: Accesses the bottom-level data in the data cube using


SQL, typically reaching into backend relational tables.

📊 8. Other Advanced Operations

 Examples:

o Top-N/Bottom-N ranking.

o Moving averages, growth rates, depreciation.

o Currency conversion, internal return rates.

o Forecasting, trend/statistical analysis, variance


calculations.
💡 Role of Concept Hierarchies in OLAP

 Enable aggregation and drilling at various levels of detail.

 Facilitate multilevel data exploration across dimensions.


Concept Hierarchies
1. Definition

 Concept hierarchy is a sequence of mappings from low-level


(specific) concepts to high-level (general) concepts.

 Purpose: Allows multilevel data abstraction.

2. Types

 Schema Hierarchies (based on database attributes):


o Example for location: Street < City < Province/State <
Country

o Example for time: Day < Month < Quarter < Year

 Lattice Structure:

o Supports partial orders, e.g., Day < Month < Quarter, and
Week < Year.

 Set-Grouping Hierarchies:

o Created by grouping values into ranges or categories.

o Example for price: $0–$100, $100–$200, etc.

o User-defined groups: cheap, moderate, expensive

 Multiple Hierarchies:

o A single attribute can have multiple concept hierarchies


depending on the analysis (e.g., price can be by range or by
category).

3. Sources of Concept Hierarchies

 Manual: Provided by users, domain experts.

 Automatic: Generated using statistical analysis (e.g., clustering).

📏 Measures: Categorization & Computation

Measure: A numerical value computed for each multidimensional


point (e.g., sales).

Categories of Measures:

Type Definition Examples Notes

Distribut Can be computed sum(), count(), Easy and efficient to


ive in parts and min(), compute.
then max()
aggregated.

Algebrai Computed from a avg() = Depends on multiple


c fixed number sum/count distributive
(M) of , stddev() functions.
distributive
measures.

Holistic Cannot be expressed median(), Complex, may


with a bounded mode(), require
number of rank() approximation
distributive
.
results.

🧠 OLAP Engine Capabilities

 Enables complex analytical computations.

 Supports:

o Aggregations, hierarchies, ratios

o Forecasting, trend and statistical analysis

 Provides a user-friendly, interactive environment for querying


multidimensional dat

Detailed Definition of Mining Frequent Patterns

Frequent Pattern Mining is a fundamental task in data mining that


involves discovering patterns (like itemsets, sequences, or
structures) that occur frequently in a dataset. These patterns
reveal relationships and associations between data items that can
be useful in decision-making, prediction, recommendation, and
classification.

Patterns
Key Concepts in Frequent Pattern Mining

1. Frequent Pattern

A frequent pattern is a set of items, subsequences, or structures that


appear together frequently in a dataset.

Examples:

 Frequent Itemset: {milk, bread} appears together in many


transactions.

 Frequent Sequential Pattern: <PC → Digital Camera → Memory


Card> appears in many customer purchase histories.

 Frequent Structured Pattern: A frequently recurring subgraph in


a chemical compound dataset.
2. Itemset

A collection of one or more items. For instance, in a supermarket:

 {milk}, {milk, bread}, {bread, butter, eggs} are itemsets.

A k-itemset contains k items.

3. Support

 Support of an itemset is the proportion (or count) of transactions


that contain the itemset.

 It measures how frequently an itemset appears in the dataset.

Support(A) = (Number of transactions containing A) / (Total number of


transactions)

4. Confidence

 Confidence of a rule A → B is the probability that transactions


containing A also contain B.

Confidence(A → B) = Support(A ∪ B) / Support(A)

It shows how reliable the rule is.

5. Association Rules

An association rule is an implication of the form:

css

CopyEdit

A → B [support = s%, confidence = c%]

It means that if A occurs, B is likely to occur with support s and confidence


c.

6. Closed and Maximal Frequent Itemsets

 Closed Frequent Itemset: A frequent itemset that has no


superset with the same support.

 Maximal Frequent Itemset: A frequent itemset that has no


frequent supersets.
Frequent Itemsets, Closed Itemsets, and
Association Rules

🔹 1. Basic Definitions

➤ Itemset:

 A group of items.

 A k-itemset contains k items (e.g., {bread, milk} is a 2-itemset).

➤ Transaction (T):

 A set of items bought together.

 Identified by a unique TID.

🔹 2. Support and Confidence

➤ Support:

 Fraction of transactions that contain an itemset.


Transactions containing both A and B
Support ( A ∪ B )=
Total transactions
➤ Confidence:

 Likelihood of item B occurring given item A.

Support ( A ∪ B )
Confidence ( A → B )=
Support ( A )

🔹 3. Association Rules

 Form: A → B

 Indicates a strong relationship: "If A occurs, B is likely to occur."


✔️Strong Rules:

 Satisfy both:

o Minimum Support (minsup)

o Minimum Confidence (minconf)

🔹 4. Frequent Itemsets

 An itemset is frequent if its support ≥ minsup.

 Support count = number of transactions containing the itemset.

🔹 5. Closed Frequent Itemsets

 An itemset is closed if no proper superset has the same


support count.

 Captures complete support info.

 Used to eliminate redundancy.

🔹 6. Maximal Frequent Itemsets

 An itemset is maximal frequent if it is frequent and none of its


supersets are frequent.

 Represents the outer boundary of frequent itemsets.

 More compact, but may lose support details of subsets.

🔹 7. Why Use Closed or Maximal?

 Mining frequent itemsets may result in an exponential number of


patterns.

 Closed and maximal reduce computation and storage.

🔹 8. Example

Dataset:

 T1: {a1, a2, ..., a100}

 T2: {a1, a2, ..., a50}


 minsup = 1

➤ Frequent itemsets: All subsets of T1 and T2

Total = 2^100 - 1 → Too large!

➤ Closed frequent itemsets:

 {a1, ..., a50} → support: 2

 {a1, ..., a100} → support: 1

➤ Maximal frequent itemset:

 {a1, ..., a100} only (as its superset doesn't exist)

🔹 9. Association Rule Mining Steps

1. Find all frequent itemsets (support ≥ minsup).

2. Generate strong association rules from those itemsets


(confidence ≥ minconf).

3.

Apriori Algorithm – Overview

 Purpose: To mine frequent itemsets for Boolean association rules.

 Proposed by: R. Agrawal and R. Srikant (1994).

 Name Origin: Uses prior knowledge of itemset properties.

🔁 Working Principle

 Level-wise iterative approach:

o Finds frequent 1-itemsets (L1) from the database.

o Uses Lk-1 to generate Lk (frequent k-itemsets).

o Iterates until no more frequent itemsets can be found.

✅ Apriori Property (Antimonotonicity)

 Definition: All non-empty subsets of a frequent itemset must also


be frequent.

 Implication:
o If itemset I is infrequent, then any superset I ∪ A is also
infrequent.

o Helps in pruning the candidate space (reducing


computations).

🧩 Two-Step Process (Join & Prune)

1. Join Step:

o Generate candidate itemsets Ck by self-joining Lk-1.

o Join l1 and l2 in Lk-1 if their first k-2 items are the same.

o Ensures no duplicates using lexicographic order.

2. Prune Step:

o Remove candidate c ∈ Ck if any of its (k-1)-subsets is not in


Lk-1.

📊 Example (Using AllElectronics DB)

 Database D: 9 transactions (T100–T900).

 min_sup = 2 (support count).

 Iterations:

o C1 → L1: All 1-itemsets satisfying min_sup.

o C2 → L2: 2-itemsets from L1 × L1; all subsets are frequent →


no pruning.

o C3 → L3: Prune itemsets with infrequent subsets using the


Apriori property.

o C4: Generated but pruned entirely due to infrequent subset →


termination.

Generating Association Rules


✅ Definitions

 Frequent Itemset: An itemset whose support ≥ minimum support


threshold.
 Association Rule: An implication of the form A → B, where A and B
are itemsets.

 Support Count: Number of transactions containing a given


itemset.

 Confidence: Measures how often items in B appear in transactions


that contain A:

Support ( A ∪ B )
Confidence ( A → B )=
Support ( A )

⚙️Steps to Generate Association Rules

1. Find all frequent itemsets using algorithms like Apriori or FP-


Growth.

2. For each frequent itemset l:

o Generate all non-empty subsets s of l.

o For each s, form the rule:

s→(l−s)s \rightarrow (l - s)s→(l−s)

o Compute the confidence of each rule.

3. Filter strong rules:

o Only keep rules with confidence ≥ min_conf.

o All rules automatically satisfy min_support because they are


derived from frequent itemsets.

📘 Example

Let X={I1,I2,I5}X = \{I1, I2, I5\}X={I1,I2,I5} be a frequent itemset.

Non-empty subsets of X:

 {I1}, {I2}, {I5}, {I1, I2}, {I1, I5}, {I2, I5}

Possible rules and confidences (assuming support counts):

 {I1, I2} → {I5} → 50%

 {I1, I5} → {I2} → 100%

 {I2, I5} → {I1} → 100%

 {I1} → {I2, I5} → 33%


 {I2} → {I1, I5} → 29%

 {I5} → {I1, I2} → 100%

With min_conf = 70%, strong rules:

 {I1, I5} → {I2}

 {I2, I5} → {I1}

 {I5} → {I1, I2}

FP-Growth Algorithm

✅ Motivation

Apriori algorithm, though effective, suffers from:

 Huge candidate generation (e.g., 10⁴ 1-itemsets → 10⁷ 2-


itemsets).

 Multiple full database scans and expensive pattern matching.

💡 FP-Growth Solution

 Avoids candidate generation by using a divide-and-conquer


strategy.

 Builds a compressed data structure called the FP-tree


(Frequent Pattern Tree).

 Recursively mines conditional FP-trees for frequent patterns.

How FP-Growth Works

1. First Database Scan

 Count support of all items → generate frequent 1-itemsets.

 Sort items in descending order of support → List L.

2. Build FP-Tree

 Start with a null root.

 For each transaction:

o Sort items according to L.


o Insert path into tree, sharing common prefixes.

o Increment node counts for existing prefixes.

 Maintain node-links for quick access via a header table.

3. Mine FP-Tree

For each item (starting from the least frequent in L):

 Construct Conditional Pattern Base (CPB):

o Paths in FP-tree ending with the item.

 Build Conditional FP-Tree from CPB.

 Recursively mine the conditional FP-tree.

📘 Example Summary

Frequent Items (sorted by support):

L = {I2:7, I1:6, I3:6, I4:2, I5:2}

Mining I5:

 CPB: {I2, I1:1}, {I2, I1, I3:1}

 Conditional FP-tree: I2:2, I1:2

 Patterns: {I2, I5}, {I1, I5}, {I2, I1, I5}

Mining I4:

 CPB: {I2, I1:1}, {I2:1}

 Conditional FP-tree: I2:2

 Pattern: {I2, I4}

Mining I3:

 CPB: {I2, I1:2}, {I2:2}, {I1:2}

 Conditional FP-tree: Two branches

 Patterns: {I2, I3}, {I1, I3}, {I2, I1, I3}

Mining I1:

 CPB: {I2:4}

 Conditional FP-tree: I2:4

 Pattern: {I2, I1}


🌲 FP-Tree Benefits

 Compact representation of the database.

 Reduces:

o Database scans

o Candidate generation

o Search space

🔄 Recursive Mining

 Focuses on smaller projected databases.

 Combines prefix path with suffix pattern during recursion.

🔍 Why Pattern Evaluation Matters


 Even strong rules (high support & confidence) can be misleading
or uninteresting.

 This problem is especially common with:

o Low support thresholds

o Long patterns
📌 Strong Rules May Be Misleading

 Example: The rule "buys computer games → buys videos" has:

o Support: 40%

o Confidence: 66%

 Seems strong, but actually misleading, because:

o The base probability of buying videos is 75%

o So buying games lowers the chance of buying videos (→


negative correlation)

 Takeaway: Confidence does not imply true correlation

⚙️Correlation-Based Evaluation

To improve pattern evaluation, use correlation analysis.

✅ Lift:

 Formula:
lift(A → B) = P(A ∩ B) / (P(A) * P(B))

 Interpretation:

o Lift > 1: Positive correlation

o Lift < 1: Negative correlation

o Lift = 1: No correlation

 In the example, lift = 0.89 → negative correlation

✅ Chi-Square (χ²) Measure:

 Based on a contingency table of observed vs expected values

 Formula:
χ² = Σ (observed - expected)² / expected

 High χ² → Strong evidence of dependence

 In the example, χ² = 555.6, confirming negative correlation

✅ Conclusion: Which Measures Are Interesting?

 Support & Confidence: Good for filtering, but insufficient


 Lift & Chi-square (χ²): Better for identifying true
interestingness

o Capture actual dependencies (positive or negative)

Would you like a visual diagram to summarize this comparison of


interestingness measures?

4o

You said:

write as detaield notes

ChatGPT said:

Sure! Here's a detailed set of notes on the topic "Which Patterns Are
Interesting? — Pattern Evaluation Methods", covering all key
points, examples, and measures in a structured and easy-to-study
format:

Pattern Evaluation Methods in Association Rule Mining

🔹 Overview

 Most association rule mining algorithms use the support–


confidence framework.

 Issue: Even with thresholds, many generated rules may still be


uninteresting or misleading.

 Especially problematic with:

o Low support thresholds

o Long patterns

 To tackle this, additional measures are used to assess the


interestingness of patterns more effectively.

🧩 Strong Rules Are Not Necessarily Interesting

🔸 Subjective vs Objective Interestingness:

 Subjective: Depends on user preferences, domain knowledge, etc.

 Objective: Based on statistical measures derived from the data.


❗ Objective measures help filter out misleading rules before presenting them to
users.

🔸 Example 6.7 – Misleading “Strong” Rule

Scenario:

 Data from AllElectronics about purchase behavior:

o Total transactions: 10,000

o Customers who bought:

 Computer games: 6000

 Videos: 7500

 Both: 4000

Discovered Rule:

buys(X, "computer games") → buys(X, "videos")


Support = 40%, Confidence = 66%

Analysis:

 Rule meets minimum support (30%) and confidence (60%) →


Considered strong.

 BUT, the actual probability of buying videos is 75%.

o Hence, confidence (66%) < base probability (75%).

o Indicates negative correlation.

 Conclusion: Rule is misleading.

o Buying games actually reduces the likelihood of buying


videos.

Key Insight:

 Confidence alone doesn't imply meaningful association.

 Need better metrics to reveal true correlations.

🧪 From Association Analysis to Correlation Analysis

🔸 Why Use Correlation Measures?

 Support & confidence cannot detect:


o Independence

o Negative correlation

 Correlation measures evaluate statistical dependency between


items.

🔹 Lift Measure

Formula:

Lift(A → B) = P(A ∩ B) / (P(A) * P(B))

Interpretation:

 Lift > 1: Positive correlation (A implies B more often than by


chance)

 Lift = 1: No correlation (independence)

 Lift < 1: Negative correlation (A implies B less than by chance)

Example 6.8 – Applying Lift

 From previous data:

o P(game) = 0.60

o P(video) = 0.75

o P(game ∩ video) = 0.40

Lift = 0.40 / (0.60 × 0.75) = 0.40 / 0.45 = 0.89

 Result: Since lift < 1 → Negative correlation

✅ Lift detects negative correlation that confidence failed to reveal.

🔹 Chi-Square (χ²) Measure

Purpose:

 Tests independence between itemsets

 Based on observed vs. expected values in a contingency table

Formula:

χ² = Σ [(Observed − Expected)² / Expected]


Example 6.9 – Applying χ²

Contingency Table: Observed Values (Table 6.6)

¬Ga Row
Gam
m Tota
e l

Video 4000 3500 7500

¬Video 2000 500 2500

Col Total 6000 4000 10000

Expected Values (Table 6.7)

Game ¬Game
(Ex (Exp
p) )

Video 4500 3000

¬Vide
1500 1000
o

χ² Calculation:

markdown

CopyEdit

χ² = (4000−4500)² / 4500

+ (3500−3000)² / 3000

+ (2000−1500)² / 1500

+ (500−1000)² / 1000
χ² = 500²/4500 + 500²/3000 + 500²/1500 + 500²/1000

= 55.56 + 83.33 + 166.67 + 250.0

= **555.6**

Interpretation:

 χ² > 1 → Statistically significant deviation

 In this case:

o Observed joint occurrence (4000) < Expected (4500)

o Confirms negative correlation

✅ Like Lift, Chi-square also detects the negative correlation missed by the
confidence metric.

You might also like