What is a Data Warehouse?
■ Defined in many different ways, but not rigorously.
■ A decision support database that is maintained separately from
the organization’s operational database
■ Support information processing by providing a solid platform of
consolidated, historical data for analysis.
■ “A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision-
making process.”— W. H. Inmon
■ Data warehousing:
■ The process of constructing and using data warehouses
Department of Information Science and Engg
Transform Here 9
Data Warehouse - Subject-Oriented
■ Organized around major subjects, such as customer,
product, sales
■ Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing.
■ Provide a simple and concise view around particular
subject issues by excluding data that are not useful in the
decision support process
Department of Information Science and Engg
Transform Here 10
Data Warehouse - Integrated
■ Constructed by integrating multiple, heterogeneous data
sources
■ Relational databases, flat files, on-line transaction
records
■ Data cleaning and data integration techniques are
applied.
■ Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
■ E.g., Hotel price: currency, tax, breakfast covered, etc.
■ When data is moved to the warehouse, it is
converted.
Department of Information Science and Engg
Transform Here 11
Data Warehouse - Nonvolatile
■ A physically separate store of data transformed from the
operational environment
■ Operational update of data does not occur in the data
warehouse environment
■ Does not require transaction processing, recovery,
and concurrency control mechanisms
■ Requires only two operations in data accessing:
■ initial loading of data and access of data
Department of Information Science and Engg
Transform Here 12
Data Warehouse - Time Variant
■ The time horizon for the data warehouse is significantly
longer than that of operational systems
■ Operational database: current value data
■ Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
■ Every key structure in the data warehouse
■ Contains an element of time, explicitly or implicitly
■ But the key of operational data may or may not
contain “time element” (dwh_create_time (dwh_cttm),
dwh_update_time (dwh_up_time)
Department of Information Science and Engg
Transform Here 13
The major distinguishing features of OLTP and OLAP are
summarized as follows:
Users and system orientation:
• An OLTP system is customer-oriented and is used for
transaction and query processing by clerks, clients, and
information technology professionals.
• An OLAP system is market-oriented and is used for data
analysis by knowledge workers, including managers,
executives, and analysts.
OLTP – Online Transaction
Processing
OLAP – Online Analytical Processing
Department of Information Science and Engg
Transform Here 14
■ Data contents:
An OLTP system manages current data that, typically,
are too detailed to be easily used for decision making.
An OLAP system manages large amounts of historic data,
provides facilities for summarization and aggregation, and
stores and manages information at different levels of
granularity.
These features make the data easier to use for informed
decision making.
Department of Information Science and Engg
Transform Here 15
Database Design:
An OLTP system usually adopts an entity-relationship (ER)
data model and an application-oriented database design.
An OLAP system typically adopts either a star or a
snowflake model and a subject-oriented database
design.
Department of Information Science and Engg
Transform Here 16
■ View: An OLTP system focuses mainly on the current
data within an enterprise or department, without referring
to historic data or data in different organizations.
■ In contrast, an OLAP system often spans multiple versions
of a database schema, due to the evolutionary process of
an organization.
■ OLAP systems also deal with information that originates
from different organizations, integrating information from
many data stores.
■ Because of their huge volume, OLAP data are stored on
multiple storage media.
Department of Information Science and Engg
Transform Here
■ Access patterns: The access patterns of an OLTP system
consist mainly of short, atomic transactions. Such a system
requires concurrency control and recovery mechanisms.
■ However, accesses to OLAP systems are mostly read-only
operations (because most data warehouses store historic
rather than up-to-date information), although many could be
complex queries.
Department of Information Science and Engg
Transform Here
OLTP vs. OLAP
Parameter OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized,
isolated multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Department of Information Science and Engg
Transform Here
How are organizations using the information from
data warehouses?
Many organization use this information to support business decision-making
activities, including
1. Increasing customer focus, which includes the analysis of customer
buying patterns (such as buying preference, buying time, budget cycles,
and appetites for spending);
2. Repositioning products and managing product portfolios by comparing
the performance of sales by quarter, by year, and by geographic regions
in order to fine-tune production strategies;
3. Analyzing operations and looking for sources of profit; and
4. Managing customer relationships, making environmental corrections, and
managing the cost of corporate assets
Department of Information Science and Engg
Transform Here
■ Because operational databases store huge amounts of
data, you may wonder, “Why not perform online
analytical processing directly on such databases
instead of spending additional time and resources
to construct a separate data warehouse?”
Department of Information Science and Engg
Transform Here
Why a Separate Data Warehouse?
■ High performance for both systems
■ DBMS - tuned for OLTP: access methods, indexing, concurrency
control, recovery
■ Warehouse - tuned for OLAP: complex OLAP queries, multidimensional
view, consolidation
■ Different functions and different data:
■ missing data: Decision support (DS) requires historical data which
operational DBs do not typically maintain
■ data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
■ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
■ Note: There are more and more systems which perform OLAP analysis
directly on relational databases
Department of Information Science and Engg
Transform Here
Data Warehouse: A Multi-Tiered
Architecture
Monitor
& OLAP Server
Other Metadata
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
Data Sources Data Storage OLAP Engine Front-
End Tools
Department of Information Science and Engg
Transform Here
Department of Information Science and Engg
Transform Here
■ The bottom tier is a warehouse database server that is
almost always a relational database system. Back-end
tools and utilities are used to feed data into the bottom tier
from operational databases or other external sources (e.g.,
customer profile information provided by external
consultants)
■ These tools and utilities perform data extraction, cleaning,
and transformation (e.g., to merge similar data from
different sources into a unified format), as well as load and
refresh functions to update the data warehouse
Department of Information Science and Engg
Transform Here
The middle tier is an OLAP server that is typically
implemented using either
a) A relational OLAP(ROLAP)model
(i.e.,an extended relational DBMS that maps operations on
multidimensional data to standard relational operations); or
b) A multidimensional OLAP (MOLAP) model
(special-purpose server that directly implements
multidimensional data and operations)
Department of Information Science and Engg
Transform Here
The top tier is a front-end client layer, which
contains query and reporting tools, analysis tools,
and/or data mining tools (e.g., trend analysis,
prediction, and so on).
Department of Information Science and Engg
Transform Here
Three Data Warehouse Models
■ Enterprise warehouse
■ collects all of the information about subjects spanning the
entire organization
■ Data Mart
■ a subset of corporate-wide data that is of value to a specific
groups of users. Its scope is confined to specific,
selected groups, such as marketing data mart
■ Independent vs. dependent (directly from warehouse) data mart
■ Virtual warehouse
■ A set of views over operational databases
■ Only some of the possible summary views may be
materialized
Department of Information Science and Engg
Transform Here
■ A virtual warehouse is easy to build but requires
excess capacity on operational database servers
“What are the pros and cons of the top-down and bottom-up
approaches to data warehouse development?”
■ The top-down development of an enterprise warehouse
serves as a systematic solution and minimizes integration
problems.
■ However, it is expensive, takes a long time to develop,
and lacks flexibility due to the difficulty in achieving
consistency and consensus for a common data model for
the entire organization.
Department of Information Science and Engg
Transform Here
■ The bottom-up approach to the design,
development, and deployment of independent
data marts provides flexibility, low cost, and rapid
return of investment.
■ It, however, can lead to problems when
integrating various disparate data marts into a
consistent enterprise data warehouse.
Department of Information Science and Engg
Transform Here
■ Depending on the source of data, data marts can
be categorized as independent or dependent.
■ Independent data marts are sourced from data
captured from one or more operational systems
or external information providers, or from data
generated locally within a particular department
or geographic area.
■ Dependent data marts are sourced directly from
enterprise data warehouses.
Department of Information Science and Engg
Transform Here
Extraction, Transformation, and Loading (ETL)
■ Data warehouse systems use back-end tools and utilities to populate and
refresh their data These tools and utilities include the following functions:
■ Data extraction
■ get data from multiple, heterogeneous, and external
sources
■ Data cleaning
■ detect errors in the data and rectify them when possible
■ Data transformation
■ convert data from legacy or host format to warehouse
format
■ Load
■ sort, summarize, consolidate, compute views, check integrity,
and build indicies and partitions
■ Refresh
■ propagate the updates from the data sources to the warehouse
Department of Information Science and Engg
Transform Here
Metadata Repository
Meta data is the data defining warehouse objects. It stores:
■ Description of the structure of the data warehouse
■ schema, view, dimensions, hierarchies, derived data definitions,
data mart locations and contents.
■ Operational meta-data
■ data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged),
monitoring information (warehouse usage statistics, error reports,
audit trails)
■ The algorithms used for summarization
■ which include measure and dimension definition algorithms, data on
granularity, partitions, subject areas, aggregation, summarization,
and predefined queries and reports.
Department of Information Science and Engg
Transform Here
■ The mapping from operational environment to the data warehouse
■ which includes source databases and their contents, gateway
descriptions, data partitions, data extraction, cleaning,
transformation rules and defaults, data refresh and purging rules,
and security (user authorization and access control).
■ Data related to system performance
■ which include indices and profiles that improve data access
and retrieval performance, in addition to rules for the timing
and scheduling of refresh, update, and replication cycles.
■ Business data
■ which include business terms and definitions, data ownership
information, and charging policies.
Department of Information Science and Engg
Transform Here
Data Warehousing and On-line Analytical Processing
■ Data Warehouse: Basic Concepts
■ Data Warehouse Modeling: Data Cube and OLAP
■ Data Warehouse Design and Usage
■ Data Warehouse Implementation
■ Data Generalization by Attribute-Oriented
Induction
■ Summary
Department of Information Science and Engg
Transform Here
From Tables and Spreadsheets to Data Cubes
■ “What is a data cube?”
“A data cube allows data to be modeled and viewed in
multiple dimensions”.
■ It is defined by dimensions and facts.
Facts are numerical measures.
A dimension is a structure that categorizes data in order to
enable users to answer business questions
Department of Information Science and Engg
Transform Here
Dimensions are the perspectives or entities with respect to which an
organization wants to keep records.
■ Eg: AllElectronics may create a sales data warehouse in order
to keep records of the store’s sales with respect to the
dimensions time, item, branch, and location. These dimensions
allow the store to keep track of things like monthly sales of
items and the branches and locations at which the items were
sold.
Each dimension may have a table associated with it, called a
dimension table, which further describes the dimension.
• For example, a dimension table for item may contain the
attributes item name, brand, and type.
• Dimension tables can be specified by users or experts, or
automatically generated and adjusted based on data
distributions. Department of Information Science and Engg
Transform Here
■ Facts are numeric measures. Think of them as the
quantities by which we want to analyze relationships
between dimensions.
■ Examples of facts for a sales data warehouse include
dollars sold (sales amount in dollars), units sold
(number of units sold), and amount budgeted.
■ The fact table contains the names of the facts, or
measures, as well as keys to each of the related dimension
tables.
Department of Information Science and Engg
Transform Here
■ 2-D representation, the sales for Vancouver
are shown with respect to the time dimension
(organized in quarters) and the item
dimension(organized according to the types of
items sold).
■ The factor measure displayed is dollars sold
(in thousands).
Department of Information Science and Engg
Transform Here
Department of Information Science and Engg
Transform Here
■ suppose that we would like to view the sales data
with a third dimension.
■ For instance, suppose we would like to view the
data according to time and item, as well as
location, for the cities Chicago, New York,
Toronto, and Vancouver. These 3-D data are
shown in Table 4.3.
Department of Information Science and Engg
Transform Here
Department of Information Science and Engg
Transform Here
cuboid
■ A 3-D data cube representation of the data inTable4.3, according to
time, item, and location.
■ The measure displayed is dollars sold (in thousands).
Department of Information Science and Engg
Transform Here
Multidimensional Data
Sales volume as a function of product, month,
and region.
Department of Information Science and Engg
Transform Here
■ Suppose that we would now like to view our sales
data with an additional fourth dimension such as
supplier.
Department of Information Science and Engg
Transform Here
A 4-D data cube representation of sales data, according to time, item,
location, and supplier. The measure displayed is dollars sold (in thousands).
For improved readability, only some of the cube values are shown.
Department of Information Science and Engg
Transform Here
Cube: A Lattice of Cuboids
Given a set of dimensions, we can generate a cuboid for each
of the possible subsets of the given dimensions.
The result would form a lattice of cuboids, each showing the
data at a different level of summarization, or group-by.
The lattice of cuboids is then referred to as a data cube.
In previous slide it shows a lattice of cuboids forming a data
cube for the dimensions time, item, location, and supplier.
The lattice of cuboid forms a data cube
Department of Information Science and Engg
Transform Here
■ The cuboid that holds the lowest level of summarization is called the
base cuboid.
■ For example, the 4-D cuboid in Figure 4.4 is the base cuboid for the
given time, item, location, and supplier dimensions.
■ The 0-D cuboid, which holds the highest level of summarization, is
called the apex cuboid.
■ In our example, this is the total sales, or dollars sold, summarized
over all four dimensions. The apex cuboid is typically denoted by all.
Department of Information Science and Engg
Transform Here
Cube: A Lattice of Cuboids
Lattice of cuboids, making up a 4-D data cube for time, item, location, and supplier. Each
cuboid represents a different degree of summarization.
Department of Information Science and Engg
Transform Here
Stars, Snowflakes, and Fact
Constellations: Schemas for
Multidimensional Data Models
Department of Information Science and Engg
Transform Here
Conceptual Modeling of Data Warehouses
■ Modeling data warehouses: dimensions & measures
■ Star schema: A fact table in the middle connected to a set
of dimension tables
■ Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
■ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
Department of Information Science and Engg
Transform Here
Star Schema:
■ The most common modeling paradigm is the star
schema, in which the data warehouse contains
1) A large central table (fact table) containing the
bulk of the data, with no redundancy, and
2) A set of smaller attendant tables (dimension
tables), one for each dimension.
■ The schema graph resembles a starburst, with the
dimension tables displayed in a radial pattern
around the central fact table.
Department of Information Science and Engg
Transform Here
■ Example 4.1 Star schema. A star schema for
AllElectronics sales is shown in Figure 4.6. Sales
are considered along four dimensions: time, item,
branch, and location. The schema contains a
central fact table for sales that contains keys to
each of the four dimensions, along with two
measures: dollars sold and units sold.
■ To minimize the size of the fact table, dimension
identifiers (e.g., time key and item key) are
system-generated identifiers.
Department of Information Science and Engg
Transform Here
Figure 4.6 Star schema of sales data warehouse.
Department of Information Science and Engg
Transform Here
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key
type
year item_key supplier_type
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
Department of Information Science and Engg
Transform Here
Snowflake Schema:
■ The snowflake schema is a variant of the star
schema model, where some dimension tables
are normalized, thereby further splitting the data
into additional tables.
■ The resulting schema graph forms a shape
similar to a snowflake.
Department of Information Science and Engg
Transform Here
■ The major difference between the snowflake and star schema models
is that the dimension tables of the snowflake model may be kept in
normalized form to reduce redundancies.
■ Such a table is easy to maintain and saves storage space. However,
this space savings is negligible in comparison to the typical
magnitude of the fact table.
■ Furthermore, the snowflake structure can reduce the effectiveness of
browsing, since more joins will be needed to execute a query.
Consequently, the system performance may be adversely impacted.
Hence, although the snowflake schema reduces redundancy, it is not
as popular as the star schema in data warehouse design.
Department of Information Science and Engg
Transform Here
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name
supplier_key
month brand
time_key supplier_type
quarter type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
Department of Information Science and Engg
Transform Here
Fact Constellation
Sophisticated applications may require multiple fact tables
to share dimension tables. This kind of schema can be viewed as a collection
of stars, and hence is called a galaxy schema or a fact constellation.
Ex figure: This schema specifies two fact tables, sales and shipping.
The sales table definition is identical to that of the star schema.
The shipping table has five dimensions, or keys: item key, time key, shipper
key, from location, and to location, and two measures: cost and units shipped.
A fact constellation schema allows dimension tables to be shared between
fact tables.
For example, the dimensions tables for time, item, and location are shared
between both the sales and shipping fact tables.
Department of Information Science and Engg
Transform Here
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
branch location_key location to_location
branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city
units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
Department of Information Science and Engg location_key
Transform Here shipper_type 56
Dimensions: The Role of Concept Hierarchies
■ A concept hierarchy defines a sequence of mappings
from a set of low-level concepts to higher-level, more
general concepts.
■ Consider a concept hierarchy for the dimension location.
City values for location include Vancouver, Toronto, New
York, and Chicago.
■ Many concept hierarchies are implicit within the database
schema. For example, suppose that the dimension location is
described by the attributes number, street, city, province or
state, zip code, and country. These attributes are related by a
total order, forming a concept hierarchy such as “street < city
< province or state < country.”
Department of Information Science and Engg
Transform Here
Department of Information Science and Engg
Transform Here
■ Hierarchical and lattice structures of
attributes in warehouse dimensions:
■ (a) a hierarchy for location and
■ (b) a lattice for time.
Lattice :
A regular geometrical arrangement
of points or objects over an area or
in space.
Department of Information Science and Engg
Transform Here
A Concept Hierarchy: Dimension (location)
all all
region Europe ... North_America
country Germany ... Canada ...
Sp Mexi
ain co
city Frankfurt Vancouver Toronto
... ...
office L. Chan ...
M. Wind
Department of Information Science and Engg
Transform Here
View of Warehouses and Hierarchies
Specification of hierarchies
■ Schema
hierarchy day <
{month <
quarter; week} < year
■ Set_grouping
hierarchy
{1..10} < inexpensive
URL: https://www2.cs.sfu.ca/CourseCentral/459/han/tutorial/tutorial.html
Department of Information Science and Engg
Transform Here
Measures: Their Categorization and Computation
■ A data cube measure is a numeric function that can be
evaluated at each point in the data cube space.
■ A measure value is computed for a given point by
aggregating the data corresponding to the respective
dimension value pairs defining the given point
■ Measures can be organized into three categories
■ Distributive,
■ Algebraic, and
■ Holistic
■ Based on the kind of aggregate functions used.
Department of Information Science and Engg
Transform Here
Data Cube Measures: Three Categories
■ Distributive: if the result derived by applying the function to
n aggregate values, is the same as that derived by
applying the function on all the data without partitioning
■ E.g., count(), sum(), min(), max()
■ Algebraic: if it can be computed by an algebraic function with
M arguments (where M is a bounded integer), each of which
is obtained by applying a distributive aggregate function
■ E.g., avg(), min_N(), standard_deviation()
■ Holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.
■ E.g., median(), mode(), rank()
Department of Information Science and Engg
Transform Here
Typical OLAP Operations
■ Roll up (drill-up): summarize data
■ by climbing up hierarchy or by dimension reduction
■ Drill down (roll down): reverse of roll-up
■ from higher level summary to lower level summary or
detailed data, or introducing new dimensions
■ Slice and dice: project and select
■ Pivot (rotate):
■ reorient the cube, visualization, 3D to series of 2D planes
■ Other operations
■ Drill Across: involving (across) more than one fact table
■ Drill Through: through the bottom level of the cube to its back-end
relational tables (using SQL)
Department of Information Science and Engg
Transform Here
Typical OLAP Operations
Department of Information Science and Engg
Transform Here
ADDITIONAL INFORMATION
Department of Information Science and Engg
Transform Here
Design of Data Warehouse: A Business Analysis
Framework
■ Four views regarding the design of a data warehouse
■ Top-down view
■ allows selection of the relevant information necessary for the data
warehouse
■ Data source view
■ exposes the information being captured, stored, and managed by
operational systems
■ Data warehouse view
■ consists of fact tables and dimension tables
■ Business query view
■ sees the perspectives of data in the warehouse from the view of end-
user
Department of Information Science and Engg
Transform Here
Data Warehouse Design Process
■ Top-down, bottom-up approaches or a combination of both
■ Top-down: Starts with overall design and planning (mature)
■ Bottom-up: Starts with experiments and prototypes (rapid)
■ From software engineering point of view
■ Waterfall: structured and systematic analysis at each step before
proceeding to the next
■ Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
■ Typical data warehouse design process
■ Choose a business process to model, e.g., orders, invoices, etc.
■ Choose the grain (atomic level of data) of the business process
■ Choose the dimensions that will apply to each fact table record
■ Choose the measure that will populate each fact table record
Department of Information Science and Engg
Transform Here
Data Warehouse Development: A Recommended
Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Enterprise
Data Data
Data
Mart Mart
Warehouse
Model refinement Model refinement
Define a high-level corporate data model
Department of Information Science and Engg
Transform Here
Data Warehouse Usage
■ Three kinds of data warehouse applications
■ Information processing
■ supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
■ Analytical processing
■ multidimensional analysis of data warehouse data
■ supports basic OLAP operations, slice-dice, drilling, pivoting
■ Data mining
■ knowledge discovery from hidden patterns
■ supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results
using visualization tools
Department of Information Science and Engg
Transform Here
From On-Line Analytical Processing (OLAP) to On
Line Analytical Mining (OLAM)
■ Why Online Analytical Mining?
■ High quality of data in data warehouses
■ DW contains integrated, consistent, cleaned data
■ Available information processing structure surrounding
data warehouses
■ ODBC, OLEDB, Web accessing, service facilities,
reporting and OLAP tools
■ OLAP-based exploratory data analysis
■ Mining with drilling, dicing, pivoting, etc.
■ On-line selection of data mining functions
■ Integration and swapping of multiple mining
functions, algorithms, and tasks
Department of Information Science and Engg
Transform Here
Reflections about todays Session
Google Form – Quiz
https://docs.google.com/forms/d/e/1FAIpQLSfdiEz7A6Z3iNU2
6F6XAxLO0AU6P06AuOl7mGeGKzMOeUhiKw/viewform
28-11-2024 Department of Information Science and Engg
76
Transform Here
Conclusion
We have studied the below concepts in todays class
1. Topics of Module-1
2. Learning Objectives
3. Basic Definitions of database approaches
4. Database system environment
5. Main Characteristics of the Database Approach
6. Advantages of using the DBMS Approach
7. Historical Development of Database Technology
8. Database Languages and Architectures
9. Schemas versus Instances
10.Reflections
28-11-2024 Department of Information Science and Engg
77
Transform Here
Contact Details:
Dr.Manjunath T N
Professor and Dean – ER
Department of Information Science and Engg
BMS Institute of Technology and Management
Mobile: +91-9900130748
E-Mail: [email protected] /
[email protected]
28-11-2024 Department of Information Science and Engg
78
Transform Here
Welcome
To
DATA MINING AND DATA
WAREHOUSING -
21CS732
28-11-2024 Department of Information Science and Engg
1
Transform Here
MODULE - 2
Module-2: Data warehouse implementation & Data mining: Efficient Data
Cube computation: An overview, Indexing OLAP Data: Bitmap index and
join index, Efficient processing of OLAP Queries, OLAP server
Architecture ROLAP versus MOLAP Versus HOLAP. : Introduction: What
is data mining, Challenges, Data Mining Tasks, Data: Types of Data, Data
Quality, Data Preprocessing, Measures of Similarity and Dissimilarity
Department of Information Science and Engg
Transform Here
Data Warehouse Implementation
■ Data warehouses contain huge volumes of data.
OLAP servers demand that decision support queries
be answered in the order of seconds.
■ Therefore, it is crucial for data warehouse systems
to support highly efficient cube computation
techniques, access methods, and query
processing techniques.
Department of Information Science and Engg
Transform Here
Efficient Data Cube Computation: An Overview
■ At the core of multidimensional data analysis is the
efficient computation of aggregations across many
sets of dimensions.
■ In SQL terms, these aggregations are referred to as
group-by’s. Each group-by can be represented by a
cuboid, where the set of group-by’s forms a lattice of
cuboids defining a data cube.
Department of Information Science and Engg
Transform Here
The compute cube Operator and the Curse of
Dimensionality
■ One approach to cube computation extends SQL
so as to include a compute cube operator.
■ The compute cube operator computes
aggregates over all subsets of the dimensions
specified in the operation.
■ This can require excessive storage space,
especially for large numbers of dimensions.
Department of Information Science and Engg
Transform Here
For Example:
■ A data cube is a lattice of cuboids. Suppose that you
want to create a data cube for AllElectronics sales that
contains the following:
city, item, year, and sales in dollars.
You want to be able to analyze the data, with queries such as
the following:
■ “Compute the sum of sales, grouping by city and item.”
■ “Compute the sum of sales, grouping by city.”
■ “Compute the sum of sales, grouping by item.”
Department of Information Science and Engg
Transform Here
Efficient Data Cube Computation
■ Data cube can be viewed as a lattice of cuboids
■ The bottom-most cuboid is the base cuboid
■ The top-most cuboid (apex) contains only one cell
■ How many cuboids in an n-dimensional cube with L
levels n
T = ∏ (L i
?
i =1
+1)
■ Materialization of data cube
■ Materialize every (cuboid) (full materialization), none
(no materialization), or some (partial materialization)
■ Selection of which cuboids to materialize
■ Based on Size of Datawarehoue, access frequency,
etc. Department of Information Science and Engg
Transform Here
■ An SQL query containing no group-
by(e.g.,“compute the sum of total sales”) is a zero
dimensional operation.
■ An SQL query containing one group-by (e.g., “compute the
sum of sales, group-by city”) is a one-dimensional
operation.
■ A cube operator on n dimensions is equivalent to a
collection of group-by statements, one for each subset of
the n dimensions.
■ Therefore, the cube operator is the
n-dimensional generalization of the group-by operator.
Department of Information Science and Engg
Transform Here
■ Similar to the SQL syntax, the data cube in could be defined
as
define cube sales_cube [city, item, year]: sum(sales in dollars)
■ For a cube with n dimensions, there are a total of 2n cuboids,
including the base cuboid.
■ A statement such as compute
cube sales cube
would explicitly instruct the system to compute the sales
aggregate cuboids for all eight subsets of the set {city, item,
year}, including the empty subset.
Department of Information Science and Engg
Transform Here
■ Online analytical processing may need to access
different cuboids for different queries. Therefore, it
may seem like a good idea to compute in advance
all or at least some of the cuboids in a data cube.
■ Precomputation leads to fast response time and
avoids some redundant computation.
■ Most, if not all, OLAP products resort to some
degree of precomputation of multidimensional
aggregates.
Department of Information Science and Engg
Transform Here
■ A major challenge related to this precomputation,
however, is that the required storage space may
explode if all the cuboids in a data cube are
precomputed, especially when the cube has
many dimensions.
■ The storage requirements are even more
excessive when many of the dimensions have
associated concept hierarchies, each with
multiple levels. This problem is referred to as the
curse of dimensionality.
Department of Information Science and Engg
Transform Here
■ “How many cuboids are there in an n-dimensional data
cube?” If there were no hierarchies associated with each
dimension, then the total number of cuboids for an n-
dimensional data cube, as we have seen, is 2n.
■ However, in practice, many dimensions do have
hierarchies. For example, time is usually explored not at
only one conceptual level (e.g., year), but rather at
multiple conceptual levels such as in the hierarchy “day <
month < quarter < year.”
Department of Information Science and Engg
Transform Here
■ For an n-dimensional data cube, the total number
of cuboids that can be generated (including the
cuboids generated by climbing up the hierarchies
along each dimension) is
where L is the number of levels associated with
dimension i. One is added to Li in Eq to include the
virtual top level, all.
(Note that generalizing to all is equivalent to the removal of the dDiamta
Wearenhosusiinog n.)
Department of Information Science and Engg
Transform Here
The “Compute Cube” Operator
■ Cube definition and computation in DMQL
define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
■ Transform it into a SQL-like language (with a new operator cube
by, introduced by Gray et al.’96)
()
SELECT item, city, year, SUM
(amount)
FROM SALES (city) (item) (year)
CUBE BY item, city, year
■ Need compute the following Group-
Bys (city, item) (city, (item, year)
(date, product, customer),
(date,product),(date, customer), (product, year)
customer),
(date), (product), (customer)
(city, item, year)
()
Department of Information Science and Engg
Transform Here 14
Partial Materialization: Selected Computation of
ThereCuboids
are three choices for data cube materialization given a
base cuboid:
1. No materialization: Do not precompute any of the
“nonbase” cuboids. This leads to computing expensive
multidimensional aggregates on-the-fly, which can be
extremely slow.
2. Full materialization: Precompute all of the cuboids. The
resulting lattice of computed cuboids is referred to as the
full cube. This choice typically requires huge amounts of
memory space in order to store all of the precomputed
cuboids.
Department of Information Science and Engg
Transform Here
3. Partialmaterialization: Selectively compute a proper subset of the
whole set of possible cuboids. Alternatively, we may compute a
subset of the cube, which contains only those cells that satisfy
some user-specified criterion, such as where the tuple count of
each cell is above some threshold.
We will use the term subcube to refer to the latter case, where only
some of the cells may be precomputed for various cuboids. Partial
materialization represents an interesting trade-off between storage
space and response time.
Department of Information Science and Engg
Transform Here
■ The partial materialization of cuboids or subcubes should
consider three factors:
(1)identify the subset of cuboids or subcubes to
materialize;
(2)exploit the materialized cuboids or subcubes during
query processing; and
(3)efficiently update the materialized cuboids or
subcubes during load and refresh.
Department of Information Science and Engg
Transform Here
■ we can compute an icebergcube, which is a data cube
that stores only those cube cells with an aggregate value
(e.g., count) that is above some minimum support
threshold.
■ Another common strategy is to materialize a shell cube.
This involves precomputing the cuboids for only a small
number of dimensions (e.g., three to five) of a data cube.
Department of Information Science and Engg
Transform Here
■ Once the selected cuboids have been
materialized, it is important to take advantage of
them during query processing.
■ This involves several issues, such as how to
determine the relevant cuboid(s) from among the
candidate materialized cuboids,
■ how to use available index structures on the
materialized cuboids, and
■ how to transform the OLAP operations on to the
selected cuboid(s)
Department of Information Science and Engg
Transform Here
■ Finally, during load and refresh, the materialized
cuboids should be updated efficiently.
■ Parallelism and incremental update techniques
for this operation should be explored.
Department of Information Science and Engg
Transform Here
Indexing OLAP Data: Bitmap Index
and Join Index
■ To facilitate efficient data accessing, most data
warehouse systems support index structures
and materialized views (using cuboids).
■ index OLAP data by bitmap indexing and
join indexing.
Department of Information Science and Engg
Transform Here
Bitmap Indexing
■ The bitmap indexing method is popular in OLAP
products because it allows quick searching in
data cubes.
■ The bitmap index is an alternative representation of
the record ID (RID) list.
■ In the bitmap index for a given attribute, there is a
distinct bit vector, Bv, for each value v in the
attribute’s domain.
Department of Information Science and Engg
Transform Here
■ If a given attribute’s domain consists of n values,
then n bits are needed for each entry in the
bitmap index (i.e., there are n bit vectors).
■ If the attribute has the value v for a given row in
the data table, then the bit representing that value
is set to 1 in the corresponding row of the bitmap
index. All other bits for that row are set to 0.
Department of Information Science and Engg
Transform Here
Indexing OLAP Data: Bitmap Index
■ Index on a particular column
■ Each value in the column has a bit vector: bit-op is fast
■ The length of the bit vector: # of records in the base table
■ The i-th bit is set if the i-th row of the base table has the value
for the indexed column
■ not suitable for high cardinality domains
■ A recent bit compression technique, Word-Aligned Hybrid (WAH),
makes it work for high cardinality domain as well [Wu, et al. TODS’06]
Department of Information Science and Engg
Transform Here
Department of Information Science and Engg
Transform Here
Department of Information Science and Engg
Transform Here
Indexing OLAP Data: Join Indices
■ Join index: JI(R-id, S-id) where R (R-id, …) ⊳⊲ S
(S-id, …)
■ Traditional indices map the values to a list of
record ids
■ It materializes relational join in JI file and
speeds up relational join
■ In data warehouses, join index relates the values
of the dimensions of a start schema to rows in
the fact table.
■ E.g. fact table: Sales and two dimensions
city
and product
■ A join index on city maintains for each
distinct city a list of R-IDs of the tuples
recording the Sales in the city
Department of Information Science and Engg
■ Join indices can span multiple dimensions Transform Here
Efficient Processing OLAP Queries
■ The purpose of materializing cuboids and
constructing OLAP index structures is to speed
up query processing in data cubes.
Department of Information Science and Engg
Transform Here
Efficient Processing OLAP
■ Determine which operations Queries
should be performed on the available cuboids
■ Transform drill, roll, etc. into corresponding SQL and/or OLAP operations,
e.g., dice = selection + projection
■ Determine which materialized cuboid(s) should be selected for OLAP op.
■ Let the query to be processed be on {brand, province_or_state} with the
condition “year = 2004”, and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?
■ Explore indexing structures and compressed vs. dense array structs in
MOLAP Department of Information Science and Engg
Transform Here
Given materialized views, query processing should
proceed as follows:
1. Determine which operations should be performed on the
available cuboids:
This involves transforming any selection, projection, roll-up (group-by), and
drill-down operations specified in the query into corresponding SQL and/or
OLAP operations.
2. Determine to which materialized cuboid(s) the
relevant operations should be applied:
This involves identifying all of the materialized cuboids that may potentially
be used to answer the query, pruning the set using knowledge of
“dominance” relationships among the cuboids, estimating the costs of using
the remaining materialized cuboids, and selecting the cuboid with the least
cost.
Department of Information Science and Engg
Transform Here
OLAP Server Architectures
■ Relational OLAP (ROLAP)
■ Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware
■ Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
■ Greater scalability
■ Multidimensional OLAP (MOLAP)
■ Sparse array-based multidimensional storage engine
■ Fast indexing to pre-computed summarized data
■ Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
■ Flexibility, e.g., low level: relational, high-level: array
■ Specialized SQL servers (e.g., Redbricks)
■ Specialized support for SQL queries over star/snowflake
schemas Department of Information Science and Engg
Transform Here
Relational OLAP(ROLAP) servers:
■ These are the intermediate servers that stand in between a
relational back-end server and client front-end tools.
■ They use a relational or extended-relational DBMS to store
and manage warehouse data, and OLAP middleware to
support missing pieces.
■ ROLAP servers include optimization for each DBMS back
end, implementation of aggregation navigation logic, and
additional tools and services.
■ ROLAP technology tends to have greater scalability than
MOLAP technology.
Department of Information Science and Engg
Transform Here
ROLAP SERVER
Department of Information Science and Engg
Transform Here 33
Multidimensional OLAP(MOLAP) servers:
■ These servers support multidimensional data views
through array-based multidimensional storage engines.
■ They map multidimensional views directly to data cube
array structures.
■ The advantage of using a data cube is that it allows fast
indexing to precomputed summarized data.
■ Notice that with multidimensional data stores, the storage
utilization may be low if the data set is sparse.
Department of Information Science and Engg
Transform Here
■ In such cases, sparse matrix compression
techniques should be explored.
■ Many MOLAP servers adopt a two-level storage
representation to handle dense and sparse data sets:
■ Denser subcubes are identified and stored as array
structures, whereas sparse subcubes employ
compression technology for efficient storage
utilization.
Department of Information Science and Engg
Transform Here
MOLAP
MOLAP Vs ROLAP
Department of Information Science and Engg
28-11-2024 36
Transform Here
Hybrid OLAP(HOLAP)servers:
■ The hybrid OLAP approach combines ROLAP and
MOLAP technology, benefiting from the greater
scalability of ROLAP and the faster computation of
MOLAP.
■ For example, a HOLAP server may allow large volumes
of detailed data to be stored in a relational database,
while aggregations are kept in a separate MOLAP store.
■ The Microsoft SQL Server 2000 supports a hybrid OLAP
server.
Department of Information Science and Engg
Transform Here
ROLAP vs MOLAP vs HOLAP
Department of Information Science and Engg
Transform Here
Specialized SQL servers:
■ To meet the growing demand of OLAP
processing in relational databases, some
database system vendors implement specialized
SQL servers that
■ provide advanced query language and
■ query processing support for SQL queries
over star and snowflake schemas in a read-
only environment.
Department of Information Science and Engg
Transform Here
Summary
■ Data warehousing: A multi-dimensional model of a data warehouse
■ A data cube consists of dimensions & measures
■ Star schema, snowflake schema, fact constellations
■ OLAP operations: drilling, rolling, slicing, dicing and pivoting
■ Data Warehouse Architecture, Design, and Usage
■ Multi-tiered architecture
■ Business analysis design framework
■ Information processing, analytical processing, data mining, OLAM
(Online
Analytical Mining)
■ Implementation: Efficient computation of data cubes
■ Partial vs. full vs. no materialization
■ Indexing OALP data: Bitmap index and join index
■ OLAP query processing
■ OLAP servers: ROLAP, MOLAP, HOLAP
■ Data generalization: Attribute-oriented
Department of Informationinduction
Science and Engg
Transform Here
Additional Information
Department of Information Science and Engg
Transform Here
Attribute-Oriented Induction
■ Proposed in 1989 (KDD ‘89 workshop)
■ Not confined to categorical data nor particular measures
■ How it is done?
■ Collect the task-relevant data (initial relation) using a
relational database query
■ Perform generalization by attribute removal or
attribute generalization
■ Apply aggregation by merging identical,
generalized tuples and accumulating their
respective counts
■ Interaction with users for knowledge presentation
Department of Information Science and Engg
Transform Here
Attribute-Oriented Induction: An Example
Example: Describe general characteristics of
graduate students in the University database
■ Step 1. Fetch relevant set of data using an SQL
statement, e.g.,
Select * (i.e., name, gender, major, birth_place,
birth_date, residence, phone#, gpa)
from student
where student_status in {“Msc”, “MBA”, “PhD” }
■ Step 2. Perform attribute-oriented induction
■ Step 3. Present results in generalized relation, cross-tab,
or rule forms
Department of Information Science and Engg
Transform Here
Class Characterization: An Example
Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Initial Jim M CS Vancouver,BC, 8-12-76 3511 Main St., 687-4598 3.67
Relation Woodman Canada Richmond
Scott M CS Montreal, Que, 28-7-75 345 1st Ave., 253-9106 3.70
Lachance Canada Richmond
Laura Lee Physics Seattle, WA, USA 25-8-70 125 Austin Ave., 420-5232 3.83
F
… … … … Burnaby … …
… …
Removed Sci,Eng, Country Age range City Removed Excl,
Retained
Bus
VG,..
Gender Major Birth_region Age_range Residence GPA Count
Prime M Science Canada 20-25 Richmond Very-good 16
Generalized F Science Foreign 25-30 Burnaby Excellent 22
Relation … … … … … … …
Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62
Department of Information Science and Engg
Transform Here
Basic Principles of Attribute-Oriented Induction
■ Data focusing: task-relevant data, including dimensions,
and the result is the initial relation
■ Attribute-removal: remove attribute A if there is a large set of
distinct values for A but (1) there is no generalization
operator on A, or (2) A’s higher level concepts are expressed
in terms of other attributes
■ Attribute-generalization: If there is a large set of distinct
values for A, and there exists a set of generalization
operators on A, then select an operator and generalize A
■ Attribute-threshold control: typical 2-8, specified/default
■ Generalized relation threshold control: control the final
relation/rule size
Department of Information Science and Engg
Transform Here
Attribute-Oriented Induction: Basic
■
Algorithm
InitialRel: Query processing of task-relevant data, deriving
the initial relation.
■ PreGen: Based on the analysis of the number of distinct
values in each attribute, determine generalization plan for
each attribute: removal? or how high to generalize?
■ PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
■ Presentation: User interaction: (1) adjust levels by drilling,
(2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.
Department of Information Science and Engg
Transform Here
Presentation of Generalized
■ Generalized relation: Results
■ Relations where some or all attributes are generalized, with counts
or other aggregation values accumulated.
■ Cross tabulation:
■ Mapping results into cross tabulation form (similar to contingency
tables).
■ Visualization techniques:
■ Pie charts, bar charts, curves, cubes, and other visual forms.
■ Quantitative characteristic rules:
■ Mapping generalized result into characteristic rules with quantitative
information associated with it, e.g.,
grad(x)∧male(x)⇒
birth_ region(x) ="Canada"[t :53%]∨birth_ region(x) =" foreign"[t :47%].
Department of Information Science and Engg
Transform Here
Mining Class Comparisons
■ Comparison: Comparing two or more classes
■ Method:
■ Partition the set of relevant data into the target class and the
contrasting class(es)
■ Generalize both classes to the same high level concepts
■ Compare tuples with the same high level descriptions
■ Present for every tuple its description and two measures
■ support - distribution within single class
■ comparison - distribution between classes
■ Highlight the tuples with strong discriminant features
■ Relevance Analysis:
■ Find attributes (features) which best distinguish different classes
Department of Information Science and Engg
Transform Here
Concept Description vs. Cube-Based
OLAP
■ Similarity:
■ Data generalization
■ Presentation of data summarization at multiple levels of
abstraction
■ Interactive drilling, pivoting, slicing and dicing
■ Differences:
■ OLAP has systematic preprocessing, query
independent,
and can drill down to rather low level
■ AOI has automated desired level allocation, and
may perform dimension relevance analysis/ranking
when there are many relevant dimensions
■ AOI works on the data
Department whichScience
of Information areandnot
Enggin relational forms
Transform Here
Reflections about todays Session
Google Form – Quiz
https://docs.google.com/forms/d/e/1FAIpQLSfdiEz7A6Z3iNU2
6F6XAxLO0AU6P06AuOl7mGeGKzMOeUhiKw/viewform
28-11-2024
Department of Information Science and Engg
50
Transform Here
Conclusion
We have studied the below concepts in todays class
1. Topics of Module-1
2. Learning Objectives
3. Basic Definitions of database approaches
4. Database system environment
5. Main Characteristics of the Database Approach
6. Advantages of using the DBMS Approach
7. Historical Development of Database Technology
8. Database Languages and Architectures
9. Schemas versus Instances
10.Reflections
28-11-2024 Department of Information Science and Engg
51
Transform Here
Contact Details:
Dr.Manjunath T N
Professor and Dean – ER
Department of Information Science and Engg
BMS Institute of Technology and Management
Mobile: +91-9900130748
E-Mail: [email protected] /
[email protected]
28-11-2024
Department of Information Science and Engg
52
Transform Here
Welcome
To
DATA MINING AND DATA
WAREHOUSING -
21CS732
28-11-2024 Department of Information Science and Engg
1
Transform Here
Modules and High Level Topics
Module – 1: Data Warehousing & Modeling
Module – 2: Data warehouse implementation& Data
Mining
Module – 3: Association Analysis
Module – 4: Classification
Module – 5: Clustering Analysis
28-11-2024
Department of Information Science and Engg
2
Transform Here
Module-3: Association Analysis: Association Analysis: Problem Definition,
Frequent Item set Generation, Rule generation. Alternative Methods for
Generating Frequent Item sets, FPGrowth Algorithm, Evaluation of
Association Patterns.
28-11-2024 Department of Information Science and Engg
3
Transform Here
Association Rule Mining
● Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
Market-Basket transactions
Example of Association Rules
{Diaper} → {Beer},
{Milk, Bread} → {Eggs, Coke},
{Beer, Bread} → {Milk},
Implication means co-occurrence,
not causality!
Department of Information Science and Engg
Transform Here
Definition: Frequent Itemset
● Itemset
– A collection of one or more items
◆ Example: {Milk, Bread, Diaper}
– k-itemset
◆ An itemset that contains k items
● Support count (σ)
– Frequency of occurrence of an itemset
– E.g. σ({Milk, Bread,Diaper}) = 2
● Support (s)
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
● Frequent Itemset
– An itemset whose support is greater than
or equal to a minsup threshold
Department of Information Science and Engg
Transform Here
Definition: Association Rule
● Association Rule
– An implication expression of the form X → Y,
where X and Y are itemsets
– Example:
{Milk, Diaper} → {Beer}
● Rule Evaluation Metrics
– Support (s)
◆ Fraction of transactions that contain both X Exampl
and Y e:
– Confidence (c)
◆ Measures how often items in Y
appear in transactions that
contain X
Department of Information Science and Engg
Transform Here
Association Rule Mining Task
● Given a set of transactions T, the goal of association rule mining is
to find all rules having
– support ≥minsup threshold
– confidence ≥ minconf threshold
– In terms of performance, when minsup is higher you will find less pattern and the algorithm is faster. For
minconf, when it is set higher, there will be less pattern but it may not be faster because many algorithms
don't use minconf to prune the search space.
● Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each
rule
– Prune rules that fail the minsup and minconf
thresholds
⇒ Computationally prohibitive!
Department of Information Science and Engg
Transform Here
Computational Complexity
● Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:
If d=6, R = 602 rules
Department of Information Science and Engg
Transform Here
Mining Association Rules
Example of Rules:
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
Observations:
● All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
● Rules originating from the same itemset have identical support but
can have different confidence
● Thus, we may decouple the support and confidence requirements
Department of Information Science and Engg
Transform Here
Mining Association Rules
● Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ≥ minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset, where
each rule is a binary partitioning of a frequent itemset
– goal is to find interesting patterns and trends in transaction
databases.
● Frequent itemset generation is still computationally expensive
Department of Information Science and Engg
Transform Here
Frequent Itemset Generation
Given d items, there
are 2d possible
candidate itemsets
Department of Information Science and Engg
Transform Here
Frequent Itemset Generation
● Brute-force approach:
– Each itemset in the lattice is a candidate frequent
itemset
– Count the support of each candidate by scanning the
database
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive
Department of Information Science and Engg since M = 2d !!!
Transform Here
Frequent Itemset Generation Strategies
● Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
● Reduce the number of transactions (N)
– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms
● Reduce the number of comparisons (NM)
– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every transaction
Department of Information Science and Engg
Transform Here
Reducing Number of Candidates
● Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
● Apriori principle holds due to the following property of the
support measure:
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
Department of Information Science and Engg
Transform Here
Found to be
Infrequent
Pruned
supersets
Department of Information Science and Engg
Transform Here
Illustrating Apriori Principle
Items (1-itemsets)
Minimum Support = 3
If every subset is considered,
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Department of Information Science and Engg
Transform Here
Illustrating Apriori Principle
Items (1-
itemsets)
Minimum Support = 3
If every subset is considered,
6C + 6C + 6C
1 2 3 We assume that the support threshold is
6 + 15 + 20 = 41 60%, which is equivalent to a minimum
With support-based pruning, support count equal to 3.
6 + 6 + 4 = 16
Department of Information Science and Engg
Transform Here
Illustrating Apriori Principle
Items (1-
itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving
Coke
or Eggs)
Minimum Support = 3
If every subset is considered,
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Department of Information Science and Engg
Transform Here
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving
Coke
or Eggs)
Minimum Support = 3
If every subset is considered,
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Department of Information Science and Engg
Transform Here
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving
Coke
or Eggs)
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered,
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Department of Information Science and Engg
Transform Here
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving
Coke
or Eggs)
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered,
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Department of Information Science and Engg
Transform Here
Illustrating Apriori Principle
Items (1-
itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving
Coke
or Eggs)
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered,
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
6 + 6 + 1 = 13
Department of Information Science and Engg
Transform Here
Apriori Algorithm If an itemset is frequent, then all of its
subsets must also be frequent
– Fk: frequent k-itemsets
– Lk: candidate k-itemsets
● Algorithm
– Let k=1
– Generate F1 = {frequent 1-itemsets}
– Repeat until Fk is empty
◆ Candidate Generation: Generate Lk+1 from Fk
◆ Candidate Pruning: Prune candidate itemsets in Lk+1 containing subsets of length k
that are infrequent
◆ Support Counting: Count the support of each candidate in Lk+1 by scanning the DB
◆ Candidate Elimination: Eliminate candidates in Lk+1 that are infrequent, leaving only
those that are frequent => Fk+1
Department of Information Science and Engg
Transform Here
Apriori Algorithm
Department of Information Science and Engg
Transform Here
Candidate Generation: Brute-force method
Department of Information Science and Engg
Transform Here
Candidate Generation: Merge Fk-1 and F1 itemsets
Department of Information Science and Engg
Transform Here
Candidate Generation: Fk-1 x Fk-1 Method
● Merge two frequent (k-1)-itemsets if their first (k-2) items
are identical
● F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, ABD) = ABCD
– Merge(ABC, ABE) = ABCE
– Merge(ABD, ABE) = ABDE
– Do not merge(ABD,ACD) because they share
only prefix of length 1 instead of length 2
Department of Information Science and Engg
Transform Here
Candidate Pruning
● Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be the set of
frequent 3-itemsets
● L4 = {ABCD,ABCE,ABDE} is the set of candidate 4-itemsets
generated (from previous slide)
● Candidate pruning
– Prune ABCE because ACE and BCE are infrequent
– Prune ABDE because ADE is infrequent
● After candidate pruning: L4 = {ABCD}
Department of Information Science and Engg
Transform Here
Candidate Generation: Fk-1 x Fk-1 Method
Department of Information Science and Engg
Transform Here
Alternate Fk-1 x Fk-1 Method
● Merge two frequent (k-1)-itemsets if the last (k-2) items of
the first one is identical to the first (k-2) items of the
second.
● F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, BCD) = ABCD
– Merge(ABD, BDE) = ABDE
– Merge(ACD, CDE) = ACDE
– Merge(BCD, CDE) = BCDE
Department of Information Science and Engg
Transform Here
Candidate Pruning for Alternate F k-1 x Fk-1 Method
● Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be the set of
frequent 3-itemsets
● L4 = {ABCD,ABDE,ACDE,BCDE} is the set of candidate 4-itemsets
generated (from previous slide)
● Candidate pruning
– Prune ABDE because ADE is infrequent
– Prune ACDE because ACE and ADE are infrequent
– Prune BCDE because BCE
● After candidate pruning: L4 = {ABCD}
Department of Information Science and Engg
Transform Here
Support Counting of Candidate Itemsets
● Support counting is the process of determining the frequency of
occurrence for every candidate itemset that survives the candidate
pruning step of the apriori-gen function.
● One approach for doing this is to compare each transaction against
every candidate itemset and to update the support counts of
candidates contained in the transaction
● This approach is computationally expensive, especially when the
numbers of transactions and candidate itemsets are large.
Department of Information Science and Engg
Transform Here
Support Counting of Candidate Itemsets
● To reduce number of comparisons, store the candidate
itemsets in a hash structure
– Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets
Department of Information Science and Engg
Transform Here
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3
5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
How many of these itemsets are supported by transaction (1,2,3,5,6)?
given t = {1, 2, 3, 5, 6}, all
the 3-itemsets contained
in t must begin with item
1, 2, or 3. It is not possible
to construct a 3-itemset
that begins with items 5 or
6 because there are only
two items in t whose
labels are greater than or
equal to 5.
Department of Information Science and Engg
Transform Here
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3
5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
● Hash function
● Max leaf size: max number of itemsets stored in a leaf node (if number of
candidate itemsets exceeds max leaf size, split the node)
Hash function 234
3,6,9 567
1,4,7 145 345 356 367
136 368
2,5,8 357
124 689
457 125 159
458
Department of Information Science and Engg
Transform Here
Rule Generation
● The Apriori algorithm uses a level-wise approach for
generating association rules, where each level
corresponds to the number of items that belong to the rule
consequent.
● Initially, all the high-confidence rules that have only one
item in the rule consequent are extracted. These rules are
then used to generate new candidate rules.
● If {A,B,C,D} is a frequent itemset, candidate rules:
ABC →D, ABD →C, ACD →B, BCD →A,
A →BCD, B →ACD, C →ABD, D →ABC
AB →CD, AC → BD, AD → BC, BC →AD,
BD →AC, CD →AB,
Department of Information Science and Engg
Transform Here
Rule Generation
● In general, confidence does not have an anti-monotone property
c(ABC →D) can be larger or smaller than
c(AB →D)
– Support of an itemset never exceeds the support of its subsets
● But confidence of rules generated from the same itemset has an
anti-monotone property
– E.g., Suppose {A,B,C,D} is a frequent 4-
itemset:
c(ABC → D) ≥ c(AB → CD) ≥ c(A →
BCD)
– Confidence is anti-monotone w.r.t. number of
items on the RHS of the rule
Department of Information Science and Engg
Transform Here
Rule Generation for Apriori Algorithm
Lattice of rules
Low
Confidence
Rule
Pruned
Rules
Department of Information Science and Engg
Transform Here
Association Analysis: Basic Concepts
and Algorithms
Algorithms and Complexity
Department of Information Science and Engg
Transform Here
Factors Affecting Complexity of Apriori
● Choice of minimum support threshold
● Dimensionality (number of items) of the data set
● Size of database
● Average transaction width
Department of Information Science and Engg
Transform Here
Factors Affecting Complexity of Apriori
● Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
● Dimensionality (number of items) of the data set
–
● Size of database
–
● Average transaction width
–
Department of Information Science and Engg
Transform Here
Impact of Support Based Pruning
Items (1-itemsets)
Minimum Support = 3
Minimum Support = 2
If every subset is considered,
6C + 6C + 6C
1 2 3
If every subset is considered,
6 + 15 + 20 = 41 6C + 6C + 6C + 6C
With support-based pruning, 1 2 3 4
6 + 15 + 20 +15 = 56
6 + 6 + 4 = 16
Department of Information Science and Engg
Transform Here
Factors Affecting Complexity of Apriori
● Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
● Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
● Size of database
● Average transaction width
–
Department of Information Science and Engg
Transform Here
Factors Affecting Complexity of Apriori
● Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
● Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
● Size of database
– run time of algorithm increases with number of transactions
● Average transaction width
Department of Information Science and Engg
Transform Here
Factors Affecting Complexity of Apriori
● Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
● Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
● Size of database
– run time of algorithm increases with number of transactions
● Average transaction width
– transaction width increases the max length of frequent itemsets
– number of subsets in a transaction increases with its width,
increasing computation time for support counting
Department of Information Science and Engg
Transform Here
Factors Affecting Complexity of Apriori
Department of Information Science and Engg
Transform Here
Example Question
● Given the following transaction data sets (dark cells indicate presence of an
item in a transaction) and a support threshold of 20%, answer the following
questions
DataSet: A Data Set: B Data Set: C
a. What is the number of frequent itemsets for each dataset? Which dataset will produce the most
number of frequent itemsets?
b. Which dataset will produce the longest frequent itemset?
c. Which dataset will produce frequent itemsets with highest maximum support?
d. Which dataset will produce frequent itemsets containing items with widely varying support levels (i.e.,
itemsets containing items with mixed support, ranging from 20% to more than 70%)?
e. What is the number of maximal frequent itemsets for each dataset? Which dataset will produce the
most number of maximal frequent itemsets?
f. What is the number of closed frequent itemsets for each dataset? Which dataset will produce the
most number of closed frequent itemsets?
Department of Information Science and Engg
Transform Here
Pattern Evaluation
● Association rule algorithms can produce large number of rules
● Interestingness measures can be used to prune/rank the patterns
– In the original formulation, support & confidence are
the only measures used
Department of Information Science and Engg
Transform Here
Contingency table
Y Y f11: support of X and Y
X f11 f10 f1+ f10: support of X and Y
X f01 f00 fo+ f01: support of X and Y
f+1 f+0 N f00: support of X and Y
Used to define various measures
● support, confidence, Gini,
entropy, etc.
Department of Information Science and Engg
Transform Here
Custo Tea Coffee …
mers
C1 0 1 …
C2 1 0 …
C3 1 1 …
C4 1 0 …
…
Association Rule: Tea → Coffee
Confidence ≅ P(Coffee|Tea) = 150/200 = 0.75
Confidence > 50%, meaning people who drink tea are more
likely to drink coffee than not drink coffee
So rule seems reasonable
Department of Information Science and Engg
Transform Here
Coffee Coffee
Tea 150 50 200
Tea 650 150 800
800 200 1000
Association Rule: Tea → Coffee
Confidence= P(Coffee|Tea) = 150/200 = 0.75
but P(Coffee) = 0.8, which means knowing that a person drinks
tea reduces the probability that the person drinks coffee!
⇒ Note that P(Coffee|Tea) = 650/800 = 0.8125
Department of Information Science and Engg
Transform Here
Custo Tea Honey …
mers
C1 0 1 …
C2 1 0 …
C3 1 1 …
C4 1 0 …
…
Association Rule: Tea → Honey
Confidence ≅ P(Honey|Tea) = 100/200 = 0.50
Confidence = 50%, which may mean that drinking tea has little
influence whether honey is used or not
So rule seems uninteresting
But P(Honey) = 120/1000 = .12 (hence tea drinkers are far more
likely to have honey
Department of Information Science and Engg
Transform Here
lift is used for rules while
interest is used for itemsets
Department of Information Science and Engg
Transform Here
Coffee Coffee
Tea 150 50 200
Tea 650 150 800
800 200 1000
Association Rule: Tea → Coffee
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.8
⇒ Interest = 0.15 / (0.2×0.8) = 0.9375 (< 1, therefore is negatively
associated)
So, is it enough to use confidence/Interest for pruning?
Department of Information Science and Engg
Transform Here
There are lots of
measures proposed in
the literature
Department of Information Science and Engg
Transform Here
10 examples of Rankings of contingency tables
contingency tables: using various measures:
Department of Information Science and Engg
Transform Here
Property under Inversion Operation
Transaction 1
.
.
.
.
.
Transaction N
Department of Information Science and Engg
Transform Here
Property under Inversion Operation
Transaction 1
.
.
.
.
.
Transaction N
Correlation: -0.1667 -0.1667
IS/cosine 0.0 0.825
Department of Information Science and Engg
Transform Here
Property under Null Addition
Invariant measures:
● cosine, Jaccard, All-confidence, confidence
Non-invariant measures:
● correlation, Interest/Lift, odds ratio, etc
Department of Information Science and Engg
Transform Here
Property under Row/Column Scaling
Grade-Gender Example (Mosteller, 1968):
Male Female Male Female
High 30 20 50 High 60 60 120
Low 40 10 50 Low 80 30 110
70 30 100 140 90 230
2x 3x
Mosteller:
Underlying association should be independent of
the relative number of male and female students
in the samples
Odds-Ratio ((f11+f00 )/(f10+f10)) has this property
Department of Information Science and Engg
Transform Here
Property under Row/Column Scaling
Relationship between Mask use and susceptibility to Covid:
Covid- Covid- Covid- Covid-
Positive Free Positive Free
Mask 20 30 50 Mask 40 300 340
No- 40 10 50 No- 80 100 180
Mask Mask
60 40 100 120 400 520
2x 10x
Mosteller:
Underlying association should be independent of
the relative number of Covid-positive and Covid-free subjects
Odds-Ratio ((f11+f00 )/(f10+f10)) has this property
Department of Information Science and Engg
Transform Here
Different Measures have Different Properties
Department of Information Science and Engg
Transform Here
Simpson’s Paradox
● Observed relationship in data may be influenced by the presence
of other confounding factors (hidden variables)
– Hidden variables may cause the observed
relationship to disappear or reverse its
direction!
● Proper stratification is needed to avoid generating spurious
patterns
Department of Information Science and Engg
Transform Here
Simpson’s Paradox
● Recovery rate from Covid
– Hospital A: 80%
– Hospital B: 90%
● Which hospital is better?
Department of Information Science and Engg
Transform Here
Simpson’s Paradox
● Recovery rate from Covid
– Hospital A: 80%
– Hospital B: 90%
● Which hospital is better?
● Covid recovery rate on older population
– Hospital A: 50%
– Hospital B: 30%
● Covid recovery rate on younger population
– Hospital A: 99%
– Hospital B: 98%
Department of Information Science and Engg
Transform Here
Simpson’s Paradox
● Covid-19 death: (per 100,000 of population)
– County A: 15
– County B: 10
● Which state is managing the pandemic better?
Department of Information Science and Engg
Transform Here
Simpson’s Paradox
● Covid-19 death: (per 100,000 of population)
– County A: 15
– County B: 10
● Which state is managing the pandemic better?
● Covid death rate on older population
– County A: 20
– County B: 40
● Covid death rate on younger population
– County A: 2
– County B: 5
Department of Information Science and Engg
Transform Here
Association Rule Mining Algorithms ?
• Apriori Algorithm.
• FP Growth Algorithm.
• FP-Tree and FP-Growth Algorithm
• FP-Tree: Frequent Pattern Tree
• FP-Growth: Mining frequent patterns with FP-Tree
Department of Information Science and Engg
Transform Here
Terminology
• Item set
• A set of items: I = {a1, a2, ……, am}
• Transaction database
• DB = <T1, T2, ……, Tn>
• Pattern
• A set of items: A
• Support
• The number of transactions containing A in DB
• Frequent pattern
• A’s support ≥ minimum support threshold ξ
• Frequent Pattern Mining Problem
• The problem of finding the complete set of frequent
patterns
Department of Information Science and Engg
Transform Here
FP-Tree Definition
An efficient and scalable method to find frequent
patterns. It allows frequent itemset discovery
without candidate itemset generation.
Department of Information Science and Engg
Transform Here
FP-Tree Definition (cont.)
• Each node in the item prefix subtree consists of three fields:
• item-name
• node-link
• count
• Each entry in the frequent-item header table consists of two
fields:
• item-name
• head of node-link
Department of Information Science and Engg
Transform Here
FP-Tree and FP-Growth
Algorithm
• FP-Tree: Frequent Pattern Tree
• Compact presentation of the DB without information loss.
• Easy to traverse, can quickly find out patterns associated with a
certain item.
• Well-ordered by item frequency.
• FP-Growth Algorithm
• Start mining from length-1 patterns
• Recursively do the following
• Constructs its conditional FP-tree
• Concatenate patterns from conditional FP-tree with suffix
• Divide-and-Conquer mining technique
Department of Information Science and Engg
Transform Here
Algorithm Steps
• Scan DB once, find frequent 1-itemset (single item pattern)
• Sort frequent items in frequency descending order, f-list
• Scan DB again, construct FP-tree
• Construct the conditional FP tree in the sequence of reverse order of F -
List - generate frequent item set
Department of Information Science and Engg
Transform Here
Example 1
Consider the below transaction in which B = Bread, J = Jelly, P =
Peanut Butter, M = Milk and E = Eggs.
Step-1: Scan DB once, find
frequent 1-itemset (single item in
itemset)
Department of Information Science and Engg
Transform Here
Example 1 (cont.)
Step-2: As minimum threshold support = 40%, So in this step we will
remove all the items that are bought less than 40% of support or support
less than 2.
Step-3: Create a F -list in which frequent items are
sorted in the descending order based on
the support.
Department of Information Science and Engg
Transform Here
Example 1 (cont.)
Step-4: Sort frequent items in transactions based on F-list.
Department of Information Science and Engg
Transform Here
Example 1 (cont.)
Department of Information Science and Engg
Transform Here
Department of Information Science and Engg
Transform Here
FP-Tree Properties
• Completeness
• Each transaction that contains frequent pattern is mapped to a
path.
• Prefix sharing does not cause path ambiguity, as only path starts
from root represents a transaction.
• Compactness
• Number of nodes bounded by overall occurrence of frequent
items.
• Height of tree bounded by maximal number of frequent items in
any transaction.
Department of Information Science and Engg
Transform Here
FP-Tree Properties (cont.)
• Traversal Friendly (for mining task)
• For any frequent item ai, all the possible frequent patterns
that contain ai can be obtained by following ai’s node-links.
• This property is important for divide-and-conquer. It
assures the soundness and completeness of problem
reduction.
Department of Information Science and Engg
Transform Here
FP-Growth Algorithm
• Functionality:
• Mining frequent patterns using FP-Tree generated before
• Input:
• FP-tree constructed earlier
• minimum support threshold ξ
• Output:
• The complete set of frequent patterns
• Main algorithm:
• Call FP-growth(FP-tree, null)
Department of Information Science and Engg
Transform Here
FP-growth(Tree, α)
Procedure FP-growth(Tree, α)
{
if (Tree contains only a single path P)
{ for each combination β of the nodes in P
{ generate pattern β Uα;
support = min(sup of all nodes in β)
}
}
else // Tree contains more than one path
{ for each ai in the header of Tree
{ generate pattern β= ai Uα;
β.support = ai.support;
construct β’s conditional pattern base;
construct β’s conditional FP-tree Treeβ;
if (Treeβ ≠ Φ)
FP-growth(Treeβ , β);
}
}
Department of Information Science and Engg
Transform Here
Discussions
• When database is extremely large.
• Use FP-Tree on projected databases.
• Or, make FP-Tree disk-resident.
• Materialization of an FP-Tree
• Construct it independently of queries, with an reasonably fit-
majority minimum support-threshold.
• Incremental updates of an FP-Tree.
• Record frequency count for every item.
• Control by watermark.
Department of Information Science and Engg
Transform Here
Department of Information Science and Engg
Transform Here
Reflections about todays Session
Google Form – Quiz
https://docs.google.com/forms/d/e/1FAIpQLSfdiEz7A6Z3iNU2
6F6XAxLO0AU6P06AuOl7mGeGKzMOeUhiKw/viewform
28-11-2024 Department of Information Science and Engg
85
Transform Here
Conclusion
We have studied the below concepts in todays class
1. Topics of Module-1
2. Learning Objectives
3. Basic Definitions of database approaches
4. Database system environment
5. Main Characteristics of the Database Approach
6. Advantages of using the DBMS Approach
7. Historical Development of Database Technology
8. Database Languages and Architectures
9. Schemas versus Instances
10.Reflections
28-11-2024
Department of Information Science and Engg
86
Transform Here
Contact Details:
Dr.Manjunath T N
Professor and Dean – CG
Department of Information Science and Engg
BMS Institute of Technology and Management
Mobile: +91-9900130748
E-Mail: [email protected] / [email protected]
28-11-2024 Department of Information Science and Engg
87
Transform Here