Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views105 pages

Module-1 Merged Merged

A data warehouse is a decision support database that consolidates historical data for analysis, organized around major subjects and integrated from various sources. It is nonvolatile, time-variant, and designed for OLAP rather than OLTP, focusing on complex queries and historical data analysis. Organizations use data warehouses to enhance decision-making, analyze customer behavior, manage product portfolios, and improve operational efficiency.

Uploaded by

c62
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views105 pages

Module-1 Merged Merged

A data warehouse is a decision support database that consolidates historical data for analysis, organized around major subjects and integrated from various sources. It is nonvolatile, time-variant, and designed for OLAP rather than OLTP, focusing on complex queries and historical data analysis. Organizations use data warehouses to enhance decision-making, analyze customer behavior, manage product portfolios, and improve operational efficiency.

Uploaded by

c62
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 105

What is a Data Warehouse?

■ Defined in many different ways, but not rigorously.


■ A decision support database that is maintained separately from
the organization’s operational database
■ Support information processing by providing a solid platform of
consolidated, historical data for analysis.
■ “A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision-
making process.”— W. H. Inmon

■ Data warehousing:
■ The process of constructing and using data warehouses
Department of Information Science and Engg
Transform Here 9

Data Warehouse - Subject-Oriented


■ Organized around major subjects, such as customer,
product, sales
■ Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing.
■ Provide a simple and concise view around particular
subject issues by excluding data that are not useful in the
decision support process

Department of Information Science and Engg


Transform Here 10
Data Warehouse - Integrated
■ Constructed by integrating multiple, heterogeneous data
sources
■ Relational databases, flat files, on-line transaction

records
■ Data cleaning and data integration techniques are
applied.
■ Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different data


sources
■ E.g., Hotel price: currency, tax, breakfast covered, etc.
■ When data is moved to the warehouse, it is
converted.
Department of Information Science and Engg
Transform Here 11

Data Warehouse - Nonvolatile

■ A physically separate store of data transformed from the


operational environment
■ Operational update of data does not occur in the data
warehouse environment
■ Does not require transaction processing, recovery,
and concurrency control mechanisms
■ Requires only two operations in data accessing:
■ initial loading of data and access of data

Department of Information Science and Engg


Transform Here 12
Data Warehouse - Time Variant
■ The time horizon for the data warehouse is significantly
longer than that of operational systems
■ Operational database: current value data
■ Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
■ Every key structure in the data warehouse
■ Contains an element of time, explicitly or implicitly
■ But the key of operational data may or may not
contain “time element” (dwh_create_time (dwh_cttm),
dwh_update_time (dwh_up_time)
Department of Information Science and Engg
Transform Here 13

The major distinguishing features of OLTP and OLAP are


summarized as follows:
Users and system orientation:

• An OLTP system is customer-oriented and is used for


transaction and query processing by clerks, clients, and
information technology professionals.

• An OLAP system is market-oriented and is used for data


analysis by knowledge workers, including managers,
executives, and analysts.

OLTP – Online Transaction


Processing
OLAP – Online Analytical Processing
Department of Information Science and Engg
Transform Here 14
■ Data contents:

An OLTP system manages current data that, typically,


are too detailed to be easily used for decision making.

An OLAP system manages large amounts of historic data,


provides facilities for summarization and aggregation, and
stores and manages information at different levels of
granularity.

These features make the data easier to use for informed


decision making.

Department of Information Science and Engg


Transform Here 15

Database Design:

An OLTP system usually adopts an entity-relationship (ER)


data model and an application-oriented database design.

An OLAP system typically adopts either a star or a


snowflake model and a subject-oriented database
design.

Department of Information Science and Engg


Transform Here 16
■ View: An OLTP system focuses mainly on the current
data within an enterprise or department, without referring
to historic data or data in different organizations.

■ In contrast, an OLAP system often spans multiple versions


of a database schema, due to the evolutionary process of
an organization.

■ OLAP systems also deal with information that originates


from different organizations, integrating information from
many data stores.
■ Because of their huge volume, OLAP data are stored on
multiple storage media.
Department of Information Science and Engg
Transform Here

■ Access patterns: The access patterns of an OLTP system


consist mainly of short, atomic transactions. Such a system
requires concurrency control and recovery mechanisms.

■ However, accesses to OLAP systems are mostly read-only


operations (because most data warehouses store historic
rather than up-to-date information), although many could be
complex queries.

Department of Information Science and Engg


Transform Here
OLTP vs. OLAP
Parameter OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized,
isolated multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Department of Information Science and Engg
Transform Here

How are organizations using the information from


data warehouses?
Many organization use this information to support business decision-making
activities, including
1. Increasing customer focus, which includes the analysis of customer
buying patterns (such as buying preference, buying time, budget cycles,
and appetites for spending);
2. Repositioning products and managing product portfolios by comparing
the performance of sales by quarter, by year, and by geographic regions
in order to fine-tune production strategies;
3. Analyzing operations and looking for sources of profit; and
4. Managing customer relationships, making environmental corrections, and
managing the cost of corporate assets

Department of Information Science and Engg


Transform Here
■ Because operational databases store huge amounts of
data, you may wonder, “Why not perform online
analytical processing directly on such databases
instead of spending additional time and resources
to construct a separate data warehouse?”

Department of Information Science and Engg


Transform Here

Why a Separate Data Warehouse?


■ High performance for both systems
■ DBMS - tuned for OLTP: access methods, indexing, concurrency

control, recovery
■ Warehouse - tuned for OLAP: complex OLAP queries, multidimensional
view, consolidation
■ Different functions and different data:
■ missing data: Decision support (DS) requires historical data which
operational DBs do not typically maintain
■ data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
■ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
■ Note: There are more and more systems which perform OLAP analysis
directly on relational databases
Department of Information Science and Engg
Transform Here
Data Warehouse: A Multi-Tiered
Architecture
Monitor
& OLAP Server
Other Metadata
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-


End Tools
Department of Information Science and Engg
Transform Here

Department of Information Science and Engg


Transform Here
■ The bottom tier is a warehouse database server that is
almost always a relational database system. Back-end
tools and utilities are used to feed data into the bottom tier
from operational databases or other external sources (e.g.,
customer profile information provided by external
consultants)

■ These tools and utilities perform data extraction, cleaning,


and transformation (e.g., to merge similar data from
different sources into a unified format), as well as load and
refresh functions to update the data warehouse

Department of Information Science and Engg


Transform Here

The middle tier is an OLAP server that is typically


implemented using either

a) A relational OLAP(ROLAP)model
(i.e.,an extended relational DBMS that maps operations on
multidimensional data to standard relational operations); or

b) A multidimensional OLAP (MOLAP) model


(special-purpose server that directly implements
multidimensional data and operations)

Department of Information Science and Engg


Transform Here
The top tier is a front-end client layer, which
contains query and reporting tools, analysis tools,
and/or data mining tools (e.g., trend analysis,
prediction, and so on).

Department of Information Science and Engg


Transform Here

Three Data Warehouse Models


■ Enterprise warehouse
■ collects all of the information about subjects spanning the

entire organization
■ Data Mart
■ a subset of corporate-wide data that is of value to a specific

groups of users. Its scope is confined to specific,


selected groups, such as marketing data mart
■ Independent vs. dependent (directly from warehouse) data mart
■ Virtual warehouse
■ A set of views over operational databases

■ Only some of the possible summary views may be

materialized
Department of Information Science and Engg
Transform Here
■ A virtual warehouse is easy to build but requires
excess capacity on operational database servers

“What are the pros and cons of the top-down and bottom-up
approaches to data warehouse development?”
■ The top-down development of an enterprise warehouse
serves as a systematic solution and minimizes integration
problems.
■ However, it is expensive, takes a long time to develop,
and lacks flexibility due to the difficulty in achieving
consistency and consensus for a common data model for
the entire organization.

Department of Information Science and Engg


Transform Here

■ The bottom-up approach to the design,


development, and deployment of independent
data marts provides flexibility, low cost, and rapid
return of investment.

■ It, however, can lead to problems when


integrating various disparate data marts into a
consistent enterprise data warehouse.

Department of Information Science and Engg


Transform Here
■ Depending on the source of data, data marts can
be categorized as independent or dependent.
■ Independent data marts are sourced from data
captured from one or more operational systems
or external information providers, or from data
generated locally within a particular department
or geographic area.
■ Dependent data marts are sourced directly from
enterprise data warehouses.

Department of Information Science and Engg


Transform Here

Extraction, Transformation, and Loading (ETL)


■ Data warehouse systems use back-end tools and utilities to populate and
refresh their data These tools and utilities include the following functions:
■ Data extraction
■ get data from multiple, heterogeneous, and external

sources
■ Data cleaning
■ detect errors in the data and rectify them when possible

■ Data transformation
■ convert data from legacy or host format to warehouse

format
■ Load
■ sort, summarize, consolidate, compute views, check integrity,
and build indicies and partitions
■ Refresh
■ propagate the updates from the data sources to the warehouse
Department of Information Science and Engg
Transform Here
Metadata Repository
Meta data is the data defining warehouse objects. It stores:
■ Description of the structure of the data warehouse
■ schema, view, dimensions, hierarchies, derived data definitions,
data mart locations and contents.
■ Operational meta-data
■ data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged),
monitoring information (warehouse usage statistics, error reports,
audit trails)
■ The algorithms used for summarization
■ which include measure and dimension definition algorithms, data on

granularity, partitions, subject areas, aggregation, summarization,


and predefined queries and reports.

Department of Information Science and Engg


Transform Here

■ The mapping from operational environment to the data warehouse


■ which includes source databases and their contents, gateway

descriptions, data partitions, data extraction, cleaning,


transformation rules and defaults, data refresh and purging rules,
and security (user authorization and access control).
■ Data related to system performance
■ which include indices and profiles that improve data access

and retrieval performance, in addition to rules for the timing


and scheduling of refresh, update, and replication cycles.
■ Business data
■ which include business terms and definitions, data ownership

information, and charging policies.

Department of Information Science and Engg


Transform Here
Data Warehousing and On-line Analytical Processing

■ Data Warehouse: Basic Concepts


■ Data Warehouse Modeling: Data Cube and OLAP
■ Data Warehouse Design and Usage
■ Data Warehouse Implementation
■ Data Generalization by Attribute-Oriented
Induction

■ Summary

Department of Information Science and Engg


Transform Here

From Tables and Spreadsheets to Data Cubes

■ “What is a data cube?”


“A data cube allows data to be modeled and viewed in
multiple dimensions”.

■ It is defined by dimensions and facts.

Facts are numerical measures.

A dimension is a structure that categorizes data in order to


enable users to answer business questions

Department of Information Science and Engg


Transform Here
Dimensions are the perspectives or entities with respect to which an
organization wants to keep records.
■ Eg: AllElectronics may create a sales data warehouse in order

to keep records of the store’s sales with respect to the


dimensions time, item, branch, and location. These dimensions
allow the store to keep track of things like monthly sales of
items and the branches and locations at which the items were
sold.

Each dimension may have a table associated with it, called a


dimension table, which further describes the dimension.
• For example, a dimension table for item may contain the
attributes item name, brand, and type.
• Dimension tables can be specified by users or experts, or
automatically generated and adjusted based on data
distributions. Department of Information Science and Engg
Transform Here

■ Facts are numeric measures. Think of them as the


quantities by which we want to analyze relationships
between dimensions.

■ Examples of facts for a sales data warehouse include


dollars sold (sales amount in dollars), units sold
(number of units sold), and amount budgeted.

■ The fact table contains the names of the facts, or


measures, as well as keys to each of the related dimension
tables.

Department of Information Science and Engg


Transform Here
■ 2-D representation, the sales for Vancouver
are shown with respect to the time dimension
(organized in quarters) and the item
dimension(organized according to the types of
items sold).

■ The factor measure displayed is dollars sold


(in thousands).

Department of Information Science and Engg


Transform Here

Department of Information Science and Engg


Transform Here
■ suppose that we would like to view the sales data
with a third dimension.

■ For instance, suppose we would like to view the


data according to time and item, as well as
location, for the cities Chicago, New York,
Toronto, and Vancouver. These 3-D data are
shown in Table 4.3.

Department of Information Science and Engg


Transform Here

Department of Information Science and Engg


Transform Here
cuboid
■ A 3-D data cube representation of the data inTable4.3, according to
time, item, and location.
■ The measure displayed is dollars sold (in thousands).

Department of Information Science and Engg


Transform Here

Multidimensional Data
Sales volume as a function of product, month,
and region.

Department of Information Science and Engg


Transform Here
■ Suppose that we would now like to view our sales
data with an additional fourth dimension such as
supplier.

Department of Information Science and Engg


Transform Here

A 4-D data cube representation of sales data, according to time, item,


location, and supplier. The measure displayed is dollars sold (in thousands).
For improved readability, only some of the cube values are shown.

Department of Information Science and Engg


Transform Here
Cube: A Lattice of Cuboids

 Given a set of dimensions, we can generate a cuboid for each


of the possible subsets of the given dimensions.
 The result would form a lattice of cuboids, each showing the
data at a different level of summarization, or group-by.

 The lattice of cuboids is then referred to as a data cube.

 In previous slide it shows a lattice of cuboids forming a data


cube for the dimensions time, item, location, and supplier.

 The lattice of cuboid forms a data cube

Department of Information Science and Engg


Transform Here

■ The cuboid that holds the lowest level of summarization is called the
base cuboid.

■ For example, the 4-D cuboid in Figure 4.4 is the base cuboid for the
given time, item, location, and supplier dimensions.

■ The 0-D cuboid, which holds the highest level of summarization, is


called the apex cuboid.

■ In our example, this is the total sales, or dollars sold, summarized


over all four dimensions. The apex cuboid is typically denoted by all.

Department of Information Science and Engg


Transform Here
Cube: A Lattice of Cuboids

Lattice of cuboids, making up a 4-D data cube for time, item, location, and supplier. Each
cuboid represents a different degree of summarization.
Department of Information Science and Engg
Transform Here

Stars, Snowflakes, and Fact


Constellations: Schemas for
Multidimensional Data Models

Department of Information Science and Engg


Transform Here
Conceptual Modeling of Data Warehouses
■ Modeling data warehouses: dimensions & measures
■ Star schema: A fact table in the middle connected to a set
of dimension tables
■ Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
■ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
Department of Information Science and Engg
Transform Here

Star Schema:
■ The most common modeling paradigm is the star
schema, in which the data warehouse contains

1) A large central table (fact table) containing the


bulk of the data, with no redundancy, and
2) A set of smaller attendant tables (dimension
tables), one for each dimension.

■ The schema graph resembles a starburst, with the


dimension tables displayed in a radial pattern
around the central fact table.
Department of Information Science and Engg
Transform Here
■ Example 4.1 Star schema. A star schema for
AllElectronics sales is shown in Figure 4.6. Sales
are considered along four dimensions: time, item,
branch, and location. The schema contains a
central fact table for sales that contains keys to
each of the four dimensions, along with two
measures: dollars sold and units sold.
■ To minimize the size of the fact table, dimension
identifiers (e.g., time key and item key) are
system-generated identifiers.

Department of Information Science and Engg


Transform Here

Figure 4.6 Star schema of sales data warehouse.

Department of Information Science and Engg


Transform Here
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key
type
year item_key supplier_type

branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
Department of Information Science and Engg
Transform Here

Snowflake Schema:
■ The snowflake schema is a variant of the star
schema model, where some dimension tables
are normalized, thereby further splitting the data
into additional tables.

■ The resulting schema graph forms a shape


similar to a snowflake.

Department of Information Science and Engg


Transform Here
■ The major difference between the snowflake and star schema models
is that the dimension tables of the snowflake model may be kept in
normalized form to reduce redundancies.

■ Such a table is easy to maintain and saves storage space. However,


this space savings is negligible in comparison to the typical
magnitude of the fact table.

■ Furthermore, the snowflake structure can reduce the effectiveness of


browsing, since more joins will be needed to execute a query.
Consequently, the system performance may be adversely impacted.
Hence, although the snowflake schema reduces redundancy, it is not
as popular as the star schema in data warehouse design.

Department of Information Science and Engg


Transform Here

Example of Snowflake Schema


time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name
supplier_key
month brand
time_key supplier_type
quarter type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
Department of Information Science and Engg
Transform Here
Fact Constellation
Sophisticated applications may require multiple fact tables
to share dimension tables. This kind of schema can be viewed as a collection
of stars, and hence is called a galaxy schema or a fact constellation.

Ex figure: This schema specifies two fact tables, sales and shipping.
The sales table definition is identical to that of the star schema.
The shipping table has five dimensions, or keys: item key, time key, shipper
key, from location, and to location, and two measures: cost and units shipped.
A fact constellation schema allows dimension tables to be shared between
fact tables.
For example, the dimensions tables for time, item, and location are shared
between both the sales and shipping fact tables.
Department of Information Science and Engg
Transform Here

Example of Fact Constellation


time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city
units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
Department of Information Science and Engg location_key
Transform Here shipper_type 56
Dimensions: The Role of Concept Hierarchies

■ A concept hierarchy defines a sequence of mappings


from a set of low-level concepts to higher-level, more
general concepts.
■ Consider a concept hierarchy for the dimension location.
City values for location include Vancouver, Toronto, New
York, and Chicago.
■ Many concept hierarchies are implicit within the database
schema. For example, suppose that the dimension location is
described by the attributes number, street, city, province or
state, zip code, and country. These attributes are related by a
total order, forming a concept hierarchy such as “street < city
< province or state < country.”
Department of Information Science and Engg
Transform Here

Department of Information Science and Engg


Transform Here
■ Hierarchical and lattice structures of
attributes in warehouse dimensions:
■ (a) a hierarchy for location and
■ (b) a lattice for time.

Lattice :
A regular geometrical arrangement
of points or objects over an area or
in space.

Department of Information Science and Engg


Transform Here

A Concept Hierarchy: Dimension (location)

all all

region Europe ... North_America

country Germany ... Canada ...


Sp Mexi
ain co
city Frankfurt Vancouver Toronto
... ...

office L. Chan ...


M. Wind
Department of Information Science and Engg
Transform Here
View of Warehouses and Hierarchies

Specification of hierarchies
■ Schema

hierarchy day <


{month <
quarter; week} < year
■ Set_grouping
hierarchy
{1..10} < inexpensive
URL: https://www2.cs.sfu.ca/CourseCentral/459/han/tutorial/tutorial.html
Department of Information Science and Engg
Transform Here

Measures: Their Categorization and Computation

■ A data cube measure is a numeric function that can be


evaluated at each point in the data cube space.
■ A measure value is computed for a given point by
aggregating the data corresponding to the respective
dimension value pairs defining the given point
■ Measures can be organized into three categories
■ Distributive,

■ Algebraic, and

■ Holistic

■ Based on the kind of aggregate functions used.

Department of Information Science and Engg


Transform Here
Data Cube Measures: Three Categories
■ Distributive: if the result derived by applying the function to
n aggregate values, is the same as that derived by
applying the function on all the data without partitioning
■ E.g., count(), sum(), min(), max()
■ Algebraic: if it can be computed by an algebraic function with
M arguments (where M is a bounded integer), each of which
is obtained by applying a distributive aggregate function
■ E.g., avg(), min_N(), standard_deviation()
■ Holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.
■ E.g., median(), mode(), rank()

Department of Information Science and Engg


Transform Here

Typical OLAP Operations


■ Roll up (drill-up): summarize data
■ by climbing up hierarchy or by dimension reduction

■ Drill down (roll down): reverse of roll-up


■ from higher level summary to lower level summary or
detailed data, or introducing new dimensions
■ Slice and dice: project and select
■ Pivot (rotate):
■ reorient the cube, visualization, 3D to series of 2D planes

■ Other operations
■ Drill Across: involving (across) more than one fact table

■ Drill Through: through the bottom level of the cube to its back-end
relational tables (using SQL)

Department of Information Science and Engg


Transform Here
Typical OLAP Operations

Department of Information Science and Engg


Transform Here

ADDITIONAL INFORMATION

Department of Information Science and Engg


Transform Here
Design of Data Warehouse: A Business Analysis
Framework
■ Four views regarding the design of a data warehouse
■ Top-down view
■ allows selection of the relevant information necessary for the data
warehouse
■ Data source view
■ exposes the information being captured, stored, and managed by
operational systems
■ Data warehouse view
■ consists of fact tables and dimension tables
■ Business query view
■ sees the perspectives of data in the warehouse from the view of end-
user
Department of Information Science and Engg
Transform Here

Data Warehouse Design Process


■ Top-down, bottom-up approaches or a combination of both
■ Top-down: Starts with overall design and planning (mature)
■ Bottom-up: Starts with experiments and prototypes (rapid)
■ From software engineering point of view
■ Waterfall: structured and systematic analysis at each step before
proceeding to the next
■ Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
■ Typical data warehouse design process
■ Choose a business process to model, e.g., orders, invoices, etc.
■ Choose the grain (atomic level of data) of the business process
■ Choose the dimensions that will apply to each fact table record
■ Choose the measure that will populate each fact table record
Department of Information Science and Engg
Transform Here
Data Warehouse Development: A Recommended
Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts

Enterprise
Data Data
Data
Mart Mart
Warehouse

Model refinement Model refinement

Define a high-level corporate data model


Department of Information Science and Engg
Transform Here

Data Warehouse Usage


■ Three kinds of data warehouse applications
■ Information processing
■ supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
■ Analytical processing
■ multidimensional analysis of data warehouse data
■ supports basic OLAP operations, slice-dice, drilling, pivoting
■ Data mining
■ knowledge discovery from hidden patterns
■ supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results
using visualization tools
Department of Information Science and Engg
Transform Here
From On-Line Analytical Processing (OLAP) to On
Line Analytical Mining (OLAM)
■ Why Online Analytical Mining?
■ High quality of data in data warehouses

■ DW contains integrated, consistent, cleaned data

■ Available information processing structure surrounding


data warehouses
■ ODBC, OLEDB, Web accessing, service facilities,

reporting and OLAP tools


■ OLAP-based exploratory data analysis

■ Mining with drilling, dicing, pivoting, etc.

■ On-line selection of data mining functions

■ Integration and swapping of multiple mining

functions, algorithms, and tasks


Department of Information Science and Engg
Transform Here

Reflections about todays Session

Google Form – Quiz


https://docs.google.com/forms/d/e/1FAIpQLSfdiEz7A6Z3iNU2
6F6XAxLO0AU6P06AuOl7mGeGKzMOeUhiKw/viewform

28-11-2024 Department of Information Science and Engg


76
Transform Here
Conclusion
We have studied the below concepts in todays class
1. Topics of Module-1
2. Learning Objectives
3. Basic Definitions of database approaches
4. Database system environment
5. Main Characteristics of the Database Approach
6. Advantages of using the DBMS Approach
7. Historical Development of Database Technology
8. Database Languages and Architectures
9. Schemas versus Instances
10.Reflections

28-11-2024 Department of Information Science and Engg


77
Transform Here

Contact Details:

Dr.Manjunath T N
Professor and Dean – ER
Department of Information Science and Engg
BMS Institute of Technology and Management
Mobile: +91-9900130748
E-Mail: [email protected] /
[email protected]

28-11-2024 Department of Information Science and Engg


78
Transform Here
Welcome
To

DATA MINING AND DATA


WAREHOUSING -
21CS732

28-11-2024 Department of Information Science and Engg


1
Transform Here

MODULE - 2
Module-2: Data warehouse implementation & Data mining: Efficient Data
Cube computation: An overview, Indexing OLAP Data: Bitmap index and
join index, Efficient processing of OLAP Queries, OLAP server
Architecture ROLAP versus MOLAP Versus HOLAP. : Introduction: What
is data mining, Challenges, Data Mining Tasks, Data: Types of Data, Data
Quality, Data Preprocessing, Measures of Similarity and Dissimilarity

Department of Information Science and Engg


Transform Here
Data Warehouse Implementation
■ Data warehouses contain huge volumes of data.
OLAP servers demand that decision support queries
be answered in the order of seconds.

■ Therefore, it is crucial for data warehouse systems


to support highly efficient cube computation
techniques, access methods, and query
processing techniques.

Department of Information Science and Engg


Transform Here

Efficient Data Cube Computation: An Overview

■ At the core of multidimensional data analysis is the


efficient computation of aggregations across many
sets of dimensions.

■ In SQL terms, these aggregations are referred to as


group-by’s. Each group-by can be represented by a
cuboid, where the set of group-by’s forms a lattice of
cuboids defining a data cube.

Department of Information Science and Engg


Transform Here
The compute cube Operator and the Curse of
Dimensionality
■ One approach to cube computation extends SQL
so as to include a compute cube operator.

■ The compute cube operator computes


aggregates over all subsets of the dimensions
specified in the operation.

■ This can require excessive storage space,


especially for large numbers of dimensions.
Department of Information Science and Engg
Transform Here

For Example:

■ A data cube is a lattice of cuboids. Suppose that you


want to create a data cube for AllElectronics sales that
contains the following:
city, item, year, and sales in dollars.
You want to be able to analyze the data, with queries such as
the following:
■ “Compute the sum of sales, grouping by city and item.”

■ “Compute the sum of sales, grouping by city.”

■ “Compute the sum of sales, grouping by item.”

Department of Information Science and Engg


Transform Here
Efficient Data Cube Computation
■ Data cube can be viewed as a lattice of cuboids
■ The bottom-most cuboid is the base cuboid
■ The top-most cuboid (apex) contains only one cell
■ How many cuboids in an n-dimensional cube with L
levels n
T = ∏ (L i
?
i =1
+1)
■ Materialization of data cube
■ Materialize every (cuboid) (full materialization), none

(no materialization), or some (partial materialization)


■ Selection of which cuboids to materialize
■ Based on Size of Datawarehoue, access frequency,
etc. Department of Information Science and Engg
Transform Here

■ An SQL query containing no group-


by(e.g.,“compute the sum of total sales”) is a zero
dimensional operation.
■ An SQL query containing one group-by (e.g., “compute the
sum of sales, group-by city”) is a one-dimensional
operation.
■ A cube operator on n dimensions is equivalent to a
collection of group-by statements, one for each subset of
the n dimensions.
■ Therefore, the cube operator is the
n-dimensional generalization of the group-by operator.

Department of Information Science and Engg


Transform Here
■ Similar to the SQL syntax, the data cube in could be defined
as
define cube sales_cube [city, item, year]: sum(sales in dollars)
■ For a cube with n dimensions, there are a total of 2n cuboids,
including the base cuboid.
■ A statement such as compute
cube sales cube
would explicitly instruct the system to compute the sales
aggregate cuboids for all eight subsets of the set {city, item,
year}, including the empty subset.

Department of Information Science and Engg


Transform Here

■ Online analytical processing may need to access


different cuboids for different queries. Therefore, it
may seem like a good idea to compute in advance
all or at least some of the cuboids in a data cube.
■ Precomputation leads to fast response time and
avoids some redundant computation.
■ Most, if not all, OLAP products resort to some
degree of precomputation of multidimensional
aggregates.

Department of Information Science and Engg


Transform Here
■ A major challenge related to this precomputation,
however, is that the required storage space may
explode if all the cuboids in a data cube are
precomputed, especially when the cube has
many dimensions.
■ The storage requirements are even more
excessive when many of the dimensions have
associated concept hierarchies, each with
multiple levels. This problem is referred to as the
curse of dimensionality.

Department of Information Science and Engg


Transform Here

■ “How many cuboids are there in an n-dimensional data


cube?” If there were no hierarchies associated with each
dimension, then the total number of cuboids for an n-
dimensional data cube, as we have seen, is 2n.
■ However, in practice, many dimensions do have
hierarchies. For example, time is usually explored not at
only one conceptual level (e.g., year), but rather at
multiple conceptual levels such as in the hierarchy “day <
month < quarter < year.”

Department of Information Science and Engg


Transform Here
■ For an n-dimensional data cube, the total number
of cuboids that can be generated (including the
cuboids generated by climbing up the hierarchies
along each dimension) is

where L is the number of levels associated with


dimension i. One is added to Li in Eq to include the
virtual top level, all.
(Note that generalizing to all is equivalent to the removal of the dDiamta

Wearenhosusiinog n.)

Department of Information Science and Engg


Transform Here

The “Compute Cube” Operator


■ Cube definition and computation in DMQL
define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
■ Transform it into a SQL-like language (with a new operator cube
by, introduced by Gray et al.’96)
()
SELECT item, city, year, SUM
(amount)
FROM SALES (city) (item) (year)
CUBE BY item, city, year
■ Need compute the following Group-
Bys (city, item) (city, (item, year)
(date, product, customer),
(date,product),(date, customer), (product, year)
customer),
(date), (product), (customer)
(city, item, year)
()
Department of Information Science and Engg
Transform Here 14
Partial Materialization: Selected Computation of
ThereCuboids
are three choices for data cube materialization given a
base cuboid:

1. No materialization: Do not precompute any of the


“nonbase” cuboids. This leads to computing expensive
multidimensional aggregates on-the-fly, which can be
extremely slow.

2. Full materialization: Precompute all of the cuboids. The


resulting lattice of computed cuboids is referred to as the
full cube. This choice typically requires huge amounts of
memory space in order to store all of the precomputed
cuboids.

Department of Information Science and Engg


Transform Here

3. Partialmaterialization: Selectively compute a proper subset of the


whole set of possible cuboids. Alternatively, we may compute a
subset of the cube, which contains only those cells that satisfy
some user-specified criterion, such as where the tuple count of
each cell is above some threshold.

We will use the term subcube to refer to the latter case, where only
some of the cells may be precomputed for various cuboids. Partial
materialization represents an interesting trade-off between storage
space and response time.

Department of Information Science and Engg


Transform Here
■ The partial materialization of cuboids or subcubes should
consider three factors:

(1)identify the subset of cuboids or subcubes to


materialize;
(2)exploit the materialized cuboids or subcubes during
query processing; and
(3)efficiently update the materialized cuboids or
subcubes during load and refresh.

Department of Information Science and Engg


Transform Here

■ we can compute an icebergcube, which is a data cube


that stores only those cube cells with an aggregate value
(e.g., count) that is above some minimum support
threshold.

■ Another common strategy is to materialize a shell cube.


This involves precomputing the cuboids for only a small
number of dimensions (e.g., three to five) of a data cube.

Department of Information Science and Engg


Transform Here
■ Once the selected cuboids have been
materialized, it is important to take advantage of
them during query processing.
■ This involves several issues, such as how to
determine the relevant cuboid(s) from among the
candidate materialized cuboids,
■ how to use available index structures on the
materialized cuboids, and
■ how to transform the OLAP operations on to the
selected cuboid(s)
Department of Information Science and Engg
Transform Here

■ Finally, during load and refresh, the materialized


cuboids should be updated efficiently.
■ Parallelism and incremental update techniques
for this operation should be explored.

Department of Information Science and Engg


Transform Here
Indexing OLAP Data: Bitmap Index
and Join Index

■ To facilitate efficient data accessing, most data


warehouse systems support index structures
and materialized views (using cuboids).
■ index OLAP data by bitmap indexing and
join indexing.

Department of Information Science and Engg


Transform Here

Bitmap Indexing
■ The bitmap indexing method is popular in OLAP
products because it allows quick searching in
data cubes.
■ The bitmap index is an alternative representation of
the record ID (RID) list.
■ In the bitmap index for a given attribute, there is a
distinct bit vector, Bv, for each value v in the
attribute’s domain.

Department of Information Science and Engg


Transform Here
■ If a given attribute’s domain consists of n values,
then n bits are needed for each entry in the
bitmap index (i.e., there are n bit vectors).
■ If the attribute has the value v for a given row in
the data table, then the bit representing that value
is set to 1 in the corresponding row of the bitmap
index. All other bits for that row are set to 0.

Department of Information Science and Engg


Transform Here

Indexing OLAP Data: Bitmap Index


■ Index on a particular column
■ Each value in the column has a bit vector: bit-op is fast
■ The length of the bit vector: # of records in the base table
■ The i-th bit is set if the i-th row of the base table has the value
for the indexed column
■ not suitable for high cardinality domains
■ A recent bit compression technique, Word-Aligned Hybrid (WAH),
makes it work for high cardinality domain as well [Wu, et al. TODS’06]

Department of Information Science and Engg


Transform Here
Department of Information Science and Engg
Transform Here

Department of Information Science and Engg


Transform Here
Indexing OLAP Data: Join Indices
■ Join index: JI(R-id, S-id) where R (R-id, …) ⊳⊲ S
(S-id, …)
■ Traditional indices map the values to a list of
record ids
■ It materializes relational join in JI file and

speeds up relational join


■ In data warehouses, join index relates the values
of the dimensions of a start schema to rows in
the fact table.
■ E.g. fact table: Sales and two dimensions
city
and product
■ A join index on city maintains for each
distinct city a list of R-IDs of the tuples
recording the Sales in the city
Department of Information Science and Engg
■ Join indices can span multiple dimensions Transform Here

Efficient Processing OLAP Queries

■ The purpose of materializing cuboids and


constructing OLAP index structures is to speed
up query processing in data cubes.

Department of Information Science and Engg


Transform Here
Efficient Processing OLAP
■ Determine which operations Queries
should be performed on the available cuboids
■ Transform drill, roll, etc. into corresponding SQL and/or OLAP operations,
e.g., dice = selection + projection
■ Determine which materialized cuboid(s) should be selected for OLAP op.
■ Let the query to be processed be on {brand, province_or_state} with the
condition “year = 2004”, and there are 4 materialized cuboids available:

1) {year, item_name, city}


2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?
■ Explore indexing structures and compressed vs. dense array structs in
MOLAP Department of Information Science and Engg
Transform Here

Given materialized views, query processing should


proceed as follows:
1. Determine which operations should be performed on the
available cuboids:
This involves transforming any selection, projection, roll-up (group-by), and
drill-down operations specified in the query into corresponding SQL and/or
OLAP operations.

2. Determine to which materialized cuboid(s) the


relevant operations should be applied:
This involves identifying all of the materialized cuboids that may potentially
be used to answer the query, pruning the set using knowledge of
“dominance” relationships among the cuboids, estimating the costs of using
the remaining materialized cuboids, and selecting the cuboid with the least
cost.

Department of Information Science and Engg


Transform Here
OLAP Server Architectures
■ Relational OLAP (ROLAP)
■ Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware
■ Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
■ Greater scalability
■ Multidimensional OLAP (MOLAP)
■ Sparse array-based multidimensional storage engine
■ Fast indexing to pre-computed summarized data
■ Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
■ Flexibility, e.g., low level: relational, high-level: array
■ Specialized SQL servers (e.g., Redbricks)
■ Specialized support for SQL queries over star/snowflake
schemas Department of Information Science and Engg
Transform Here

Relational OLAP(ROLAP) servers:


■ These are the intermediate servers that stand in between a
relational back-end server and client front-end tools.

■ They use a relational or extended-relational DBMS to store


and manage warehouse data, and OLAP middleware to
support missing pieces.

■ ROLAP servers include optimization for each DBMS back


end, implementation of aggregation navigation logic, and
additional tools and services.

■ ROLAP technology tends to have greater scalability than


MOLAP technology.
Department of Information Science and Engg
Transform Here
ROLAP SERVER

Department of Information Science and Engg


Transform Here 33

Multidimensional OLAP(MOLAP) servers:

■ These servers support multidimensional data views


through array-based multidimensional storage engines.

■ They map multidimensional views directly to data cube


array structures.

■ The advantage of using a data cube is that it allows fast


indexing to precomputed summarized data.

■ Notice that with multidimensional data stores, the storage


utilization may be low if the data set is sparse.
Department of Information Science and Engg
Transform Here
■ In such cases, sparse matrix compression
techniques should be explored.

■ Many MOLAP servers adopt a two-level storage


representation to handle dense and sparse data sets:

■ Denser subcubes are identified and stored as array


structures, whereas sparse subcubes employ
compression technology for efficient storage
utilization.

Department of Information Science and Engg


Transform Here

MOLAP

MOLAP Vs ROLAP

Department of Information Science and Engg


28-11-2024 36
Transform Here
Hybrid OLAP(HOLAP)servers:
■ The hybrid OLAP approach combines ROLAP and
MOLAP technology, benefiting from the greater
scalability of ROLAP and the faster computation of
MOLAP.

■ For example, a HOLAP server may allow large volumes


of detailed data to be stored in a relational database,
while aggregations are kept in a separate MOLAP store.

■ The Microsoft SQL Server 2000 supports a hybrid OLAP


server.
Department of Information Science and Engg
Transform Here

ROLAP vs MOLAP vs HOLAP

Department of Information Science and Engg


Transform Here
Specialized SQL servers:
■ To meet the growing demand of OLAP
processing in relational databases, some
database system vendors implement specialized
SQL servers that
■ provide advanced query language and

■ query processing support for SQL queries

over star and snowflake schemas in a read-


only environment.

Department of Information Science and Engg


Transform Here

Summary
■ Data warehousing: A multi-dimensional model of a data warehouse
■ A data cube consists of dimensions & measures
■ Star schema, snowflake schema, fact constellations
■ OLAP operations: drilling, rolling, slicing, dicing and pivoting
■ Data Warehouse Architecture, Design, and Usage
■ Multi-tiered architecture
■ Business analysis design framework
■ Information processing, analytical processing, data mining, OLAM
(Online
Analytical Mining)
■ Implementation: Efficient computation of data cubes
■ Partial vs. full vs. no materialization
■ Indexing OALP data: Bitmap index and join index
■ OLAP query processing
■ OLAP servers: ROLAP, MOLAP, HOLAP
■ Data generalization: Attribute-oriented
Department of Informationinduction
Science and Engg
Transform Here
Additional Information

Department of Information Science and Engg


Transform Here

Attribute-Oriented Induction
■ Proposed in 1989 (KDD ‘89 workshop)
■ Not confined to categorical data nor particular measures
■ How it is done?
■ Collect the task-relevant data (initial relation) using a

relational database query


■ Perform generalization by attribute removal or
attribute generalization
■ Apply aggregation by merging identical,
generalized tuples and accumulating their
respective counts
■ Interaction with users for knowledge presentation
Department of Information Science and Engg
Transform Here
Attribute-Oriented Induction: An Example
Example: Describe general characteristics of
graduate students in the University database
■ Step 1. Fetch relevant set of data using an SQL
statement, e.g.,
Select * (i.e., name, gender, major, birth_place,
birth_date, residence, phone#, gpa)
from student
where student_status in {“Msc”, “MBA”, “PhD” }
■ Step 2. Perform attribute-oriented induction
■ Step 3. Present results in generalized relation, cross-tab,
or rule forms
Department of Information Science and Engg
Transform Here

Class Characterization: An Example


Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Initial Jim M CS Vancouver,BC, 8-12-76 3511 Main St., 687-4598 3.67
Relation Woodman Canada Richmond
Scott M CS Montreal, Que, 28-7-75 345 1st Ave., 253-9106 3.70
Lachance Canada Richmond
Laura Lee Physics Seattle, WA, USA 25-8-70 125 Austin Ave., 420-5232 3.83
F
… … … … Burnaby … …
… …
Removed Sci,Eng, Country Age range City Removed Excl,
Retained
Bus
VG,..
Gender Major Birth_region Age_range Residence GPA Count
Prime M Science Canada 20-25 Richmond Very-good 16
Generalized F Science Foreign 25-30 Burnaby Excellent 22
Relation … … … … … … …
Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62

Department of Information Science and Engg


Transform Here
Basic Principles of Attribute-Oriented Induction
■ Data focusing: task-relevant data, including dimensions,
and the result is the initial relation
■ Attribute-removal: remove attribute A if there is a large set of
distinct values for A but (1) there is no generalization
operator on A, or (2) A’s higher level concepts are expressed
in terms of other attributes
■ Attribute-generalization: If there is a large set of distinct
values for A, and there exists a set of generalization
operators on A, then select an operator and generalize A
■ Attribute-threshold control: typical 2-8, specified/default
■ Generalized relation threshold control: control the final
relation/rule size
Department of Information Science and Engg
Transform Here

Attribute-Oriented Induction: Basic



Algorithm
InitialRel: Query processing of task-relevant data, deriving
the initial relation.
■ PreGen: Based on the analysis of the number of distinct
values in each attribute, determine generalization plan for
each attribute: removal? or how high to generalize?
■ PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
■ Presentation: User interaction: (1) adjust levels by drilling,
(2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.
Department of Information Science and Engg
Transform Here
Presentation of Generalized
■ Generalized relation: Results
■ Relations where some or all attributes are generalized, with counts
or other aggregation values accumulated.
■ Cross tabulation:
■ Mapping results into cross tabulation form (similar to contingency
tables).
■ Visualization techniques:
■ Pie charts, bar charts, curves, cubes, and other visual forms.
■ Quantitative characteristic rules:
■ Mapping generalized result into characteristic rules with quantitative
information associated with it, e.g.,
grad(x)∧male(x)⇒
birth_ region(x) ="Canada"[t :53%]∨birth_ region(x) =" foreign"[t :47%].
Department of Information Science and Engg
Transform Here

Mining Class Comparisons


■ Comparison: Comparing two or more classes
■ Method:
■ Partition the set of relevant data into the target class and the
contrasting class(es)
■ Generalize both classes to the same high level concepts
■ Compare tuples with the same high level descriptions
■ Present for every tuple its description and two measures
■ support - distribution within single class
■ comparison - distribution between classes
■ Highlight the tuples with strong discriminant features
■ Relevance Analysis:
■ Find attributes (features) which best distinguish different classes

Department of Information Science and Engg


Transform Here
Concept Description vs. Cube-Based
OLAP
■ Similarity:
■ Data generalization
■ Presentation of data summarization at multiple levels of
abstraction
■ Interactive drilling, pivoting, slicing and dicing
■ Differences:
■ OLAP has systematic preprocessing, query

independent,
and can drill down to rather low level
■ AOI has automated desired level allocation, and

may perform dimension relevance analysis/ranking


when there are many relevant dimensions
■ AOI works on the data
Department whichScience
of Information areandnot
Enggin relational forms
Transform Here

Reflections about todays Session

Google Form – Quiz


https://docs.google.com/forms/d/e/1FAIpQLSfdiEz7A6Z3iNU2
6F6XAxLO0AU6P06AuOl7mGeGKzMOeUhiKw/viewform

28-11-2024
Department of Information Science and Engg
50
Transform Here
Conclusion
We have studied the below concepts in todays class
1. Topics of Module-1
2. Learning Objectives
3. Basic Definitions of database approaches
4. Database system environment
5. Main Characteristics of the Database Approach
6. Advantages of using the DBMS Approach
7. Historical Development of Database Technology
8. Database Languages and Architectures
9. Schemas versus Instances
10.Reflections

28-11-2024 Department of Information Science and Engg


51
Transform Here

Contact Details:

Dr.Manjunath T N
Professor and Dean – ER
Department of Information Science and Engg
BMS Institute of Technology and Management
Mobile: +91-9900130748
E-Mail: [email protected] /
[email protected]

28-11-2024
Department of Information Science and Engg
52
Transform Here
Welcome
To

DATA MINING AND DATA


WAREHOUSING -
21CS732

28-11-2024 Department of Information Science and Engg


1
Transform Here

Modules and High Level Topics

Module – 1: Data Warehousing & Modeling


Module – 2: Data warehouse implementation& Data
Mining
Module – 3: Association Analysis
Module – 4: Classification
Module – 5: Clustering Analysis

28-11-2024
Department of Information Science and Engg
2
Transform Here
Module-3: Association Analysis: Association Analysis: Problem Definition,
Frequent Item set Generation, Rule generation. Alternative Methods for
Generating Frequent Item sets, FPGrowth Algorithm, Evaluation of
Association Patterns.

28-11-2024 Department of Information Science and Engg


3
Transform Here

Association Rule Mining


● Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction

Market-Basket transactions
Example of Association Rules

{Diaper} → {Beer},
{Milk, Bread} → {Eggs, Coke},
{Beer, Bread} → {Milk},

Implication means co-occurrence,


not causality!

Department of Information Science and Engg


Transform Here
Definition: Frequent Itemset
● Itemset
– A collection of one or more items
◆ Example: {Milk, Bread, Diaper}
– k-itemset
◆ An itemset that contains k items

● Support count (σ)


– Frequency of occurrence of an itemset
– E.g. σ({Milk, Bread,Diaper}) = 2
● Support (s)
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
● Frequent Itemset
– An itemset whose support is greater than
or equal to a minsup threshold

Department of Information Science and Engg


Transform Here

Definition: Association Rule


● Association Rule
– An implication expression of the form X → Y,
where X and Y are itemsets
– Example:
{Milk, Diaper} → {Beer}

● Rule Evaluation Metrics


– Support (s)
◆ Fraction of transactions that contain both X Exampl
and Y e:
– Confidence (c)
◆ Measures how often items in Y
appear in transactions that
contain X

Department of Information Science and Engg


Transform Here
Association Rule Mining Task
● Given a set of transactions T, the goal of association rule mining is
to find all rules having
– support ≥minsup threshold
– confidence ≥ minconf threshold
– In terms of performance, when minsup is higher you will find less pattern and the algorithm is faster. For
minconf, when it is set higher, there will be less pattern but it may not be faster because many algorithms
don't use minconf to prune the search space.

● Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each
rule
– Prune rules that fail the minsup and minconf
thresholds
⇒ Computationally prohibitive!
Department of Information Science and Engg
Transform Here

Computational Complexity
● Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:

If d=6, R = 602 rules

Department of Information Science and Engg


Transform Here
Mining Association Rules
Example of Rules:
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)

Observations:
● All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
● Rules originating from the same itemset have identical support but
can have different confidence
● Thus, we may decouple the support and confidence requirements

Department of Information Science and Engg


Transform Here

Mining Association Rules


● Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ≥ minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset, where
each rule is a binary partitioning of a frequent itemset
– goal is to find interesting patterns and trends in transaction
databases.

● Frequent itemset generation is still computationally expensive

Department of Information Science and Engg


Transform Here
Frequent Itemset Generation

Given d items, there


are 2d possible
candidate itemsets

Department of Information Science and Engg


Transform Here

Frequent Itemset Generation


● Brute-force approach:
– Each itemset in the lattice is a candidate frequent
itemset
– Count the support of each candidate by scanning the
database

– Match each transaction against every candidate


– Complexity ~ O(NMw) => Expensive
Department of Information Science and Engg since M = 2d !!!
Transform Here
Frequent Itemset Generation Strategies
● Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M

● Reduce the number of transactions (N)


– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms

● Reduce the number of comparisons (NM)


– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every transaction

Department of Information Science and Engg


Transform Here

Reducing Number of Candidates


● Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent

● Apriori principle holds due to the following property of the


support measure:

– Support of an itemset never exceeds the support of its


subsets
– This is known as the anti-monotone property of support

Department of Information Science and Engg


Transform Here
Found to be
Infrequent

Pruned
supersets
Department of Information Science and Engg
Transform Here

Illustrating Apriori Principle


Items (1-itemsets)

Minimum Support = 3

If every subset is considered,


6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

Department of Information Science and Engg


Transform Here
Illustrating Apriori Principle
Items (1-
itemsets)

Minimum Support = 3

If every subset is considered,


6C + 6C + 6C
1 2 3 We assume that the support threshold is
6 + 15 + 20 = 41 60%, which is equivalent to a minimum
With support-based pruning, support count equal to 3.
6 + 6 + 4 = 16

Department of Information Science and Engg


Transform Here

Illustrating Apriori Principle


Items (1-
itemsets)

Pairs (2-itemsets)

(No need to generate


candidates involving
Coke
or Eggs)
Minimum Support = 3

If every subset is considered,


6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

Department of Information Science and Engg


Transform Here
Illustrating Apriori Principle
Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate


candidates involving
Coke
or Eggs)
Minimum Support = 3

If every subset is considered,


6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

Department of Information Science and Engg


Transform Here

Illustrating Apriori Principle


Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate


candidates involving
Coke
or Eggs)
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered,
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

Department of Information Science and Engg


Transform Here
Illustrating Apriori Principle
Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate


candidates involving
Coke
or Eggs)
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered,
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

Department of Information Science and Engg


Transform Here

Illustrating Apriori Principle


Items (1-
itemsets)

Pairs (2-itemsets)

(No need to generate


candidates involving
Coke
or Eggs)
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered,
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
6 + 6 + 1 = 13

Department of Information Science and Engg


Transform Here
Apriori Algorithm If an itemset is frequent, then all of its
subsets must also be frequent
– Fk: frequent k-itemsets
– Lk: candidate k-itemsets
● Algorithm
– Let k=1
– Generate F1 = {frequent 1-itemsets}
– Repeat until Fk is empty
◆ Candidate Generation: Generate Lk+1 from Fk
◆ Candidate Pruning: Prune candidate itemsets in Lk+1 containing subsets of length k
that are infrequent
◆ Support Counting: Count the support of each candidate in Lk+1 by scanning the DB
◆ Candidate Elimination: Eliminate candidates in Lk+1 that are infrequent, leaving only
those that are frequent => Fk+1

Department of Information Science and Engg


Transform Here

Apriori Algorithm

Department of Information Science and Engg


Transform Here
Candidate Generation: Brute-force method

Department of Information Science and Engg


Transform Here

Candidate Generation: Merge Fk-1 and F1 itemsets

Department of Information Science and Engg


Transform Here
Candidate Generation: Fk-1 x Fk-1 Method
● Merge two frequent (k-1)-itemsets if their first (k-2) items
are identical

● F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, ABD) = ABCD
– Merge(ABC, ABE) = ABCE
– Merge(ABD, ABE) = ABDE

– Do not merge(ABD,ACD) because they share


only prefix of length 1 instead of length 2

Department of Information Science and Engg


Transform Here

Candidate Pruning
● Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be the set of
frequent 3-itemsets

● L4 = {ABCD,ABCE,ABDE} is the set of candidate 4-itemsets


generated (from previous slide)

● Candidate pruning
– Prune ABCE because ACE and BCE are infrequent
– Prune ABDE because ADE is infrequent

● After candidate pruning: L4 = {ABCD}

Department of Information Science and Engg


Transform Here
Candidate Generation: Fk-1 x Fk-1 Method

Department of Information Science and Engg


Transform Here

Alternate Fk-1 x Fk-1 Method


● Merge two frequent (k-1)-itemsets if the last (k-2) items of
the first one is identical to the first (k-2) items of the
second.

● F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, BCD) = ABCD
– Merge(ABD, BDE) = ABDE
– Merge(ACD, CDE) = ACDE
– Merge(BCD, CDE) = BCDE

Department of Information Science and Engg


Transform Here
Candidate Pruning for Alternate F k-1 x Fk-1 Method
● Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be the set of
frequent 3-itemsets

● L4 = {ABCD,ABDE,ACDE,BCDE} is the set of candidate 4-itemsets


generated (from previous slide)
● Candidate pruning
– Prune ABDE because ADE is infrequent
– Prune ACDE because ACE and ADE are infrequent
– Prune BCDE because BCE
● After candidate pruning: L4 = {ABCD}

Department of Information Science and Engg


Transform Here

Support Counting of Candidate Itemsets


● Support counting is the process of determining the frequency of
occurrence for every candidate itemset that survives the candidate
pruning step of the apriori-gen function.
● One approach for doing this is to compare each transaction against
every candidate itemset and to update the support counts of
candidates contained in the transaction
● This approach is computationally expensive, especially when the
numbers of transactions and candidate itemsets are large.

Department of Information Science and Engg


Transform Here
Support Counting of Candidate Itemsets
● To reduce number of comparisons, store the candidate
itemsets in a hash structure
– Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets

Department of Information Science and Engg


Transform Here

Suppose you have 15 candidate itemsets of length 3:


{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3
5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}

How many of these itemsets are supported by transaction (1,2,3,5,6)?

given t = {1, 2, 3, 5, 6}, all


the 3-itemsets contained
in t must begin with item
1, 2, or 3. It is not possible
to construct a 3-itemset
that begins with items 5 or
6 because there are only
two items in t whose
labels are greater than or
equal to 5.

Department of Information Science and Engg


Transform Here
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3
5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
● Hash function
● Max leaf size: max number of itemsets stored in a leaf node (if number of
candidate itemsets exceeds max leaf size, split the node)

Hash function 234


3,6,9 567
1,4,7 145 345 356 367
136 368
2,5,8 357
124 689
457 125 159
458
Department of Information Science and Engg
Transform Here

Rule Generation
● The Apriori algorithm uses a level-wise approach for
generating association rules, where each level
corresponds to the number of items that belong to the rule
consequent.
● Initially, all the high-confidence rules that have only one
item in the rule consequent are extracted. These rules are
then used to generate new candidate rules.
● If {A,B,C,D} is a frequent itemset, candidate rules:
ABC →D, ABD →C, ACD →B, BCD →A,
A →BCD, B →ACD, C →ABD, D →ABC
AB →CD, AC → BD, AD → BC, BC →AD,
BD →AC, CD →AB,

Department of Information Science and Engg


Transform Here
Rule Generation
● In general, confidence does not have an anti-monotone property
c(ABC →D) can be larger or smaller than
c(AB →D)
– Support of an itemset never exceeds the support of its subsets

● But confidence of rules generated from the same itemset has an


anti-monotone property
– E.g., Suppose {A,B,C,D} is a frequent 4-
itemset:

c(ABC → D) ≥ c(AB → CD) ≥ c(A →


BCD)
– Confidence is anti-monotone w.r.t. number of
items on the RHS of the rule
Department of Information Science and Engg
Transform Here

Rule Generation for Apriori Algorithm


Lattice of rules
Low
Confidence
Rule

Pruned
Rules

Department of Information Science and Engg


Transform Here
Association Analysis: Basic Concepts
and Algorithms

Algorithms and Complexity

Department of Information Science and Engg


Transform Here

Factors Affecting Complexity of Apriori


● Choice of minimum support threshold

● Dimensionality (number of items) of the data set

● Size of database

● Average transaction width

Department of Information Science and Engg


Transform Here
Factors Affecting Complexity of Apriori
● Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
● Dimensionality (number of items) of the data set

● Size of database

● Average transaction width


Department of Information Science and Engg


Transform Here

Impact of Support Based Pruning


Items (1-itemsets)

Minimum Support = 3
Minimum Support = 2
If every subset is considered,
6C + 6C + 6C
1 2 3
If every subset is considered,
6 + 15 + 20 = 41 6C + 6C + 6C + 6C
With support-based pruning, 1 2 3 4
6 + 15 + 20 +15 = 56
6 + 6 + 4 = 16

Department of Information Science and Engg


Transform Here
Factors Affecting Complexity of Apriori
● Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
● Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
● Size of database

● Average transaction width


Department of Information Science and Engg


Transform Here

Factors Affecting Complexity of Apriori


● Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
● Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
● Size of database
– run time of algorithm increases with number of transactions
● Average transaction width

Department of Information Science and Engg


Transform Here
Factors Affecting Complexity of Apriori
● Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
● Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
● Size of database
– run time of algorithm increases with number of transactions
● Average transaction width
– transaction width increases the max length of frequent itemsets
– number of subsets in a transaction increases with its width,
increasing computation time for support counting

Department of Information Science and Engg


Transform Here

Factors Affecting Complexity of Apriori

Department of Information Science and Engg


Transform Here
Example Question
● Given the following transaction data sets (dark cells indicate presence of an
item in a transaction) and a support threshold of 20%, answer the following
questions

DataSet: A Data Set: B Data Set: C

a. What is the number of frequent itemsets for each dataset? Which dataset will produce the most
number of frequent itemsets?
b. Which dataset will produce the longest frequent itemset?
c. Which dataset will produce frequent itemsets with highest maximum support?
d. Which dataset will produce frequent itemsets containing items with widely varying support levels (i.e.,
itemsets containing items with mixed support, ranging from 20% to more than 70%)?
e. What is the number of maximal frequent itemsets for each dataset? Which dataset will produce the
most number of maximal frequent itemsets?
f. What is the number of closed frequent itemsets for each dataset? Which dataset will produce the
most number of closed frequent itemsets?
Department of Information Science and Engg
Transform Here

Pattern Evaluation
● Association rule algorithms can produce large number of rules

● Interestingness measures can be used to prune/rank the patterns


– In the original formulation, support & confidence are
the only measures used

Department of Information Science and Engg


Transform Here
Contingency table
Y Y f11: support of X and Y
X f11 f10 f1+ f10: support of X and Y
X f01 f00 fo+ f01: support of X and Y
f+1 f+0 N f00: support of X and Y

Used to define various measures


● support, confidence, Gini,
entropy, etc.

Department of Information Science and Engg


Transform Here

Custo Tea Coffee …


mers
C1 0 1 …
C2 1 0 …
C3 1 1 …
C4 1 0 …

Association Rule: Tea → Coffee

Confidence ≅ P(Coffee|Tea) = 150/200 = 0.75


Confidence > 50%, meaning people who drink tea are more
likely to drink coffee than not drink coffee
So rule seems reasonable
Department of Information Science and Engg
Transform Here
Coffee Coffee
Tea 150 50 200
Tea 650 150 800
800 200 1000

Association Rule: Tea → Coffee

Confidence= P(Coffee|Tea) = 150/200 = 0.75

but P(Coffee) = 0.8, which means knowing that a person drinks


tea reduces the probability that the person drinks coffee!
⇒ Note that P(Coffee|Tea) = 650/800 = 0.8125

Department of Information Science and Engg


Transform Here

Custo Tea Honey …


mers
C1 0 1 …
C2 1 0 …
C3 1 1 …
C4 1 0 …

Association Rule: Tea → Honey
Confidence ≅ P(Honey|Tea) = 100/200 = 0.50
Confidence = 50%, which may mean that drinking tea has little
influence whether honey is used or not
So rule seems uninteresting
But P(Honey) = 120/1000 = .12 (hence tea drinkers are far more
likely to have honey
Department of Information Science and Engg
Transform Here
lift is used for rules while
interest is used for itemsets

Department of Information Science and Engg


Transform Here

Coffee Coffee
Tea 150 50 200
Tea 650 150 800
800 200 1000

Association Rule: Tea → Coffee

Confidence= P(Coffee|Tea) = 0.75


but P(Coffee) = 0.8
⇒ Interest = 0.15 / (0.2×0.8) = 0.9375 (< 1, therefore is negatively
associated)
So, is it enough to use confidence/Interest for pruning?
Department of Information Science and Engg
Transform Here
There are lots of
measures proposed in
the literature

Department of Information Science and Engg


Transform Here

10 examples of Rankings of contingency tables


contingency tables: using various measures:

Department of Information Science and Engg


Transform Here
Property under Inversion Operation

Transaction 1

.
.
.
.
.
Transaction N

Department of Information Science and Engg


Transform Here

Property under Inversion Operation

Transaction 1

.
.
.
.
.
Transaction N

Correlation: -0.1667 -0.1667


IS/cosine 0.0 0.825

Department of Information Science and Engg


Transform Here
Property under Null Addition

Invariant measures:
● cosine, Jaccard, All-confidence, confidence
Non-invariant measures:
● correlation, Interest/Lift, odds ratio, etc

Department of Information Science and Engg


Transform Here

Property under Row/Column Scaling


Grade-Gender Example (Mosteller, 1968):

Male Female Male Female

High 30 20 50 High 60 60 120


Low 40 10 50 Low 80 30 110
70 30 100 140 90 230

2x 3x
Mosteller:
Underlying association should be independent of
the relative number of male and female students
in the samples

Odds-Ratio ((f11+f00 )/(f10+f10)) has this property


Department of Information Science and Engg
Transform Here
Property under Row/Column Scaling
Relationship between Mask use and susceptibility to Covid:

Covid- Covid- Covid- Covid-


Positive Free Positive Free
Mask 20 30 50 Mask 40 300 340
No- 40 10 50 No- 80 100 180
Mask Mask
60 40 100 120 400 520
2x 10x
Mosteller:
Underlying association should be independent of
the relative number of Covid-positive and Covid-free subjects

Odds-Ratio ((f11+f00 )/(f10+f10)) has this property


Department of Information Science and Engg
Transform Here

Different Measures have Different Properties

Department of Information Science and Engg


Transform Here
Simpson’s Paradox
● Observed relationship in data may be influenced by the presence
of other confounding factors (hidden variables)
– Hidden variables may cause the observed
relationship to disappear or reverse its
direction!

● Proper stratification is needed to avoid generating spurious


patterns

Department of Information Science and Engg


Transform Here

Simpson’s Paradox
● Recovery rate from Covid
– Hospital A: 80%
– Hospital B: 90%
● Which hospital is better?

Department of Information Science and Engg


Transform Here
Simpson’s Paradox
● Recovery rate from Covid
– Hospital A: 80%
– Hospital B: 90%
● Which hospital is better?
● Covid recovery rate on older population
– Hospital A: 50%
– Hospital B: 30%
● Covid recovery rate on younger population
– Hospital A: 99%
– Hospital B: 98%

Department of Information Science and Engg


Transform Here

Simpson’s Paradox
● Covid-19 death: (per 100,000 of population)
– County A: 15
– County B: 10
● Which state is managing the pandemic better?

Department of Information Science and Engg


Transform Here
Simpson’s Paradox
● Covid-19 death: (per 100,000 of population)
– County A: 15
– County B: 10
● Which state is managing the pandemic better?
● Covid death rate on older population
– County A: 20
– County B: 40
● Covid death rate on younger population
– County A: 2
– County B: 5

Department of Information Science and Engg


Transform Here

Association Rule Mining Algorithms ?


• Apriori Algorithm.
• FP Growth Algorithm.

• FP-Tree and FP-Growth Algorithm


• FP-Tree: Frequent Pattern Tree
• FP-Growth: Mining frequent patterns with FP-Tree

Department of Information Science and Engg


Transform Here
Terminology

• Item set
• A set of items: I = {a1, a2, ……, am}
• Transaction database
• DB = <T1, T2, ……, Tn>
• Pattern
• A set of items: A
• Support
• The number of transactions containing A in DB
• Frequent pattern
• A’s support ≥ minimum support threshold ξ
• Frequent Pattern Mining Problem
• The problem of finding the complete set of frequent
patterns

Department of Information Science and Engg


Transform Here

FP-Tree Definition

An efficient and scalable method to find frequent


patterns. It allows frequent itemset discovery
without candidate itemset generation.

Department of Information Science and Engg


Transform Here
FP-Tree Definition (cont.)
• Each node in the item prefix subtree consists of three fields:
• item-name
• node-link
• count
• Each entry in the frequent-item header table consists of two
fields:
• item-name
• head of node-link

Department of Information Science and Engg


Transform Here

FP-Tree and FP-Growth


Algorithm
• FP-Tree: Frequent Pattern Tree
• Compact presentation of the DB without information loss.
• Easy to traverse, can quickly find out patterns associated with a
certain item.
• Well-ordered by item frequency.

• FP-Growth Algorithm
• Start mining from length-1 patterns
• Recursively do the following
• Constructs its conditional FP-tree
• Concatenate patterns from conditional FP-tree with suffix
• Divide-and-Conquer mining technique
Department of Information Science and Engg
Transform Here
Algorithm Steps
• Scan DB once, find frequent 1-itemset (single item pattern)
• Sort frequent items in frequency descending order, f-list
• Scan DB again, construct FP-tree
• Construct the conditional FP tree in the sequence of reverse order of F -
List - generate frequent item set

Department of Information Science and Engg


Transform Here

Example 1
Consider the below transaction in which B = Bread, J = Jelly, P =
Peanut Butter, M = Milk and E = Eggs.

Step-1: Scan DB once, find


frequent 1-itemset (single item in
itemset)

Department of Information Science and Engg


Transform Here
Example 1 (cont.)
Step-2: As minimum threshold support = 40%, So in this step we will
remove all the items that are bought less than 40% of support or support
less than 2.

Step-3: Create a F -list in which frequent items are


sorted in the descending order based on
the support.

Department of Information Science and Engg


Transform Here

Example 1 (cont.)
Step-4: Sort frequent items in transactions based on F-list.

Department of Information Science and Engg


Transform Here
Example 1 (cont.)

Department of Information Science and Engg


Transform Here

Department of Information Science and Engg


Transform Here
FP-Tree Properties
• Completeness
• Each transaction that contains frequent pattern is mapped to a
path.
• Prefix sharing does not cause path ambiguity, as only path starts
from root represents a transaction.

• Compactness
• Number of nodes bounded by overall occurrence of frequent
items.
• Height of tree bounded by maximal number of frequent items in
any transaction.

Department of Information Science and Engg


Transform Here

FP-Tree Properties (cont.)

• Traversal Friendly (for mining task)


• For any frequent item ai, all the possible frequent patterns
that contain ai can be obtained by following ai’s node-links.
• This property is important for divide-and-conquer. It
assures the soundness and completeness of problem
reduction.

Department of Information Science and Engg


Transform Here
FP-Growth Algorithm
• Functionality:
• Mining frequent patterns using FP-Tree generated before
• Input:
• FP-tree constructed earlier

• minimum support threshold ξ

• Output:
• The complete set of frequent patterns

• Main algorithm:
• Call FP-growth(FP-tree, null)

Department of Information Science and Engg


Transform Here

FP-growth(Tree, α)
Procedure FP-growth(Tree, α)
{
if (Tree contains only a single path P)
{ for each combination β of the nodes in P
{ generate pattern β Uα;
support = min(sup of all nodes in β)
}
}
else // Tree contains more than one path
{ for each ai in the header of Tree
{ generate pattern β= ai Uα;
β.support = ai.support;
construct β’s conditional pattern base;
construct β’s conditional FP-tree Treeβ;
if (Treeβ ≠ Φ)
FP-growth(Treeβ , β);
}
}

Department of Information Science and Engg


Transform Here
Discussions

• When database is extremely large.


• Use FP-Tree on projected databases.
• Or, make FP-Tree disk-resident.
• Materialization of an FP-Tree
• Construct it independently of queries, with an reasonably fit-
majority minimum support-threshold.

• Incremental updates of an FP-Tree.


• Record frequency count for every item.
• Control by watermark.
Department of Information Science and Engg
Transform Here

Department of Information Science and Engg


Transform Here
Reflections about todays Session

Google Form – Quiz


https://docs.google.com/forms/d/e/1FAIpQLSfdiEz7A6Z3iNU2
6F6XAxLO0AU6P06AuOl7mGeGKzMOeUhiKw/viewform

28-11-2024 Department of Information Science and Engg


85
Transform Here

Conclusion
We have studied the below concepts in todays class
1. Topics of Module-1
2. Learning Objectives
3. Basic Definitions of database approaches
4. Database system environment
5. Main Characteristics of the Database Approach
6. Advantages of using the DBMS Approach
7. Historical Development of Database Technology
8. Database Languages and Architectures
9. Schemas versus Instances
10.Reflections

28-11-2024
Department of Information Science and Engg
86
Transform Here
Contact Details:

Dr.Manjunath T N
Professor and Dean – CG
Department of Information Science and Engg
BMS Institute of Technology and Management
Mobile: +91-9900130748
E-Mail: [email protected] / [email protected]

28-11-2024 Department of Information Science and Engg


87
Transform Here

You might also like