0% found this document useful (0 votes)

8 views145 pages

Unit 1 and 2

notes on data mining .

Uploaded by

akshat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views145 pages

Unit 1 and 2

notes on data mining .

Uploaded by

akshat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 145

Introduction to Data Mining

TBD 502
Syllabus
UNIT 1: Introduction to Data Mining: Fundamentals and functionalities of data
mining. Classification of data mining systems and major issues
.
Data warehouse Fundamentals: architecture, components, and models. OLAP
operations, OLTP vs OLAP, data cubes, schemas (star, snowflake),
Overview of Data Mining Tasks: Data Types, Data Mining Functionalities, Pattern
Interestingness, Integration with Data Warehouse, Task Primitives and Issues.
.
UNIT 2 : Data Preprocessing
Need for preprocessing Descriptive data summarization, Data Cleaning: Missing
Values, Noisy Data, (Binning, Clustering, Regression, Computer and Human
inspection), Inconsistent Data, Data Integration and Transformation

Unit 3: Association Rule Mining: Market Basket Analysis, Association rules:

Association rules from transaction database & relational database, Apriori algorithm and
correlation analysis..
Syllabus
UNIT 4 : Classification and Clustering: Classification: decision trees, Bayesian
methods, K-NN Prediction techniques and related issues Clustering: data types,
partitioning, and hierarchical methods

UNIT 5 : Advanced Topics in Data Mining: Mining complex data: spatial,

multimedia, time-series, sequence data Text mining and web mining
4

• Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) patterns
or knowledge from huge amount of data

• Key Characteristics of Data Mining Patterns:

• Non-trivial – Not obvious or easily seen.
• Implicit – Hidden within the data, not explicitly stated.
• Previously unknown – Not known before the analysis.
• Potentially useful – Can aid in decision making,
predictions, or understanding phenomena.
• Interesting – Patterns should be novel, valid, and
actionable.
lternative names

Knowledge discovery (mining) in

databases (KDD), knowledge
extraction, data/pattern analysis, data
archeology, data dredging, information
harvesting, business intelligence, etc.
KDD
Process

Knowledge
Discovery in
Databases
Many people treat data mining as a synonym
for another popularly used term, Knowledge
Discovery from Data.
Steps in KDD
• Data cleaning (to remove noise and inconsistent
data).
• Data integration (where multiple data sources
may be combined)
• Data selection (where data relevant to the
analysis task are retrieved from the database)
• Data transformation (where data are
transformed or consolidated into forms
appropriate for mining by performing summary or
aggregation operations, for instance).
• Data mining (an essential process where
intelligent methods are applied in order to
extract data patterns).

• Pattern evaluation (to identify the truly

interesting patterns representing knowledge
based on some interestingness measures.

• Knowledge presentation (where visualization

and knowledge representation techniques are
used to present the mined knowledge to the
user).
Architecture of a Data Mining System
The architecture of a typical data mining system may have the
following major components
Database, data warehouse, World Wide Web, or other information
repository: This is one or a set of databases, data warehouses, spread
sheets, or other kinds of information repositories. Data cleaning and data
integration techniques may be performed on the data.

Database or data warehouse server: The database or data warehouse

server is responsible for fetching the relevant data, based on the user’s
data mining request.

Knowledge base: This is the domain knowledge that is used to guide the
search or evaluate the interestingness of resulting patterns.
Such knowledge can include concept hierarchies, used to organize
attributes or attribute values into different levels of abstraction.

Knowledge such as user beliefs, which can be used to assess a pattern’s

interestingness based on its unexpectedness, may also be included.
Data mining engine: This is essential to the data mining
system and ideally consists of a set of functional modules for
tasks such as characterization, association and correlation
analysis, classiﬁcation, prediction, cluster analysis, outlier
analysis.

Pattern evaluation module: This component typically

employs interestingness measures and interacts with the
data mining modules so as to focus the search toward
interesting patterns. It may use interestingness thresholds to
ﬁlter out discovered patterns.

User interface: This module communicates between users

and the datamining system, allowing the user to interact with
the system by specifying a data mining query or task
Data Mining—On What Kind of Data?

In principle, data mining is not specific to one type of

media or data. Data mining should be applicable to any
kind of information repository.

Flat files: Flat files are actually the most common data
source for data mining algorithms, especially at the
research level. Flat files are simple data files in text or
binary format with a structure known by the data
mining algorithm to be applied.

The data in these files can be transactions, time-series

data, scientific measurements, etc.
A text database is a database that contains text documents
or other word descriptions in the form of long sentences or
paragraphs, such as product specications, error or bug
reports, warning messages, summary reports, notes, or
other documents.
Relational Databases: Briefly, a relational
database consists of a set of tables containing
either values of entity attributes, or values of
attributes from entity relationships.

Tables have columns and rows, where columns

represent attributes and rows represent tuples.

A tuple in a relational table corresponds to

either an object or a relationship between
objects and is identified by a set of attribute
values representing a unique key.
 Data Warehouses: A data warehouse as a storehouse is a
repository of data collected from multiple data sources (often
heterogeneous) and is intended to be used as a whole under
the same unified schema. A data warehouse gives the option
to analyze data from different sources under the same roof.

 Transaction Databases: A transaction database is a set of

records representing transactions, each with a time stamp, an
identifier and a set of items.

 Multimedia Databases: Multimedia databases include video,

images, audio and text media. Multimedia is characterized by
its high dimensionality, which makes data mining even more
challenging.
Spatial Databases: Spatial databases are databases that, in
addition to usual data, store geographical information like
maps, and global or regional positioning.
Relational Databases: Briefly, a relational
database consists of a set of tables containing
either values of entity attributes, or values of
attributes from entity relationships.

Tables have columns and rows, where columns

represent attributes and rows represent tuples.

A tuple in a relational table corresponds to

either an object or a relationship between
objects and is identified by a set of attribute
values representing a unique key.
Data Mining Functionality - What Kinds of
Patterns Can Be Mined?

Data mining functionalities are used to specify

the kind of patterns to be found in data mining
tasks. In general, data mining tasks can be
classified into two categories:

Descriptive and Predictive.

Descriptive mining tasks characterize the general
properties of the data in the database.

Predictive mining tasks perform inference on the

current data in order to make predictions.
Data mining functionalities, and the kinds of patterns they can
discover :

1. Concept/Class Description: Characterization and Discrimination

Data can be associated with classes or concepts
.
Data discrimination is a comparison of the general features of target class
data objects with the general features of objects from one or a set of
contrasting classes. The target and contrasting classes can be specified by
the user, and the corresponding data objects retrieved through database
queries.

For example, the user may like to compare the general features of software
products whose sales increased by 10% in the last year with those whose
sales decreased by at least 30% during the same period

Discrimination (Comparison between classes) - Data discrimination

compares two or more classes and identifies distinguishing features
between them. It helps discover what makes one class different from
another

Example : Compare high-performing and low-performing students:

Characterization (General description of data)

Data characterization, by summarizing the data of the class
August 19, 2025
20

under study (often called the target class) in general terms, Data
characterization is a summarization of the general characteristics
or features of a target class of data. The data corresponding to
the user-specified class are typically collected by a database
query.

For example, to study the characteristics of software products

whose sales increased by 10% in the last year.
, characterize high-performing students in a university dataset.

Characterization is a summarization of the general

characteristics or features of a target class of data.

Data characterization provides a general, high-level summary

of the main features of a target class of data. It gives a
concise description of the data set, often through descriptive
statistics, visualizations, or aggregated rules.
Association is the discovery of association rules showing
attribute-value conditions that occur frequently together
in a given set of data. frequent itemset typically refers to
a set of items that frequently appear together in a
transactional data set, such as milk and bread. A
frequently occurring subsequence, such as the pattern
that customers tend to purchase first a PC, followed by a
digital camera, and then a memory card, is a (frequent)
sequential pattern.

where X is a variable representing a customer. A

confidence, or certainty, of 50% means that if a customer
buys a computer, there is a 50% chance that she will buy
software as well. A 1% support means that 1% of all of
the transactions under analysis showed that computer
and software were purchased together.
Classification and Prediction

Classification is the process of finding a model (or

function) that describes and distinguishes data classes
or concepts, for the purpose of being able to use the
model to predict the class of objects whose class label
is unknown. The derived model is based on the
analysis of a set of training data (i.e., data objects
whose class label is known).

Whereas classification predicts categorical (discrete,

unordered) labels, prediction models continuous-
valued functions
August 19, 2025
23

Clustering analyzes data objects without consulting a

known class label. The objects are clustered or
grouped based on the principle of maximizing the
intraclass similarity and minimizing the interclass
similarity. Each cluster that is formed can be viewed
as a class of objects. Clustering can also facilitate
taxonomy formation, that is, the organization of
observations into a hierarchy of classes that group
similar events together.
Major Issues in Data Mining
Major issues in data mining regarding mining methodology,
user interaction, performance, and diverse data types.

Mining methodology and user interaction issues:

These reflect the kinds of knowledge mined, the ability to
mine knowledge at multiple granularities, the use of domain
knowledge, ad hoc mining, and knowledge visualization.

•Mining different kinds of knowledge in databases:

Because different users can be interested in different kinds of
knowledge, data mining should cover a wide
spectrum of data analysis and knowledge discovery tasks,
including data characterization, discrimination, association and
correlation analysis, classification, prediction, clustering,
outlier analysis, and evolution analysis (which includes trend
and similarity analysis). These tasks may use the same
database in different ways and require the development of
and pivoting through the data space and knowledge space
interactively, similar
to what OLAP can do on data cubes. In this way, the user can
interact with
the data mining system to view data and discovered patterns at
multiple granularities
and from different angles.

•Incorporation of background knowledge: Background

knowledge, or information regarding the domain under
study, may be used to guide the discovery process and allow
discovered patterns to be expressed in concise terms and at
different levels of abstraction. Domain knowledge related to
databases, such as integrity constraints and deduction rules, can
help focus and speed up a data mining process, or judge the
interestingness of discovered patterns.
•Data mining query languages and ad hoc data mining:
Relational query languages (such as SQL) allow users to pose ad
•Handling noisy or incomplete data: The data stored in a
database may reflect noise,
exceptional cases, or incomplete data objects. When mining
data regularities, these objects may confuse the process,
causing the knowledge model constructed to overfit the data.
As a result, the accuracy of the discovered patterns can be
poor. Data cleaning methods and data analysis methods that
can handle noise are required, as well as Outlier mining
methods for the discovery and analysis of exceptional cases.
•Pattern evaluation—the interestingness problem: A data
mining systemcan uncover thousands of patterns. Many of the
patterns discovered may be uninteresting to the given user,
either because they represent common knowledge or lack
novelty. Several challenges remain regarding the development
of techniques to assess the interestingness of discovered
patterns, particularly with regard to subjective measures that
estimate the value of patterns with respect to a given user
Performance issues: These include efficiency, scalability,
and parallelization of data
mining algorithms.

•Efficiency and scalability of data mining algorithms:

To effectively extract information
from a huge amount of data in databases, data mining
algorithms must be efficient and scalable. In other words,
the running time of a data mining algorithm must be
predictable and acceptable in large databases. Froma
database perspective on knowledge discovery, efficiency and
scalability are key issues in the implementation
of data mining systems.
•Parallel, distributed, and incremental mining
algorithms: The huge size of many
databases, the wide distribution of data, and the
What Is a Data Warehouse?

A single, complete and consistent store of data

obtained from a variety of different sources
made available to end users in a what they
can understand and use in a business context.
• “A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of management’s
decision-making process.”—W. H. Inmon

• Data warehousing:
– The process of constructing and using data warehouses
• “A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of management’s
decision-making process.”—W. H. Inmon

• Data warehousing:
– The process of constructing and using data warehouses
31

Data Warehouse—
Subject-Oriented
• Organized around major subjects, such as customer, product,
sales
• Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing

• Provide a simple and concise view around particular subject

issues by excluding data that are not useful in the decision
support process
32

Data Warehouse—
Integrated
• Constructed by integrating multiple, heterogeneous data
sources
– relational databases, flat files, on-line transaction records
• Data cleaning and data integration techniques are applied.

– Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different data
sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.

– When data is moved to the warehouse, it is converted.

Data Warehouse—Time
Variant
• The time horizon for the data warehouse is significantly longer
than that of operational systems
– Operational database: current value data
– Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
• Every key structure in the data warehouse
– Contains an element of time, explicitly or implicitly
– But the key of operational data may or may not contain
“time element”
34

Data Warehouse—
Nonvolatile
• A physically separate store of data transformed from the
operational environment
• Operational update of data does not occur in the data
warehouse environment
– Does not require transaction processing, recovery, and
concurrency control mechanisms
– Requires only two operations in data accessing:
• initial loading of data and access of data
Features of Data Warehousing

• Subject Oriented: The Data Warehouse is

Subject Oriented because it provides us the
information around a subject rather the
organization's ongoing operations.
• These subjects can be product, customers,
suppliers, sales, revenue etc.
• The data warehouse does not focus on the
ongoing operations Rather it focuses on
modelling and analysis of data for decision
making.
Claims under automobile insurance policies are
processed in the Auto Insurance application, similarly
claim for compensation of workers is organized in
workers’ compensation application.
• Integrated:-The data in DWH is integrated in the sense that
data from the disparate system as well as the data from
different application will be accumulated and will be stored in
the single database called DWH(This process is known as ETL
in DWH.) for analysis and decision taking purpose.
Time variant:
• In OLTP the data will be stored according for current
values unlike in Data Warehouse, the data will be
stored for time which may range from 1 year to 10
years.
• This aspect of data warehouse is quite significant for
both the design and implementation phase.

• It helps in analysis of the past, correlates with the

present scenario and to predict for the future.
Non-Volatile:
We add change or delete the data from an operational
system as each transaction happens.
• We do not delete the data from the data warehouse
in the real time.
• Once the data is captured in the data warehouse,
we do not run individual transactions to change the
data there.
• Data warehouse can only be viewed.
• DWH is used for analysing the data so it
periodically get refreshed by picking up the data
from various OLTP systems.
Data warehouse Operational system
Subject oriented Transaction oriented

Large (hundreds of GB up to several Small (MB up to several GB)

TB)
Historic data Current data

De-normalized table structure (few Normalized table structure (many

tables, many columns per table) tables, few columns per table)
Batch updates Continuous updates

Usually very complex queries Simple to complex queries

Comparison between OLTP and OLAP
system
Data Warehouse and Data Mart
• A data mart contains a subset of corporate wide data that is of
value to a specific group of users.
• The scope is confined to specific selected subjects.

• The data contained in data marts tend to be summarized.

• Data marts are usually implemented on low-cost departmental

servers that are Unix/Linux- or Windows-based. The
implementation cycle of a data mart is more likely to be measured
in weeks rather than months or years.

• However, it may involve complex integration in the long run if its

design and planning were not enterprise-wide
Depending on the source of data, data marts can be
categorized as independent or dependent..

Independent Data mart:-Independent data marts are

sourced from data captured from one or more
operational systems or external information providers,
or from data generated locally within a particular
department or geographic area.

Dependent data mart:- Dependent data marts are

sourced directly from enterprise data warehouses.
Data Warehouse and Data Mart

S.n Data warehouse DataMart

o
1. Corporate/Enterprise Wide Departmental

2. It is a unit of all data marts It is a single business process

3 Structure for corporate view of data Structure to suit the departmental

view of data
4 Queries on presentation resource Technology optimal for data access
and analysis
5 Data received from staging area Star-join(Facts & Dimensions)
Organized on E-R Model
Multidimensional Data Model
Data warehouses and OLAP tools are based on a multidimensional data
model. This model views data in the form of a data cube
Stars, Snowﬂakes, and Fact Constellations: Schemas for
Multidimensional Databases

The entity-relationship data model is commonly used in the

design of relational databases, where a database schema
consists of a set of entities and the relationships between
them.

A data warehouse, however, requires a concise, subject-

oriented schema that facilitates on-line data analysis.

The most popular data model for a data warehouse is a

multidimensional model. Such a model can exist in the form of
a star schema, a snowﬂake schema, or a fact constellation
schema.
Star schema:
The most common modeling paradigm is the star schema, in which the data
warehouse contains
(1) a large central table (fact table) containing the bulk of the data, with no
redundancy, and
(2) a set of smaller attendant tables (dimension tables), one for each
dimension.

The schema graph resembles a starburst, with the dimension tables displayed
in a radial pattern around the central fact table.

Snowﬂake schema:
The snowﬂake schema is a variant of the star schema model, where some
dimension tables are normalized, thereby further splitting the data into
additional tables.

The resulting schema graph forms a shape similar to a snowﬂake.

The major difference between the snowﬂake and star schema models is that the
dimension tables of the snowﬂake model may be kept in normalized form to
reduce redundancies.

Such a table is easy to maintain and saves storage space.

Fact constellation: Sophisticated applications may require multiple fact tables

to share dimension tables. This kind of schema can be viewed as a collection of
stars, and hence is called a galaxy schema or a fact constellation.
52

Conceptual Modeling of Data Warehouses

• Modeling data warehouses: dimensions & measures

– Star schema: A fact table in the middle connected to a set of
dimension tables
– Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
– Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
53

Sales amount
Quantity sold
Revenue
Profit
Number of transactions

These are usually aggregatable values (e.g., SUM, AVG, COUNT) and are the
central focus of analysis in reports.

Suppose we are analyzing sales data:

Fact Table: Sales_Fact Dimension Tables :

• Sales_Amount (measure) Product_Dimension (Product details)
• Quantity_Sold (measure) Customer_Dimension (Customer details)
• Product_ID (foreign key to Product Time_Dimension (Dates, months, years
dimension)
• Customer_ID (foreign key to Customer
dimension)
• Date_ID (foreign key to Time dimension)
Fact Table Sales

Date_ID Product_ID Store_ID Sales_Amount Units_Sold

20250730 101 501 6000 50

Dimension Tables:

Date (Date_ID, Day, Month, Year)

Product (Product_ID, Name, Category)
Store (Store_ID, Location, Manager)
-- Date Dimension
CREATE TABLE date_dim ( -- Customer Dimension
date_id INT PRIMARY KEY,
CREATE TABLE customer_dim (
full_date DATE,
day INT, customer_id INT PRIMARY KEY,
month INT, customer_name VARCHAR(100),
quarter INT, gender VARCHAR(10),
year INT age INT,
); city VARCHAR(50),
state VARCHAR(50)
-- Product Dimension );
CREATE TABLE product_dim (
product_id INT PRIMARY KEY,
product_name VARCHAR(100), -- Store Dimension
category VARCHAR(50), CREATE TABLE store_dim (
brand VARCHAR(50) store_id INT PRIMARY KEY,
); store_name VARCHAR(100),
region VARCHAR(50)
);
CREATE TABLE sales_fact (
sale_id INT PRIMARY KEY,
date_id INT,
product_id INT,
customer_id INT,
store_id INT,
units_sold INT,
total_revenue DECIMAL(10, 2),

-- Foreign Keys
FOREIGN KEY (date_id) REFERENCES date_dim(date_id),
FOREIGN KEY (product_id) REFERENCES product_dim(product_id),
FOREIGN KEY (customer_id) REFERENCES customer_dim(customer_id),
FOREIGN KEY (store_id) REFERENCES store_dim(store_id)
);

Syntax for foreign key

FOREIGN KEY (column_name) REFERENCES referenced_table(referenced_column)
58

Example of Snowflake Schema

time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
59

Example of Fact Constellation

time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location

branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type
60

A Concept Hierarchy:
Dimension (location)
all all

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind

A data cube measure
is a numerical function that can be evaluated at each point in the data
cube space. A measure value is computed for a given point by
aggregating the data corresponding to the respective dimension-value
pairs defining the given point

Measures can be organized into three categories (i.e., distributive,

algebraic, holistic), based on the kind of aggregate functions used.

Distributive: An aggregate function is distributive if it can be computed in

a distributed manner as follows. Suppose the data are partitioned into
n sets. We apply the function to each partition, resulting in n aggregate
values. If the result derived by applying the function to the n aggregate
values is the same as that derived by applying the function to the entire
data set (without partitioning), the function can be computed in a
distributed manner. For example, sum() can be computed for a data cube
by first partitioning the cube into a set of subcubes, computing sum() for
each subcube, and then summing up the counts obtained for each
subcube. Hence, sum() is a distributive aggregate function.
Algebraic
An aggregate function is algebraic if it can be computed by an
algebraic function with M arguments (where M is a bounded
positive integer), each of which is obtained by applying a
distributive aggregate function.

For example, avg() (average) can be computed by

sum()/count(), where both sum() and count() are distributive
aggregate functions.

Similarly, it can be shown that min N() and max N() (which find
the N minimum and N maximum values, respectively, in a given
set) and standard deviation() are algebraic aggregate functions.
A measure is algebraic if it is obtained by applying an algebraic
aggregate function
63

Data Cube Measures: Three Categories

• Distributive: if the result derived by applying the function to n

aggregate values is the same as that derived by applying the
function on all the data without partitioning
• E.g., count(), sum(), min(), max()
• Algebraic: if it can be computed by an algebraic function with M
arguments (where M is a bounded integer), each of which is
obtained by applying a distributive aggregate function
• E.g., avg(), min_N(), standard_deviation()
• Holistic: if there is no constant bound on the storage size needed
to describe a subaggregate.
• E.g., median(), mode(), rank()
In a data warehouse, a measure is a property on which
calculations (e.g., sum, count, average, minimum,
maximum) can be made
65

Multidimensional Data

• Sales volume as a function of product, month,

and region Dimensions: Product, Location, Time
Hierarchical summarization paths

Industry Region Year

Category Country Quarter

Product

Product City Month Week

Office Day

Month
66

A Sample Data Cube

Total annual sales

Date of TVs in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR

Country
sum
Canada

Mexico

sum
OLAP OPERATIONS
68

Typical summarize
• Roll up (drill-up): OLAP Operations
data
– by climbing up hierarchy or by dimension
reduction
• Drill down (roll down): reverse of roll-up
– from higher level summary to lower level
summary or detailed data, or introducing new
dimensions
• Slice and dice: project and select
• Pivot (rotate):
– reorient the cube, visualization, 3D to series of
2D planes
.
Roll-up: The roll-up operation
(also called the drill-up
operation by some vendors)
performs aggregation on a
data cube, by climbing up a
concept hierarchy for a
dimension
Drill-down: Drill-down is
the reverse of roll-up. It
navigates from less
detailed data to more
detailed data.

Drill-down can be
realized by stepping
down a concept
hierarchy for a
dimension
Dice:

The dice operation deﬁnes a

subcube by performing a selection
on two or more dimensions.
Slice :

The slice operation performs

a selection on one
dimension of the given cube,
resulting in a subcube.
Types of OLAP Servers: ROLAP versus MOLAP
versus HOLAP
Relational OLAP (ROLAP) servers: These are the
intermediate servers that stand in between a relational back-
end server and client front-end tools. They use a relational or
extended- relational DBMS to store and manage warehouse
data, and OLAP middleware to support missing pieces.

Multidimensional OLAP(MOLAP)servers: These servers

support multidimensional views of data through array-based
multi dimensional storage engines. They map multi
dimensional views directly to data cube array structures

Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach

combines ROLAP and MOLAP technology, beneﬁting from the
greater scalability of ROLAP and the faster computation of
MOLAP.
76

Data Warehouse: A Multi-Tiered Architecture

Monitor
& OLAP Server
Other Metadata
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools

A Three-Tier Data Warehouse Architecture

Data warehouses often adopt a three-tier architecture

1. The bottom tier is a warehouse database server that is

almost always a relational database system. Back-end tools
and utilities are used to feed data into the bottom tier from
operational databases or other external sources.

These tools and utilities perform data extraction, cleaning, and

transformation as well as load and refresh functions to update
the data warehouse.
2. The middle tier is an OLAP server that is
typically implemented using either

(1)a relational OLAP (ROLAP) model, that is, an

extended relational DBMS that maps
operations on multidimensional data to
standard relational operations; or

(2) a multidimensional OLAP (MOLAP) model,

that is, a special-purpose server that directly
implements multidimensional data and
operations.
August 19, 2025
80

The top tier is a front-end client layer, which contains

query and reporting tools, analysis tools, and/or data
mining tools (e.g., trend analysis, prediction, and so on).

From the architecture point of view, there are three

data warehouse models:

the enterprise warehouse,

the data mart,

and the virtual warehouse.

• Enterprise warehouse
Three Data Warehouse Models
– collects all of the information about subjects spanning the
entire organization
• Data Mart
– a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to specific,
selected groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse)
data mart
• Virtual warehouse
– A set of views over operational databases
– Only some of the possible summary views may be
materialized
Multidimensional versus Multi
relational OLAP
In the MOLAP model, online analytical processing is best implemented by storing
the data multidimensional that is easily viewed in a multidimensional way. Here
the data structure is fixed so that the logic to process multidimensional analysis
ca be based on well defined methods of establishing data storage coordinates.
Usually, multidimensional databases (MDDBs) are vendors proprietary systems.

On the other hand , the ROLAP model relies on the existing relational DBMS of
data warehouse. OLAP Features are provided against the relational database.

Multidimensional vs. Multirelational OLAP:

MOLAP (Multidimensional OLAP): Stores data in pre-aggregated "cubes"

optimized for fast query response, ideal for complex analysis with pre-defined
dimensions and hierarchies.

ROLAP (Relational OLAP): Utilizes standard relational databases and SQL

queries, offering greater flexibility for ad-hoc analysis but potentially slower
performance for complex queries.
ROLAP MOLAP

While MOLAP stands

ROLAP stands for Relational Online
for Multidimensional Online
Analytical Processing.
Analytical Processing.

While it is used for limited data

ROLAP is used for large data volumes.
volumes.

The access of ROLAP is slow. While the access of MOLAP is fast.

In ROLAP, Data is stored in relation While in MOLAP, Data is stored in

tables. multidimensional array.

In ROLAP, Data is fetched from data- While in MOLAP, Data is fetched from
warehouse. MDDBs database.

In ROLAP, Complicated sql queries are While in MOLAP, Sparse matrix is

used. used.

While in MOLAP, Dynamic

In ROLAP, Static multidimensional
multidimensional view of data is
view of data is created.
created.
OLAP Servers
OLAP (Online Analytical Processing) servers are specialized tools or platforms used to
perform complex analytical queries on multidimensional data efficiently. These servers
are part of data warehouse systems and are designed to support decision support
systems (DSS), business intelligence (BI), and analytical applications.

MOLAP (Multidimensional OLAP) : Stores data in a multidimensional cube format.

Pre-aggregated and indexed data for fast access.

Architecture: Data is loaded from the data warehouse, transformed into a cube, and
stored in a proprietary multidimensional database.

Merits :
Very fast query performance.
Efficient storage due to data compression.
Good for frequently queried data.

Demerits :
Not ideal for real-time data.
Cube size limitations.
Requires preprocessing (build time).

Examples: Microsoft Analysis Services (SSAS in MOLAP mode), IBM Cognos TM1
Oracle Essbase
ROLAP (Relational OLAP)
Overview: Stores data in relational databases.
Uses SQL queries to dynamically calculate aggregations.

Architecture: No need to create cubes.

Metadata layer maps multidimensional views to relational data.

Advantages:
Can handle large volumes of data.
Real-time data access.
Works with existing relational databases (like MySQL, SQL Server).

Disadvantages:
Slower performance compared to MOLAP.
Depends on the database query optimizer.
Complex queries may lead to high latency.

Examples:
MicroStrategy
SAP BusinessObjects
IBM Cognos (in ROLAP mode)
HOLAP (Hybrid OLAP)
Combines the features of MOLAP and ROLAP.
Stores summary data in MOLAP and detailed data in ROLAP.

➤ Architecture:
High-level aggregates are stored in cubes for fast access.
Drill-down to relational tables for detailed data.

➤ Advantages:
Balanced performance and scalability.
Can handle large datasets and frequent queries.

➤ Disadvantages:
Complex architecture.
Management overhead due to dual storage.

➤ Examples:
Microsoft SSAS (in HOLAP mode)
SAP BW
90

• Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of
data
• Alternative names
Knowledge discovery (mining) in databases
(KDD), knowledge extraction, data/pattern
analysis, data archeology, data dredging,
information harvesting, business intelligence,
etc.
KDD
Process

Knowledge
Discovery in
Databases
Many people treat data mining as a synonym for
another popularly used term, Knowledge
Discovery from Data.
Steps in KDD
• Data cleaning (to remove noise and inconsistent
data).
• Data integration (where multiple data sources may
be combined)
• Data selection (where data relevant to the analysis
task are retrieved from the database)
• Data transformation (where data are transformed or
consolidated into forms appropriate for mining by
performing summary or aggregation operations, for
instance).
• Data mining (an essential process where
intelligent methods are applied in order to
extract data patterns).

• Pattern evaluation (to identify the truly

interesting patterns representing knowledge
based on some interestingness measures.

• Knowledge presentation (where

visualization and knowledge representation
techniques are used to present the mined
knowledge to the user).
UNIT-2

Data Preprocessing: Need for preprocessing Descriptive data

summarization, Data Cleaning: Missing Values, Noisy Data, (Binning,
Clustering, Regression, Computer and Human inspection), Inconsistent
Data, Data Integration and Transformation.
Data Preprocessing

Data preprocessing is the process of preparing raw data for analysis by

cleaning and transforming it into a usable format. In data mining it refers
to preparing raw data for mining by performing tasks like cleaning,
transforming, and organizing it into a format suitable for mining algorithms
.
• Goal is to improve the quality of the data.

• Helps in handling missing values, removing duplicates, and normalizing

data.

• Ensures the accuracy and consistency of the dataset.

Today’s real-world databases are highly susceptible
to noisy, missing, and inconsistent data due to their
typically huge size (often several gigabytes or more) and
their likely origin from multiple, heterogenous sources.
Low-quality data will lead to low-quality mining results.
Today’s real-world databases are highly susceptible
to noisy, missing, and inconsistent data due to their
typically huge size (often several gigabytes or more) and
their likely origin from multiple, heterogenous sources.
Low-quality data will lead to low-quality mining results.
There are many possible reasons for noisy data (having
incorrect attribute values).

•The data collection instruments used may be faulty.

• There may have been human or computer errors

occurring at data entry.

•Errors in data transmission can also occur.

•There may be technology limitations, such as limited

buffer size for coordinating synchronized data transfer
and consumption.

Incorrect data may also result from inconsistencies in

naming conventions or data codes used, or inconsistent
formats for input fields, such as date. Duplicate tuples
also require data cleaning
August 19, 2025
99

There are a number of data preprocessing techniques.

• Data cleaning can be applied to remove noise and correct

inconsistencies in the data.

•Data integration merges data from multiple sources into a

coherent data store, such as a data warehouse.

•Data transformations, such as normalization, may be

applied. For example, normalization may improve the
accuracy and efficiency of mining algorithms involving
distance measurements.

•Data reduction can reduce the data size by aggregating,

eliminating redundant features, or clustering, for instance.
August 19, 2025
100

In other words, the data you wish to analyze by data

mining techniques are incomplete (lacking attribute
values or certain attributes of interest, or containing only
aggregate data), noisy (containing errors, or outlier
values that deviate from the expected), and inconsistent
(e.g., containing discrepancies in the department codes
used to categorize items).
Data preprocessing

Data preprocessing is the process of cleaning, transforming, and

organizing raw data before feeding it into a data mining or
machine learning system

.
Raw data is often incomplete, inconsistent, noisy, or in different
formats, making preprocessing essential.
Reasons why preprocessing is needed:

• Handling Missing Values – Fill in, remove, or estimate missing data

.
• Noise Removal – Eliminate errors or outliers (e.g., sensor glitches).

• Data Integration – Merge data from multiple sources into a consistent

dataset.

• Data Transformation – Normalize, aggregate, or discretize for better

analysis.

• Improved Accuracy – Preprocessed data leads to better model

performance.

• Efficiency – Reduces complexity, storage, and computation time.

Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Data compression

103
Data transformation and data
discretization
Normalization
Concept hierarchy generation
August 19, 2025
105
Data Cleaning

• Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or
data cleansing) routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data.

• Missing data

1. Ignore the tuple: This is usually done when the class label is missing (assuming
the mining task involves classification). This method is not very effective, unless
the tuple contains several attributes with missing values. It is especially poor when
the percentage of missing values per attribute varies considerably.
2. Fill in the missing value manually: In general, this approach is time-consuming
and may not be feasible given a large data set with many missing values.

106
August
Data
107 Mining:
19, 2025
Concepts and Techniques

• 3. Use a global constant to fill in the missing

value: Replace all missing attribute values by
the same constant, such as a label like
“Unknown”. If missing values are replaced by,
say, “Unknown,” then the mining program may
mistakenly think that they form an interesting
concept, since they all have a value in
common—that of “Unknown.” Hence,
although this method is simple, it is not
foolproof.
• 4. Use the attribute mean to fill in the missing
value:
August 19, 2025
108

5. Use the attribute mean for all samples belonging

to the same class as the given tuple

6. Use the most probable value to fill in the missing

value: This may be determined with regression, inference-
based tools using a Bayesian formalism, or decision tree
induction. For example, using the other customer
attributes in your data set, you
may construct a decision tree to predict the missing values
for income.
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data (e.g., Occupation=“ ” (missing data))
– noisy: containing noise, errors, or outliers (e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data) (Jan. 1 as everyone’s birthday?)

109
Noise: Noise is a random error or variance in a measured variable.

• Binning: Binning methods a sorted data value by consulting its

“neighborhood,” that is, the values around it. The sorted values
are distributed into a number
of “buckets,” or bins. Because binning methods consult the
neighborhood of values, they perform local smoothing

• Clustering: may be detected by clustering,

where similar values are organized into groups,
or “clusters.” Intuitively, values that fall outside
of the set of clusters may be considered outliers

110
Methods to Handle Noisy Data
1. Binning : Group values into “bins” and smooth them by replacing with
representative values (mean, median, or boundary).

2. Clustering : Group similar data points together; noisy data points will
often fall outside main clusters and can be treated as outliers

Example:In a 2D plot of customer age vs. income, most points form

dense groups, but a few scattered far away may be noise.

Benefit: Automatically detects outliers without predefined rules.

3. Regression : Fit a function (linear, polynomial, etc.) to the data and

replace values that deviate strongly from the fit.

Example:
A simple linear regression predicts y = 2x + 5. If the actual point (x=10,
y=200) is far from this line, it’s likely noise.
4. Computer and Human Inspection

Computer inspection: Use automated scripts to detect anomalies (e.g.,

sudden jumps, impossible values like negative age).

Human inspection: Experts manually review flagged records to confirm or

correct them.

Example: Automated system flags a temperature reading of −200°C in a

human body dataset

A human confirms this is impossible and corrects the entry

BINNING
Binning: Binning methods smooth a sorted data
value by consulting its “neighborhood,” that is, the
values around it. The sorted values are distributed
into a number of “buckets,” or bins. Because
binning methods consult the neighborhood of
values, they perform local smoothing.

•smoothing by bin means, each value in a bin is

replaced by the mean value of the bin. For
example, the mean of the values 4, 8, and 15 in
Bin 1 is 9. Therefore, each original value in this bin
is replaced by the value 9.
In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest
boundary value. In general, the larger the width, the
greater the effect of the smoothing.

Bins may be equal-width : where the interval range of

values in each bin is constant. Binning is also used as a
discretization technique

Examle : Medical Sensor Readings (Heart Rate)Readings:

78, 80, 79, 82, 800, 81, 77. 800 is an impossible value
(faulty reading).
Binning: Bin range (75–85) replace all bin members with
mean (80) smooths small fluctuations and removes spike.
Binning Methods for Data Smoothing

 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
115
Simple Discretization: Binning

• Equal-width (distance) partitioning

– Divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
– The most straightforward, but outliers may dominate presentation
– if the observed values are all between 0-100,we could create 5 bins as follows
width = (100 – 0)/5 = 20
– bins: [0-20], (20-40], (40-60], (60-80], (80-100] [ or ] means the endpoint is
included( or ) means the endpoint is not included
– •typically, the first and last bins are extended to allow for values outside the
range of observed values(-infinity-20], (20-40], (40-60], (60-80], (80-infinity)

116
• Equal-depth (frequency) partitioning
– Divides the range into N intervals, each containing approximately same
number of samples
– Good data scaling
– Managing categorical attributes can be tricky
Discretization by
Binning

Discretization is the process of converting continuous

numerical data into discrete categories (intervals or
“bins”).

Steps in Discretization by Binning

1.Sort the data in ascending order.

2. Divide the range into a fixed number of bins (equal-width or equal-
frequency).
3. Replace the values in each bin with:
Bin mean (smoothing by bin means)
Bin median (smoothing by bin medians)
Bin boundaries (nearest boundary value)
22, 25, 27, 28, 32, 35, 37, 40, 44, 48

Steps sort the data first then choose binning method for example use
equal-width binning with 3 bins

W = (B –A)/N.
48-22/3 = 8.6

Bins:
Bin 1: [22 – 30] : 22,25,27,28
Bin 2: [31 – 39] : 32, 35, 37
Bin 3:[40 – 48] : 40, 44, 48

22–30 → “Young Adult

31–39 → “Adult”
40–48 → “Middle-aged”
Inconsistent Data
When the same data is recorded in different formats, units, spellings, or
conventions, leading to mismatches. This often happens when data comes from
multiple sources or is entered manually.

Examples :

Date format mismatch: 2025-08-11 vs 11/08/2025 vs Aug 11, 2025.

Units mismatch: Weight recorded as 70 kg in one dataset and 154 lbs in another.

Spelling/abbreviation differences: "NY" vs "New York", "Male" vs "M".

Apply data cleaning (standardization, conversion of units, correcting

spellings).
Data Integration and Transformation
The process of combining data from different sources into a unified view,
resolving differences in format, structure, and semantics.
Common in data warehouses, analytics, and ETL processes.

Data Transformation : The process of converting data into a suitable format or

structure for analysis or storage.

Often a step in ETL after integration and before loading into a warehouse.

Types of Transformations:
1. Smoothing – removing noise (e.g., binning, moving average).
2. Aggregation – summarizing (e.g., total sales per month).
3. Normalization – scaling values to a range (e.g., 0–1).
4. Attribute construction – creating new features from existing data (e.g., Full Name
from First + Last Name).
5. Encoding – converting categories into numbers (e.g., "Male" = 0, "Female" = 1).
August 19, 2025
122

Data Integration and Transformation

Data mining often requires data integration—the merging

of data from multiple data stores. The data may also need
to be transformed into forms appropriate for mining

Redundancy is important issue. An attribute (such as

annual revenue, for instance) may be redundant if it can
be “derived” from another attribute or set of attributes.
Inconsistencies in attribute or dimension naming can also
cause redundancies in the resulting data set.
Some redundancies can be detected by correlation
analysis
August 19, 2025
123

Data Transformation
In data transformation, the data are transformed or
consolidated into forms appropriate
for mining. Data transformation can involve the following:

Smoothing, which works to remove noise from the data.

Such techniques include binning, and clustering.

Aggregation, where summary or aggregation operations

are applied to the data. For example, the daily sales data
may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a
data cube for analysis of the data at multiple
granularities.
August 19, 2025
124

Generalization of the data, where low-level or “primitive”

(raw) data are replaced by higher-level concepts through
the use of concept hierarchies. For example, categorical
attributes, like street, can be generalized to higher-level
concepts, like city or country. Similarly, values for numerical
attributes, like age, may be mapped to higher-level
concepts, like youth, middle-aged, and senior.

Normalization, where the attribute data are scaled so as

to fall within a small specified
range, such as -1:0 to 1:0, or 0:0 to 1:0.
August 19, 2025
125

Normalization
Normalization is particularly useful for classification
algorithms involving neural networks, or distance
measurements such as nearest-neighbor classification
and clustering.

If using the neural network backpropagation algorithm

for classification Mining normalizing the input values for
each attribute measured in the training tuples will help
speed up the learning phase.

For distance-based methods, normalization helps

prevent attributes with initially large ranges (e.g.,
income) from outweighing attributes with initially
smaller ranges
126

Normalization
• Min-max normalization: to [new_minA, new_maxA]

v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
– Then $73,600 is mapped to

73,600  12,000
(1.0  0)  0  0.716
98,000  12,000
How to normalize the data through the min-
max normalization technique?

8,10,15, 20

New min is 10 and new max is

5000=10
,10000= 10.56
15000 11.1
Z-score normalization (μ: mean, σ:
standard deviation):
v  A
v' 
 A

Ex. Let μ = 54,000, σ = 16,000.

Then
73,600  54,000
 1.225
16,000
The grades on a history midterm at
University have a mean of μ=85, and
a standard deviation of σ=2. John
scored 86 on the exam.

Find the z-score for John’s exam

grade.
The grades on a geometry midterm at
University have a mean of μ=82, a
standard deviation of σ= 4 sigma, John
scored 74 on the exam. Find the Z-Score .

The grades on a geometry midterm at

University have a mean μ=74, and a
standard deviation of σ=4.0.Anil
scored 70 on the exam.
Find the z-score for Anil exam grade.
Round to two decimal places.
Normalization by decimal scaling

Transform the data by moving the decimal points of values of feature 𝐹.

The number of decimal points moved depends on the maximum absolute
value of 𝐹. A value 𝑣 of 𝐹 is normalized to 𝑣 ′ by computing

v
v'  Where j is the smallest integer such that Max(|ν’|) <
10 j
Suppose that the recorded values of 𝐹 range from
− 986 to 917. The maximum absolute value of 𝐹 is 986. To
normalize by decimal scaling, we therefore divide each value by
1,000 (i.e., 𝑗 = 3) so that −986 normalizes to −0.986 and 917
normalizes to 0.917.
Example: suppose the range of
attribute X is −500 to 45. The
maximum absolute value of X is
500. To normalize by decimal
scaling we will divide each value
by 1,000 (j = 3).

In this case, −500 becomes −0.5

while 45 will become 0.045
Salary bonus Formula Scaling
400 400/1000 0.4
310 310/1000 0.31
August 19, 2025
138
Data Discretization and Concept Hierarchy
Generation

Data Discretization: – Dividing the range of a continuous

attribute into intervals

• Interval labels can then be used to replace actual data

values.

•Replacing numerous values of a continuous attribute by a

small number of interval labels thereby reduces and
simplifies the original data.

•Some classification algorithms only accept categorical Data

•This leads to a concise, easy-to-use, knowledge-level

representation of mining results
August 19, 2025
139

Data discretization example. we have an attribute of

age with the following values.

Age 10,11,13,14,17,19,30, 31, 32, 38, 40, 42,70 , 72, 73, 75

Attribute Age Age Age

10,11,13,14,17,19 30, 31, 32, 38, 40,

70 , 72, 73, 75
, 42
After
Young Mature Old
Discretization
August 19, 2025
140

Concept hierarchies can be used to reduce the data by

collecting and replacing low-level concepts with higher-level
concepts.

In the multidimensional model, data are organized into

multiple dimensions, and each dimension contains multiple
levels of abstraction defined by concept hierarchies. This
organization provides users with the flexibility to view data
from different perspectives.

Concept hierarchies can be used to reduce the data by

collecting and replacing low-level concepts (such as
numerical values for the attribute age) with higher-level
concepts (such as youth, middle-aged, or senior).
August 19, 2025
141
Encoding Methods
Encoding is the process of converting categorical (non-numeric) data into
a numerical format so that machine learning algorithms can process it.

Encoding Methods

1. Label Encoding : Assigns a unique integer to each category.

Color: Red, Blue, Green
Encoded: 0, 1, 2
2. One hot Encoding : Creates a binary column for each category (1 if
present, 0 otherwise).

Color: Red, Blue, Green

Red Blue Green
Red 1 0 0
Blue 0 1 0
Green 0 0 1
3. Ordinal Encoding : Similar to label encoding but uses logical order for
categories.

Size: Small=1, Medium=2, Large=3

4. Binary Encoding : Converts categories to binary digits, then stores them in

separate columns.

Category IDs: 1, 2, 3, 4
Binary form: 01, 10, 11, 100
Descriptive data summarization,

Descriptive data summarization is the process of condensing large datasets into

compact, easy-to-understand summaries using statistical measures and visual
representations.

It helps in understanding the distribution, central tendency, variability, and

overall characteristics of the data before applying deeper analysis or modeling.

Types of Summarization

1. Numerical Measures
Measures of Central Tendency – Show the “center” of data:
Mean – Arithmetic average
Median – Middle value when sorted
Mode – Most frequent value
2.Measures of Dispersion – Show how spread out the data is:
• Range – Max − Min
• Variance & Standard Deviation – Spread from the mean
• Quartiles & IQR (Interquartile Range) – Middle 50% spread

ABAP On Hana Learning Metirial 1747019274
No ratings yet
ABAP On Hana Learning Metirial 1747019274
258 pages
DS-BDS (Unit 1) Technical
No ratings yet
DS-BDS (Unit 1) Technical
22 pages
MR22-DM 1
No ratings yet
MR22-DM 1
21 pages
Technology Glossary For Recruiters
100% (2)
Technology Glossary For Recruiters
20 pages
SAP BW Interview Questions 1
No ratings yet
SAP BW Interview Questions 1
144 pages
FALLSEM2025 26 - VL - ISWE209L - 00100 - TH - 2025 07 31 - Course Material For Module 1
No ratings yet
FALLSEM2025 26 - VL - ISWE209L - 00100 - TH - 2025 07 31 - Course Material For Module 1
31 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
32 pages
BCA VI: Data Warehousing Essentials
No ratings yet
BCA VI: Data Warehousing Essentials
149 pages
Data Mining-Introduction
No ratings yet
Data Mining-Introduction
47 pages
Unit 2 Introduction To Data Mining
No ratings yet
Unit 2 Introduction To Data Mining
38 pages
DWDM Notes
No ratings yet
DWDM Notes
59 pages
Data Science - Doug Rose
No ratings yet
Data Science - Doug Rose
397 pages
Module1 1 Introduction
No ratings yet
Module1 1 Introduction
27 pages
Managers' Role in Information Technology
100% (2)
Managers' Role in Information Technology
15 pages
What Is A Data Warehouse?: Data Warehouse Architecture From Data Warehousing To Data Mining
No ratings yet
What Is A Data Warehouse?: Data Warehouse Architecture From Data Warehousing To Data Mining
27 pages
Hoffer Mdm12e PP Ch09
No ratings yet
Hoffer Mdm12e PP Ch09
20 pages
Unit III
No ratings yet
Unit III
101 pages
Cognos 8 Bi Installation and Config Guide
No ratings yet
Cognos 8 Bi Installation and Config Guide
218 pages
MSC CS Mqp0708
No ratings yet
MSC CS Mqp0708
12 pages
Module 4
No ratings yet
Module 4
54 pages
G. B. Technical University, Lucknow: Syllabus
No ratings yet
G. B. Technical University, Lucknow: Syllabus
7 pages
Data Warehousing Case Studies .Compressed-1
No ratings yet
Data Warehousing Case Studies .Compressed-1
8 pages
Key Terms and Concepts - Defined: - Goal-Seeking Analysis
No ratings yet
Key Terms and Concepts - Defined: - Goal-Seeking Analysis
4 pages
Intelligent Cubes
No ratings yet
Intelligent Cubes
12 pages
Readme For Adventure Works DW 2014 Multidimensional Databases
No ratings yet
Readme For Adventure Works DW 2014 Multidimensional Databases
13 pages
02 DM BI Data Mining
No ratings yet
02 DM BI Data Mining
66 pages
DM Unit2 (Part1)
No ratings yet
DM Unit2 (Part1)
19 pages
DataStage Matter
0% (1)
DataStage Matter
81 pages
UNIT-3 DATA MINING - Part1
No ratings yet
UNIT-3 DATA MINING - Part1
111 pages
Module 2: Information Technology Infrastructure: Chapter 5: Databases and Information Management
No ratings yet
Module 2: Information Technology Infrastructure: Chapter 5: Databases and Information Management
39 pages
Data Mining and Datawarehousing CS-303
No ratings yet
Data Mining and Datawarehousing CS-303
34 pages
Week1 2
No ratings yet
Week1 2
24 pages
Unit - 4 Introduction To Data Mining
No ratings yet
Unit - 4 Introduction To Data Mining
71 pages
BI Testing Training Deck
No ratings yet
BI Testing Training Deck
53 pages
Human Resource Information System (Hris) Designing
No ratings yet
Human Resource Information System (Hris) Designing
9 pages
Unit 1 DM
No ratings yet
Unit 1 DM
62 pages
The Teradata Database - Part 3 Usage Fundamentals PDF
No ratings yet
The Teradata Database - Part 3 Usage Fundamentals PDF
20 pages
Unit-1 A
No ratings yet
Unit-1 A
47 pages
Data Mining Notes UNIT I
No ratings yet
Data Mining Notes UNIT I
21 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
39 pages
Unit-1 Notes
No ratings yet
Unit-1 Notes
24 pages
Data Mining Essentials
No ratings yet
Data Mining Essentials
13 pages
Unit 1
No ratings yet
Unit 1
21 pages
Module 1
No ratings yet
Module 1
41 pages
Data Mining Unit I Notes
No ratings yet
Data Mining Unit I Notes
24 pages
21SE204-B DATA MINING - S2 M.Tech: Prepared By, Prince V Jose Ap, Cse Saintgits College of Engg
No ratings yet
21SE204-B DATA MINING - S2 M.Tech: Prepared By, Prince V Jose Ap, Cse Saintgits College of Engg
31 pages
File To Submitt Till 20
No ratings yet
File To Submitt Till 20
2 pages
DWH Fundamentals
No ratings yet
DWH Fundamentals
63 pages
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
100% (1)
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
115 pages
Data Minng
No ratings yet
Data Minng
20 pages
Data Mining Summaries PDF
No ratings yet
Data Mining Summaries PDF
22 pages
Unit 1
No ratings yet
Unit 1
59 pages
Cap481 - Business Communication Unit 4
No ratings yet
Cap481 - Business Communication Unit 4
90 pages
Unit 1 DMDW
No ratings yet
Unit 1 DMDW
57 pages
Data Mining for Computer Science Students
No ratings yet
Data Mining for Computer Science Students
52 pages
DM Chapter 1
No ratings yet
DM Chapter 1
10 pages
Comprehensive Guide to Data Mining
No ratings yet
Comprehensive Guide to Data Mining
32 pages
Dwbi Notes-4
No ratings yet
Dwbi Notes-4
34 pages
Data Science & Big Data Analysis Module 1,2,3,4,5
No ratings yet
Data Science & Big Data Analysis Module 1,2,3,4,5
70 pages
BW Characteristics With Conversion Routines - SAP Help Portal
No ratings yet
BW Characteristics With Conversion Routines - SAP Help Portal
11 pages
Basics of Big Data 100 MCQs
No ratings yet
Basics of Big Data 100 MCQs
19 pages
1.1 - Data Mining
No ratings yet
1.1 - Data Mining
18 pages
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
No ratings yet
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
16 pages
Adm Full Notes
No ratings yet
Adm Full Notes
74 pages
DWM (W2022)
No ratings yet
DWM (W2022)
2 pages
Data Warehousing Essentials
No ratings yet
Data Warehousing Essentials
15 pages
Introduction To Data Mining 1604
No ratings yet
Introduction To Data Mining 1604
32 pages
DWDM 01 Introduction
No ratings yet
DWDM 01 Introduction
43 pages
Data Warehouse & Mining
No ratings yet
Data Warehouse & Mining
28 pages
Introduction To Data Mining: - Chapter 3
No ratings yet
Introduction To Data Mining: - Chapter 3
39 pages
DMWH M1
No ratings yet
DMWH M1
25 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
Chapter-1 - Introduction To Data Mining
No ratings yet
Chapter-1 - Introduction To Data Mining
10 pages
18mca52c U1
No ratings yet
18mca52c U1
17 pages
Data Mining Moodle Notes U1
No ratings yet
Data Mining Moodle Notes U1
11 pages
Unit-1 DWDM
No ratings yet
Unit-1 DWDM
20 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
10 pages
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
No ratings yet
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
37 pages
Data Mining
No ratings yet
Data Mining
27 pages
DM 1 PDF
No ratings yet
DM 1 PDF
67 pages
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
No ratings yet
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
27 pages
1 Intro
No ratings yet
1 Intro
33 pages
Kinds of Data: 1. Data Bases Data 2.data Warehouses Data 3. Transactional Data
No ratings yet
Kinds of Data: 1. Data Bases Data 2.data Warehouses Data 3. Transactional Data
24 pages
UNIT-1 Introduction To Data Mining
No ratings yet
UNIT-1 Introduction To Data Mining
29 pages
Chap 1
No ratings yet
Chap 1
32 pages
Chapter 1. Introduction
No ratings yet
Chapter 1. Introduction
323 pages
Data Warehousing 2
No ratings yet
Data Warehousing 2
18 pages

Unit 1 and 2

Uploaded by

Unit 1 and 2

Uploaded by

Introduction to Data Mining

Unit 3: Association Rule Mining: Market Basket Analysis, Association rules:

UNIT 5 : Advanced Topics in Data Mining: Mining complex data: spatial,

• Data mining (knowledge discovery from data)

• Key Characteristics of Data Mining Patterns:

Knowledge discovery (mining) in

• Pattern evaluation (to identify the truly

• Knowledge presentation (where visualization

Database or data warehouse server: The database or data warehouse

Knowledge such as user beliefs, which can be used to assess a pattern’s

Pattern evaluation module: This component typically

User interface: This module communicates between users

In principle, data mining is not specific to one type of

The data in these files can be transactions, time-series

Tables have columns and rows, where columns

A tuple in a relational table corresponds to

 Transaction Databases: A transaction database is a set of

 Multimedia Databases: Multimedia databases include video,

Tables have columns and rows, where columns

A tuple in a relational table corresponds to

Data mining functionalities are used to specify

Descriptive and Predictive.

Predictive mining tasks perform inference on the

1. Concept/Class Description: Characterization and Discrimination

Discrimination (Comparison between classes) - Data discrimination

Example : Compare high-performing and low-performing students:

Characterization (General description of data)

For example, to study the characteristics of software products

Characterization is a summarization of the general

Data characterization provides a general, high-level summary

where X is a variable representing a customer. A

Classification is the process of finding a model (or

Whereas classification predicts categorical (discrete,

Clustering analyzes data objects without consulting a

Mining methodology and user interaction issues:

•Mining different kinds of knowledge in databases:

•Incorporation of background knowledge: Background

•Efficiency and scalability of data mining algorithms:

A single, complete and consistent store of data

• Provide a simple and concise view around particular subject

– Ensure consistency in naming conventions, encoding

– When data is moved to the warehouse, it is converted.

• Subject Oriented: The Data Warehouse is

• It helps in analysis of the past, correlates with the

Large (hundreds of GB up to several Small (MB up to several GB)

De-normalized table structure (few Normalized table structure (many

Usually very complex queries Simple to complex queries

• The data contained in data marts tend to be summarized.

• Data marts are usually implemented on low-cost departmental

• However, it may involve complex integration in the long run if its

Independent Data mart:-Independent data marts are

Dependent data mart:- Dependent data marts are

S.n Data warehouse DataMart

2. It is a unit of all data marts It is a single business process

3 Structure for corporate view of data Structure to suit the departmental

The entity-relationship data model is commonly used in the

A data warehouse, however, requires a concise, subject-

The most popular data model for a data warehouse is a

The resulting schema graph forms a shape similar to a snowﬂake.

Such a table is easy to maintain and saves storage space.

Fact constellation: Sophisticated applications may require multiple fact tables

Conceptual Modeling of Data Warehouses

• Modeling data warehouses: dimensions & measures

Example of Star Schema

Suppose we are analyzing sales data:

Fact Table: Sales_Fact Dimension Tables :

Date_ID Product_ID Store_ID Sales_Amount Units_Sold

Date (Date_ID, Day, Month, Year)

Syntax for foreign key

Example of Snowflake Schema

Example of Fact Constellation

branch location_key location to_location

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind