Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views145 pages

Unit 1 and 2

notes on data mining .

Uploaded by

akshat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views145 pages

Unit 1 and 2

notes on data mining .

Uploaded by

akshat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 145

Introduction to Data Mining

TBD 502
Syllabus
UNIT 1: Introduction to Data Mining: Fundamentals and functionalities of data
mining. Classification of data mining systems and major issues
.
Data warehouse Fundamentals: architecture, components, and models. OLAP
operations, OLTP vs OLAP, data cubes, schemas (star, snowflake),
Overview of Data Mining Tasks: Data Types, Data Mining Functionalities, Pattern
Interestingness, Integration with Data Warehouse, Task Primitives and Issues.
.
UNIT 2 : Data Preprocessing
Need for preprocessing Descriptive data summarization, Data Cleaning: Missing
Values, Noisy Data, (Binning, Clustering, Regression, Computer and Human
inspection), Inconsistent Data, Data Integration and Transformation

Unit 3: Association Rule Mining: Market Basket Analysis, Association rules:


Association rules from transaction database & relational database, Apriori algorithm and
correlation analysis..
Syllabus
UNIT 4 : Classification and Clustering: Classification: decision trees, Bayesian
methods, K-NN Prediction techniques and related issues Clustering: data types,
partitioning, and hierarchical methods

UNIT 5 : Advanced Topics in Data Mining: Mining complex data: spatial,


multimedia, time-series, sequence data Text mining and web mining
4

• Data mining (knowledge discovery from data)


Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) patterns
or knowledge from huge amount of data

• Key Characteristics of Data Mining Patterns:


• Non-trivial – Not obvious or easily seen.
• Implicit – Hidden within the data, not explicitly stated.
• Previously unknown – Not known before the analysis.
• Potentially useful – Can aid in decision making,
predictions, or understanding phenomena.
• Interesting – Patterns should be novel, valid, and
actionable.
lternative names

Knowledge discovery (mining) in


databases (KDD), knowledge
extraction, data/pattern analysis, data
archeology, data dredging, information
harvesting, business intelligence, etc.
KDD
Process

Knowledge
Discovery in
Databases
Many people treat data mining as a synonym
for another popularly used term, Knowledge
Discovery from Data.
Steps in KDD
• Data cleaning (to remove noise and inconsistent
data).
• Data integration (where multiple data sources
may be combined)
• Data selection (where data relevant to the
analysis task are retrieved from the database)
• Data transformation (where data are
transformed or consolidated into forms
appropriate for mining by performing summary or
aggregation operations, for instance).
• Data mining (an essential process where
intelligent methods are applied in order to
extract data patterns).

• Pattern evaluation (to identify the truly


interesting patterns representing knowledge
based on some interestingness measures.

• Knowledge presentation (where visualization


and knowledge representation techniques are
used to present the mined knowledge to the
user).
Architecture of a Data Mining System
The architecture of a typical data mining system may have the
following major components
Database, data warehouse, World Wide Web, or other information
repository: This is one or a set of databases, data warehouses, spread
sheets, or other kinds of information repositories. Data cleaning and data
integration techniques may be performed on the data.

Database or data warehouse server: The database or data warehouse


server is responsible for fetching the relevant data, based on the user’s
data mining request.

Knowledge base: This is the domain knowledge that is used to guide the
search or evaluate the interestingness of resulting patterns.
Such knowledge can include concept hierarchies, used to organize
attributes or attribute values into different levels of abstraction.

Knowledge such as user beliefs, which can be used to assess a pattern’s


interestingness based on its unexpectedness, may also be included.
Data mining engine: This is essential to the data mining
system and ideally consists of a set of functional modules for
tasks such as characterization, association and correlation
analysis, classification, prediction, cluster analysis, outlier
analysis.

Pattern evaluation module: This component typically


employs interestingness measures and interacts with the
data mining modules so as to focus the search toward
interesting patterns. It may use interestingness thresholds to
filter out discovered patterns.

User interface: This module communicates between users


and the datamining system, allowing the user to interact with
the system by specifying a data mining query or task
Data Mining—On What Kind of Data?

In principle, data mining is not specific to one type of


media or data. Data mining should be applicable to any
kind of information repository.

Flat files: Flat files are actually the most common data
source for data mining algorithms, especially at the
research level. Flat files are simple data files in text or
binary format with a structure known by the data
mining algorithm to be applied.

The data in these files can be transactions, time-series


data, scientific measurements, etc.
A text database is a database that contains text documents
or other word descriptions in the form of long sentences or
paragraphs, such as product specications, error or bug
reports, warning messages, summary reports, notes, or
other documents.
Relational Databases: Briefly, a relational
database consists of a set of tables containing
either values of entity attributes, or values of
attributes from entity relationships.

Tables have columns and rows, where columns


represent attributes and rows represent tuples.

A tuple in a relational table corresponds to


either an object or a relationship between
objects and is identified by a set of attribute
values representing a unique key.
 Data Warehouses: A data warehouse as a storehouse is a
repository of data collected from multiple data sources (often
heterogeneous) and is intended to be used as a whole under
the same unified schema. A data warehouse gives the option
to analyze data from different sources under the same roof.

 Transaction Databases: A transaction database is a set of


records representing transactions, each with a time stamp, an
identifier and a set of items.

 Multimedia Databases: Multimedia databases include video,


images, audio and text media. Multimedia is characterized by
its high dimensionality, which makes data mining even more
challenging.
Spatial Databases: Spatial databases are databases that, in
addition to usual data, store geographical information like
maps, and global or regional positioning.
Relational Databases: Briefly, a relational
database consists of a set of tables containing
either values of entity attributes, or values of
attributes from entity relationships.

Tables have columns and rows, where columns


represent attributes and rows represent tuples.

A tuple in a relational table corresponds to


either an object or a relationship between
objects and is identified by a set of attribute
values representing a unique key.
Data Mining Functionality - What Kinds of
Patterns Can Be Mined?

Data mining functionalities are used to specify


the kind of patterns to be found in data mining
tasks. In general, data mining tasks can be
classified into two categories:

Descriptive and Predictive.


Descriptive mining tasks characterize the general
properties of the data in the database.

Predictive mining tasks perform inference on the


current data in order to make predictions.
Data mining functionalities, and the kinds of patterns they can
discover :

1. Concept/Class Description: Characterization and Discrimination


Data can be associated with classes or concepts
.
Data discrimination is a comparison of the general features of target class
data objects with the general features of objects from one or a set of
contrasting classes. The target and contrasting classes can be specified by
the user, and the corresponding data objects retrieved through database
queries.

For example, the user may like to compare the general features of software
products whose sales increased by 10% in the last year with those whose
sales decreased by at least 30% during the same period

Discrimination (Comparison between classes) - Data discrimination


compares two or more classes and identifies distinguishing features
between them. It helps discover what makes one class different from
another

Example : Compare high-performing and low-performing students:

Characterization (General description of data)


Data characterization, by summarizing the data of the class
August 19, 2025
20

under study (often called the target class) in general terms, Data
characterization is a summarization of the general characteristics
or features of a target class of data. The data corresponding to
the user-specified class are typically collected by a database
query.

For example, to study the characteristics of software products


whose sales increased by 10% in the last year.
, characterize high-performing students in a university dataset.

Characterization is a summarization of the general


characteristics or features of a target class of data.

Data characterization provides a general, high-level summary


of the main features of a target class of data. It gives a
concise description of the data set, often through descriptive
statistics, visualizations, or aggregated rules.
Association is the discovery of association rules showing
attribute-value conditions that occur frequently together
in a given set of data. frequent itemset typically refers to
a set of items that frequently appear together in a
transactional data set, such as milk and bread. A
frequently occurring subsequence, such as the pattern
that customers tend to purchase first a PC, followed by a
digital camera, and then a memory card, is a (frequent)
sequential pattern.

where X is a variable representing a customer. A


confidence, or certainty, of 50% means that if a customer
buys a computer, there is a 50% chance that she will buy
software as well. A 1% support means that 1% of all of
the transactions under analysis showed that computer
and software were purchased together.
Classification and Prediction

Classification is the process of finding a model (or


function) that describes and distinguishes data classes
or concepts, for the purpose of being able to use the
model to predict the class of objects whose class label
is unknown. The derived model is based on the
analysis of a set of training data (i.e., data objects
whose class label is known).

Whereas classification predicts categorical (discrete,


unordered) labels, prediction models continuous-
valued functions
August 19, 2025
23

Clustering analyzes data objects without consulting a


known class label. The objects are clustered or
grouped based on the principle of maximizing the
intraclass similarity and minimizing the interclass
similarity. Each cluster that is formed can be viewed
as a class of objects. Clustering can also facilitate
taxonomy formation, that is, the organization of
observations into a hierarchy of classes that group
similar events together.
Major Issues in Data Mining
Major issues in data mining regarding mining methodology,
user interaction, performance, and diverse data types.

Mining methodology and user interaction issues:


These reflect the kinds of knowledge mined, the ability to
mine knowledge at multiple granularities, the use of domain
knowledge, ad hoc mining, and knowledge visualization.

•Mining different kinds of knowledge in databases:


Because different users can be interested in different kinds of
knowledge, data mining should cover a wide
spectrum of data analysis and knowledge discovery tasks,
including data characterization, discrimination, association and
correlation analysis, classification, prediction, clustering,
outlier analysis, and evolution analysis (which includes trend
and similarity analysis). These tasks may use the same
database in different ways and require the development of
and pivoting through the data space and knowledge space
interactively, similar
to what OLAP can do on data cubes. In this way, the user can
interact with
the data mining system to view data and discovered patterns at
multiple granularities
and from different angles.

•Incorporation of background knowledge: Background


knowledge, or information regarding the domain under
study, may be used to guide the discovery process and allow
discovered patterns to be expressed in concise terms and at
different levels of abstraction. Domain knowledge related to
databases, such as integrity constraints and deduction rules, can
help focus and speed up a data mining process, or judge the
interestingness of discovered patterns.
•Data mining query languages and ad hoc data mining:
Relational query languages (such as SQL) allow users to pose ad
•Handling noisy or incomplete data: The data stored in a
database may reflect noise,
exceptional cases, or incomplete data objects. When mining
data regularities, these objects may confuse the process,
causing the knowledge model constructed to overfit the data.
As a result, the accuracy of the discovered patterns can be
poor. Data cleaning methods and data analysis methods that
can handle noise are required, as well as Outlier mining
methods for the discovery and analysis of exceptional cases.
•Pattern evaluation—the interestingness problem: A data
mining systemcan uncover thousands of patterns. Many of the
patterns discovered may be uninteresting to the given user,
either because they represent common knowledge or lack
novelty. Several challenges remain regarding the development
of techniques to assess the interestingness of discovered
patterns, particularly with regard to subjective measures that
estimate the value of patterns with respect to a given user
Performance issues: These include efficiency, scalability,
and parallelization of data
mining algorithms.

•Efficiency and scalability of data mining algorithms:


To effectively extract information
from a huge amount of data in databases, data mining
algorithms must be efficient and scalable. In other words,
the running time of a data mining algorithm must be
predictable and acceptable in large databases. Froma
database perspective on knowledge discovery, efficiency and
scalability are key issues in the implementation
of data mining systems.
•Parallel, distributed, and incremental mining
algorithms: The huge size of many
databases, the wide distribution of data, and the
What Is a Data Warehouse?

A single, complete and consistent store of data


obtained from a variety of different sources
made available to end users in a what they
can understand and use in a business context.
• “A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of management’s
decision-making process.”—W. H. Inmon

• Data warehousing:
– The process of constructing and using data warehouses
• “A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of management’s
decision-making process.”—W. H. Inmon

• Data warehousing:
– The process of constructing and using data warehouses
31

Data Warehouse—
Subject-Oriented
• Organized around major subjects, such as customer, product,
sales
• Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing

• Provide a simple and concise view around particular subject


issues by excluding data that are not useful in the decision
support process
32

Data Warehouse—
Integrated
• Constructed by integrating multiple, heterogeneous data
sources
– relational databases, flat files, on-line transaction records
• Data cleaning and data integration techniques are applied.

– Ensure consistency in naming conventions, encoding


structures, attribute measures, etc. among different data
sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.

– When data is moved to the warehouse, it is converted.


33

Data Warehouse—Time
Variant
• The time horizon for the data warehouse is significantly longer
than that of operational systems
– Operational database: current value data
– Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
• Every key structure in the data warehouse
– Contains an element of time, explicitly or implicitly
– But the key of operational data may or may not contain
“time element”
34

Data Warehouse—
Nonvolatile
• A physically separate store of data transformed from the
operational environment
• Operational update of data does not occur in the data
warehouse environment
– Does not require transaction processing, recovery, and
concurrency control mechanisms
– Requires only two operations in data accessing:
• initial loading of data and access of data
Features of Data Warehousing

• Subject Oriented: The Data Warehouse is


Subject Oriented because it provides us the
information around a subject rather the
organization's ongoing operations.
• These subjects can be product, customers,
suppliers, sales, revenue etc.
• The data warehouse does not focus on the
ongoing operations Rather it focuses on
modelling and analysis of data for decision
making.
Claims under automobile insurance policies are
processed in the Auto Insurance application, similarly
claim for compensation of workers is organized in
workers’ compensation application.
• Integrated:-The data in DWH is integrated in the sense that
data from the disparate system as well as the data from
different application will be accumulated and will be stored in
the single database called DWH(This process is known as ETL
in DWH.) for analysis and decision taking purpose.
Time variant:
• In OLTP the data will be stored according for current
values unlike in Data Warehouse, the data will be
stored for time which may range from 1 year to 10
years.
• This aspect of data warehouse is quite significant for
both the design and implementation phase.

• It helps in analysis of the past, correlates with the


present scenario and to predict for the future.
Non-Volatile:
We add change or delete the data from an operational
system as each transaction happens.
• We do not delete the data from the data warehouse
in the real time.
• Once the data is captured in the data warehouse,
we do not run individual transactions to change the
data there.
• Data warehouse can only be viewed.
• DWH is used for analysing the data so it
periodically get refreshed by picking up the data
from various OLTP systems.
Data warehouse Operational system
Subject oriented Transaction oriented

Large (hundreds of GB up to several Small (MB up to several GB)


TB)
Historic data Current data

De-normalized table structure (few Normalized table structure (many


tables, many columns per table) tables, few columns per table)
Batch updates Continuous updates

Usually very complex queries Simple to complex queries


Comparison between OLTP and OLAP
system
Data Warehouse and Data Mart
• A data mart contains a subset of corporate wide data that is of
value to a specific group of users.
• The scope is confined to specific selected subjects.

• The data contained in data marts tend to be summarized.

• Data marts are usually implemented on low-cost departmental


servers that are Unix/Linux- or Windows-based. The
implementation cycle of a data mart is more likely to be measured
in weeks rather than months or years.

• However, it may involve complex integration in the long run if its


design and planning were not enterprise-wide
Depending on the source of data, data marts can be
categorized as independent or dependent..

Independent Data mart:-Independent data marts are


sourced from data captured from one or more
operational systems or external information providers,
or from data generated locally within a particular
department or geographic area.

Dependent data mart:- Dependent data marts are


sourced directly from enterprise data warehouses.
Data Warehouse and Data Mart

S.n Data warehouse DataMart


o
1. Corporate/Enterprise Wide Departmental

2. It is a unit of all data marts It is a single business process

3 Structure for corporate view of data Structure to suit the departmental


view of data
4 Queries on presentation resource Technology optimal for data access
and analysis
5 Data received from staging area Star-join(Facts & Dimensions)
Organized on E-R Model
Multidimensional Data Model
Data warehouses and OLAP tools are based on a multidimensional data
model. This model views data in the form of a data cube
Stars, Snowflakes, and Fact Constellations: Schemas for
Multidimensional Databases

The entity-relationship data model is commonly used in the


design of relational databases, where a database schema
consists of a set of entities and the relationships between
them.

A data warehouse, however, requires a concise, subject-


oriented schema that facilitates on-line data analysis.

The most popular data model for a data warehouse is a


multidimensional model. Such a model can exist in the form of
a star schema, a snowflake schema, or a fact constellation
schema.
Star schema:
The most common modeling paradigm is the star schema, in which the data
warehouse contains
(1) a large central table (fact table) containing the bulk of the data, with no
redundancy, and
(2) a set of smaller attendant tables (dimension tables), one for each
dimension.

The schema graph resembles a starburst, with the dimension tables displayed
in a radial pattern around the central fact table.

Snowflake schema:
The snowflake schema is a variant of the star schema model, where some
dimension tables are normalized, thereby further splitting the data into
additional tables.

The resulting schema graph forms a shape similar to a snowflake.


The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model may be kept in normalized form to
reduce redundancies.

Such a table is easy to maintain and saves storage space.

Fact constellation: Sophisticated applications may require multiple fact tables


to share dimension tables. This kind of schema can be viewed as a collection of
stars, and hence is called a galaxy schema or a fact constellation.
52

Conceptual Modeling of Data Warehouses

• Modeling data warehouses: dimensions & measures


– Star schema: A fact table in the middle connected to a set of
dimension tables
– Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
– Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
53

Example of Star Schema


time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
measures
Measures are the metrics or facts that businesses want to analyze, such as:

Sales amount
Quantity sold
Revenue
Profit
Number of transactions

These are usually aggregatable values (e.g., SUM, AVG, COUNT) and are the
central focus of analysis in reports.

Suppose we are analyzing sales data:

Fact Table: Sales_Fact Dimension Tables :


• Sales_Amount (measure) Product_Dimension (Product details)
• Quantity_Sold (measure) Customer_Dimension (Customer details)
• Product_ID (foreign key to Product Time_Dimension (Dates, months, years
dimension)
• Customer_ID (foreign key to Customer
dimension)
• Date_ID (foreign key to Time dimension)
Fact Table Sales

Date_ID Product_ID Store_ID Sales_Amount Units_Sold


20250730 101 501 6000 50

Dimension Tables:

Date (Date_ID, Day, Month, Year)


Product (Product_ID, Name, Category)
Store (Store_ID, Location, Manager)
-- Date Dimension
CREATE TABLE date_dim ( -- Customer Dimension
date_id INT PRIMARY KEY,
CREATE TABLE customer_dim (
full_date DATE,
day INT, customer_id INT PRIMARY KEY,
month INT, customer_name VARCHAR(100),
quarter INT, gender VARCHAR(10),
year INT age INT,
); city VARCHAR(50),
state VARCHAR(50)
-- Product Dimension );
CREATE TABLE product_dim (
product_id INT PRIMARY KEY,
product_name VARCHAR(100), -- Store Dimension
category VARCHAR(50), CREATE TABLE store_dim (
brand VARCHAR(50) store_id INT PRIMARY KEY,
); store_name VARCHAR(100),
region VARCHAR(50)
);
CREATE TABLE sales_fact (
sale_id INT PRIMARY KEY,
date_id INT,
product_id INT,
customer_id INT,
store_id INT,
units_sold INT,
total_revenue DECIMAL(10, 2),

-- Foreign Keys
FOREIGN KEY (date_id) REFERENCES date_dim(date_id),
FOREIGN KEY (product_id) REFERENCES product_dim(product_id),
FOREIGN KEY (customer_id) REFERENCES customer_dim(customer_id),
FOREIGN KEY (store_id) REFERENCES store_dim(store_id)
);

Syntax for foreign key


FOREIGN KEY (column_name) REFERENCES referenced_table(referenced_column)
58

Example of Snowflake Schema


time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
59

Example of Fact Constellation


time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type
60

A Concept Hierarchy:
Dimension (location)
all all

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind


A data cube measure
is a numerical function that can be evaluated at each point in the data
cube space. A measure value is computed for a given point by
aggregating the data corresponding to the respective dimension-value
pairs defining the given point

Measures can be organized into three categories (i.e., distributive,


algebraic, holistic), based on the kind of aggregate functions used.

Distributive: An aggregate function is distributive if it can be computed in


a distributed manner as follows. Suppose the data are partitioned into
n sets. We apply the function to each partition, resulting in n aggregate
values. If the result derived by applying the function to the n aggregate
values is the same as that derived by applying the function to the entire
data set (without partitioning), the function can be computed in a
distributed manner. For example, sum() can be computed for a data cube
by first partitioning the cube into a set of subcubes, computing sum() for
each subcube, and then summing up the counts obtained for each
subcube. Hence, sum() is a distributive aggregate function.
Algebraic
An aggregate function is algebraic if it can be computed by an
algebraic function with M arguments (where M is a bounded
positive integer), each of which is obtained by applying a
distributive aggregate function.

For example, avg() (average) can be computed by


sum()/count(), where both sum() and count() are distributive
aggregate functions.

Similarly, it can be shown that min N() and max N() (which find
the N minimum and N maximum values, respectively, in a given
set) and standard deviation() are algebraic aggregate functions.
A measure is algebraic if it is obtained by applying an algebraic
aggregate function
63

Data Cube Measures: Three Categories

• Distributive: if the result derived by applying the function to n


aggregate values is the same as that derived by applying the
function on all the data without partitioning
• E.g., count(), sum(), min(), max()
• Algebraic: if it can be computed by an algebraic function with M
arguments (where M is a bounded integer), each of which is
obtained by applying a distributive aggregate function
• E.g., avg(), min_N(), standard_deviation()
• Holistic: if there is no constant bound on the storage size needed
to describe a subaggregate.
• E.g., median(), mode(), rank()
In a data warehouse, a measure is a property on which
calculations (e.g., sum, count, average, minimum,
maximum) can be made
65

Multidimensional Data

• Sales volume as a function of product, month,


and region Dimensions: Product, Location, Time
Hierarchical summarization paths

Industry Region Year

Category Country Quarter


Product

Product City Month Week

Office Day

Month
66

A Sample Data Cube

Total annual sales


Date of TVs in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR

Country
sum
Canada

Mexico

sum
OLAP OPERATIONS
68

Typical summarize
• Roll up (drill-up): OLAP Operations
data
– by climbing up hierarchy or by dimension
reduction
• Drill down (roll down): reverse of roll-up
– from higher level summary to lower level
summary or detailed data, or introducing new
dimensions
• Slice and dice: project and select
• Pivot (rotate):
– reorient the cube, visualization, 3D to series of
2D planes
.
Roll-up: The roll-up operation
(also called the drill-up
operation by some vendors)
performs aggregation on a
data cube, by climbing up a
concept hierarchy for a
dimension
Drill-down: Drill-down is
the reverse of roll-up. It
navigates from less
detailed data to more
detailed data.

Drill-down can be
realized by stepping
down a concept
hierarchy for a
dimension
Dice:

The dice operation defines a


subcube by performing a selection
on two or more dimensions.
Slice :

The slice operation performs


a selection on one
dimension of the given cube,
resulting in a subcube.
Types of OLAP Servers: ROLAP versus MOLAP
versus HOLAP
Relational OLAP (ROLAP) servers: These are the
intermediate servers that stand in between a relational back-
end server and client front-end tools. They use a relational or
extended- relational DBMS to store and manage warehouse
data, and OLAP middleware to support missing pieces.

Multidimensional OLAP(MOLAP)servers: These servers


support multidimensional views of data through array-based
multi dimensional storage engines. They map multi
dimensional views directly to data cube array structures

Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach


combines ROLAP and MOLAP technology, benefiting from the
greater scalability of ROLAP and the faster computation of
MOLAP.
76

Data Warehouse: A Multi-Tiered Architecture

Monitor
& OLAP Server
Other Metadata
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


A Three-Tier Data Warehouse Architecture

Data warehouses often adopt a three-tier architecture

1. The bottom tier is a warehouse database server that is


almost always a relational database system. Back-end tools
and utilities are used to feed data into the bottom tier from
operational databases or other external sources.

These tools and utilities perform data extraction, cleaning, and


transformation as well as load and refresh functions to update
the data warehouse.
2. The middle tier is an OLAP server that is
typically implemented using either

(1)a relational OLAP (ROLAP) model, that is, an


extended relational DBMS that maps
operations on multidimensional data to
standard relational operations; or

(2) a multidimensional OLAP (MOLAP) model,


that is, a special-purpose server that directly
implements multidimensional data and
operations.
August 19, 2025
80

The top tier is a front-end client layer, which contains


query and reporting tools, analysis tools, and/or data
mining tools (e.g., trend analysis, prediction, and so on).

From the architecture point of view, there are three


data warehouse models:

the enterprise warehouse,

the data mart,

and the virtual warehouse.


81

• Enterprise warehouse
Three Data Warehouse Models
– collects all of the information about subjects spanning the
entire organization
• Data Mart
– a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to specific,
selected groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse)
data mart
• Virtual warehouse
– A set of views over operational databases
– Only some of the possible summary views may be
materialized
Multidimensional versus Multi
relational OLAP
In the MOLAP model, online analytical processing is best implemented by storing
the data multidimensional that is easily viewed in a multidimensional way. Here
the data structure is fixed so that the logic to process multidimensional analysis
ca be based on well defined methods of establishing data storage coordinates.
Usually, multidimensional databases (MDDBs) are vendors proprietary systems.

On the other hand , the ROLAP model relies on the existing relational DBMS of
data warehouse. OLAP Features are provided against the relational database.

Multidimensional vs. Multirelational OLAP:

MOLAP (Multidimensional OLAP): Stores data in pre-aggregated "cubes"


optimized for fast query response, ideal for complex analysis with pre-defined
dimensions and hierarchies.

ROLAP (Relational OLAP): Utilizes standard relational databases and SQL


queries, offering greater flexibility for ad-hoc analysis but potentially slower
performance for complex queries.
ROLAP MOLAP

While MOLAP stands


ROLAP stands for Relational Online
for Multidimensional Online
Analytical Processing.
Analytical Processing.

While it is used for limited data


ROLAP is used for large data volumes.
volumes.

The access of ROLAP is slow. While the access of MOLAP is fast.

In ROLAP, Data is stored in relation While in MOLAP, Data is stored in


tables. multidimensional array.

In ROLAP, Data is fetched from data- While in MOLAP, Data is fetched from
warehouse. MDDBs database.

In ROLAP, Complicated sql queries are While in MOLAP, Sparse matrix is


used. used.

While in MOLAP, Dynamic


In ROLAP, Static multidimensional
multidimensional view of data is
view of data is created.
created.
OLAP Servers
OLAP (Online Analytical Processing) servers are specialized tools or platforms used to
perform complex analytical queries on multidimensional data efficiently. These servers
are part of data warehouse systems and are designed to support decision support
systems (DSS), business intelligence (BI), and analytical applications.

MOLAP (Multidimensional OLAP) : Stores data in a multidimensional cube format.


Pre-aggregated and indexed data for fast access.

Architecture: Data is loaded from the data warehouse, transformed into a cube, and
stored in a proprietary multidimensional database.

Merits :
Very fast query performance.
Efficient storage due to data compression.
Good for frequently queried data.

Demerits :
Not ideal for real-time data.
Cube size limitations.
Requires preprocessing (build time).

Examples: Microsoft Analysis Services (SSAS in MOLAP mode), IBM Cognos TM1
Oracle Essbase
ROLAP (Relational OLAP)
Overview: Stores data in relational databases.
Uses SQL queries to dynamically calculate aggregations.

Architecture: No need to create cubes.


Metadata layer maps multidimensional views to relational data.

Advantages:
Can handle large volumes of data.
Real-time data access.
Works with existing relational databases (like MySQL, SQL Server).

Disadvantages:
Slower performance compared to MOLAP.
Depends on the database query optimizer.
Complex queries may lead to high latency.

Examples:
MicroStrategy
SAP BusinessObjects
IBM Cognos (in ROLAP mode)
HOLAP (Hybrid OLAP)
Combines the features of MOLAP and ROLAP.
Stores summary data in MOLAP and detailed data in ROLAP.

➤ Architecture:
High-level aggregates are stored in cubes for fast access.
Drill-down to relational tables for detailed data.

➤ Advantages:
Balanced performance and scalability.
Can handle large datasets and frequent queries.

➤ Disadvantages:
Complex architecture.
Management overhead due to dual storage.

➤ Examples:
Microsoft SSAS (in HOLAP mode)
SAP BW
90

• Data mining (knowledge discovery from data)


Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of
data
• Alternative names
Knowledge discovery (mining) in databases
(KDD), knowledge extraction, data/pattern
analysis, data archeology, data dredging,
information harvesting, business intelligence,
etc.
KDD
Process

Knowledge
Discovery in
Databases
Many people treat data mining as a synonym for
another popularly used term, Knowledge
Discovery from Data.
Steps in KDD
• Data cleaning (to remove noise and inconsistent
data).
• Data integration (where multiple data sources may
be combined)
• Data selection (where data relevant to the analysis
task are retrieved from the database)
• Data transformation (where data are transformed or
consolidated into forms appropriate for mining by
performing summary or aggregation operations, for
instance).
• Data mining (an essential process where
intelligent methods are applied in order to
extract data patterns).

• Pattern evaluation (to identify the truly


interesting patterns representing knowledge
based on some interestingness measures.

• Knowledge presentation (where


visualization and knowledge representation
techniques are used to present the mined
knowledge to the user).
UNIT-2

Data Preprocessing: Need for preprocessing Descriptive data


summarization, Data Cleaning: Missing Values, Noisy Data, (Binning,
Clustering, Regression, Computer and Human inspection), Inconsistent
Data, Data Integration and Transformation.
Data Preprocessing

Data preprocessing is the process of preparing raw data for analysis by


cleaning and transforming it into a usable format. In data mining it refers
to preparing raw data for mining by performing tasks like cleaning,
transforming, and organizing it into a format suitable for mining algorithms
.
• Goal is to improve the quality of the data.

• Helps in handling missing values, removing duplicates, and normalizing


data.

• Ensures the accuracy and consistency of the dataset.


Today’s real-world databases are highly susceptible
to noisy, missing, and inconsistent data due to their
typically huge size (often several gigabytes or more) and
their likely origin from multiple, heterogenous sources.
Low-quality data will lead to low-quality mining results.
Today’s real-world databases are highly susceptible
to noisy, missing, and inconsistent data due to their
typically huge size (often several gigabytes or more) and
their likely origin from multiple, heterogenous sources.
Low-quality data will lead to low-quality mining results.
There are many possible reasons for noisy data (having
incorrect attribute values).

•The data collection instruments used may be faulty.

• There may have been human or computer errors


occurring at data entry.

•Errors in data transmission can also occur.

•There may be technology limitations, such as limited


buffer size for coordinating synchronized data transfer
and consumption.

Incorrect data may also result from inconsistencies in


naming conventions or data codes used, or inconsistent
formats for input fields, such as date. Duplicate tuples
also require data cleaning
August 19, 2025
99

There are a number of data preprocessing techniques.

• Data cleaning can be applied to remove noise and correct


inconsistencies in the data.

•Data integration merges data from multiple sources into a


coherent data store, such as a data warehouse.

•Data transformations, such as normalization, may be


applied. For example, normalization may improve the
accuracy and efficiency of mining algorithms involving
distance measurements.

•Data reduction can reduce the data size by aggregating,


eliminating redundant features, or clustering, for instance.
August 19, 2025
100

In other words, the data you wish to analyze by data


mining techniques are incomplete (lacking attribute
values or certain attributes of interest, or containing only
aggregate data), noisy (containing errors, or outlier
values that deviate from the expected), and inconsistent
(e.g., containing discrepancies in the department codes
used to categorize items).
Data preprocessing

Data preprocessing is the process of cleaning, transforming, and


organizing raw data before feeding it into a data mining or
machine learning system

.
Raw data is often incomplete, inconsistent, noisy, or in different
formats, making preprocessing essential.
Reasons why preprocessing is needed:

• Handling Missing Values – Fill in, remove, or estimate missing data


.
• Noise Removal – Eliminate errors or outliers (e.g., sensor glitches).

• Data Integration – Merge data from multiple sources into a consistent


dataset.

• Data Transformation – Normalize, aggregate, or discretize for better


analysis.

• Improved Accuracy – Preprocessed data leads to better model


performance.

• Efficiency – Reduces complexity, storage, and computation time.


Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Data compression

103
Data transformation and data
discretization
Normalization
Concept hierarchy generation
August 19, 2025
105
Data Cleaning

• Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or
data cleansing) routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data.

• Missing data

1. Ignore the tuple: This is usually done when the class label is missing (assuming
the mining task involves classification). This method is not very effective, unless
the tuple contains several attributes with missing values. It is especially poor when
the percentage of missing values per attribute varies considerably.
2. Fill in the missing value manually: In general, this approach is time-consuming
and may not be feasible given a large data set with many missing values.

106
August
Data
107 Mining:
19, 2025
Concepts and Techniques

• 3. Use a global constant to fill in the missing


value: Replace all missing attribute values by
the same constant, such as a label like
“Unknown”. If missing values are replaced by,
say, “Unknown,” then the mining program may
mistakenly think that they form an interesting
concept, since they all have a value in
common—that of “Unknown.” Hence,
although this method is simple, it is not
foolproof.
• 4. Use the attribute mean to fill in the missing
value:
August 19, 2025
108

5. Use the attribute mean for all samples belonging


to the same class as the given tuple

6. Use the most probable value to fill in the missing


value: This may be determined with regression, inference-
based tools using a Bayesian formalism, or decision tree
induction. For example, using the other customer
attributes in your data set, you
may construct a decision tree to predict the missing values
for income.
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data (e.g., Occupation=“ ” (missing data))
– noisy: containing noise, errors, or outliers (e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data) (Jan. 1 as everyone’s birthday?)

109
Noise: Noise is a random error or variance in a measured variable.

• Binning: Binning methods a sorted data value by consulting its


“neighborhood,” that is, the values around it. The sorted values
are distributed into a number
of “buckets,” or bins. Because binning methods consult the
neighborhood of values, they perform local smoothing

• Clustering: may be detected by clustering,


where similar values are organized into groups,
or “clusters.” Intuitively, values that fall outside
of the set of clusters may be considered outliers

110
Methods to Handle Noisy Data
1. Binning : Group values into “bins” and smooth them by replacing with
representative values (mean, median, or boundary).

2. Clustering : Group similar data points together; noisy data points will
often fall outside main clusters and can be treated as outliers

Example:In a 2D plot of customer age vs. income, most points form


dense groups, but a few scattered far away may be noise.

Benefit: Automatically detects outliers without predefined rules.

3. Regression : Fit a function (linear, polynomial, etc.) to the data and


replace values that deviate strongly from the fit.

Example:
A simple linear regression predicts y = 2x + 5. If the actual point (x=10,
y=200) is far from this line, it’s likely noise.
4. Computer and Human Inspection

Computer inspection: Use automated scripts to detect anomalies (e.g.,


sudden jumps, impossible values like negative age).

Human inspection: Experts manually review flagged records to confirm or


correct them.

Example: Automated system flags a temperature reading of −200°C in a


human body dataset

A human confirms this is impossible and corrects the entry


BINNING
Binning: Binning methods smooth a sorted data
value by consulting its “neighborhood,” that is, the
values around it. The sorted values are distributed
into a number of “buckets,” or bins. Because
binning methods consult the neighborhood of
values, they perform local smoothing.

•smoothing by bin means, each value in a bin is


replaced by the mean value of the bin. For
example, the mean of the values 4, 8, and 15 in
Bin 1 is 9. Therefore, each original value in this bin
is replaced by the value 9.
In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest
boundary value. In general, the larger the width, the
greater the effect of the smoothing.

Bins may be equal-width : where the interval range of


values in each bin is constant. Binning is also used as a
discretization technique

Examle : Medical Sensor Readings (Heart Rate)Readings:


78, 80, 79, 82, 800, 81, 77. 800 is an impossible value
(faulty reading).
Binning: Bin range (75–85) replace all bin members with
mean (80) smooths small fluctuations and removes spike.
Binning Methods for Data Smoothing

 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
115
Simple Discretization: Binning

• Equal-width (distance) partitioning


– Divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
– The most straightforward, but outliers may dominate presentation
– if the observed values are all between 0-100,we could create 5 bins as follows
width = (100 – 0)/5 = 20
– bins: [0-20], (20-40], (40-60], (60-80], (80-100] [ or ] means the endpoint is
included( or ) means the endpoint is not included
– •typically, the first and last bins are extended to allow for values outside the
range of observed values(-infinity-20], (20-40], (40-60], (60-80], (80-infinity)

116
• Equal-depth (frequency) partitioning
– Divides the range into N intervals, each containing approximately same
number of samples
– Good data scaling
– Managing categorical attributes can be tricky
Discretization by
Binning

Discretization is the process of converting continuous


numerical data into discrete categories (intervals or
“bins”).

Steps in Discretization by Binning

1.Sort the data in ascending order.


2. Divide the range into a fixed number of bins (equal-width or equal-
frequency).
3. Replace the values in each bin with:
Bin mean (smoothing by bin means)
Bin median (smoothing by bin medians)
Bin boundaries (nearest boundary value)
22, 25, 27, 28, 32, 35, 37, 40, 44, 48

Steps sort the data first then choose binning method for example use
equal-width binning with 3 bins

W = (B –A)/N.
48-22/3 = 8.6

Bins:
Bin 1: [22 – 30] : 22,25,27,28
Bin 2: [31 – 39] : 32, 35, 37
Bin 3:[40 – 48] : 40, 44, 48

22–30 → “Young Adult


31–39 → “Adult”
40–48 → “Middle-aged”
Inconsistent Data
When the same data is recorded in different formats, units, spellings, or
conventions, leading to mismatches. This often happens when data comes from
multiple sources or is entered manually.

Examples :

Date format mismatch: 2025-08-11 vs 11/08/2025 vs Aug 11, 2025.

Units mismatch: Weight recorded as 70 kg in one dataset and 154 lbs in another.

Spelling/abbreviation differences: "NY" vs "New York", "Male" vs "M".

Apply data cleaning (standardization, conversion of units, correcting


spellings).
Data Integration and Transformation
The process of combining data from different sources into a unified view,
resolving differences in format, structure, and semantics.
Common in data warehouses, analytics, and ETL processes.

Data Transformation : The process of converting data into a suitable format or


structure for analysis or storage.

Often a step in ETL after integration and before loading into a warehouse.

Types of Transformations:
1. Smoothing – removing noise (e.g., binning, moving average).
2. Aggregation – summarizing (e.g., total sales per month).
3. Normalization – scaling values to a range (e.g., 0–1).
4. Attribute construction – creating new features from existing data (e.g., Full Name
from First + Last Name).
5. Encoding – converting categories into numbers (e.g., "Male" = 0, "Female" = 1).
August 19, 2025
122

Data Integration and Transformation

Data mining often requires data integration—the merging


of data from multiple data stores. The data may also need
to be transformed into forms appropriate for mining

Redundancy is important issue. An attribute (such as


annual revenue, for instance) may be redundant if it can
be “derived” from another attribute or set of attributes.
Inconsistencies in attribute or dimension naming can also
cause redundancies in the resulting data set.
Some redundancies can be detected by correlation
analysis
August 19, 2025
123

Data Transformation
In data transformation, the data are transformed or
consolidated into forms appropriate
for mining. Data transformation can involve the following:

Smoothing, which works to remove noise from the data.


Such techniques include binning, and clustering.

Aggregation, where summary or aggregation operations


are applied to the data. For example, the daily sales data
may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a
data cube for analysis of the data at multiple
granularities.
August 19, 2025
124

Generalization of the data, where low-level or “primitive”


(raw) data are replaced by higher-level concepts through
the use of concept hierarchies. For example, categorical
attributes, like street, can be generalized to higher-level
concepts, like city or country. Similarly, values for numerical
attributes, like age, may be mapped to higher-level
concepts, like youth, middle-aged, and senior.

Normalization, where the attribute data are scaled so as


to fall within a small specified
range, such as -1:0 to 1:0, or 0:0 to 1:0.
August 19, 2025
125

Normalization
Normalization is particularly useful for classification
algorithms involving neural networks, or distance
measurements such as nearest-neighbor classification
and clustering.

If using the neural network backpropagation algorithm


for classification Mining normalizing the input values for
each attribute measured in the training tuples will help
speed up the learning phase.

For distance-based methods, normalization helps


prevent attributes with initially large ranges (e.g.,
income) from outweighing attributes with initially
smaller ranges
126

Normalization
• Min-max normalization: to [new_minA, new_maxA]

v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
– Then $73,600 is mapped to

73,600  12,000
(1.0  0)  0  0.716
98,000  12,000
How to normalize the data through the min-
max normalization technique?

8,10,15, 20

New min is 10 and new max is


20

5000=10
,10000= 10.56
15000 11.1
Z-score normalization (μ: mean, σ:
standard deviation):
v  A
v' 
 A

Ex. Let μ = 54,000, σ = 16,000.


Then
73,600  54,000
 1.225
16,000
The grades on a history midterm at
University have a mean of μ=85, and
a standard deviation of σ=2. John
scored 86 on the exam.

Find the z-score for John’s exam


grade.
The grades on a geometry midterm at
University have a mean of μ=82, a
standard deviation of σ= 4 sigma, John
scored 74 on the exam. Find the Z-Score .

The grades on a geometry midterm at


University have a mean μ=74, and a
standard deviation of σ=4.0.Anil
scored 70 on the exam.
Find the z-score for Anil exam grade.
Round to two decimal places.
Normalization by decimal scaling

Transform the data by moving the decimal points of values of feature 𝐹.


The number of decimal points moved depends on the maximum absolute
value of 𝐹. A value 𝑣 of 𝐹 is normalized to 𝑣 ′ by computing

v
v'  Where j is the smallest integer such that Max(|ν’|) <
10 j
Suppose that the recorded values of 𝐹 range from
− 986 to 917. The maximum absolute value of 𝐹 is 986. To
normalize by decimal scaling, we therefore divide each value by
1,000 (i.e., 𝑗 = 3) so that −986 normalizes to −0.986 and 917
normalizes to 0.917.
Example: suppose the range of
attribute X is −500 to 45. The
maximum absolute value of X is
500. To normalize by decimal
scaling we will divide each value
by 1,000 (j = 3).

In this case, −500 becomes −0.5


while 45 will become 0.045
Salary bonus Formula Scaling
400 400/1000 0.4
310 310/1000 0.31
August 19, 2025
138
Data Discretization and Concept Hierarchy
Generation

Data Discretization: – Dividing the range of a continuous


attribute into intervals

• Interval labels can then be used to replace actual data


values.

•Replacing numerous values of a continuous attribute by a


small number of interval labels thereby reduces and
simplifies the original data.

•Some classification algorithms only accept categorical Data

•This leads to a concise, easy-to-use, knowledge-level


representation of mining results
August 19, 2025
139

Data discretization example. we have an attribute of


age with the following values.

Age 10,11,13,14,17,19,30, 31, 32, 38, 40, 42,70 , 72, 73, 75

Attribute Age Age Age

10,11,13,14,17,19 30, 31, 32, 38, 40,


70 , 72, 73, 75
, 42
After
Young Mature Old
Discretization
August 19, 2025
140

Concept hierarchies can be used to reduce the data by


collecting and replacing low-level concepts with higher-level
concepts.

In the multidimensional model, data are organized into


multiple dimensions, and each dimension contains multiple
levels of abstraction defined by concept hierarchies. This
organization provides users with the flexibility to view data
from different perspectives.

Concept hierarchies can be used to reduce the data by


collecting and replacing low-level concepts (such as
numerical values for the attribute age) with higher-level
concepts (such as youth, middle-aged, or senior).
August 19, 2025
141
Encoding Methods
Encoding is the process of converting categorical (non-numeric) data into
a numerical format so that machine learning algorithms can process it.

Encoding Methods

1. Label Encoding : Assigns a unique integer to each category.


Color: Red, Blue, Green
Encoded: 0, 1, 2
2. One hot Encoding : Creates a binary column for each category (1 if
present, 0 otherwise).

Color: Red, Blue, Green


Red Blue Green
Red 1 0 0
Blue 0 1 0
Green 0 0 1
3. Ordinal Encoding : Similar to label encoding but uses logical order for
categories.

Size: Small=1, Medium=2, Large=3

4. Binary Encoding : Converts categories to binary digits, then stores them in


separate columns.

Category IDs: 1, 2, 3, 4
Binary form: 01, 10, 11, 100
Descriptive data summarization,

Descriptive data summarization is the process of condensing large datasets into


compact, easy-to-understand summaries using statistical measures and visual
representations.

It helps in understanding the distribution, central tendency, variability, and


overall characteristics of the data before applying deeper analysis or modeling.

Types of Summarization

1. Numerical Measures
Measures of Central Tendency – Show the “center” of data:
Mean – Arithmetic average
Median – Middle value when sorted
Mode – Most frequent value
2.Measures of Dispersion – Show how spread out the data is:
• Range – Max − Min
• Variance & Standard Deviation – Spread from the mean
• Quartiles & IQR (Interquartile Range) – Middle 50% spread

You might also like