0% found this document useful (0 votes)

58 views122 pages

Data Mining for Students

Uploaded by

Maanav Ashish Chellani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views122 pages

Data Mining for Students

Uploaded by

Maanav Ashish Chellani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 122

Dr.

Anil Chhangani
Associate Professor

CSC - 504

Data Warehousing & Mining

Data
Data
A stream of raw facts
representing things or events
that have happened
Data A stream of raw facts representing things or
events that have happened

Information
Data A stream of raw facts representing things or
events that have happened

Information
Data that has been processed
to make it meaningful & useful

Data + Meaning = Information

Data A stream of raw facts representing things or events that
have happened
Information Data that has been processed to make it
meaningful & useful
Data + Meaning = Information

Knowledge
Data A stream of raw facts representing things or events that
have happened
Information Data that has been processed to make it
meaningful & useful
Data + Meaning = Information

Knowledge
Knowing what to do with Data &
Information requires knowledge
Unit
Kilobytes (KB) 1,000 bytes
Megabytes (MB) 1,000 Kilobytes
Gigabytes (GB) 1,000 Megabytes
Terabytes (TB) 1,000 Gigabytes
Petabytes (PB) 1,000 Terabytes

Exabytes (EB) 1,000 Petabytes

Zettabytes (ZB) 1,000 Exabytes

Yottabytes (YB) 1,000 Zettabytes

Why Data Mining?
• The Explosive Growth of Data: from
terabytes(10004) to yottabytes(10008)

• Data rich but information poor!

• Data mining — Analysis of massive data sets

What is Data Mining?

Data
What is Data Mining?

- searching for
knowledge
(interesting patterns) in Data
data
What is Data Mining?

- searching for knowledge

(interesting patterns) in data

- analysing large amount of

data stored in a data Data
warehouse for useful
information which makes
use of statistical tools to
find patterns, which are
otherwise hidden
What is Data Mining?

- searching for knowledge

(interesting patterns) in data

- analysing large amount of data

stored in a data warehouse for
useful information which makes Data
use of statistical tools to find
patterens, which are otherwise
hidden

- an essential step in the

process of knowledge
discovery
Data Mining Task Primitives
- a data mining task in mind (data analysis )

- DM task can be - form of a data mining query,

- DM query is - data mining task primitives

- These primitives allow to communicate with the

DM system during discovery in order to
- direct the mining process
- examine the findings from different angles
or depths
Data Mining Task Primitives
The data mining primitives specify the following:
The set of task-relevant data to be mined:
- specifies portions of database or set of
data in which the user is interested.
- includes database attributes or data
warehouse dimensions of interest
Data Mining Task Primitives
The set of task-relevant data to be mined

The kind of knowledge to be mined:

Specifies DM functions to be performed, like
Characterization
Discrimination
Association or correlation analysis,
Classification
Prediction
Clustering
Outlier analysis
Evolution analysis.
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined

The background knowledge to be used in the

discovery process:
Useful for
- guiding knowledge discovery process
- evaluating the patterns found
Concept hierarchies allows data to be mined
at multiple levels of abstraction
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined

The background knowledge to be used in the

discovery process:
Concept hierarchies allows data to be mined
at multiple levels of abstraction eg

For attribute age

Root represents
the most general
abstraction level,
denoted as all
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined
The background knowledge to be used in the discovery process:
The interestingness measures and thresholds for
pattern evaluation:
Used
- to guide the mining process or,
- to evaluate the discovered patterns.
Different kinds of knowledge may have different
interestingness measures.
For eg. : Interestingness measures for association
rules include support and confidence
Support and confidence values are below user-
specified thresholds are considered uninteresting
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined
The background knowledge to be used in the discovery process:
The interestingness measures and thresholds for pattern
evaluation:
For ex., to know which products are frequently purchased
together (within the same transaction)
An example of such a rule, mined from a Electronis store
transactional database, is

buys(X,“computer”) -> buys(X, “software”) [support = 1%, confidence = 50%]

- X is a variable representing a customer

- confidence, of 50% means that if a customer buys a
computer, there is a 50% chance that she will buy
software as well
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined
The background knowledge to be used in the discovery process:
The interestingness measures and thresholds for pattern
evaluation:
buys(X,“computer”) -> buys(X, “software”) [support = 1%, confidence = 50%]

- X is a variable representing a customer

- confidence, of 50% means that if a customer buys a
computer, there is a 50% chance that customer will buy
software as well
- 1% support means that 1% of all the transactions under analysis
show that computer and software are purchased together

This association rule involves a single attribute or predicate (i.e., buys)

hence is referred to as single-dimensional association
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined
The background knowledge to be used in the discovery process:
The interestingness measures and thresholds for pattern
evaluation:
Why Data Mining
What is Data Mining
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined
The background knowledge to be used in the
discovery process
The interestingness measures and thresholds for
pattern evaluation
The expected representation for visualizing the
discovered patterns:
The kind of knowledge to be mined:
Specifies DM functions to be performed,

Characterization
Discrimination
Association or correlation analysis,
Classification
Prediction
Clustering
Outlier analysis
Evolution analysis
Characterization/ Descriptions
Data can be associated with classes or
concepts
E.g.
classes of items – computers, printers
concepts of customers – bigSpenders,
budgetSpenders,
Characterization / Descriptions
Descriptions can be derived via
Data characterization – summarizing the
general characteristics of a target class
E.g. summarizing the characteristics of customers
who spend more than Rs10,000 a year

Result can be a general profile of the customers,

such as 40 – 50 years old, employed, have
excellent credit ratings
Data discrimination –
comparing the target class with one
or a set of comparative classes
Data discrimination –
E.g. Compare the general features of software
products whose sales increase by 10% in the
last year with those whose sales
decrease by 30% during the same period

Or both of the above

Mining Frequent Patterns, Associations & Correlations
Frequent itemset: a set of items that frequently appear
together e.g. milk and bread

Frequent subsequence: a pattern that customers tend to

purchase product A, followed by a purchase of product B
Mining Frequent Patterns, Associations & Correlations
Association Analysis: find frequent patterns
E.g. a sample analysis result – an association rule:

buys(X, “computer”) => buys( X, “software”)

[support= 1%, confidence = 50%]
if a customer buys a computer, there is a 50% chance that
he/she will buy software
1% of all of the transactions under analysis showed that
computer and software are purchased together

Associations rules are discarded as uninteresting if they do

not satisfy both a minimum threshold values
Mining Frequent Patterns, Associations & Correlations
Correlation Analysis: additional analysis to find statistical
correlations between associated pairs
Classification and Prediction
Classification:
The process of finding a model that describes & distinguishes
the data classes or concepts, for the purpose of being able to
use the model to predict the class of objects whose class
label is unknown

The derived model is based on the analysis of a set of training

data (data objects whose class label is known)

The model can be represented in classification (IF-THEN)

rules, decision trees, neural networks, etc.
Classification and Prediction
Prediction
Predict missing or unavailable numerical data values or
a class label for some data
Cluster Analysis
Class label is unknown ( unsupervised) : group data to
form new classes

Clusters of objects are formed based on the principle

of maximizing intra-class similarity and
minimizing interclass similarity

E.g. Identify homogeneous subpopulations of

customers. These clusters may represent individual
target groups for marketing.
Evolution Analysis

Describes and models regularities or trends for

objects whose behavior changes over time

E.g. Identify stock evolution regularities for overall

stocks and for the stocks of particular companies
Major Issues in Data Mining
Issues in data mining research, are grouped into
five groups:
Mining methodology
User interaction
Efficiency and scalability
Diversity of data types
Data mining and society
Major Issues in Data Mining
Mining methodology: Various aspects of mining
methodology

Mining various and new kinds of knowledge

Mining knowledge in multidimensional space

Handling uncertainty, noise, or incompleteness of data

Pattern evaluation and pattern- or constraint-guided mining

Major Issues in Data Mining
User Interaction: Various aspects of it are

Interactive mining

Incorporation of background knowledge

Ad hoc data mining and data mining query languages

Presentation and visualization of data mining results

Major Issues in Data Mining
Efficiency and Scalability: Various aspects of it are

Efficiency and scalability of data mining algorithms

Parallel, distributed, and incremental mining algorithms

Ad hoc data mining and data mining query languages

Presentation and visualization of data mining results

Major Issues in Data Mining
Diversity of Database Types: Various aspects of it are

Handling complex types of data

Mining dynamic, networked, and global data repositories

Major Issues in Data Mining
Data Mining and Society: Various aspects of it are

Social impacts of data mining

Privacy-preserving data mining

Invisible data mining

Outlier Analysis
Data that do no comply with the general behavior or
model

Outliers are usually discarded as noise or exceptions

Useful for fraud detection

E.g. Detect purchases of extremely large amounts

Knowledge Discovery from Data (KDD) Process
Knowledge Discovery from Data (KDD) Process
The KDD process is shown in Figure as an iterative
sequence of the following steps:
1. Data cleaning
2. Data integration
3. Data selection
4. Data transformation
5. Data mining
6. Pattern evaluation
7. Knowledge presentation
Knowledge Discovery from Data (KDD) Process

1. Data cleaning
to remove noise and
inconsistent data
Knowledge Discovery from Data (KDD) Process

2. Data integration
multiple data sources
may be combined
Knowledge Discovery from Data (KDD) Process

3. Data selection
data relevant to the
analysis task are
retrieved from the
database
Knowledge Discovery from Data (KDD) Process

4. Data transformation
data are transformed
and consolidated into
forms appropriate for
mining by performing
summary or
aggregation operations
Knowledge Discovery from Data (KDD) Process

5. Data mining
an essential process
where intelligent
methods are applied to
extract data patterns
Knowledge Discovery from Data (KDD) Process

6. Pattern evaluation
to identify interesting
patterns representing
knowledge based on
interestingness measures
Knowledge Discovery from Data (KDD) Process

7. Knowledge presentation
where visualization and
knowledge representation
techniques are used to
present mined knowledge
to users
Knowledge Discovery from Data (KDD) Process

1. Data cleaning
2. Data integration
3. Data selection
4. Data transformation
5. Data mining
6. Pattern evaluation
7. Knowledge presentation
What Is an Attribute

An attribute is a data field, representing a characteristic or

feature of a data object

The nouns attribute, dimension, feature, and variable are

often used interchangeably

Dimension is commonly used in data warehousing

Machine learning literature tends to use the term feature

Statisticians prefer the term variable

Data mining and database professionals commonly use the

term attribute
Attribute

Attributes describing a customer object can include, for

example, customer ID, name, and address

Observed values for a given attribute are known as

observations

A set of attributes used to describe a given object is called an

attribute vector (or feature vector)

The distribution of data involving one attribute (or variable) is

called univariate

A bivariate distribution involves two attributes, and so on

Types of Attribute

The type of an attribute is determined by the set of possible

values the attribute can have

Following are types of attributes

Nominal

Binary

Ordinal

Numeric
Types of Attribute : Nominal

Nominal means “relating to names”

Values of nominal attribute are symbols or names of things

Each value represents some kind of category, code, or state,

Nominal attributes are also referred to as categorical

Values do not have any meaningful order

Values are also known as enumerations

Types of Attribute : Nominal

Example: hair color is an attributes describing person

objects

Possible values for hair color are black, brown, blond, red,
pink, gray, and white

Another example of a nominal attribute is occupation, with

the values professor, dentist, programmer, farmer, and so on

It is possible to represent symbols or “names” with numbers.

With hair color, for instance, we can assign a code of 0 for
black, 1 for brown, and so on
Types of Attribute : Nominal

Nominal attribute values do not have any meaningful order

about them and are not quantitative, it makes no sense to
find the mean (average) value or median (middle) value for
such an attribute, given a set of objects

One thing that is of interest, however, is the attribute’s most

commonly occurring value
Types of Attribute : Nominal

Nominal attribute values do not have any meaningful order

about them and are not quantitative, it makes no sense to
find the mean (average) value or median (middle) value for
such an attribute, given a set of objects

One thing that is of interest, however, is the attribute’s most

commonly occurring value

This value, known as the mode, is one of the measures of

central tendency
Types of Attribute : Binary

A binary attribute is a nominal attribute with only two

categories or states: 0 or 1, where

0 typically means that the attribute is absent, and

1 means that it is present

Binary attributes are referred to as Boolean if the two states

correspond to true and false.
Types of Attribute : Binary
Example : Suppose a patient undergoes a medical test that
has two possible outcomes
The attribute medical test is binary, where a value of 1 means
the result of the test for the patient is positive, while 0 means
the result is negative
Binary attribute is symmetric if both of its states are equally
valuable and carry the same weight
It means, there is no preference on which outcome should be
coded as 0 or 1
Example could be the attribute gender having the states male
and female
Types of Attribute : Binary
Example :
A binary attribute is asymmetric if the outcomes of the states
are not equally important

The positive and negative outcomes of a medical test

By convention, we code the most important outcome, which

is usually the rarest one, by 1(positive) and the other by 0
Types of Attribute : Ordinal
Ordinal Attributes

An ordinal attribute is an attribute with possible values that

have a meaningful order or ranking among them, but the
magnitude between successive values is not known

Ordinal attributes may be obtained from the discretization of

numeric quantities by splitting the value range into a finite
number of ordered categories

Central tendency of an ordinal attribute can be represented by

its mode and its median, but the mean cannot be defined.
Types of Attribute : Ordinal
Ordinal Attributes

Nominal, binary, and ordinal attributes are qualitative

That is, they describe a feature of an object without giving an

actual size or quantity

The values of such qualitative attributes are typically words

representing categories
Types of Attribute : Ordinal
Ordinal Attributes
Example : Suppose that drink size corresponds to the size of
drinks available at a fast-food restaurant

Attribute has 3 possible values: small, medium,and large

The values have a meaningful sequence (which corresponds

to increasing drink size); however, we cannot tell from the
values how much bigger, say, a medium is than a large

Other examples of ordinal attributes include grade (e.g., A+,

A, A-, B+,and so on) and professional rank
Types of Attribute : Ordinal
Example :Professional ranks can be enumerated in a
sequential order for example, assistant, associate, and full for
professors

Ordinal attributes are useful for registering subjective

assessments of qualities that cannot be measured objectively

Ordinal attributes are often used in surveys for ratings

Customer satisfaction had the following ordinal categories: 0:

very dissatisfied, 1: somewhat dissatisfied, 2: neutral, 3:
satisfied, and 4: very satisfied
Types of Attribute : Numeric
Numeric Attribute : is quantitative
It is a measurable quantity, represented in integer or real
Numeric attributes can be interval-scaled or ratio-scaled
Interval-scaled attributes are measured on a scale of equal-
size units
The values of interval-scaled attributes have order and can be
positive, 0, or negative

In addition to providing a ranking of values, such attributes

allow us to compare and quantify the difference between
values
Types of Attribute : Numeric- Interval-scaled
Example: A temperature attribute is interval-scaled
The outdoor temperature value for a number of different days,
where each day is an object
By ordering the values, we obtain a ranking of the objects
with respect to temperature
We can also quantify the difference between values, a temp of
20 is five degrees higher than a temperature of 15
Calendar dates are another example. For instance, the years
2002 and 2022 are twenty years apart
Attributes are numeric, we can compute their mean, median
and mode measures of central tendency
Types of Attribute : Numeric- Ratio-scaled
Ratio-Scaled Attributes
A ratio-scaled attribute is a numeric attribute with an
inherent zero-point

If a measurement is ratio-scaled, we can speak of a value as

being a multiple (or ratio) of another value

In addition, the values are ordered, and we can also compute

the difference between values, as well as the mean, median,
and mode.
Types of Attribute : Numeric- Ratio-scaled
Example: A count attributes such as years of experience
(e.g., the objects are employees) and

number of words (e.g., the objects are documents)

monetary quantities (e.g., you are 100 times richer with

Rs1000 than with Rs10).
Statistical Descriptions of Data
For data preprocessing to be successful, it is essential to have
an overall picture of your data

Basic statistical descriptions can be used to identify properties

of the data and highlight which data values should be treated
as noise or outliers

There are three areas of basic statistical descriptions

Measures of central tendency which measure the location of

the middle or center of a data distribution

We discuss the mean, median, mode, and midrange

Statistical Descriptions of Data
Measures of central tendency which measure the location of
the middle or center of a data distribution

We discuss the mean, median, mode, and midrange

The dispersion of the data : That is, how are the data spread
out? The most common data dispersion measures are the
Range
Quartiles
Interquartile range
Five-number summary and boxplots
Variance and standard deviation

These measures are useful for identifying outliers

Statistical Descriptions of Data
Graphic displays of basic statistical descriptions to visually
inspect our data

Most statistical or graphical data presentation software

packages include bar charts, pie charts, and line graphs

Other popular displays of data summaries and distributions

include quantile plots, quantile–quantile plots, histograms,
and scatter plots
Measure of Central Tendancy
Mean, Median, and Midrange
Measures of central tendency include the mean, median,
mode, and midrange.
The most common and effective numeric measure of the
“center” of a set of data is the (arithmetic) mean

The mean of this set of values is

Measure of Central Tendancy
Mean, Median, and Midrange
Example Mean. Suppose we have the following values for
salary (in thousands of dollars), shown in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110. Using Eq.
Measure of Central Tendancy
Mean, Median, and Midrange
Measure of Central Tendancy
Mean, Median, and Midrange
Although the mean is the single most useful quantity for
describing a data set, it is not always the best way of
measuring the center of the data
A major problem with the mean is its sensitivity to extreme
(e.g., outlier) values
Even a small number of extreme values can corrupt the mean
For ex, the mean salary at a company may be substantially
pushed up by that of a few highly paid managers
To offset the effect caused by a small number of extreme
values, we can instead use the trimmed mean,
Measure of Central Tendancy
Mean, Median, and Midrange
To offset the effect caused by a small number of extreme
values, we can instead use the trimmed mean

It is the mean obtained after chopping off values at the high

and low extremes.

For example, we can sort the values observed for salary and
remove the top and bottom 2% before computing the mean

Avoid trimming too large a portion (such as 20%) at both

ends, as this can result in the loss of valuable information
Measure of Central Tendancy
Mean, Median, and Midrange
For skewed (asymmetric) data, a better measure of the center
of data is the median

Median is the middle value in a set of ordered data values

Median separates the higher half of a data set from the lower
half

In a given data set of N values for an attribute X is sorted in

increasing order

If N is odd, then the median is the middle value of the ordered

set
Measure of Central Tendancy
Mean, Median, and Midrange
If N is odd, then the median is the middle value of the ordered
If N is even, then it is the two middlemost values and the
median is taken as the average of the two middlemost values
Median Suppose we have the following values for salary (in
thousands of dollars), shown in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
Median can be any value within the two middlemost values
of 52 and 56 ,the average of the two middlemost values as the
median; that is, (52+56) / 2 = 54

Thus, the median is $54,000

Measure of Central Tendancy
Mean, Median, and Midrange
Mode : is another measure of central tendency

Mode for a set of data is the value that occurs most frequently
in the set

Data sets with one, two, or three modes are respectively

called unimodal, bimodal, and trimodal

In general, a data set with two or more modes is multimodal

If each data value occurs only once, then there is no mode

Measure of Central Tendancy
Mean, Median, and Midrange
Mode : Suppose we have the following values for salary (in
thousands of dollars), shown in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
The two modes are $52,000 and $70,000

For unimodal numeric data that are moderately skewed

(asymmetrical), we have the following empirical relation:

Mean -Mode = 3(Mean-Median)

Measure of Central Tendancy
Mean, Median, and Midrange
The midrange can also be used to assess the central tendency
of a numeric data set

It is the average of the largest and smallest values in the set

Midrange : Suppose we have the following values for salary
(in thousands of dollars), shown in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
Midrange = ( 30,000 + 110,000 ) / 2 = $70,000
Measure of Central Tendancy
Mean, Median, and Midrange
In a unimodal frequency curve with perfect symmetric data
distribution, the mean, median, and mode are all at the same
center value, as shown in Figure (a)
Data in most real applications are not symmetric. They may
instead be either positively skewed, where the mode occurs at
a value that is smaller than the median(Figure b), or
Negatively skewed, where the mode occurs at a value greater
than the median (Figure c)
Measure of Central Tendancy
Mean, Median, and Midrange

•If mean = median = mode, the shape of the distribution is

symmetric
•If mode < median < mean, the shape of the distribution trails
to the right, is positively skewed
•If mean < median < mode, the shape of the distribution trails
to the left, is negatively skewed
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
To assess the dispersion or spread of numeric data, the
measures include range, quantiles, quartiles, percentiles,
and the interquartile range T

The five-number summary, which can be displayed as a

boxplot, is useful in identifying outliers

Variance and standard deviation also indicate the spread of

a data distribution
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
Let x1,x2,…… ,xN be a set of observations for some numeric
attribute, X
The range of the set is the difference between the largest
(max()) and smallest (min()) values
Let the data for attribute X are sorted in increasing numeric
order
Pick certain data points so as to split the data distribution into
equal-size consecutive sets, as in Figure, These data points
are called quantiles
Quantiles are points taken at regular intervals of a data
distribution, dividing it into essentially equal size
consecutive sets.
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
Quantiles are points taken at regular intervals of a data
distribution, dividing it into essentially equal size consecutive
sets

4-quantiles are the 3 data points that split the data

distribution into 4 equal parts; each part represents one-fourth
of the data distribution, referred to as quartiles

100-quantiles are referred to as percentiles; they divide the

data distribution into 100 equal-sized consecutive sets

The median, quartiles, and percentiles are the most widely

used forms of quantiles
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
The median, quartiles, and percentiles are the most widely
used forms of quantiles
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
The quartiles give an indication of a
distribution’s center, spread, and shape.
The first quartile, denoted by Q1, is
the 25th percentile. It cuts off the
lowest 25% of the data.

Third quartile, Q3, is the 75th percentile -it cuts off the
lowest 75% (or highest 25%) of the data

Second quartile is the 50th percentile, as the median, it

gives the center of the data distribution
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
The distance between the first and third
quartiles is a simple measure of spread
that gives the range covered by the
middle half of the data

This distance is called the Interquartile Range (IQR) and

is defined as

IQR = Q3 - Q1
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
•Interquartile Range :
To find the first, second, and third quartiles:
1. Arrange the N data values into an array

2. First quartile, Q1 = data value at position (N + 1)/4

3. Second quartile, Q2 = data value at position 2(N + 1)/4

4. Third quartile, Q3 = data value at position 3(N + 1)/4

* Use N instead of N+1 , if N is even.

Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
Interquartile Range :Suppose we have the following values
for salary (in thousands of dollars), in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110

Quartiles for this data are the 3rd, 6th, and 9th values,
Therefore, Q1 = $47,000 and Q3 = $63,000

Interquartile Range IQR = 63 - 47 = $16,000

Sixth value is a median, $52,000

Measuring Dispersion of Data
Five-Number Summary, Boxplots, and Outliers:
The five-number summary of a distribution consists of the
median (Q2), the quartiles Q1 and Q3, and the smallest and
largest individual observations

Five-Number Summary written in the order of

Minimum, Q1, Median(Q2), Q3, Maximum

Measuring Dispersion of Data
Five-Number Summary, Boxplots, and Outliers:
Boxplots are a popular way of visualizing a distribution
A boxplot incorporates the five-number summary as follows:
Typically, the ends of the box are at the quartiles so that the
box length is the interquartile range
The median is marked by a line within the box
Two lines (called whiskers) outside the box extend to the
smallest (Minimum) and largest (Maximum) observations

Outliers : A common rule of thumb for identifying suspected

outliers is to single out values falling at least 1.5*IQR above
the third quartile or below the first quartile
Measuring Dispersion of Data
Five-Number Summary, Boxplots, and Outliers:
Measuring Dispersion of Data
Five-Number Summary, Boxplots, and Outliers:
Boxplot: Figure shows boxplots for unit price data for items
sold at four branches of AllElectronics during a given period

For branch 1, the median price of items sold is $80,

Q1 is $60, and Q3 is $100

Notice that two outlying observations for this branch were

plotted individually, as their values of 175 and 202 are more
than 1.5 times the IQR here of 40
Measuring Dispersion of Data
Five-Number Summary, Boxplots, and Outliers:
Suppose that the data for analysis includes the attribute age. The age
values for the data tuples are (in increasing order)
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30,33, 33, 35, 35,
35, 35, 36, 40, 45, 46, 52, 70
(a) What is the mean of the data? What is the median?
(b) What is the mode of the data? Comment on the data’s modality (i.e.,
bimodal,trimodal, etc.).
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1) and the third quartile
(Q3) of the data?
(e) Give the five-number summary of the data.
(f) Show a boxplot of the data.
Measuring Dispersion of Data
Example : Suppose that the data for analysis includes the attribute
age. The age values for the data tuples are (in increasing order)

13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30,33, 33, 35, 35,
35, 35, 36, 40, 45, 46, 52, 70

(a) What is the mean of the data? What is the median?

(b) What is the mode of the data? Comment on the data’s modality (i.e.,
bimodal,trimodal, etc.).
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1) and the third quartile
(Q3) of the data?
(e) Give the five-number summary of the data.
(f) Show a boxplot of the data.
Measuring Dispersion of Data
Solution:
N 27
Sum 836
Mean 29.85714286
Median ( 14th 25
Mode 25 , 35 (Bimodal)
Midrange 41.5
Q1 (7) { (N + 1) / 4 } 20
Q2 (14) { 2 * (N + 1) / 4 } 25
Q3 (21) { 3 * (N + 1) / 4 } 35
5 Pt Summary 13,20,25,35,70
Measuring Dispersion of Data
Binning: Binning methods smooth a sorted data value by
consulting its “neighborhood,” that is, the values around it

The sorted values are distributed into a number of “buckets,”

or bins

Because binning methods consult the neighborhood of values,

they perform local smoothing

Figure below illustrates some binning techniques

In this example, the data for price are first sorted and then
partitioned into equal-frequency bins of size 3
Measuring Dispersion of Data
Binning: In smoothing by bin means, each value in a bin is
replaced by the mean value of the bin

For example, the mean of the values 4, 8, and 15 in Bin 1 is 9

Therefore, each original value in this bin is replaced by the

value 9
Similarly, smoothing by bin medians can be employed, in
which each bin value is replaced by the bin median

In smoothing by bin boundaries, the minimum and

maximum values in a given bin are identified as the bin
boundaries
Measuring Dispersion of Data
Binning:
Each bin value is then replaced by the closest boundary value.
In general, the larger the width, the greater the effect of the
smoothing

Alternatively, bins may be equal width, where the interval

range of values in each bin is constant

Binning is also used as a discretization technique and is further

discussed
Measuring Dispersion of Data
Binning:
Binning: MU Question
Suppose a group of sales price records has been sorted as follows
6, 9, 12, 13, 15, 25, 50, 70, 72, 92, 204, 232
Partition them into 3- bins by equal frequency partitioning method.
Perform data smoothing by bin mean.
Sol: Partition into 3 bins each of size 4
Bin 1 : 6 9 12 13
Bin 2 : 15, 25, 50, 70
Bin 3 : 72, 92, 204, 232
Bin 1 : 6 9 12 13 Bin 1 Mean = (6 + 9 + 12 + 13) / 4 = 10
Bin 2 : 15, 25, 50, 70 Bin 2 Mean = 40
Bin 3 : 72, 92, 204, 232 Bin 3 Mean = 150

Smoothing by Bin Means

Bin 1 : 10 10 10 10
Bin 2 : 40 40 40 40
Bin 3 : 150 150 150 150
Binning: MU Question
Suppose a group of sales price records has been sorted as follows
6, 9, 12, 13, 15, 25, 50, 70, 72, 92, 204, 232
Partition them into 3- bins by equal frequency partitioning method. Perform
data smoothing by bin mean.
Sol: Partition into 3 bins each of size 4
Bin 1 : 6 9 12 13
Bin 2 : 15, 25, 50, 70
Bin 3 : 72, 92, 204, 232

Smoothing by Bin Means

Bin 1 : 10 10 10 10
Bin 2 : 40 40 40 40
Bin 3 : 150 150 150 150
Smoothing by Bin Boundaries
Bin 1 : 6 6 13 13
Bin 2 : 15, 15, 70, 70
Bin 3 : 72, 72, 232, 232
Data Preprocessing
Why Preprocess the Data?
Data have quality if they satisfy the requirements of the
intended use
There are many factors comprising data quality, including:
Accuracy
Completeness
Consistency
Timeliness,
Believability
Interpretability
Inaccurate, incomplete, and inconsistent data are
commonplace properties of large real-world databases and
data warehouses
Data Preprocessing
Major Tasks in Data Preprocessing
Major steps involved in data preprocessing, namely,
Data Cleaning

Data Integration

Data Reduction

Data Transformation
Data Preprocessing
Major Tasks in Data Preprocessing : Fig taken from Han-
Kamber’s book Data mining concepts and techniques
Data Preprocessing
Major Tasks in Data Preprocessing
Data cleaning routines attempt to fill in missing values,
smooth out noise while identifying outliers, and correct
inconsistencies in the data

Data cleaning is usually performed as an iterative two-step

process consisting of
Discrepancy detection and
Data transformation
Data Preprocessing
Major Tasks in Data Preprocessing
Data integration combines data from multiple sources to
form a coherent data store

The resolution of semantic heterogeneity, metadata,

correlation analysis, tuple duplication detection, and data
conflict detection contribute to smooth data integration
Data Preprocessing
Major Tasks in Data Preprocessing
Data integration combines data from multiple sources to
form a coherent data store

The resolution of semantic heterogeneity, metadata,

correlation analysis, tuple duplication detection, and data
conflict detection contribute to smooth data integration
Data Preprocessing
Major Tasks in Data Preprocessing
Data reduction techniques obtain a reduced representation
of the data while minimizing the loss of information content

These include methods of

Dimensionality reduction
Numerosity reduction
Data compression
Data Preprocessing
Major Tasks in Data Preprocessing: Data reduction
Dimensionality reduction
reduces the number of random variables or attributes under

Numerosity reduction methods use parametric or nonparatmetric

models to obtain smaller representations of the original data
Parametric models store only the model parameters instead
of the actual data
Nonparamteric methods include histograms, clustering,
sampling, and data cube aggregation

Data compression methods apply transformations to obtain a

reduced or “compressed” representation of the original data
Data Preprocessing
Major Tasks in Data Preprocessing:
Data transformation routines convert the data into appropriate
forms for mining

For example, in normalization, attribute data are scaled so as to

fall within a small range such as 0.0 to 1.0

Other examples are

Data discretization
Concept hierarchy generation

Data discretization transforms numeric data by mapping values to

interval or concept labels
Discretization techniques include binning, histogram analysis,
cluster analysis, decision tree analysis, and correlation analysis

Central Tendency Measures Guide
90% (10)
Central Tendency Measures Guide
22 pages
STAB22 Lecture's Notes
No ratings yet
STAB22 Lecture's Notes
64 pages
Unit 1
No ratings yet
Unit 1
21 pages
Data Mining Essentials for Analysts
No ratings yet
Data Mining Essentials for Analysts
73 pages
Lecture2 DataMiningFunctionalities
No ratings yet
Lecture2 DataMiningFunctionalities
18 pages
BCA-404: Data Mining and Data Ware Housing
No ratings yet
BCA-404: Data Mining and Data Ware Housing
19 pages
Unit 1 and 2
No ratings yet
Unit 1 and 2
145 pages
DMWH M1
No ratings yet
DMWH M1
25 pages
CSC 425 Data Mining and Warehousing 2024
No ratings yet
CSC 425 Data Mining and Warehousing 2024
54 pages
Data Mining Essentials Guide
No ratings yet
Data Mining Essentials Guide
23 pages
Chapter-1 - Introduction To Data Mining
No ratings yet
Chapter-1 - Introduction To Data Mining
10 pages
Introduction To Data Mining Techniques: Dr. Rajni Jain
No ratings yet
Introduction To Data Mining Techniques: Dr. Rajni Jain
11 pages
U1 - Data Mining Task Primitives
No ratings yet
U1 - Data Mining Task Primitives
4 pages
Data Mining
No ratings yet
Data Mining
25 pages
Module 4
No ratings yet
Module 4
54 pages
Data Mining - Tasks: Data Characterization Data Discrimination
No ratings yet
Data Mining - Tasks: Data Characterization Data Discrimination
4 pages
Introduction
No ratings yet
Introduction
26 pages
DW and DM Notes
No ratings yet
DW and DM Notes
89 pages
Data Mining for Business Growth
No ratings yet
Data Mining for Business Growth
7 pages
Data Mining
No ratings yet
Data Mining
27 pages
Data Mining 1
No ratings yet
Data Mining 1
56 pages
Data Mining Task Primitives and Major Issues
No ratings yet
Data Mining Task Primitives and Major Issues
18 pages
Data Mining Tutorials
No ratings yet
Data Mining Tutorials
52 pages
DATA MINING-Knowledge Discovery in Databases
No ratings yet
DATA MINING-Knowledge Discovery in Databases
6 pages
Data Mining Notes UNIT I
No ratings yet
Data Mining Notes UNIT I
21 pages
DM 1 PDF
No ratings yet
DM 1 PDF
67 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
10 pages
Data Mining Primitives, Languages and System Architecture
No ratings yet
Data Mining Primitives, Languages and System Architecture
64 pages
Unit-1 Notes
No ratings yet
Unit-1 Notes
24 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
10 pages
DataWarehouseMining Complete Notes
No ratings yet
DataWarehouseMining Complete Notes
55 pages
LECTURE NOTES ON DATA MINING and DATA WA
No ratings yet
LECTURE NOTES ON DATA MINING and DATA WA
84 pages
Unit III
No ratings yet
Unit III
101 pages
Unit 1
No ratings yet
Unit 1
59 pages
Module1 1 Introduction
No ratings yet
Module1 1 Introduction
27 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
Data Mining and Warehousing Guide
No ratings yet
Data Mining and Warehousing Guide
27 pages
Unit 1: Scs5623 - Data Mining and Warehousing
No ratings yet
Unit 1: Scs5623 - Data Mining and Warehousing
13 pages
Data Mining Primitives
No ratings yet
Data Mining Primitives
39 pages
Unit 2
No ratings yet
Unit 2
21 pages
Data Mining: Key Issues and Tasks
No ratings yet
Data Mining: Key Issues and Tasks
5 pages
Archana Data Mining
No ratings yet
Archana Data Mining
24 pages
What Motivated Data Mining?: Huge Amount of Raw DATA Is Available - The Motivation For The Data Mining Is To
No ratings yet
What Motivated Data Mining?: Huge Amount of Raw DATA Is Available - The Motivation For The Data Mining Is To
83 pages
MR22-DM 1
No ratings yet
MR22-DM 1
21 pages
Lecture Notes 1.1 & 1.2
No ratings yet
Lecture Notes 1.1 & 1.2
8 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
47 pages
DWDM 01 Introduction
No ratings yet
DWDM 01 Introduction
43 pages
Data Mining
No ratings yet
Data Mining
22 pages
Data Mining Primitives, Languages and System Architecture
No ratings yet
Data Mining Primitives, Languages and System Architecture
26 pages
Data Mining: Techniques & Applications
No ratings yet
Data Mining: Techniques & Applications
21 pages
CH 2
No ratings yet
CH 2
37 pages
Data Mining & Warehousing Guide
No ratings yet
Data Mining & Warehousing Guide
21 pages
Data Minng
No ratings yet
Data Minng
20 pages
DWM - Module 2
No ratings yet
DWM - Module 2
74 pages
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
No ratings yet
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
16 pages
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
Unit 1 DMDW
No ratings yet
Unit 1 DMDW
57 pages
Unit
No ratings yet
Unit
27 pages
Notes For DMDWH - Module1
No ratings yet
Notes For DMDWH - Module1
21 pages
Artificial Intelligence
100% (1)
Artificial Intelligence
76 pages
Data Mining Primitives, Languages and System Architecture
No ratings yet
Data Mining Primitives, Languages and System Architecture
64 pages
Measures of Dispersion: Prepared By:-Sanjeev Arora (Faculty Management) 1 Graphic Era University, Dehradun
No ratings yet
Measures of Dispersion: Prepared By:-Sanjeev Arora (Faculty Management) 1 Graphic Era University, Dehradun
4 pages
MOD 4 Ex
No ratings yet
MOD 4 Ex
21 pages
Damage Detection in Concrete Beams
No ratings yet
Damage Detection in Concrete Beams
15 pages
Measure of Central Tendency
No ratings yet
Measure of Central Tendency
22 pages
3 5 A Appliedstatistics
No ratings yet
3 5 A Appliedstatistics
3 pages
TQ (70 Items)
No ratings yet
TQ (70 Items)
9 pages
Assignment-I Edited
100% (1)
Assignment-I Edited
3 pages
(Maa 4.9) Discrete Distributions in General
No ratings yet
(Maa 4.9) Discrete Distributions in General
20 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
10 pages
Statistics
No ratings yet
Statistics
18 pages
Handlg Data Ch3
No ratings yet
Handlg Data Ch3
0 pages
Grade 10 - Math W.S
No ratings yet
Grade 10 - Math W.S
2 pages
Find The Mean, Median, Mode, and Range For The Following List of Values
No ratings yet
Find The Mean, Median, Mode, and Range For The Following List of Values
2 pages
Data Science Lab Guide
No ratings yet
Data Science Lab Guide
61 pages
Research Fundamentals For Dissertation: Concepts, Methods, Tests, and Reporting Practices
No ratings yet
Research Fundamentals For Dissertation: Concepts, Methods, Tests, and Reporting Practices
513 pages
Statistical Analysis Summary
No ratings yet
Statistical Analysis Summary
6 pages
Stasts
No ratings yet
Stasts
29 pages
Quantitative Methods for Students
No ratings yet
Quantitative Methods for Students
30 pages
MMW Module 6 - Measures of Central Tendency
No ratings yet
MMW Module 6 - Measures of Central Tendency
10 pages
Economics Class 11 Chapterwise Notes, Ebook For Board Exams - PDF Download
No ratings yet
Economics Class 11 Chapterwise Notes, Ebook For Board Exams - PDF Download
67 pages
Business Statistics Mock MCQ Test
No ratings yet
Business Statistics Mock MCQ Test
12 pages
Eda Ce Reviewer
No ratings yet
Eda Ce Reviewer
10 pages
Control Charts & Normal Distribution
No ratings yet
Control Charts & Normal Distribution
25 pages
Mathematics Worksheet Two For Grade 12
No ratings yet
Mathematics Worksheet Two For Grade 12
2 pages
O Level Mathematics Project 2
No ratings yet
O Level Mathematics Project 2
9 pages
Mean, Median, Mode for Ungrouped Data
No ratings yet
Mean, Median, Mode for Ungrouped Data
19 pages
Statistics Mind Map Class-10th Math
No ratings yet
Statistics Mind Map Class-10th Math
2 pages