Dr.
Anil Chhangani
Associate Professor
CSC - 504
Data Warehousing & Mining
Data
Data
A stream of raw facts
representing things or events
that have happened
Data A stream of raw facts representing things or
events that have happened
Information
Data A stream of raw facts representing things or
events that have happened
Information
Data that has been processed
to make it meaningful & useful
Data + Meaning = Information
Data A stream of raw facts representing things or events that
have happened
Information Data that has been processed to make it
meaningful & useful
Data + Meaning = Information
Knowledge
Data A stream of raw facts representing things or events that
have happened
Information Data that has been processed to make it
meaningful & useful
Data + Meaning = Information
Knowledge
Knowing what to do with Data &
Information requires knowledge
Unit
Kilobytes (KB) 1,000 bytes
Megabytes (MB) 1,000 Kilobytes
Gigabytes (GB) 1,000 Megabytes
Terabytes (TB) 1,000 Gigabytes
Petabytes (PB) 1,000 Terabytes
Exabytes (EB) 1,000 Petabytes
Zettabytes (ZB) 1,000 Exabytes
Yottabytes (YB) 1,000 Zettabytes
Why Data Mining?
• The Explosive Growth of Data: from
terabytes(10004) to yottabytes(10008)
• Data rich but information poor!
• Data mining — Analysis of massive data sets
What is Data Mining?
Data
What is Data Mining?
- searching for
knowledge
(interesting patterns) in Data
data
What is Data Mining?
- searching for knowledge
(interesting patterns) in data
- analysing large amount of
data stored in a data Data
warehouse for useful
information which makes
use of statistical tools to
find patterns, which are
otherwise hidden
What is Data Mining?
- searching for knowledge
(interesting patterns) in data
- analysing large amount of data
stored in a data warehouse for
useful information which makes Data
use of statistical tools to find
patterens, which are otherwise
hidden
- an essential step in the
process of knowledge
discovery
Data Mining Task Primitives
- a data mining task in mind (data analysis )
- DM task can be - form of a data mining query,
- DM query is - data mining task primitives
- These primitives allow to communicate with the
DM system during discovery in order to
- direct the mining process
- examine the findings from different angles
or depths
Data Mining Task Primitives
The data mining primitives specify the following:
The set of task-relevant data to be mined:
- specifies portions of database or set of
data in which the user is interested.
- includes database attributes or data
warehouse dimensions of interest
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined:
Specifies DM functions to be performed, like
Characterization
Discrimination
Association or correlation analysis,
Classification
Prediction
Clustering
Outlier analysis
Evolution analysis.
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined
The background knowledge to be used in the
discovery process:
Useful for
- guiding knowledge discovery process
- evaluating the patterns found
Concept hierarchies allows data to be mined
at multiple levels of abstraction
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined
The background knowledge to be used in the
discovery process:
Concept hierarchies allows data to be mined
at multiple levels of abstraction eg
For attribute age
Root represents
the most general
abstraction level,
denoted as all
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined
The background knowledge to be used in the discovery process:
The interestingness measures and thresholds for
pattern evaluation:
Used
- to guide the mining process or,
- to evaluate the discovered patterns.
Different kinds of knowledge may have different
interestingness measures.
For eg. : Interestingness measures for association
rules include support and confidence
Support and confidence values are below user-
specified thresholds are considered uninteresting
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined
The background knowledge to be used in the discovery process:
The interestingness measures and thresholds for pattern
evaluation:
For ex., to know which products are frequently purchased
together (within the same transaction)
An example of such a rule, mined from a Electronis store
transactional database, is
buys(X,“computer”) -> buys(X, “software”) [support = 1%, confidence = 50%]
- X is a variable representing a customer
- confidence, of 50% means that if a customer buys a
computer, there is a 50% chance that she will buy
software as well
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined
The background knowledge to be used in the discovery process:
The interestingness measures and thresholds for pattern
evaluation:
buys(X,“computer”) -> buys(X, “software”) [support = 1%, confidence = 50%]
- X is a variable representing a customer
- confidence, of 50% means that if a customer buys a
computer, there is a 50% chance that customer will buy
software as well
- 1% support means that 1% of all the transactions under analysis
show that computer and software are purchased together
This association rule involves a single attribute or predicate (i.e., buys)
hence is referred to as single-dimensional association
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined
The background knowledge to be used in the discovery process:
The interestingness measures and thresholds for pattern
evaluation:
Why Data Mining
What is Data Mining
Data Mining Task Primitives
The set of task-relevant data to be mined
The kind of knowledge to be mined
The background knowledge to be used in the
discovery process
The interestingness measures and thresholds for
pattern evaluation
The expected representation for visualizing the
discovered patterns:
The kind of knowledge to be mined:
Specifies DM functions to be performed,
Characterization
Discrimination
Association or correlation analysis,
Classification
Prediction
Clustering
Outlier analysis
Evolution analysis
Characterization/ Descriptions
Data can be associated with classes or
concepts
E.g.
classes of items – computers, printers
concepts of customers – bigSpenders,
budgetSpenders,
Characterization / Descriptions
Descriptions can be derived via
Data characterization – summarizing the
general characteristics of a target class
E.g. summarizing the characteristics of customers
who spend more than Rs10,000 a year
Result can be a general profile of the customers,
such as 40 – 50 years old, employed, have
excellent credit ratings
Data discrimination –
comparing the target class with one
or a set of comparative classes
Data discrimination –
E.g. Compare the general features of software
products whose sales increase by 10% in the
last year with those whose sales
decrease by 30% during the same period
Or both of the above
Mining Frequent Patterns, Associations & Correlations
Frequent itemset: a set of items that frequently appear
together e.g. milk and bread
Frequent subsequence: a pattern that customers tend to
purchase product A, followed by a purchase of product B
Mining Frequent Patterns, Associations & Correlations
Association Analysis: find frequent patterns
E.g. a sample analysis result – an association rule:
buys(X, “computer”) => buys( X, “software”)
[support= 1%, confidence = 50%]
if a customer buys a computer, there is a 50% chance that
he/she will buy software
1% of all of the transactions under analysis showed that
computer and software are purchased together
Associations rules are discarded as uninteresting if they do
not satisfy both a minimum threshold values
Mining Frequent Patterns, Associations & Correlations
Correlation Analysis: additional analysis to find statistical
correlations between associated pairs
Classification and Prediction
Classification:
The process of finding a model that describes & distinguishes
the data classes or concepts, for the purpose of being able to
use the model to predict the class of objects whose class
label is unknown
The derived model is based on the analysis of a set of training
data (data objects whose class label is known)
The model can be represented in classification (IF-THEN)
rules, decision trees, neural networks, etc.
Classification and Prediction
Prediction
Predict missing or unavailable numerical data values or
a class label for some data
Cluster Analysis
Class label is unknown ( unsupervised) : group data to
form new classes
Clusters of objects are formed based on the principle
of maximizing intra-class similarity and
minimizing interclass similarity
E.g. Identify homogeneous subpopulations of
customers. These clusters may represent individual
target groups for marketing.
Evolution Analysis
Describes and models regularities or trends for
objects whose behavior changes over time
E.g. Identify stock evolution regularities for overall
stocks and for the stocks of particular companies
Major Issues in Data Mining
Issues in data mining research, are grouped into
five groups:
Mining methodology
User interaction
Efficiency and scalability
Diversity of data types
Data mining and society
Major Issues in Data Mining
Mining methodology: Various aspects of mining
methodology
Mining various and new kinds of knowledge
Mining knowledge in multidimensional space
Handling uncertainty, noise, or incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
Major Issues in Data Mining
User Interaction: Various aspects of it are
Interactive mining
Incorporation of background knowledge
Ad hoc data mining and data mining query languages
Presentation and visualization of data mining results
Major Issues in Data Mining
Efficiency and Scalability: Various aspects of it are
Efficiency and scalability of data mining algorithms
Parallel, distributed, and incremental mining algorithms
Ad hoc data mining and data mining query languages
Presentation and visualization of data mining results
Major Issues in Data Mining
Diversity of Database Types: Various aspects of it are
Handling complex types of data
Mining dynamic, networked, and global data repositories
Major Issues in Data Mining
Data Mining and Society: Various aspects of it are
Social impacts of data mining
Privacy-preserving data mining
Invisible data mining
Outlier Analysis
Data that do no comply with the general behavior or
model
Outliers are usually discarded as noise or exceptions
Useful for fraud detection
E.g. Detect purchases of extremely large amounts
Knowledge Discovery from Data (KDD) Process
Knowledge Discovery from Data (KDD) Process
The KDD process is shown in Figure as an iterative
sequence of the following steps:
1. Data cleaning
2. Data integration
3. Data selection
4. Data transformation
5. Data mining
6. Pattern evaluation
7. Knowledge presentation
Knowledge Discovery from Data (KDD) Process
1. Data cleaning
to remove noise and
inconsistent data
Knowledge Discovery from Data (KDD) Process
2. Data integration
multiple data sources
may be combined
Knowledge Discovery from Data (KDD) Process
3. Data selection
data relevant to the
analysis task are
retrieved from the
database
Knowledge Discovery from Data (KDD) Process
4. Data transformation
data are transformed
and consolidated into
forms appropriate for
mining by performing
summary or
aggregation operations
Knowledge Discovery from Data (KDD) Process
5. Data mining
an essential process
where intelligent
methods are applied to
extract data patterns
Knowledge Discovery from Data (KDD) Process
6. Pattern evaluation
to identify interesting
patterns representing
knowledge based on
interestingness measures
Knowledge Discovery from Data (KDD) Process
7. Knowledge presentation
where visualization and
knowledge representation
techniques are used to
present mined knowledge
to users
Knowledge Discovery from Data (KDD) Process
1. Data cleaning
2. Data integration
3. Data selection
4. Data transformation
5. Data mining
6. Pattern evaluation
7. Knowledge presentation
What Is an Attribute
An attribute is a data field, representing a characteristic or
feature of a data object
The nouns attribute, dimension, feature, and variable are
often used interchangeably
Dimension is commonly used in data warehousing
Machine learning literature tends to use the term feature
Statisticians prefer the term variable
Data mining and database professionals commonly use the
term attribute
Attribute
Attributes describing a customer object can include, for
example, customer ID, name, and address
Observed values for a given attribute are known as
observations
A set of attributes used to describe a given object is called an
attribute vector (or feature vector)
The distribution of data involving one attribute (or variable) is
called univariate
A bivariate distribution involves two attributes, and so on
Types of Attribute
The type of an attribute is determined by the set of possible
values the attribute can have
Following are types of attributes
Nominal
Binary
Ordinal
Numeric
Types of Attribute : Nominal
Nominal means “relating to names”
Values of nominal attribute are symbols or names of things
Each value represents some kind of category, code, or state,
Nominal attributes are also referred to as categorical
Values do not have any meaningful order
Values are also known as enumerations
Types of Attribute : Nominal
Example: hair color is an attributes describing person
objects
Possible values for hair color are black, brown, blond, red,
pink, gray, and white
Another example of a nominal attribute is occupation, with
the values professor, dentist, programmer, farmer, and so on
It is possible to represent symbols or “names” with numbers.
With hair color, for instance, we can assign a code of 0 for
black, 1 for brown, and so on
Types of Attribute : Nominal
Nominal attribute values do not have any meaningful order
about them and are not quantitative, it makes no sense to
find the mean (average) value or median (middle) value for
such an attribute, given a set of objects
One thing that is of interest, however, is the attribute’s most
commonly occurring value
Types of Attribute : Nominal
Nominal attribute values do not have any meaningful order
about them and are not quantitative, it makes no sense to
find the mean (average) value or median (middle) value for
such an attribute, given a set of objects
One thing that is of interest, however, is the attribute’s most
commonly occurring value
This value, known as the mode, is one of the measures of
central tendency
Types of Attribute : Binary
A binary attribute is a nominal attribute with only two
categories or states: 0 or 1, where
0 typically means that the attribute is absent, and
1 means that it is present
Binary attributes are referred to as Boolean if the two states
correspond to true and false.
Types of Attribute : Binary
Example : Suppose a patient undergoes a medical test that
has two possible outcomes
The attribute medical test is binary, where a value of 1 means
the result of the test for the patient is positive, while 0 means
the result is negative
Binary attribute is symmetric if both of its states are equally
valuable and carry the same weight
It means, there is no preference on which outcome should be
coded as 0 or 1
Example could be the attribute gender having the states male
and female
Types of Attribute : Binary
Example :
A binary attribute is asymmetric if the outcomes of the states
are not equally important
The positive and negative outcomes of a medical test
By convention, we code the most important outcome, which
is usually the rarest one, by 1(positive) and the other by 0
Types of Attribute : Ordinal
Ordinal Attributes
An ordinal attribute is an attribute with possible values that
have a meaningful order or ranking among them, but the
magnitude between successive values is not known
Ordinal attributes may be obtained from the discretization of
numeric quantities by splitting the value range into a finite
number of ordered categories
Central tendency of an ordinal attribute can be represented by
its mode and its median, but the mean cannot be defined.
Types of Attribute : Ordinal
Ordinal Attributes
Nominal, binary, and ordinal attributes are qualitative
That is, they describe a feature of an object without giving an
actual size or quantity
The values of such qualitative attributes are typically words
representing categories
Types of Attribute : Ordinal
Ordinal Attributes
Example : Suppose that drink size corresponds to the size of
drinks available at a fast-food restaurant
Attribute has 3 possible values: small, medium,and large
The values have a meaningful sequence (which corresponds
to increasing drink size); however, we cannot tell from the
values how much bigger, say, a medium is than a large
Other examples of ordinal attributes include grade (e.g., A+,
A, A-, B+,and so on) and professional rank
Types of Attribute : Ordinal
Example :Professional ranks can be enumerated in a
sequential order for example, assistant, associate, and full for
professors
Ordinal attributes are useful for registering subjective
assessments of qualities that cannot be measured objectively
Ordinal attributes are often used in surveys for ratings
Customer satisfaction had the following ordinal categories: 0:
very dissatisfied, 1: somewhat dissatisfied, 2: neutral, 3:
satisfied, and 4: very satisfied
Types of Attribute : Numeric
Numeric Attribute : is quantitative
It is a measurable quantity, represented in integer or real
Numeric attributes can be interval-scaled or ratio-scaled
Interval-scaled attributes are measured on a scale of equal-
size units
The values of interval-scaled attributes have order and can be
positive, 0, or negative
In addition to providing a ranking of values, such attributes
allow us to compare and quantify the difference between
values
Types of Attribute : Numeric- Interval-scaled
Example: A temperature attribute is interval-scaled
The outdoor temperature value for a number of different days,
where each day is an object
By ordering the values, we obtain a ranking of the objects
with respect to temperature
We can also quantify the difference between values, a temp of
20 is five degrees higher than a temperature of 15
Calendar dates are another example. For instance, the years
2002 and 2022 are twenty years apart
Attributes are numeric, we can compute their mean, median
and mode measures of central tendency
Types of Attribute : Numeric- Ratio-scaled
Ratio-Scaled Attributes
A ratio-scaled attribute is a numeric attribute with an
inherent zero-point
If a measurement is ratio-scaled, we can speak of a value as
being a multiple (or ratio) of another value
In addition, the values are ordered, and we can also compute
the difference between values, as well as the mean, median,
and mode.
Types of Attribute : Numeric- Ratio-scaled
Example: A count attributes such as years of experience
(e.g., the objects are employees) and
number of words (e.g., the objects are documents)
monetary quantities (e.g., you are 100 times richer with
Rs1000 than with Rs10).
Statistical Descriptions of Data
For data preprocessing to be successful, it is essential to have
an overall picture of your data
Basic statistical descriptions can be used to identify properties
of the data and highlight which data values should be treated
as noise or outliers
There are three areas of basic statistical descriptions
Measures of central tendency which measure the location of
the middle or center of a data distribution
We discuss the mean, median, mode, and midrange
Statistical Descriptions of Data
Measures of central tendency which measure the location of
the middle or center of a data distribution
We discuss the mean, median, mode, and midrange
The dispersion of the data : That is, how are the data spread
out? The most common data dispersion measures are the
Range
Quartiles
Interquartile range
Five-number summary and boxplots
Variance and standard deviation
These measures are useful for identifying outliers
Statistical Descriptions of Data
Graphic displays of basic statistical descriptions to visually
inspect our data
Most statistical or graphical data presentation software
packages include bar charts, pie charts, and line graphs
Other popular displays of data summaries and distributions
include quantile plots, quantile–quantile plots, histograms,
and scatter plots
Measure of Central Tendancy
Mean, Median, and Midrange
Measures of central tendency include the mean, median,
mode, and midrange.
The most common and effective numeric measure of the
“center” of a set of data is the (arithmetic) mean
The mean of this set of values is
Measure of Central Tendancy
Mean, Median, and Midrange
Example Mean. Suppose we have the following values for
salary (in thousands of dollars), shown in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110. Using Eq.
Measure of Central Tendancy
Mean, Median, and Midrange
Measure of Central Tendancy
Mean, Median, and Midrange
Although the mean is the single most useful quantity for
describing a data set, it is not always the best way of
measuring the center of the data
A major problem with the mean is its sensitivity to extreme
(e.g., outlier) values
Even a small number of extreme values can corrupt the mean
For ex, the mean salary at a company may be substantially
pushed up by that of a few highly paid managers
To offset the effect caused by a small number of extreme
values, we can instead use the trimmed mean,
Measure of Central Tendancy
Mean, Median, and Midrange
To offset the effect caused by a small number of extreme
values, we can instead use the trimmed mean
It is the mean obtained after chopping off values at the high
and low extremes.
For example, we can sort the values observed for salary and
remove the top and bottom 2% before computing the mean
Avoid trimming too large a portion (such as 20%) at both
ends, as this can result in the loss of valuable information
Measure of Central Tendancy
Mean, Median, and Midrange
For skewed (asymmetric) data, a better measure of the center
of data is the median
Median is the middle value in a set of ordered data values
Median separates the higher half of a data set from the lower
half
In a given data set of N values for an attribute X is sorted in
increasing order
If N is odd, then the median is the middle value of the ordered
set
Measure of Central Tendancy
Mean, Median, and Midrange
If N is odd, then the median is the middle value of the ordered
If N is even, then it is the two middlemost values and the
median is taken as the average of the two middlemost values
Median Suppose we have the following values for salary (in
thousands of dollars), shown in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
Median can be any value within the two middlemost values
of 52 and 56 ,the average of the two middlemost values as the
median; that is, (52+56) / 2 = 54
Thus, the median is $54,000
Measure of Central Tendancy
Mean, Median, and Midrange
Mode : is another measure of central tendency
Mode for a set of data is the value that occurs most frequently
in the set
Data sets with one, two, or three modes are respectively
called unimodal, bimodal, and trimodal
In general, a data set with two or more modes is multimodal
If each data value occurs only once, then there is no mode
Measure of Central Tendancy
Mean, Median, and Midrange
Mode : Suppose we have the following values for salary (in
thousands of dollars), shown in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
The two modes are $52,000 and $70,000
For unimodal numeric data that are moderately skewed
(asymmetrical), we have the following empirical relation:
Mean -Mode = 3(Mean-Median)
Measure of Central Tendancy
Mean, Median, and Midrange
The midrange can also be used to assess the central tendency
of a numeric data set
It is the average of the largest and smallest values in the set
Midrange : Suppose we have the following values for salary
(in thousands of dollars), shown in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
Midrange = ( 30,000 + 110,000 ) / 2 = $70,000
Measure of Central Tendancy
Mean, Median, and Midrange
In a unimodal frequency curve with perfect symmetric data
distribution, the mean, median, and mode are all at the same
center value, as shown in Figure (a)
Data in most real applications are not symmetric. They may
instead be either positively skewed, where the mode occurs at
a value that is smaller than the median(Figure b), or
Negatively skewed, where the mode occurs at a value greater
than the median (Figure c)
Measure of Central Tendancy
Mean, Median, and Midrange
•If mean = median = mode, the shape of the distribution is
symmetric
•If mode < median < mean, the shape of the distribution trails
to the right, is positively skewed
•If mean < median < mode, the shape of the distribution trails
to the left, is negatively skewed
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
To assess the dispersion or spread of numeric data, the
measures include range, quantiles, quartiles, percentiles,
and the interquartile range T
The five-number summary, which can be displayed as a
boxplot, is useful in identifying outliers
Variance and standard deviation also indicate the spread of
a data distribution
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
Let x1,x2,…… ,xN be a set of observations for some numeric
attribute, X
The range of the set is the difference between the largest
(max()) and smallest (min()) values
Let the data for attribute X are sorted in increasing numeric
order
Pick certain data points so as to split the data distribution into
equal-size consecutive sets, as in Figure, These data points
are called quantiles
Quantiles are points taken at regular intervals of a data
distribution, dividing it into essentially equal size
consecutive sets.
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
Quantiles are points taken at regular intervals of a data
distribution, dividing it into essentially equal size consecutive
sets
4-quantiles are the 3 data points that split the data
distribution into 4 equal parts; each part represents one-fourth
of the data distribution, referred to as quartiles
100-quantiles are referred to as percentiles; they divide the
data distribution into 100 equal-sized consecutive sets
The median, quartiles, and percentiles are the most widely
used forms of quantiles
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
The median, quartiles, and percentiles are the most widely
used forms of quantiles
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
The quartiles give an indication of a
distribution’s center, spread, and shape.
The first quartile, denoted by Q1, is
the 25th percentile. It cuts off the
lowest 25% of the data.
Third quartile, Q3, is the 75th percentile -it cuts off the
lowest 75% (or highest 25%) of the data
Second quartile is the 50th percentile, as the median, it
gives the center of the data distribution
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
The distance between the first and third
quartiles is a simple measure of spread
that gives the range covered by the
middle half of the data
This distance is called the Interquartile Range (IQR) and
is defined as
IQR = Q3 - Q1
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
•Interquartile Range :
To find the first, second, and third quartiles:
1. Arrange the N data values into an array
2. First quartile, Q1 = data value at position (N + 1)/4
3. Second quartile, Q2 = data value at position 2(N + 1)/4
4. Third quartile, Q3 = data value at position 3(N + 1)/4
* Use N instead of N+1 , if N is even.
Measuring Dispersion of Data
Range, Quartiles, and Interquartile Range
Interquartile Range :Suppose we have the following values
for salary (in thousands of dollars), in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
Quartiles for this data are the 3rd, 6th, and 9th values,
Therefore, Q1 = $47,000 and Q3 = $63,000
Interquartile Range IQR = 63 - 47 = $16,000
Sixth value is a median, $52,000
Measuring Dispersion of Data
Five-Number Summary, Boxplots, and Outliers:
The five-number summary of a distribution consists of the
median (Q2), the quartiles Q1 and Q3, and the smallest and
largest individual observations
Five-Number Summary written in the order of
Minimum, Q1, Median(Q2), Q3, Maximum
Measuring Dispersion of Data
Five-Number Summary, Boxplots, and Outliers:
Boxplots are a popular way of visualizing a distribution
A boxplot incorporates the five-number summary as follows:
Typically, the ends of the box are at the quartiles so that the
box length is the interquartile range
The median is marked by a line within the box
Two lines (called whiskers) outside the box extend to the
smallest (Minimum) and largest (Maximum) observations
Outliers : A common rule of thumb for identifying suspected
outliers is to single out values falling at least 1.5*IQR above
the third quartile or below the first quartile
Measuring Dispersion of Data
Five-Number Summary, Boxplots, and Outliers:
Measuring Dispersion of Data
Five-Number Summary, Boxplots, and Outliers:
Boxplot: Figure shows boxplots for unit price data for items
sold at four branches of AllElectronics during a given period
For branch 1, the median price of items sold is $80,
Q1 is $60, and Q3 is $100
Notice that two outlying observations for this branch were
plotted individually, as their values of 175 and 202 are more
than 1.5 times the IQR here of 40
Measuring Dispersion of Data
Five-Number Summary, Boxplots, and Outliers:
Suppose that the data for analysis includes the attribute age. The age
values for the data tuples are (in increasing order)
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30,33, 33, 35, 35,
35, 35, 36, 40, 45, 46, 52, 70
(a) What is the mean of the data? What is the median?
(b) What is the mode of the data? Comment on the data’s modality (i.e.,
bimodal,trimodal, etc.).
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1) and the third quartile
(Q3) of the data?
(e) Give the five-number summary of the data.
(f) Show a boxplot of the data.
Measuring Dispersion of Data
Example : Suppose that the data for analysis includes the attribute
age. The age values for the data tuples are (in increasing order)
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30,33, 33, 35, 35,
35, 35, 36, 40, 45, 46, 52, 70
(a) What is the mean of the data? What is the median?
(b) What is the mode of the data? Comment on the data’s modality (i.e.,
bimodal,trimodal, etc.).
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1) and the third quartile
(Q3) of the data?
(e) Give the five-number summary of the data.
(f) Show a boxplot of the data.
Measuring Dispersion of Data
Solution:
N 27
Sum 836
Mean 29.85714286
Median ( 14th 25
Mode 25 , 35 (Bimodal)
Midrange 41.5
Q1 (7) { (N + 1) / 4 } 20
Q2 (14) { 2 * (N + 1) / 4 } 25
Q3 (21) { 3 * (N + 1) / 4 } 35
5 Pt Summary 13,20,25,35,70
Measuring Dispersion of Data
Binning: Binning methods smooth a sorted data value by
consulting its “neighborhood,” that is, the values around it
The sorted values are distributed into a number of “buckets,”
or bins
Because binning methods consult the neighborhood of values,
they perform local smoothing
Figure below illustrates some binning techniques
In this example, the data for price are first sorted and then
partitioned into equal-frequency bins of size 3
Measuring Dispersion of Data
Binning: In smoothing by bin means, each value in a bin is
replaced by the mean value of the bin
For example, the mean of the values 4, 8, and 15 in Bin 1 is 9
Therefore, each original value in this bin is replaced by the
value 9
Similarly, smoothing by bin medians can be employed, in
which each bin value is replaced by the bin median
In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin
boundaries
Measuring Dispersion of Data
Binning:
Each bin value is then replaced by the closest boundary value.
In general, the larger the width, the greater the effect of the
smoothing
Alternatively, bins may be equal width, where the interval
range of values in each bin is constant
Binning is also used as a discretization technique and is further
discussed
Measuring Dispersion of Data
Binning:
Binning: MU Question
Suppose a group of sales price records has been sorted as follows
6, 9, 12, 13, 15, 25, 50, 70, 72, 92, 204, 232
Partition them into 3- bins by equal frequency partitioning method.
Perform data smoothing by bin mean.
Sol: Partition into 3 bins each of size 4
Bin 1 : 6 9 12 13
Bin 2 : 15, 25, 50, 70
Bin 3 : 72, 92, 204, 232
Bin 1 : 6 9 12 13 Bin 1 Mean = (6 + 9 + 12 + 13) / 4 = 10
Bin 2 : 15, 25, 50, 70 Bin 2 Mean = 40
Bin 3 : 72, 92, 204, 232 Bin 3 Mean = 150
Smoothing by Bin Means
Bin 1 : 10 10 10 10
Bin 2 : 40 40 40 40
Bin 3 : 150 150 150 150
Binning: MU Question
Suppose a group of sales price records has been sorted as follows
6, 9, 12, 13, 15, 25, 50, 70, 72, 92, 204, 232
Partition them into 3- bins by equal frequency partitioning method. Perform
data smoothing by bin mean.
Sol: Partition into 3 bins each of size 4
Bin 1 : 6 9 12 13
Bin 2 : 15, 25, 50, 70
Bin 3 : 72, 92, 204, 232
Smoothing by Bin Means
Bin 1 : 10 10 10 10
Bin 2 : 40 40 40 40
Bin 3 : 150 150 150 150
Smoothing by Bin Boundaries
Bin 1 : 6 6 13 13
Bin 2 : 15, 15, 70, 70
Bin 3 : 72, 72, 232, 232
Data Preprocessing
Why Preprocess the Data?
Data have quality if they satisfy the requirements of the
intended use
There are many factors comprising data quality, including:
Accuracy
Completeness
Consistency
Timeliness,
Believability
Interpretability
Inaccurate, incomplete, and inconsistent data are
commonplace properties of large real-world databases and
data warehouses
Data Preprocessing
Major Tasks in Data Preprocessing
Major steps involved in data preprocessing, namely,
Data Cleaning
Data Integration
Data Reduction
Data Transformation
Data Preprocessing
Major Tasks in Data Preprocessing : Fig taken from Han-
Kamber’s book Data mining concepts and techniques
Data Preprocessing
Major Tasks in Data Preprocessing
Data cleaning routines attempt to fill in missing values,
smooth out noise while identifying outliers, and correct
inconsistencies in the data
Data cleaning is usually performed as an iterative two-step
process consisting of
Discrepancy detection and
Data transformation
Data Preprocessing
Major Tasks in Data Preprocessing
Data integration combines data from multiple sources to
form a coherent data store
The resolution of semantic heterogeneity, metadata,
correlation analysis, tuple duplication detection, and data
conflict detection contribute to smooth data integration
Data Preprocessing
Major Tasks in Data Preprocessing
Data integration combines data from multiple sources to
form a coherent data store
The resolution of semantic heterogeneity, metadata,
correlation analysis, tuple duplication detection, and data
conflict detection contribute to smooth data integration
Data Preprocessing
Major Tasks in Data Preprocessing
Data reduction techniques obtain a reduced representation
of the data while minimizing the loss of information content
These include methods of
Dimensionality reduction
Numerosity reduction
Data compression
Data Preprocessing
Major Tasks in Data Preprocessing: Data reduction
Dimensionality reduction
reduces the number of random variables or attributes under
Numerosity reduction methods use parametric or nonparatmetric
models to obtain smaller representations of the original data
Parametric models store only the model parameters instead
of the actual data
Nonparamteric methods include histograms, clustering,
sampling, and data cube aggregation
Data compression methods apply transformations to obtain a
reduced or “compressed” representation of the original data
Data Preprocessing
Major Tasks in Data Preprocessing:
Data transformation routines convert the data into appropriate
forms for mining
For example, in normalization, attribute data are scaled so as to
fall within a small range such as 0.0 to 1.0
Other examples are
Data discretization
Concept hierarchy generation
Data discretization transforms numeric data by mapping values to
interval or concept labels
Discretization techniques include binning, histogram analysis,
cluster analysis, decision tree analysis, and correlation analysis