Chapter 2 Data Mining
Chapter 2 Data Mining
n Record
n Relational records
n Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
team
ball
lost
pla
crosstabs
wi
n
y
n Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
n Transaction data
n Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
n Binary
n Numeric: quantitative
n Interval-scaled
n Ratio-scaled
3
Attribute Types
n Nominal: “relating to names”
n The values of a nominal attribute are symbols or names of
things.
n Each value represents some kind of category, code, or state,
and so nominal attributes are also referred to as categorical.
n In computer science, the values are also known as
enumerations.
n Hair_color = {auburn, black, blond, brown, grey, red, white}
n marital status, occupation, ID numbers, zip codes
n Although we said that the values of a nominal attribute are
symbols or “names of things,” it is possible to represent
such symbols or “names” with numbers.
n With hair color, for instance, we can assign a code of 0 for black, 1 for
brown, and so on.
n However, in such cases, the numbers are not intended to be used
quantitatively.
4
Attribute Types
n Binary
n Nominal attribute with only 2 categories or states (0 and 1),
where 0 typically means that the attribute is absent, and 1 means
that it is present.
n Binary attributes are referred to as Boolean if the two states
correspond to true and false.
n Attribute = smoker; 1, patient smokes: 0, patient does not.
n Symmetric binary: if both of its states are equally valuable and
carry the same weight; that is, there is no preference on which
outcome should be coded as 0 or 1.
n e.g., gender (male and female)
n Asymmetric binary: if the outcomes of the states are not equally
important.
n e.g., medical test (positive vs. negative)
n Convention: assign 1 to most important outcome (e.g., HIV
positive)
5
Attribute Types
n Ordinal
n is an attribute with possible values that have a meaningful
order or ranking among them, but the magnitude between
successive values is not known.
n drink size = {small, medium, large}, grades, army rankings
n Other examples of ordinal attributes include grade (e.g., A+,
A, A-, B+ and so on) and professional rank.
n Ordinal attributes are useful for registering subjective
assessments of qualities that cannot be measured objectively;
thus ordinal attributes are often used in surveys for ratings.
n Customer satisfaction had the following ordinal categories:
n 0: very dissatisfied,
n 1: somewhat dissatisfied,
n 2: neutral,
n 3: satisfied, and
n 4: very satisfied
6
Attribute Types
n Note that nominal, binary, and ordinal attributes
are qualitative. That is, they describe a feature
of an object without giving an actual size or
quantity.
n The values of such qualitative attributes are
typically words representing categories.
n If integers are used, they represent computer
codes for the categories, as opposed to
measurable quantities (e.g., 0 for small drink size,
1 for medium, and 2 for large).
7
Numeric Attribute Types
n A numeric attribute is quantitative; that is, it is a
measurable quantity, represented in integer or real values.
n Numeric attributes can be interval-scaled or ratio-
scaled.
n Interval-Scaled Attributes
n are measured on a scale of equal-size units.
9
Discrete vs. Continuous Attributes
n Discrete Attribute
n A discrete attribute has a finite or countably infinite set
of values, which may or may not be represented as
integers.
n E.g., zip codes, profession, or the set of words in a
collection of documents
n Note: Binary attributes are a special case of discrete
attributes
n Continuous Attribute/ Numeric Attribute
n Continuous attributes are typically represented as
floating-point variables.
n E.g., temperature, height, or weight
n Data Visualization
n Summary
11
Basic Statistical Descriptions of Data
n Motivation
n To better understand the data: central tendency,
variation and spread
n Data dispersion characteristics
n median, max, min, quantiles, outliers, variance, etc.
n Numerical dimensions correspond to sorted intervals
n Data dispersion: analyzed with multiple granularities
of precision
n Boxplot or quantile analysis on sorted intervals
n Dispersion analysis on computed measures
n Folding measures into numerical dimensions
n Boxplot or quantile analysis on the transformed cube
12
Measuring the Central Tendency
n Mean (algebraic measure) (sample vs. population): 1 n
x = å xi µ= å x
Note: n is sample size and N is population size. n i =1 N
n
Weighted arithmetic mean:
åw x
n
i i
n Trimmed mean: chopping extreme values x= i =1
n
åw
i =1
i
13
Measuring the Central Tendency
n Mean (algebraic measure) (sample vs. population): 1 n
x = å xi µ= å x
Note: n is sample size and N is population size. n i =1 N
n
Weighted arithmetic mean:
åw x
n
i i
n Trimmed mean: chopping extreme values x= i =1
n
åw
i =1
i
14
Measuring the Central Tendency
n Median:
n Middle value if odd number of values, or average of
the middle two values otherwise
n Estimated by interpolation (for grouped data):
n / 2 - (å freq)l
median = L1 + ( ) width
freqmedian
n What is our median?
15
Measuring the Central Tendency
n Median:
n Middle value if odd number of values, or average of the middle two values
otherwise
n Estimated by interpolation (for grouped data):
n / 2 - (å freq)l
median = L1 + ( ) width
freqmedian
n What is our median?
17
Measuring the Central Tendency
n Mode
n Value that occurs most frequently in the data
n Data sets with one, two, or three modes are respectively called unimodal,
bimodal, and trimodal.
n In general, a data set with two or more modes is multimodal.
n Empirical formula: mean - mode = 3 ´ (mean - median)
n What is our mode?
18
Measuring the Central Tendency
n Mode
n The midrange can also be used to assess the central tendency of a
numeric data set.
n It is the average of the largest and smallest values in the set. This measure
is easy to compute using the SQL aggregate functions, max() and min().
n What is our midrange?
19
Measuring the Central Tendency
n Mode
n The midrange can also be used to assess the central tendency of a
numeric data set.
n It is the average of the largest and smallest values in the set. This measure
is easy to compute using the SQL aggregate functions, max() and min().
n What is our midrange?
20
Symmetric vs. Skewed Data
n Median, mean and mode of symmetric
symmetric, positively and
negatively skewed data
n The range of the set is the difference between the largest (max()) and
smallest (min()) values.
n Quantiles are points taken at regular intervals of a data distribution, dividing it
into essentially equal-size consecutive sets.
22
Measuring the Dispersion of Data
n Quantiles are points taken at regular intervals of a data distribution, dividing it
into essentially equal-size consecutive sets.
n The 2-quantile is the data point dividing the lower and upper halves of the data
distribution. It corresponds to the median.
n The 4-quantiles are the three data points that split the data distribution into four
equal parts; each part represents one-fourth of the data distribution. They are more
commonly referred to as quartiles. (25th, 50th, 75th, 100th)
n The 100-quantiles are more commonly referred to as percentiles; they divide the
data distribution into 100 equal-sized consecutive sets.
n Interquartile range (IQR) is the distance between the first and third
quartiles is a simple measure of spread that gives the range covered by the
middle half of the data.
n Example:
23
Measuring the Dispersion of Data
n Interquartile range (IQR) is the distance between the first and third
quartiles is a simple measure of spread that gives the range covered by the
middle half of the data.
n Example:
n Q1 = $47, 000
n Q3 = $63, 000
n IQR = $63, 000 - $47, 000
n IQR = $16, 000
24
Measuring the Dispersion of Data
n Five number summary of a distribution consists of the median (Q2),
the quartiles Q1 and Q3, and the smallest and largest individual
observations, written in the order of Minimum, Q1, Median, Q3,
Maximum.
n Boxplot are a popular way of visualizing a distribution
n The ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
n Outlier: usually, a value higher/lower than 1.5 x IQR
25
Boxplot Analysis
27
Measuring the Dispersion of Data
30
Visualization of Data Dispersion: 3-D Boxplots
32
Histogram Analysis
n Histogram: Graph display of
tabulated frequencies, shown as 40
bars 35
n It shows what proportion of cases 30
fall into each of several categories
25
n Differs from a bar chart in that it is
20
the area of the bar that denotes the
value, not the height as in bar 15
charts, a crucial distinction when the 10
categories are not of uniform width
5
n The categories are usually specified
0
as non-overlapping intervals of 10000 30000 50000 70000 90000
some variable. The categories (bars)
must be adjacent
33
Quantile Plot
n Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
n Plots quantile information
n For a data xi data sorted in increasing order, fi
35
Scatter plot
n Provides a first look at bivariate data to see clusters of
points, outliers, etc
n Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
36
Positively and Negatively Correlated Data
37
Uncorrelated Data
38
Similarity and Dissimilarity
n Similarity
n Numerical measure of how alike two data objects are
are
n Lower when objects are more alike
n Minimum dissimilarity is often 0
39
Ordinal Variables
40
Attributes of Mixed Type
41