Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views41 pages

Chapter 2 Data Mining

fata

Uploaded by

shynxx 02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views41 pages

Chapter 2 Data Mining

fata

Uploaded by

shynxx 02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Types of Data Sets

n Record
n Relational records
n Data matrix, e.g., numerical matrix,

timeout

season
coach

game
score
team

ball

lost
pla
crosstabs

wi
n
y
n Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
n Transaction data
n Graph and network Document 2 0 7 0 2 1 0 0 3 0 0

n World Wide Web Document 3 0 1 0 0 1 2 2 0 3 0


n Social or information networks
n Molecular Structures
n Ordered TID Items
n Video data: sequence of images 1 Bread, Coke, Milk
n Temporal data: time-series
2 Beer, Bread
n Sequential Data: transaction sequences
3 Beer, Coke, Diaper, Milk
n Genetic sequence data
n Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
n Spatial data: maps 5 Coke, Diaper, Milk
n Image data:
n Video data:
1
Data Objects

n Data sets are made up of data objects.


n A data object represents an entity.
n Examples:
n sales database: customers, store items, sales
n medical database: patients, treatments
n university database: students, professors, courses
n Also called samples , examples, instances, data points, objects,
tuples.
n If the data objects are stored in a database, they are data
tuples.
n Data objects are described by attributes.
n Database rows -> data objects; columns ->attributes.
2
Attributes
n Attribute (or dimensions, features, variables): a data field,
representing a characteristic or feature of a data object.
n A set of attributes used to describe a given object is called an attribute
vector (or feature vector).
n The distribution of data involving one attribute (or variable) is called
univariate.
n A bivariate distribution involves two attributes, and so on.
n E.g., customer _ID, name, address
n Types:
n Nominal

n Binary

n Numeric: quantitative

n Interval-scaled

n Ratio-scaled

3
Attribute Types
n Nominal: “relating to names”
n The values of a nominal attribute are symbols or names of
things.
n Each value represents some kind of category, code, or state,
and so nominal attributes are also referred to as categorical.
n In computer science, the values are also known as
enumerations.
n Hair_color = {auburn, black, blond, brown, grey, red, white}
n marital status, occupation, ID numbers, zip codes
n Although we said that the values of a nominal attribute are
symbols or “names of things,” it is possible to represent
such symbols or “names” with numbers.
n With hair color, for instance, we can assign a code of 0 for black, 1 for
brown, and so on.
n However, in such cases, the numbers are not intended to be used
quantitatively.

4
Attribute Types
n Binary
n Nominal attribute with only 2 categories or states (0 and 1),
where 0 typically means that the attribute is absent, and 1 means
that it is present.
n Binary attributes are referred to as Boolean if the two states
correspond to true and false.
n Attribute = smoker; 1, patient smokes: 0, patient does not.
n Symmetric binary: if both of its states are equally valuable and
carry the same weight; that is, there is no preference on which
outcome should be coded as 0 or 1.
n e.g., gender (male and female)
n Asymmetric binary: if the outcomes of the states are not equally
important.
n e.g., medical test (positive vs. negative)
n Convention: assign 1 to most important outcome (e.g., HIV
positive)

5
Attribute Types
n Ordinal
n is an attribute with possible values that have a meaningful
order or ranking among them, but the magnitude between
successive values is not known.
n drink size = {small, medium, large}, grades, army rankings
n Other examples of ordinal attributes include grade (e.g., A+,
A, A-, B+ and so on) and professional rank.
n Ordinal attributes are useful for registering subjective
assessments of qualities that cannot be measured objectively;
thus ordinal attributes are often used in surveys for ratings.
n Customer satisfaction had the following ordinal categories:
n 0: very dissatisfied,
n 1: somewhat dissatisfied,
n 2: neutral,
n 3: satisfied, and
n 4: very satisfied

6
Attribute Types
n Note that nominal, binary, and ordinal attributes
are qualitative. That is, they describe a feature
of an object without giving an actual size or
quantity.
n The values of such qualitative attributes are
typically words representing categories.
n If integers are used, they represent computer
codes for the categories, as opposed to
measurable quantities (e.g., 0 for small drink size,
1 for medium, and 2 for large).

7
Numeric Attribute Types
n A numeric attribute is quantitative; that is, it is a
measurable quantity, represented in integer or real values.
n Numeric attributes can be interval-scaled or ratio-
scaled.
n Interval-Scaled Attributes
n are measured on a scale of equal-size units.

n The values of interval-scaled attributes have order and


can be positive, 0, or negative.
n Thus, in addition to providing a ranking of values, such
attributes allow us to compare and quantify the
difference between values.
n Values have order
n E.g., temperature in C˚or F˚, calendar dates
n No true zero-point
8
Numeric Attribute Types
n Ratio-Scaled Attributes
n is a numeric attribute with an inherent zero-point. That
is, if a measurement is ratio-scaled, we can speak of a
value as being a multiple (or ratio) of another
value.
n In addition, the values are ordered, and we can also
compute the difference between values, as well as the
mean, median, and mode.
n E.g., count attributes such as years of experience

(e.g., the objects are employees) and number of


words (e.g., the objects are documents), weight,
height, latitude and longitude, coordinates (e.g., when
clustering houses), and monetary quantities (e.g., you
are 100 times richer with $100 than with $1).

9
Discrete vs. Continuous Attributes
n Discrete Attribute
n A discrete attribute has a finite or countably infinite set
of values, which may or may not be represented as
integers.
n E.g., zip codes, profession, or the set of words in a

collection of documents
n Note: Binary attributes are a special case of discrete

attributes
n Continuous Attribute/ Numeric Attribute
n Continuous attributes are typically represented as
floating-point variables.
n E.g., temperature, height, or weight

n Practically, real values can only be measured and


represented using a finite number of digits
10
Chapter 2: Getting to Know Your Data

n Data Objects and Attribute Types

n Basic Statistical Descriptions of Data

n Data Visualization

n Measuring Data Similarity and Dissimilarity

n Summary

11
Basic Statistical Descriptions of Data
n Motivation
n To better understand the data: central tendency,
variation and spread
n Data dispersion characteristics
n median, max, min, quantiles, outliers, variance, etc.
n Numerical dimensions correspond to sorted intervals
n Data dispersion: analyzed with multiple granularities
of precision
n Boxplot or quantile analysis on sorted intervals
n Dispersion analysis on computed measures
n Folding measures into numerical dimensions
n Boxplot or quantile analysis on the transformed cube
12
Measuring the Central Tendency
n Mean (algebraic measure) (sample vs. population): 1 n
x = å xi µ= å x
Note: n is sample size and N is population size. n i =1 N
n
Weighted arithmetic mean:
åw x
n
i i
n Trimmed mean: chopping extreme values x= i =1
n

åw
i =1
i

n Problem: Suppose we have the following values for


salary (in thousands of dollars), shown in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
n What is our mean?

13
Measuring the Central Tendency
n Mean (algebraic measure) (sample vs. population): 1 n
x = å xi µ= å x
Note: n is sample size and N is population size. n i =1 N
n
Weighted arithmetic mean:
åw x
n
i i
n Trimmed mean: chopping extreme values x= i =1
n

åw
i =1
i

n Problem: Suppose we have the following values for


salary (in thousands of dollars), shown in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
n What is our mean?

14
Measuring the Central Tendency
n Median:
n Middle value if odd number of values, or average of
the middle two values otherwise
n Estimated by interpolation (for grouped data):

n / 2 - (å freq)l
median = L1 + ( ) width
freqmedian
n What is our median?

15
Measuring the Central Tendency
n Median:
n Middle value if odd number of values, or average of the middle two values
otherwise
n Estimated by interpolation (for grouped data):

n / 2 - (å freq)l
median = L1 + ( ) width
freqmedian
n What is our median?

Thus, the median is $54,000.


16
Measuring the Central Tendency
n Mode
n Value that occurs most frequently in the data
n Data sets with one, two, or three modes are respectively called unimodal,
bimodal, and trimodal.
n In general, a data set with two or more modes is multimodal.
n Empirical formula: mean - mode = 3 ´ (mean - median)
n What is our mode?

17
Measuring the Central Tendency
n Mode
n Value that occurs most frequently in the data
n Data sets with one, two, or three modes are respectively called unimodal,
bimodal, and trimodal.
n In general, a data set with two or more modes is multimodal.
n Empirical formula: mean - mode = 3 ´ (mean - median)
n What is our mode?

n The two modes are $52,000 and $70,000.

18
Measuring the Central Tendency
n Mode
n The midrange can also be used to assess the central tendency of a
numeric data set.
n It is the average of the largest and smallest values in the set. This measure
is easy to compute using the SQL aggregate functions, max() and min().
n What is our midrange?

19
Measuring the Central Tendency
n Mode
n The midrange can also be used to assess the central tendency of a
numeric data set.
n It is the average of the largest and smallest values in the set. This measure
is easy to compute using the SQL aggregate functions, max() and min().
n What is our midrange?

20
Symmetric vs. Skewed Data
n Median, mean and mode of symmetric
symmetric, positively and
negatively skewed data

positively skewed negatively skewed

February 19, 2024 Data Mining: Concepts and Techniques 21


Measuring the Dispersion of Data
n The measures include range, quantiles, quartiles, percentiles, and the
interquartile range.
n The five-number summary, which can be displayed as a boxplot, is useful in
identifying outliers.
n Variance and standard deviation also indicate the spread of a data distribution.

n The range of the set is the difference between the largest (max()) and
smallest (min()) values.
n Quantiles are points taken at regular intervals of a data distribution, dividing it
into essentially equal-size consecutive sets.

22
Measuring the Dispersion of Data
n Quantiles are points taken at regular intervals of a data distribution, dividing it
into essentially equal-size consecutive sets.
n The 2-quantile is the data point dividing the lower and upper halves of the data
distribution. It corresponds to the median.
n The 4-quantiles are the three data points that split the data distribution into four
equal parts; each part represents one-fourth of the data distribution. They are more
commonly referred to as quartiles. (25th, 50th, 75th, 100th)
n The 100-quantiles are more commonly referred to as percentiles; they divide the
data distribution into 100 equal-sized consecutive sets.

n Interquartile range (IQR) is the distance between the first and third
quartiles is a simple measure of spread that gives the range covered by the
middle half of the data.

n Example:
23
Measuring the Dispersion of Data
n Interquartile range (IQR) is the distance between the first and third
quartiles is a simple measure of spread that gives the range covered by the
middle half of the data.

n Example:

n Q1 = $47, 000
n Q3 = $63, 000
n IQR = $63, 000 - $47, 000
n IQR = $16, 000

24
Measuring the Dispersion of Data
n Five number summary of a distribution consists of the median (Q2),
the quartiles Q1 and Q3, and the smallest and largest individual
observations, written in the order of Minimum, Q1, Median, Q3,
Maximum.
n Boxplot are a popular way of visualizing a distribution
n The ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
n Outlier: usually, a value higher/lower than 1.5 x IQR

25
Boxplot Analysis

n Five-number summary of a distribution


n Minimum, Q1, Median, Q3, Maximum
n Boxplot
n Data is represented with a box
n The ends of the box are at the first
and third quartiles, i.e., the height of
the box is IQR
n The median is marked by a line within
the box
n Whiskers: two lines outside the box
extended to Minimum and Maximum
n Outliers: points beyond a specified
outlier threshold, plotted individually
26
Measuring the Dispersion of Data
n Variance and standard deviation
n They are measures of data dispersion. They indicate how spread out a data
distribution is.
n A low standard deviation means that the data observations tend to be
very close to the mean, while a high standard deviation indicates that
the data are spread out over a large range of values.

27
Measuring the Dispersion of Data

What is the variance?


What is the standard deviation?
28
Measuring the Dispersion of Data

What is the variance?


What is the standard deviation?
29
Measuring the Dispersion of Data
n Variance and standard deviation (sample: s, population: σ)
n Variance: (algebraic, scalable computation)
n n
1 n 1 n 2 1 n 1 1
å i åi nå s = å µ åx - µ2
2
s =
2
( x - x ) 2
= [ x - ( xi ]
) 2 2
( xi - 2
) = i
n - 1 i =1 n - 1 i =1 i =1
N i =1 N i =1

n Standard deviation s (or σ) is the square root of variance s2 (or σ2)

30
Visualization of Data Dispersion: 3-D Boxplots

February 19, 2024 Data Mining: Concepts and Techniques 31


Graphic Displays of Basic Statistical Descriptions

n Boxplot: graphic display of five-number summary


n Histogram: x-axis are values, y-axis repres. frequencies
n Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are £ xi
n Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
n Scatter plot: each pair of values is a pair of coordinates
and plotted as points in the plane

32
Histogram Analysis
n Histogram: Graph display of
tabulated frequencies, shown as 40
bars 35
n It shows what proportion of cases 30
fall into each of several categories
25
n Differs from a bar chart in that it is
20
the area of the bar that denotes the
value, not the height as in bar 15
charts, a crucial distinction when the 10
categories are not of uniform width
5
n The categories are usually specified
0
as non-overlapping intervals of 10000 30000 50000 70000 90000
some variable. The categories (bars)
must be adjacent

33
Quantile Plot
n Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
n Plots quantile information
n For a data xi data sorted in increasing order, fi

indicates that approximately 100 fi% of the data are


below or equal to the value xi

Data Mining: Concepts and Techniques 34


Quantile-Quantile (Q-Q) Plot
n Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
n Example shows unit price of items sold at Branch 1 vs. Branch 2 for
each quantile. Unit prices of items sold at Branch 1 tend to be lower
than those at Branch 2.

35
Scatter plot
n Provides a first look at bivariate data to see clusters of
points, outliers, etc
n Each pair of values is treated as a pair of coordinates and
plotted as points in the plane

36
Positively and Negatively Correlated Data

n The left half fragment is positively


correlated
n The right half is negative correlated

37
Uncorrelated Data

38
Similarity and Dissimilarity
n Similarity
n Numerical measure of how alike two data objects are

n Value is higher when objects are more alike

n Often falls in the range [0,1]

n Dissimilarity (e.g., distance)


n Numerical measure of how different two data objects

are
n Lower when objects are more alike
n Minimum dissimilarity is often 0

n Upper limit varies

n Proximity refers to a similarity or dissimilarity

39
Ordinal Variables

n An ordinal variable can be discrete or continuous


n Order is important, e.g., rank
n Can be treated like interval-scaled
n replace xif by their rank rif Î{1,..., M f }
n map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
rif -1
zif =
M f -1
n compute the dissimilarity using methods for interval-
scaled variables

40
Attributes of Mixed Type

n A database may contain all attribute types


n Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
n One may use a weighted formula to combine their effects
S pf = 1d ij( f ) dij( f )
d (i, j) =
S pf = 1d ij( f )
n f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
n f is numeric: use the normalized distance
n f is ordinal
n Compute ranks rif and r -1
zif = if

n Treat zif as interval-scaled M -1 f

41

You might also like