0% found this document useful (0 votes)

5 views41 pages

Chapter 2 Data Mining

fata

Uploaded by

shynxx 02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views41 pages

Chapter 2 Data Mining

fata

Uploaded by

shynxx 02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Types of Data Sets

n Record
n Relational records
n Data matrix, e.g., numerical matrix,

timeout

season
coach

game
score
team

ball

lost
pla
crosstabs

wi
n
y
n Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
n Transaction data
n Graph and network Document 2 0 7 0 2 1 0 0 3 0 0

n World Wide Web Document 3 0 1 0 0 1 2 2 0 3 0

n Social or information networks
n Molecular Structures
n Ordered TID Items
n Video data: sequence of images 1 Bread, Coke, Milk
n Temporal data: time-series
2 Beer, Bread
n Sequential Data: transaction sequences
3 Beer, Coke, Diaper, Milk
n Genetic sequence data
n Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
n Spatial data: maps 5 Coke, Diaper, Milk
n Image data:
n Video data:
1
Data Objects

n Data sets are made up of data objects.

n A data object represents an entity.
n Examples:
n sales database: customers, store items, sales
n medical database: patients, treatments
n university database: students, professors, courses
n Also called samples , examples, instances, data points, objects,
tuples.
n If the data objects are stored in a database, they are data
tuples.
n Data objects are described by attributes.
n Database rows -> data objects; columns ->attributes.
2
Attributes
n Attribute (or dimensions, features, variables): a data field,
representing a characteristic or feature of a data object.
n A set of attributes used to describe a given object is called an attribute
vector (or feature vector).
n The distribution of data involving one attribute (or variable) is called
univariate.
n A bivariate distribution involves two attributes, and so on.
n E.g., customer _ID, name, address
n Types:
n Nominal

n Binary

n Numeric: quantitative

n Interval-scaled

n Ratio-scaled

3
Attribute Types
n Nominal: “relating to names”
n The values of a nominal attribute are symbols or names of
things.
n Each value represents some kind of category, code, or state,
and so nominal attributes are also referred to as categorical.
n In computer science, the values are also known as
enumerations.
n Hair_color = {auburn, black, blond, brown, grey, red, white}
n marital status, occupation, ID numbers, zip codes
n Although we said that the values of a nominal attribute are
symbols or “names of things,” it is possible to represent
such symbols or “names” with numbers.
n With hair color, for instance, we can assign a code of 0 for black, 1 for
brown, and so on.
n However, in such cases, the numbers are not intended to be used
quantitatively.

4
Attribute Types
n Binary
n Nominal attribute with only 2 categories or states (0 and 1),
where 0 typically means that the attribute is absent, and 1 means
that it is present.
n Binary attributes are referred to as Boolean if the two states
correspond to true and false.
n Attribute = smoker; 1, patient smokes: 0, patient does not.
n Symmetric binary: if both of its states are equally valuable and
carry the same weight; that is, there is no preference on which
outcome should be coded as 0 or 1.
n e.g., gender (male and female)
n Asymmetric binary: if the outcomes of the states are not equally
important.
n e.g., medical test (positive vs. negative)
n Convention: assign 1 to most important outcome (e.g., HIV
positive)

5
Attribute Types
n Ordinal
n is an attribute with possible values that have a meaningful
order or ranking among them, but the magnitude between
successive values is not known.
n drink size = {small, medium, large}, grades, army rankings
n Other examples of ordinal attributes include grade (e.g., A+,
A, A-, B+ and so on) and professional rank.
n Ordinal attributes are useful for registering subjective
assessments of qualities that cannot be measured objectively;
thus ordinal attributes are often used in surveys for ratings.
n Customer satisfaction had the following ordinal categories:
n 0: very dissatisfied,
n 1: somewhat dissatisfied,
n 2: neutral,
n 3: satisfied, and
n 4: very satisfied

6
Attribute Types
n Note that nominal, binary, and ordinal attributes
are qualitative. That is, they describe a feature
of an object without giving an actual size or
quantity.
n The values of such qualitative attributes are
typically words representing categories.
n If integers are used, they represent computer
codes for the categories, as opposed to
measurable quantities (e.g., 0 for small drink size,
1 for medium, and 2 for large).

7
Numeric Attribute Types
n A numeric attribute is quantitative; that is, it is a
measurable quantity, represented in integer or real values.
n Numeric attributes can be interval-scaled or ratio-
scaled.
n Interval-Scaled Attributes
n are measured on a scale of equal-size units.

n The values of interval-scaled attributes have order and

can be positive, 0, or negative.
n Thus, in addition to providing a ranking of values, such
attributes allow us to compare and quantify the
difference between values.
n Values have order
n E.g., temperature in C˚or F˚, calendar dates
n No true zero-point
8
Numeric Attribute Types
n Ratio-Scaled Attributes
n is a numeric attribute with an inherent zero-point. That
is, if a measurement is ratio-scaled, we can speak of a
value as being a multiple (or ratio) of another
value.
n In addition, the values are ordered, and we can also
compute the difference between values, as well as the
mean, median, and mode.
n E.g., count attributes such as years of experience

(e.g., the objects are employees) and number of

words (e.g., the objects are documents), weight,
height, latitude and longitude, coordinates (e.g., when
clustering houses), and monetary quantities (e.g., you
are 100 times richer with $100 than with $1).

9
Discrete vs. Continuous Attributes
n Discrete Attribute
n A discrete attribute has a finite or countably infinite set
of values, which may or may not be represented as
integers.
n E.g., zip codes, profession, or the set of words in a

collection of documents
n Note: Binary attributes are a special case of discrete

attributes
n Continuous Attribute/ Numeric Attribute
n Continuous attributes are typically represented as
floating-point variables.
n E.g., temperature, height, or weight

n Practically, real values can only be measured and

represented using a finite number of digits
10
Chapter 2: Getting to Know Your Data

n Data Objects and Attribute Types

n Basic Statistical Descriptions of Data

n Data Visualization

n Measuring Data Similarity and Dissimilarity

n Summary

11
Basic Statistical Descriptions of Data
n Motivation
n To better understand the data: central tendency,
variation and spread
n Data dispersion characteristics
n median, max, min, quantiles, outliers, variance, etc.
n Numerical dimensions correspond to sorted intervals
n Data dispersion: analyzed with multiple granularities
of precision
n Boxplot or quantile analysis on sorted intervals
n Dispersion analysis on computed measures
n Folding measures into numerical dimensions
n Boxplot or quantile analysis on the transformed cube
12
Measuring the Central Tendency
n Mean (algebraic measure) (sample vs. population): 1 n
x = å xi µ= å x
Note: n is sample size and N is population size. n i =1 N
n
Weighted arithmetic mean:
åw x
n
i i
n Trimmed mean: chopping extreme values x= i =1
n

åw
i =1
i

n Problem: Suppose we have the following values for

salary (in thousands of dollars), shown in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
n What is our mean?

13
Measuring the Central Tendency
n Mean (algebraic measure) (sample vs. population): 1 n
x = å xi µ= å x
Note: n is sample size and N is population size. n i =1 N
n
Weighted arithmetic mean:
åw x
n
i i
n Trimmed mean: chopping extreme values x= i =1
n

åw
i =1
i

n Problem: Suppose we have the following values for

salary (in thousands of dollars), shown in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
n What is our mean?

14
Measuring the Central Tendency
n Median:
n Middle value if odd number of values, or average of
the middle two values otherwise
n Estimated by interpolation (for grouped data):

n / 2 - (å freq)l
median = L1 + ( ) width
freqmedian
n What is our median?

15
Measuring the Central Tendency
n Median:
n Middle value if odd number of values, or average of the middle two values
otherwise
n Estimated by interpolation (for grouped data):

n / 2 - (å freq)l
median = L1 + ( ) width
freqmedian
n What is our median?

Thus, the median is $54,000.

16
Measuring the Central Tendency
n Mode
n Value that occurs most frequently in the data
n Data sets with one, two, or three modes are respectively called unimodal,
bimodal, and trimodal.
n In general, a data set with two or more modes is multimodal.
n Empirical formula: mean - mode = 3 ´ (mean - median)
n What is our mode?

17
Measuring the Central Tendency
n Mode
n Value that occurs most frequently in the data
n Data sets with one, two, or three modes are respectively called unimodal,
bimodal, and trimodal.
n In general, a data set with two or more modes is multimodal.
n Empirical formula: mean - mode = 3 ´ (mean - median)
n What is our mode?

n The two modes are $52,000 and $70,000.

18
Measuring the Central Tendency
n Mode
n The midrange can also be used to assess the central tendency of a
numeric data set.
n It is the average of the largest and smallest values in the set. This measure
is easy to compute using the SQL aggregate functions, max() and min().
n What is our midrange?

19
Measuring the Central Tendency
n Mode
n The midrange can also be used to assess the central tendency of a
numeric data set.
n It is the average of the largest and smallest values in the set. This measure
is easy to compute using the SQL aggregate functions, max() and min().
n What is our midrange?

20
Symmetric vs. Skewed Data
n Median, mean and mode of symmetric
symmetric, positively and
negatively skewed data

positively skewed negatively skewed

February 19, 2024 Data Mining: Concepts and Techniques 21

Measuring the Dispersion of Data
n The measures include range, quantiles, quartiles, percentiles, and the
interquartile range.
n The five-number summary, which can be displayed as a boxplot, is useful in
identifying outliers.
n Variance and standard deviation also indicate the spread of a data distribution.

n The range of the set is the difference between the largest (max()) and
smallest (min()) values.
n Quantiles are points taken at regular intervals of a data distribution, dividing it
into essentially equal-size consecutive sets.

22
Measuring the Dispersion of Data
n Quantiles are points taken at regular intervals of a data distribution, dividing it
into essentially equal-size consecutive sets.
n The 2-quantile is the data point dividing the lower and upper halves of the data
distribution. It corresponds to the median.
n The 4-quantiles are the three data points that split the data distribution into four
equal parts; each part represents one-fourth of the data distribution. They are more
commonly referred to as quartiles. (25th, 50th, 75th, 100th)
n The 100-quantiles are more commonly referred to as percentiles; they divide the
data distribution into 100 equal-sized consecutive sets.

n Interquartile range (IQR) is the distance between the first and third
quartiles is a simple measure of spread that gives the range covered by the
middle half of the data.

n Example:
23
Measuring the Dispersion of Data
n Interquartile range (IQR) is the distance between the first and third
quartiles is a simple measure of spread that gives the range covered by the
middle half of the data.

n Example:

n Q1 = $47, 000
n Q3 = $63, 000
n IQR = $63, 000 - $47, 000
n IQR = $16, 000

24
Measuring the Dispersion of Data
n Five number summary of a distribution consists of the median (Q2),
the quartiles Q1 and Q3, and the smallest and largest individual
observations, written in the order of Minimum, Q1, Median, Q3,
Maximum.
n Boxplot are a popular way of visualizing a distribution
n The ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
n Outlier: usually, a value higher/lower than 1.5 x IQR

25
Boxplot Analysis

n Five-number summary of a distribution

n Minimum, Q1, Median, Q3, Maximum
n Boxplot
n Data is represented with a box
n The ends of the box are at the first
and third quartiles, i.e., the height of
the box is IQR
n The median is marked by a line within
the box
n Whiskers: two lines outside the box
extended to Minimum and Maximum
n Outliers: points beyond a specified
outlier threshold, plotted individually
26
Measuring the Dispersion of Data
n Variance and standard deviation
n They are measures of data dispersion. They indicate how spread out a data
distribution is.
n A low standard deviation means that the data observations tend to be
very close to the mean, while a high standard deviation indicates that
the data are spread out over a large range of values.

27
Measuring the Dispersion of Data

What is the variance?

What is the standard deviation?
28
Measuring the Dispersion of Data

What is the variance?

What is the standard deviation?
29
Measuring the Dispersion of Data
n Variance and standard deviation (sample: s, population: σ)
n Variance: (algebraic, scalable computation)
n n
1 n 1 n 2 1 n 1 1
å i åi nå s = å µ åx - µ2
2
s =
2
( x - x ) 2
= [ x - ( xi ]
) 2 2
( xi - 2
) = i
n - 1 i =1 n - 1 i =1 i =1
N i =1 N i =1

n Standard deviation s (or σ) is the square root of variance s2 (or σ2)

30
Visualization of Data Dispersion: 3-D Boxplots

February 19, 2024 Data Mining: Concepts and Techniques 31

Graphic Displays of Basic Statistical Descriptions

n Boxplot: graphic display of five-number summary

n Histogram: x-axis are values, y-axis repres. frequencies
n Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are £ xi
n Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
n Scatter plot: each pair of values is a pair of coordinates
and plotted as points in the plane

32
Histogram Analysis
n Histogram: Graph display of
tabulated frequencies, shown as 40
bars 35
n It shows what proportion of cases 30
fall into each of several categories
25
n Differs from a bar chart in that it is
20
the area of the bar that denotes the
value, not the height as in bar 15
charts, a crucial distinction when the 10
categories are not of uniform width
5
n The categories are usually specified
0
as non-overlapping intervals of 10000 30000 50000 70000 90000
some variable. The categories (bars)
must be adjacent

33
Quantile Plot
n Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
n Plots quantile information
n For a data xi data sorted in increasing order, fi

indicates that approximately 100 fi% of the data are

below or equal to the value xi

Data Mining: Concepts and Techniques 34

Quantile-Quantile (Q-Q) Plot
n Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
n Example shows unit price of items sold at Branch 1 vs. Branch 2 for
each quantile. Unit prices of items sold at Branch 1 tend to be lower
than those at Branch 2.

35
Scatter plot
n Provides a first look at bivariate data to see clusters of
points, outliers, etc
n Each pair of values is treated as a pair of coordinates and
plotted as points in the plane

36
Positively and Negatively Correlated Data

n The left half fragment is positively

correlated
n The right half is negative correlated

37
Uncorrelated Data

38
Similarity and Dissimilarity
n Similarity
n Numerical measure of how alike two data objects are

n Value is higher when objects are more alike

n Often falls in the range [0,1]

n Dissimilarity (e.g., distance)

n Numerical measure of how different two data objects

are
n Lower when objects are more alike
n Minimum dissimilarity is often 0

n Upper limit varies

n Proximity refers to a similarity or dissimilarity

39
Ordinal Variables

n An ordinal variable can be discrete or continuous

n Order is important, e.g., rank
n Can be treated like interval-scaled
n replace xif by their rank rif Î{1,..., M f }
n map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
rif -1
zif =
M f -1
n compute the dissimilarity using methods for interval-
scaled variables

40
Attributes of Mixed Type

n A database may contain all attribute types

n Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
n One may use a weighted formula to combine their effects
S pf = 1d ij( f ) dij( f )
d (i, j) =
S pf = 1d ij( f )
n f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
n f is numeric: use the normalized distance
n f is ordinal
n Compute ranks rif and r -1
zif = if

n Treat zif as interval-scaled M -1 f

The Most Notorious "Talker" Runs The World's Greatest Clan Vol 3
No ratings yet
The Most Notorious "Talker" Runs The World's Greatest Clan Vol 3
339 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
Attribute Oriented Analysis
No ratings yet
Attribute Oriented Analysis
27 pages
DMDW 2
No ratings yet
DMDW 2
68 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Lesson 2.1 - Know Your Data PDF
No ratings yet
Lesson 2.1 - Know Your Data PDF
43 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
(Hooker and Monas, 2008) Shoestring Venture - The Startup Bible
No ratings yet
(Hooker and Monas, 2008) Shoestring Venture - The Startup Bible
532 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
01 Data
No ratings yet
01 Data
100 pages
CH 2
No ratings yet
CH 2
68 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
IDS Unit 2
No ratings yet
IDS Unit 2
49 pages
Unit 2 Final Ids
No ratings yet
Unit 2 Final Ids
38 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
Unit 3
No ratings yet
Unit 3
43 pages
Data ch2
No ratings yet
Data ch2
16 pages
Data Mining Unit-I
No ratings yet
Data Mining Unit-I
44 pages
DM Introduction
No ratings yet
DM Introduction
50 pages
DLL Speech Style
100% (1)
DLL Speech Style
2 pages
Lect 3
No ratings yet
Lect 3
51 pages
Knowing The Data Set
No ratings yet
Knowing The Data Set
31 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Topics To Be Covered
No ratings yet
Topics To Be Covered
58 pages
02 Data
No ratings yet
02 Data
35 pages
2nd Slides
No ratings yet
2nd Slides
54 pages
Screening and Assessment LD
No ratings yet
Screening and Assessment LD
63 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
CH 2
No ratings yet
CH 2
35 pages
COC III Set Up Computer Server
No ratings yet
COC III Set Up Computer Server
77 pages
Some Basic Concepts of Chemistry
No ratings yet
Some Basic Concepts of Chemistry
19 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
8 pages
Respect FocusedTherapy CH 1
100% (1)
Respect FocusedTherapy CH 1
15 pages
Know Your Data
No ratings yet
Know Your Data
83 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
19 pages
Fender
No ratings yet
Fender
14 pages
MATH 1300-MIDTERM # 2-2012: For Long Answer Questions, YOU MUST SHOW YOUR WORK
No ratings yet
MATH 1300-MIDTERM # 2-2012: For Long Answer Questions, YOU MUST SHOW YOUR WORK
8 pages
Equipment Design: Mechanical Aspects Week 1 Assignment - 1 Solution
No ratings yet
Equipment Design: Mechanical Aspects Week 1 Assignment - 1 Solution
4 pages
French SAT Subject Test
No ratings yet
French SAT Subject Test
1 page
Dmi Unit 2
No ratings yet
Dmi Unit 2
19 pages
Datalec 1
No ratings yet
Datalec 1
23 pages
CS822 DataMining Week2
No ratings yet
CS822 DataMining Week2
28 pages
02 Data
No ratings yet
02 Data
24 pages
Maths
No ratings yet
Maths
114 pages
Stress in Speech
No ratings yet
Stress in Speech
1 page
Structure Syllabi
No ratings yet
Structure Syllabi
19 pages
Lecture 2
No ratings yet
Lecture 2
33 pages
Runge-Kutta Method: Consider First Single First-Order Equation: Classic High-Order Scheme Error (4th Order)
No ratings yet
Runge-Kutta Method: Consider First Single First-Order Equation: Classic High-Order Scheme Error (4th Order)
17 pages
Steel Squares: Specifications
No ratings yet
Steel Squares: Specifications
1 page
CPSC 4830 2025summer Lecture 2
No ratings yet
CPSC 4830 2025summer Lecture 2
42 pages
Dmi Unit 2 - 186 - N3
No ratings yet
Dmi Unit 2 - 186 - N3
21 pages
UNIT3
No ratings yet
UNIT3
98 pages
Lec2 Data
No ratings yet
Lec2 Data
51 pages
Faircode Technologies Private Limited - Home
No ratings yet
Faircode Technologies Private Limited - Home
1 page
Puritanism & Early American Literature
No ratings yet
Puritanism & Early American Literature
4 pages
Participant Handbook: Iot Hardware Analyst
No ratings yet
Participant Handbook: Iot Hardware Analyst
152 pages
Insurance Industry Career
No ratings yet
Insurance Industry Career
6 pages
Get To Know About Data
No ratings yet
Get To Know About Data
25 pages
Ids U2 PPT 30092024
No ratings yet
Ids U2 PPT 30092024
87 pages
PCC-2000 Reference Manual V1.42
No ratings yet
PCC-2000 Reference Manual V1.42
26 pages
Unit 1
No ratings yet
Unit 1
10 pages
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
Goodwill Valuation in Accountancy
No ratings yet
Goodwill Valuation in Accountancy
4 pages
Aditya Internship Training
No ratings yet
Aditya Internship Training
14 pages
Vernalisation in Details
No ratings yet
Vernalisation in Details
3 pages
Introduction To Data
No ratings yet
Introduction To Data
26 pages
A Review On Artabotrys Odoratissimus (Annonaceae) : Saritha Kodithala and R Murali
No ratings yet
A Review On Artabotrys Odoratissimus (Annonaceae) : Saritha Kodithala and R Murali
3 pages
Education, Arts, and Sciences
No ratings yet
Education, Arts, and Sciences
1 page
Chapter-2 Getting To Know Your Data
No ratings yet
Chapter-2 Getting To Know Your Data
92 pages
02data - 7 7 25
No ratings yet
02data - 7 7 25
63 pages
Pumpe en 2023 v1
No ratings yet
Pumpe en 2023 v1
12 pages
Science Quiz Bee
No ratings yet
Science Quiz Bee
5 pages
Data Science - Unit 2
No ratings yet
Data Science - Unit 2
57 pages
HW 683608 1answe
No ratings yet
HW 683608 1answe
4 pages
Topic3 Data Types
No ratings yet
Topic3 Data Types
124 pages
About Data
No ratings yet
About Data
25 pages
02data Part1
No ratings yet
02data Part1
51 pages
Session 1 - Getting To Know Data
No ratings yet
Session 1 - Getting To Know Data
62 pages
DEP Unit 2
No ratings yet
DEP Unit 2
83 pages
Data Mining - Data Objects and Attributes
No ratings yet
Data Mining - Data Objects and Attributes
50 pages