Data Mining
Data
CS 584 :: Fall 2024
Ziwei Zhu
Department of Computer Science
George Mason University
Part of slides is from Drs. Tan, Steinbach and Kumar.
Part of slides is from Dr. James Caverlee. 1
Outline
• Attributes and objects
• Types of data
• Data Preprocessing
2
Outline
Ø Attributes and objects
• Types of data
• Data Preprocessing
3
What is Data?
• Collection of data objects and their
attributes
4
What is Data?
• Collection of data objects and their
attributes
• An attribute is a property or
characteristic of an object
- Examples: eye color of a person,
temperature, etc.
- Attribute is also known as variable,
field, characteristic, dimension, or
feature
5
What is Data?
• Collection of data objects and their
attributes
• An attribute is a property or
characteristic of an object
- Examples: eye color of a person,
temperature, etc.
- Attribute is also known as variable,
field, characteristic, dimension, or
feature
• A collection of attributes describe
an object
- Object is also known as record, point,
case, sample, entity, or instance
6
Types of Attributes
• There are different types of attributes
Categorical
Discrete
or
Continuous
Numeric
or
7
Types of Attributes
• There are different types of attributes
- Nominal
Categorical
‣ Examples: ID numbers, eye color, zip codes
Discrete
or
- Ordinal
‣ Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades
Continuous
Numeric
or
8
Types of Attributes
• There are different types of attributes
- Nominal
Categorical
‣ Examples: ID numbers, eye color, zip codes
Discrete
or
- Ordinal
‣ Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades
- Interval
Continuous
‣ Examples: temperatures in Celsius or Fahrenheit.
Numeric
- Ratio
or
‣ Examples: length, counts, temperatures in kelvin (0
means no heat)
9
Difference Between Ratio and Interval
• The ratio of two values of an interval
attribute has no meaningful interpretation:
o Is it physically meaningful to say that a
temperature of 10 °F is twice that of 5 °F?
• The ratio attribute has true zero, but the
interval attribute does not:
o 0 Kelvin means no heat
10
Properties of Attribute Values
• The type of an attribute depends on which of the
following properties/operations it possesses:
o Distinctness: = ¹
o Order: < >
o Differences: + -
o Ratios: * /
o Nominal attribute: distinctness
o Ordinal attribute: distinctness & order
o Interval attribute: distinctness, order & differences
o Ratio attribute: all 4 operations
11
Outline
• Attributes and objects
Ø Types of data
• Data Preprocessing
12
Types of data sets
• Record • Ordered
o Document Data o Spatial Data
o Transaction Data o Temporal Data
o Sequential Data
• Graph o Genetic Sequence Data
o World Wide Web
o Molecular Structures
13
Record Data
• Data that consists of a collection of records,
each of which has a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
14
Record Data – Document Data
• Each document becomes a ‘term’ vector
o Each term is a component (attribute) of the vector
o The value of each component is the number of times
the corresponding term occurs in the document
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
15
Record Data – Transaction Data
• A special type of data, where
o Each transaction involves a set of items.
o Can represent transaction data as record data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Bread Beer Coke Diaper Milk
T1 1 0 1 0 1
T2 1 1 0 0 0
T3 0 1 1 1 1
T4 1 1 0 1 1
T5 0 0 1 1 1 16
Graph Data
nodes
and
links between nodes (directed or undirected,
different types)
17
Graph Data
The Web
18
Graph Data
Social Networks
19
Graph Data
Molecular Structures
20
Ordered Data
o Sequential Data
o Spatial Data
o Temporal Data
o Genetic Sequence Data
21
Ordered Data
Sequences of transactions/behaviors/words
22
Ordered Data
Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
23
Ordered Data
Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
24
Real World Data is a Mess
• Noise:
o Errors and outliers
o e.g., Salary = -100 (error)
• Missing values:
o Missing some attribute values, lacking the attributes
you care about
o e.g., Occupation = Null (missing)
• Duplicate data:
o e.g., Same person with multiple emails
25
Real World Data is a Mess
26
Outline
• Attributes and objects
• Types of data
Ø Data Preprocessing
27
Typical Data Cleaning Tasks
• Task 1: Missing Values
• Task 2: Duplicates
• Task 3: Data Reduction
• Task 4: One-hot Encoding
• Task 5: Normalizing
28
Task 1: Missing Values
Most real data collected from sensors, surveys,
agents, have a high percentage of N/A or nulls,
special values (99999) etc.
What do we do? What strategies?
29
Task 1: Missing Values
• Ignore the record:
o usually done when class label is missing (when doing
classification) – not effective
• Fill in the missing value manually:
o tedious + infeasible?
• Fill in it automatically with:
o the attribute mean
o smarter: the attribute mean for all samples belonging to
the same class
o the most probable value: inference-based such as
machine learning model to predict the missing value given
other known attributes
30
Task 2: Duplicates
• In many scenarios, we may have duplicate entries
• E.g., a collection of users including
Ziwei Zhu
Z Zhu
Zhu, Ziwei
…
• Solution?
31
Task 2: Duplicates
• Similarity measures: how to define? Exact match or
soft match? (e.g., Ziwei Zhu vs Zhiwei Chu)
• Machine learning for classifying pairs as duplicates
or not
• Clustering and merging records
32
Task 2: Duplicates
33
Task 3: Data Reduction
Obtain a reduced representation of the data
set that is much smaller in volume but yet
produces the same (or almost the same)
analytical results
• Reduce Objects
• Reduce Attributes
34
Task 3: Data Reduction – Reduce Objects
o Sampling:
• A sample is representative if it has approximately the
same properties (of interest) as the original set of
data (progressively increase sampling size)
• Sampling with replacement, sampling without
replacement
35
Task 3: Data Reduction - Reduce Attributes
Curse of Dimensionality
• When dimensionality increases, data
becomes increasingly sparse in the space that
it occupies
• Distances between objects get uniform and
less meaningful, which critically influence the
performance of clustering and classification
tasks.
36
Curse of Dimensionality
37
Curse of Dimensionality
max _𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 − 𝑚𝑖𝑛_𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒
𝑚𝑖𝑛_𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒
The notions of distance
between samples, which
are critical for clustering
and classification, become
less meaningful.
• Randomly generate 500 points
• Compute difference between max and min
distance between any pair of points
38
Task 3: Data Reduction - Reduce Attributes
• Principle Component Analysis (PCA) (we will
learn it later!)
• Feature Selection
• Others: supervised, unsupervised, non-linear
methods
39
Task 3: Data Reduction - Reduce Attributes
• Feature Selection
• Redundant features
o Duplicate much or all the information contained in one or
more other attributes
o Example: purchase price of a product and the amount of
sales tax paid
• Irrelevant features
o Contain no useful information for the data mining task
o Example: students' ID is often irrelevant to the task of
predicting students' GPA
• Many techniques developed, especially for
classification
40
Task 4: Data Reduction - Reduce Attributes
• Feature Selection
https://scikit-learn.org/stable/modules/feature_selection.html 41
Task 5: One-hot Encoding
• Suppose we’ve got four majors: [English, History, Math,
CS]
• Many of our downstream analyses will only understand
data as a number (integer, float, etc.)
• We can [English, History, Math, CS] —> [0, 1, 2, 3]
• But that indicates order, i.e., CS > English
• One alternative: One-hot Encoding
42
Task 5: One-hot Encoding
[English, History, Math, CS]
becomes
English: [1, 0, 0, 0]
History: [0, 1, 0, 0]
Math: [0, 0, 1, 0]
CS: [0, 0, 0, 1]
43
Task 6: Normalizing
• Features have different scales
o GPA vs. Age vs. Height
• Map to a common range
o Z-score
o Min-max
44
Task 6: Normalizing
z-score normalization (standardization in statistics)
45
Task 6: Normalizing
Min-max Scaling
46
What we learned so far…
• Attributes and objects
• Types of data
• Data Preprocessing
47