0% found this document useful (0 votes)

7 views47 pages

1 Data Mining

data mining

Uploaded by

wuyuman6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views47 pages

1 Data Mining

data mining

Uploaded by

wuyuman6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Data Mining

Data

CS 584 :: Fall 2024

Ziwei Zhu
Department of Computer Science
George Mason University

Part of slides is from Drs. Tan, Steinbach and Kumar.

Part of slides is from Dr. James Caverlee. 1
Outline

• Attributes and objects

• Types of data
• Data Preprocessing

2
Outline

Ø Attributes and objects

• Types of data
• Data Preprocessing

3
What is Data?
• Collection of data objects and their
attributes

4
What is Data?
• Collection of data objects and their
attributes
• An attribute is a property or
characteristic of an object
- Examples: eye color of a person,
temperature, etc.
- Attribute is also known as variable,
field, characteristic, dimension, or
feature

5
What is Data?
• Collection of data objects and their
attributes
• An attribute is a property or
characteristic of an object
- Examples: eye color of a person,
temperature, etc.
- Attribute is also known as variable,
field, characteristic, dimension, or
feature

• A collection of attributes describe

an object
- Object is also known as record, point,
case, sample, entity, or instance
6
Types of Attributes
• There are different types of attributes
Categorical

Discrete
or
Continuous
Numeric
or

7
Types of Attributes
• There are different types of attributes
- Nominal
Categorical

‣ Examples: ID numbers, eye color, zip codes

Discrete
or

- Ordinal
‣ Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades
Continuous
Numeric
or

8
Types of Attributes
• There are different types of attributes
- Nominal
Categorical

‣ Examples: ID numbers, eye color, zip codes

Discrete
or

- Ordinal
‣ Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades

- Interval
Continuous

‣ Examples: temperatures in Celsius or Fahrenheit.

Numeric

- Ratio
or

‣ Examples: length, counts, temperatures in kelvin (0

means no heat)
9
Difference Between Ratio and Interval

• The ratio of two values of an interval

attribute has no meaningful interpretation:
o Is it physically meaningful to say that a
temperature of 10 °F is twice that of 5 °F?

• The ratio attribute has true zero, but the

interval attribute does not:
o 0 Kelvin means no heat

10
Properties of Attribute Values
• The type of an attribute depends on which of the
following properties/operations it possesses:
o Distinctness: = ¹
o Order: < >
o Differences: + -
o Ratios: * /

o Nominal attribute: distinctness

o Ordinal attribute: distinctness & order
o Interval attribute: distinctness, order & differences
o Ratio attribute: all 4 operations

11
Outline

• Attributes and objects

Ø Types of data
• Data Preprocessing

12
Types of data sets

• Record • Ordered
o Document Data o Spatial Data
o Transaction Data o Temporal Data
o Sequential Data
• Graph o Genetic Sequence Data
o World Wide Web
o Molecular Structures

13
Record Data
• Data that consists of a collection of records,
each of which has a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

14
Record Data – Document Data
• Each document becomes a ‘term’ vector
o Each term is a component (attribute) of the vector

o The value of each component is the number of times

the corresponding term occurs in the document

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

15
Record Data – Transaction Data
• A special type of data, where
o Each transaction involves a set of items.
o Can represent transaction data as record data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Bread Beer Coke Diaper Milk
T1 1 0 1 0 1
T2 1 1 0 0 0
T3 0 1 1 1 1
T4 1 1 0 1 1
T5 0 0 1 1 1 16
Graph Data

nodes
and
links between nodes (directed or undirected,
different types)

17
Graph Data
The Web

18
Graph Data
Social Networks

19
Graph Data
Molecular Structures

20
Ordered Data

o Sequential Data
o Spatial Data
o Temporal Data
o Genetic Sequence Data

21
Ordered Data
Sequences of transactions/behaviors/words

22
Ordered Data
Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

23
Ordered Data
Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean

24
Real World Data is a Mess
• Noise:
o Errors and outliers
o e.g., Salary = -100 (error)

• Missing values:
o Missing some attribute values, lacking the attributes
you care about
o e.g., Occupation = Null (missing)

• Duplicate data:
o e.g., Same person with multiple emails

25
Real World Data is a Mess

26
Outline

• Attributes and objects

• Types of data
Ø Data Preprocessing

27
Typical Data Cleaning Tasks

• Task 1: Missing Values

• Task 2: Duplicates
• Task 3: Data Reduction
• Task 4: One-hot Encoding
• Task 5: Normalizing

28
Task 1: Missing Values
Most real data collected from sensors, surveys,
agents, have a high percentage of N/A or nulls,
special values (99999) etc.

What do we do? What strategies?

29
Task 1: Missing Values
• Ignore the record:
o usually done when class label is missing (when doing
classification) – not effective
• Fill in the missing value manually:
o tedious + infeasible?
• Fill in it automatically with:
o the attribute mean
o smarter: the attribute mean for all samples belonging to
the same class
o the most probable value: inference-based such as
machine learning model to predict the missing value given
other known attributes

30
Task 2: Duplicates

• In many scenarios, we may have duplicate entries

• E.g., a collection of users including
Ziwei Zhu
Z Zhu
Zhu, Ziwei
…
• Solution?

31
Task 2: Duplicates

• Similarity measures: how to define? Exact match or

soft match? (e.g., Ziwei Zhu vs Zhiwei Chu)
• Machine learning for classifying pairs as duplicates
or not
• Clustering and merging records

32
Task 2: Duplicates

33
Task 3: Data Reduction

Obtain a reduced representation of the data

set that is much smaller in volume but yet
produces the same (or almost the same)
analytical results

• Reduce Objects
• Reduce Attributes

34
Task 3: Data Reduction – Reduce Objects
o Sampling:
• A sample is representative if it has approximately the
same properties (of interest) as the original set of
data (progressively increase sampling size)
• Sampling with replacement, sampling without
replacement

35
Task 3: Data Reduction - Reduce Attributes

Curse of Dimensionality
• When dimensionality increases, data
becomes increasingly sparse in the space that
it occupies
• Distances between objects get uniform and
less meaningful, which critically influence the
performance of clustering and classification
tasks.
36
Curse of Dimensionality

37
Curse of Dimensionality
max _𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 − 𝑚𝑖𝑛_𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒
𝑚𝑖𝑛_𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒

The notions of distance

between samples, which
are critical for clustering
and classification, become
less meaningful.

• Randomly generate 500 points

• Compute difference between max and min
distance between any pair of points
38
Task 3: Data Reduction - Reduce Attributes

• Principle Component Analysis (PCA) (we will

learn it later!)
• Feature Selection
• Others: supervised, unsupervised, non-linear
methods

39
Task 3: Data Reduction - Reduce Attributes
• Feature Selection
• Redundant features
o Duplicate much or all the information contained in one or
more other attributes
o Example: purchase price of a product and the amount of
sales tax paid

• Irrelevant features
o Contain no useful information for the data mining task
o Example: students' ID is often irrelevant to the task of
predicting students' GPA

• Many techniques developed, especially for

classification
40
Task 4: Data Reduction - Reduce Attributes
• Feature Selection

https://scikit-learn.org/stable/modules/feature_selection.html 41
Task 5: One-hot Encoding

• Suppose we’ve got four majors: [English, History, Math,

CS]

• Many of our downstream analyses will only understand

data as a number (integer, float, etc.)

• We can [English, History, Math, CS] —> [0, 1, 2, 3]

• But that indicates order, i.e., CS > English

• One alternative: One-hot Encoding

42
Task 5: One-hot Encoding

[English, History, Math, CS]

becomes
English: [1, 0, 0, 0]
History: [0, 1, 0, 0]
Math: [0, 0, 1, 0]
CS: [0, 0, 0, 1]

43
Task 6: Normalizing

• Features have different scales

o GPA vs. Age vs. Height

• Map to a common range

o Z-score

o Min-max

44
Task 6: Normalizing
z-score normalization (standardization in statistics)

45
Task 6: Normalizing
Min-max Scaling

46
What we learned so far…

• Attributes and objects

• Types of data
• Data Preprocessing

CNAS (PS-DBM) June 13, 2025
No ratings yet
CNAS (PS-DBM) June 13, 2025
5 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Full
No ratings yet
Full
367 pages
Data Attributes & Types Explained
No ratings yet
Data Attributes & Types Explained
69 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
ML - Data - Preprocessing For Machine Learning
No ratings yet
ML - Data - Preprocessing For Machine Learning
44 pages
Data and Attributes in Data Mining
No ratings yet
Data and Attributes in Data Mining
47 pages
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
No ratings yet
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
67 pages
2-Data Preprocessing
No ratings yet
2-Data Preprocessing
104 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
Class 2 Introduction To Data
No ratings yet
Class 2 Introduction To Data
40 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
Lec 5
No ratings yet
Lec 5
24 pages
Data
No ratings yet
Data
84 pages
Lect 2
No ratings yet
Lect 2
77 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
50 pages
Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
12 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
Modified Module 2-DM
No ratings yet
Modified Module 2-DM
107 pages
Chapter-2 Getting To Know Your Data
No ratings yet
Chapter-2 Getting To Know Your Data
92 pages
Attributes
No ratings yet
Attributes
66 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
DM Day3 Preprocessing A S25
No ratings yet
DM Day3 Preprocessing A S25
109 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
CSC 452 DM Lecture02 Know Your Data A 13102020 014137pm
No ratings yet
CSC 452 DM Lecture02 Know Your Data A 13102020 014137pm
39 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
03 - Data Mining
No ratings yet
03 - Data Mining
37 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
2 What Is DATA ST
No ratings yet
2 What Is DATA ST
63 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Data Preprocessing & Attributes
No ratings yet
Data Preprocessing & Attributes
33 pages
Lecture2 IntroData
No ratings yet
Lecture2 IntroData
16 pages
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
No ratings yet
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
42 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
40 pages
Lect-2 Getting To Know Your Data-Part-I
No ratings yet
Lect-2 Getting To Know Your Data-Part-I
28 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
PREPROCESSING
No ratings yet
PREPROCESSING
122 pages
Chapter 02 Data and Data Preprocessing
No ratings yet
Chapter 02 Data and Data Preprocessing
74 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
DM Lec03
No ratings yet
DM Lec03
37 pages
Data Preprocessing PDF
No ratings yet
Data Preprocessing PDF
57 pages
Chap2 Data
No ratings yet
Chap2 Data
88 pages
Datalec 1
No ratings yet
Datalec 1
23 pages
02data Part4
No ratings yet
02data Part4
28 pages
Chapter 2
No ratings yet
Chapter 2
57 pages
Week 2
No ratings yet
Week 2
73 pages
Data Similarity
0% (1)
Data Similarity
18 pages
On The Implicit Bias in Deep-Learning Algorithms: Gal Vardi TTI-Chicago and Hebrew University
No ratings yet
On The Implicit Bias in Deep-Learning Algorithms: Gal Vardi TTI-Chicago and Hebrew University
17 pages
Functionally Constrained Algorithm Solves Convex Simple Bilevel Problems
No ratings yet
Functionally Constrained Algorithm Solves Convex Simple Bilevel Problems
22 pages
Will Bilevel Optimizers Benefit From Loops: Kaiyi Ji, Mingrui Liu, Yingbin Liang and Lei Ying June 2, 2022
No ratings yet
Will Bilevel Optimizers Benefit From Loops: Kaiyi Ji, Mingrui Liu, Yingbin Liang and Lei Ying June 2, 2022
32 pages
1 s2.0 S0885064X14000831 Main
No ratings yet
1 s2.0 S0885064X14000831 Main
14 pages
Asymptotics For Sketching in Least Squares Regression: Edgar Dobriban and Sifan Liu October 8, 2019
No ratings yet
Asymptotics For Sketching in Least Squares Regression: Edgar Dobriban and Sifan Liu October 8, 2019
47 pages
A Lyapunov Analysis of Accelerated Methods in Optimization: Ashia C. Wilson
No ratings yet
A Lyapunov Analysis of Accelerated Methods in Optimization: Ashia C. Wilson
34 pages
Understanding The Role of Momentum in Stochastic Gradient Methods
No ratings yet
Understanding The Role of Momentum in Stochastic Gradient Methods
32 pages
Benign Overfitting in Ridge Regression: Alexander Tsigler
No ratings yet
Benign Overfitting in Ridge Regression: Alexander Tsigler
76 pages
Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting
No ratings yet
Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting
25 pages
1 s2.0 0022247X9190144O Main
No ratings yet
1 s2.0 0022247X9190144O Main
22 pages
Module10 Assignment
No ratings yet
Module10 Assignment
2 pages
A Statistical Perspective On Randomized Sketching For Ordinary Least-Squares
No ratings yet
A Statistical Perspective On Randomized Sketching For Ordinary Least-Squares
31 pages
Gasnikov 19 A
No ratings yet
Gasnikov 19 A
18 pages
Online Mirror Descent and Dual Averaging: Keeping Pace in The Dynamic Case
No ratings yet
Online Mirror Descent and Dual Averaging: Keeping Pace in The Dynamic Case
38 pages
8 Clustering2
No ratings yet
8 Clustering2
84 pages
Stochastic Optimization Under Distributional Drift: Joshua Cutler
No ratings yet
Stochastic Optimization Under Distributional Drift: Joshua Cutler
56 pages
A Parameter-Free Conditional Gradient Method For Composite Minimization Under H Older Condition
No ratings yet
A Parameter-Free Conditional Gradient Method For Composite Minimization Under H Older Condition
34 pages
cs580 HWK Set2 Sol
No ratings yet
cs580 HWK Set2 Sol
6 pages
7-2 Categories+Objects
No ratings yet
7-2 Categories+Objects
15 pages
7 Clustering1
No ratings yet
7 Clustering1
72 pages
4 LinReg
No ratings yet
4 LinReg
80 pages
Optimizing Energy Consumption in Smart Homes Using Machine Learning Techniques
No ratings yet
Optimizing Energy Consumption in Smart Homes Using Machine Learning Techniques
7 pages
Konnwei Kw310 Can Obdii+Eobd Code Reader: Specifications
No ratings yet
Konnwei Kw310 Can Obdii+Eobd Code Reader: Specifications
16 pages
Welcome To Jiwaji
No ratings yet
Welcome To Jiwaji
1 page
Class e Instructions Rev2a
No ratings yet
Class e Instructions Rev2a
29 pages
Factorizing Polynomials
No ratings yet
Factorizing Polynomials
51 pages
TDX Agentforce Hackathon Rules
No ratings yet
TDX Agentforce Hackathon Rules
11 pages
2013HW70753-EndSemReport-Sagar Agrawal
No ratings yet
2013HW70753-EndSemReport-Sagar Agrawal
56 pages
HPU Main Library Membership Form For Smart Card HPU Staff
No ratings yet
HPU Main Library Membership Form For Smart Card HPU Staff
1 page
Centralized Disaster Management System
No ratings yet
Centralized Disaster Management System
29 pages
PARAM Siddhi-AI System Manual Ver1.0
No ratings yet
PARAM Siddhi-AI System Manual Ver1.0
88 pages
Chapter 12 Quizzes
No ratings yet
Chapter 12 Quizzes
3 pages
Day - 8 - Solutions: Non-Verbal - Coding and Decoding (Logical)
No ratings yet
Day - 8 - Solutions: Non-Verbal - Coding and Decoding (Logical)
8 pages
GMC 300E Plus User Guide
No ratings yet
GMC 300E Plus User Guide
24 pages
Symbol Table
No ratings yet
Symbol Table
24 pages
Advanced VLSI Design Course
No ratings yet
Advanced VLSI Design Course
13 pages
Question No 1: Cryptanalytic Attacks On 3DES
No ratings yet
Question No 1: Cryptanalytic Attacks On 3DES
2 pages
Fiche Technique SAP
No ratings yet
Fiche Technique SAP
7 pages
BackToThe Roots
No ratings yet
BackToThe Roots
6 pages
2 Smartforms
No ratings yet
2 Smartforms
7 pages
Introduction to Linear Programming
No ratings yet
Introduction to Linear Programming
17 pages
CV - Andi Kurniawan - 2023
No ratings yet
CV - Andi Kurniawan - 2023
6 pages
Project Diary - Major
No ratings yet
Project Diary - Major
12 pages
Sih PS 2024
No ratings yet
Sih PS 2024
5 pages
SVMBasedRealTimeHand WrittenDigitRecognitionSystem
No ratings yet
SVMBasedRealTimeHand WrittenDigitRecognitionSystem
7 pages
Fixed Assets in D365
No ratings yet
Fixed Assets in D365
29 pages
Assignment 2 CS Sec#4
No ratings yet
Assignment 2 CS Sec#4
5 pages
4.3.8 Packet Tracer - Configure Layer 3 Switching and Inter-VLAN Routing - ILM
No ratings yet
4.3.8 Packet Tracer - Configure Layer 3 Switching and Inter-VLAN Routing - ILM
6 pages
Abhipedia Abhimanu Com Article 1049 MjcyMDc2 My Experiments With Silence
No ratings yet
Abhipedia Abhimanu Com Article 1049 MjcyMDc2 My Experiments With Silence
5 pages
Coding Theory
No ratings yet
Coding Theory
4 pages

1 Data Mining

Uploaded by

1 Data Mining

Uploaded by

Data Mining

CS 584 :: Fall 2024

Part of slides is from Drs. Tan, Steinbach and Kumar.

• Attributes and objects

Ø Attributes and objects

• A collection of attributes describe

‣ Examples: ID numbers, eye color, zip codes

‣ Examples: ID numbers, eye color, zip codes

‣ Examples: temperatures in Celsius or Fahrenheit.

‣ Examples: length, counts, temperatures in kelvin (0

• The ratio of two values of an interval

• The ratio attribute has true zero, but the

o Nominal attribute: distinctness

• Attributes and objects

1 Yes Single 125K No

o The value of each component is the number of times

the corresponding term occurs in the document

• Attributes and objects

• Task 1: Missing Values

What do we do? What strategies?

• In many scenarios, we may have duplicate entries

• Similarity measures: how to define? Exact match or

Obtain a reduced representation of the data

The notions of distance

• Randomly generate 500 points

• Principle Component Analysis (PCA) (we will

• Many techniques developed, especially for

• Suppose we’ve got four majors: [English, History, Math,

• Many of our downstream analyses will only understand

• We can [English, History, Math, CS] —> [0, 1, 2, 3]

• But that indicates order, i.e., CS > English

• One alternative: One-hot Encoding

[English, History, Math, CS]

• Features have different scales

• Map to a common range

• Attributes and objects

You might also like