Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views47 pages

1 Data Mining

data mining

Uploaded by

wuyuman6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views47 pages

1 Data Mining

data mining

Uploaded by

wuyuman6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Data Mining

Data

CS 584 :: Fall 2024


Ziwei Zhu
Department of Computer Science
George Mason University

Part of slides is from Drs. Tan, Steinbach and Kumar.


Part of slides is from Dr. James Caverlee. 1
Outline

• Attributes and objects


• Types of data
• Data Preprocessing

2
Outline

Ø Attributes and objects


• Types of data
• Data Preprocessing

3
What is Data?
• Collection of data objects and their
attributes

4
What is Data?
• Collection of data objects and their
attributes
• An attribute is a property or
characteristic of an object
- Examples: eye color of a person,
temperature, etc.
- Attribute is also known as variable,
field, characteristic, dimension, or
feature

5
What is Data?
• Collection of data objects and their
attributes
• An attribute is a property or
characteristic of an object
- Examples: eye color of a person,
temperature, etc.
- Attribute is also known as variable,
field, characteristic, dimension, or
feature

• A collection of attributes describe


an object
- Object is also known as record, point,
case, sample, entity, or instance
6
Types of Attributes
• There are different types of attributes
Categorical

Discrete
or
Continuous
Numeric
or

7
Types of Attributes
• There are different types of attributes
- Nominal
Categorical

‣ Examples: ID numbers, eye color, zip codes


Discrete
or

- Ordinal
‣ Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades
Continuous
Numeric
or

8
Types of Attributes
• There are different types of attributes
- Nominal
Categorical

‣ Examples: ID numbers, eye color, zip codes


Discrete
or

- Ordinal
‣ Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades

- Interval
Continuous

‣ Examples: temperatures in Celsius or Fahrenheit.


Numeric

- Ratio
or

‣ Examples: length, counts, temperatures in kelvin (0


means no heat)
9
Difference Between Ratio and Interval

• The ratio of two values of an interval


attribute has no meaningful interpretation:
o Is it physically meaningful to say that a
temperature of 10 °F is twice that of 5 °F?

• The ratio attribute has true zero, but the


interval attribute does not:
o 0 Kelvin means no heat

10
Properties of Attribute Values
• The type of an attribute depends on which of the
following properties/operations it possesses:
o Distinctness: = ¹
o Order: < >
o Differences: + -
o Ratios: * /

o Nominal attribute: distinctness


o Ordinal attribute: distinctness & order
o Interval attribute: distinctness, order & differences
o Ratio attribute: all 4 operations

11
Outline

• Attributes and objects


Ø Types of data
• Data Preprocessing

12
Types of data sets

• Record • Ordered
o Document Data o Spatial Data
o Transaction Data o Temporal Data
o Sequential Data
• Graph o Genetic Sequence Data
o World Wide Web
o Molecular Structures

13
Record Data
• Data that consists of a collection of records,
each of which has a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

14
Record Data – Document Data
• Each document becomes a ‘term’ vector
o Each term is a component (attribute) of the vector

o The value of each component is the number of times

the corresponding term occurs in the document

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

15
Record Data – Transaction Data
• A special type of data, where
o Each transaction involves a set of items.
o Can represent transaction data as record data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Bread Beer Coke Diaper Milk
T1 1 0 1 0 1
T2 1 1 0 0 0
T3 0 1 1 1 1
T4 1 1 0 1 1
T5 0 0 1 1 1 16
Graph Data

nodes
and
links between nodes (directed or undirected,
different types)

17
Graph Data
The Web

18
Graph Data
Social Networks

19
Graph Data
Molecular Structures

20
Ordered Data

o Sequential Data
o Spatial Data
o Temporal Data
o Genetic Sequence Data

21
Ordered Data
Sequences of transactions/behaviors/words

22
Ordered Data
Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

23
Ordered Data
Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean

24
Real World Data is a Mess
• Noise:
o Errors and outliers
o e.g., Salary = -100 (error)

• Missing values:
o Missing some attribute values, lacking the attributes
you care about
o e.g., Occupation = Null (missing)

• Duplicate data:
o e.g., Same person with multiple emails

25
Real World Data is a Mess

26
Outline

• Attributes and objects


• Types of data
Ø Data Preprocessing

27
Typical Data Cleaning Tasks

• Task 1: Missing Values


• Task 2: Duplicates
• Task 3: Data Reduction
• Task 4: One-hot Encoding
• Task 5: Normalizing

28
Task 1: Missing Values
Most real data collected from sensors, surveys,
agents, have a high percentage of N/A or nulls,
special values (99999) etc.

What do we do? What strategies?

29
Task 1: Missing Values
• Ignore the record:
o usually done when class label is missing (when doing
classification) – not effective
• Fill in the missing value manually:
o tedious + infeasible?
• Fill in it automatically with:
o the attribute mean
o smarter: the attribute mean for all samples belonging to
the same class
o the most probable value: inference-based such as
machine learning model to predict the missing value given
other known attributes

30
Task 2: Duplicates

• In many scenarios, we may have duplicate entries


• E.g., a collection of users including
Ziwei Zhu
Z Zhu
Zhu, Ziwei

• Solution?

31
Task 2: Duplicates

• Similarity measures: how to define? Exact match or


soft match? (e.g., Ziwei Zhu vs Zhiwei Chu)
• Machine learning for classifying pairs as duplicates
or not
• Clustering and merging records

32
Task 2: Duplicates

33
Task 3: Data Reduction

Obtain a reduced representation of the data


set that is much smaller in volume but yet
produces the same (or almost the same)
analytical results

• Reduce Objects
• Reduce Attributes

34
Task 3: Data Reduction – Reduce Objects
o Sampling:
• A sample is representative if it has approximately the
same properties (of interest) as the original set of
data (progressively increase sampling size)
• Sampling with replacement, sampling without
replacement

35
Task 3: Data Reduction - Reduce Attributes

Curse of Dimensionality
• When dimensionality increases, data
becomes increasingly sparse in the space that
it occupies
• Distances between objects get uniform and
less meaningful, which critically influence the
performance of clustering and classification
tasks.
36
Curse of Dimensionality

37
Curse of Dimensionality
max _𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 − 𝑚𝑖𝑛_𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒
𝑚𝑖𝑛_𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒

The notions of distance


between samples, which
are critical for clustering
and classification, become
less meaningful.

• Randomly generate 500 points


• Compute difference between max and min
distance between any pair of points
38
Task 3: Data Reduction - Reduce Attributes

• Principle Component Analysis (PCA) (we will


learn it later!)
• Feature Selection
• Others: supervised, unsupervised, non-linear
methods

39
Task 3: Data Reduction - Reduce Attributes
• Feature Selection
• Redundant features
o Duplicate much or all the information contained in one or
more other attributes
o Example: purchase price of a product and the amount of
sales tax paid

• Irrelevant features
o Contain no useful information for the data mining task
o Example: students' ID is often irrelevant to the task of
predicting students' GPA

• Many techniques developed, especially for


classification
40
Task 4: Data Reduction - Reduce Attributes
• Feature Selection

https://scikit-learn.org/stable/modules/feature_selection.html 41
Task 5: One-hot Encoding

• Suppose we’ve got four majors: [English, History, Math,


CS]

• Many of our downstream analyses will only understand


data as a number (integer, float, etc.)

• We can [English, History, Math, CS] —> [0, 1, 2, 3]

• But that indicates order, i.e., CS > English

• One alternative: One-hot Encoding

42
Task 5: One-hot Encoding

[English, History, Math, CS]


becomes
English: [1, 0, 0, 0]
History: [0, 1, 0, 0]
Math: [0, 0, 1, 0]
CS: [0, 0, 0, 1]

43
Task 6: Normalizing

• Features have different scales


o GPA vs. Age vs. Height

• Map to a common range


o Z-score

o Min-max

44
Task 6: Normalizing
z-score normalization (standardization in statistics)

45
Task 6: Normalizing
Min-max Scaling

46
What we learned so far…

• Attributes and objects


• Types of data
• Data Preprocessing

47

You might also like