0% found this document useful (0 votes)

62 views27 pages

CH 03-01 Data Preprocessing

This document provides an overview of data preprocessing. It discusses data quality issues like accuracy, completeness, consistency and timeliness that require preprocessing. The major tasks in preprocessing are data cleaning, integration, reduction, and transformation. Data cleaning involves handling missing data, noisy data, and inconsistencies. Data integration combines data from multiple sources. Data reduction reduces dimensionality and numerosity through techniques like compression. Data transformation includes normalization and discretization.

Uploaded by

akash kahsyap

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views27 pages

CH 03-01 Data Preprocessing

Uploaded by

akash kahsyap

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

— Chapter 3 —

Data Preprocessing

1
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Integration

 Data Reduction

 Data Transformation and Data Discretization

2
Data Quality: Why Preprocess the Data?

 Measures for data quality: A multidimensional view

 Accuracy: correct or wrong, accurate or not

 Completeness: not recorded, unavailable, …

 Consistency: some modied but some not,…

 Timeliness: timely update?

 Interpretability: how easily the data can be

understood?

3
Major Tasks in Data Preprocessing

 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or les
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation 4
Forms of data preprocessing

5
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data, e.
g., instrument faulty, human or computer error, transmission error
 Incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
 e.g., Occupation=“ ” (missing data)
 Noisy: containing noise, errors, or outliers
 e.g., Salary=“−10” (an error)
 Inconsistent: containing discrepancies in codes or names, e.g.,
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
6
Incomplete (Missing) Data

 Data is not always available

 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 Equipment malfunction
 Inconsistent with other recorded data and thus
deleted
 Data not entered due to misunderstanding
 Certain data may not be considered important at the
time of entry
7

How to Handle Missing Data?

8
Noisy Data

 Noise: random error or variance in a measured variable

 Incorrect attribute values may be due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which require data cleaning
 duplicate records
 incomplete data
 inconsistent data 9
How to Handle Noisy Data?

 Binning
 First sort data and partition into (equal-frequency)
bins
 Then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Example:
Sorted data for price (in dollars):
4, 8, 15, 21, 21, 24, 25, 28, 34

10
How to Handle Noisy Data?

 Regression
 smooth by tting the data into regression functions
 nding the “best” line to t two attributes (or variables)
so that one attribute can be used to predict the other.
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human (e.g.,
deal with possible outliers)

11
 Three data clusters, outliers may be detected as values
that fall outside of the cluster sets.

12
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency,

distribution)
 Check eld overloading

 Check uniqueness rule, consecutive rule and null rule

 Use commercial tools:

 Data scrubbing: use simple domain knowledge (e.g.,

postal code, spell-check) to detect errors and make

corrections
 Data auditing: by analyzing data to discover rules

and relationship to detect violators (e.g., correlation

and clustering to nd outliers)
13
Data Cleaning as a Process
 Data migration and integration
 Data migration tools: allow transformations to be

specied. Example: moving data from one location to

another, one format to another, or one application to
another.
 ETL (Extraction/Transformation/Loading) tools: allow

users to specify transformations through a graphical

user interface (GUI).
 Integration of the two processes
 discrepancy detection and data transformation (to

correct discrepancies) iterates.

14
Data Integration

 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identication problem:
 Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Detecting and resolving data value conicts
 For the same real world entity, attribute values from different
sources are different
 Possible reasons: different representations, different scales, e.g., 15
Handling Redundancy in Data Integration

 Redundant data occur often when integration of

multiple databases
 Object identication: The same attribute or object may have
diﬀerent names in diﬀerent databases
 Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue

 Redundant attributes may be able to be detected by

correlation analysis and covariance analysis
 Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality 16
 Some redundancies can be detected by correlation
analysis. Given two attributes, such analysis can
measure how strongly one attribute implies the other,
based on the available data.
 2

 For nominal data, we use the (chi-square) test.

 For numeric attributes, we can use the correlation

coeﬃcient and covariance, both of which access how
one attribute’s values vary from those of another.

17
Correlation Analysis (Nominal Data)

 Χ2 (chi-square) test

( Observed  Expected )
2

 
2

Expected

 The larger the Χ2 value, the more likely the variables are
related
 The cells that contribute the most to the Χ2 value are
those whose actual count is very diﬀerent from the
expected count

18
Chi-Square Calculation: An Example

 A group of 1500 people was surveyed. The gender of

each person was noted. Each person was polled as to
whether his or her preferred type of reading material
was ction or nonction. The observed frequency of
each possible joint event is summarized in the
contingency table:

male female Total

Fiction 250(90) 200(360) 450
Nonction 50(210) 1000(840) 1050
Total 300 1200 1500

19
Chi-Square Calculation: An Example

 Χ2 (chi-square) calculation (numbers in parenthesis are

expected counts calculated based on the data
distribution in the two categories)

( 250  90) ( 50  210 ) ( 200  360)  840) 2

2 2 2
(1000
 
2
    507 . 93
90 210 360 840

 For this 2x2 table, the degrees of freedom are (2-1)(2-1) =

1
 The value needed to reject the hypothesis at the 0.001
signicance level is (10.828)
20
 Since our computed value is above this, we can reject
the hypothesis that gender and preferred reading are
independent and conclude that the two attributes are
(strongly) correlated for the given group of people.
21
Correlation Analysis (Numeric Data)

 Correlation coeﬃcient (also called Pearson’s product

moment coeﬃcient)
 
n n
( ai  A )( bi  B ) ( a i bi )  n AB
rA, B  i 1
 i 1

(n  1 )  A B (n  1 )  A B

A B
where n is the number of tuples, and are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.

 If rA,B > 0, A and B are positively correlated (A’s values

increase as B’s). The higher, the stronger correlation.
r = 0: independent; r < 0: negatively correlated 22
Correlation (viewed as linear relationship)
 Correlation measures the linear relationship between
objects

 To compute correlation, we standardize data objects, A

and B, and then take their dot product
a 'k  ( a k  mean ( A )) / std ( A )

b 'k  ( bk  mean ( B )) / std ( B )

correlatio n ( A , B )  A ' B '

23
Covariance (Numeric Data)

 Covariance is similar to correlation

Correlation coeﬃcient:

where
A n is the number of tuples,
B
and are the respective mean or expected values of A and B
σA and σB are the respective standard deviation of A and B.

24
 Positive covariance: If CovA,B > 0, then A and B both tend to be
larger than their expected values.

 Negative covariance: If CovA,B < 0 then if A is larger than its

expected value, B is likely to be smaller than its expected value.

 Independence: CovA,B = 0 but the converse is not true:


Some pairs of random variables may have a covariance of 0 but are
not independent. Only under some additional assumptions (e.g., the
data follow multivariate normal distributions) does a covariance of 0
imply independence

25
 Suppose two stocks A and B have the following values in one week:

(2, 5), (3, 5), (4, 10), (5, 10), (6, 20).

 Question: If the stocks are aﬀected by the same industry trends,

will their prices rise or fall together?

26
E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = $4

E(B) = (20 + 10 + 14 + 5 + 5) /5 = 54/5 = $10.80

The covariance between A and B is dened as:

Cov(A,B) = (6×20 + 5×10 + 4×14 + 3×5 + 2×5) − 4 × 10.80 / 5

= 50.2 – 43.2 = 7

 Thus, A and B rise together since Cov(A, B) > 0.

Encyclopedia of Recreational Diving Chapter 1
100% (4)
Encyclopedia of Recreational Diving Chapter 1
98 pages
Simple Belt Conveyor Calculation Example
90% (10)
Simple Belt Conveyor Calculation Example
3 pages
Practice Exam Answers
No ratings yet
Practice Exam Answers
19 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
64 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
63 pages
Data Mining: Dosen: Dr. Vitri Tundjungsari
No ratings yet
Data Mining: Dosen: Dr. Vitri Tundjungsari
64 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
52 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
IT326 - Ch3
No ratings yet
IT326 - Ch3
33 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
66 pages
03 Preprocessing
No ratings yet
03 Preprocessing
64 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
03 Preprocessing
No ratings yet
03 Preprocessing
54 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
Computer Graphics
100% (1)
Computer Graphics
132 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
56 pages
Lecture 3
No ratings yet
Lecture 3
47 pages
Metal Fatigue Failure
100% (3)
Metal Fatigue Failure
2 pages
' 3 IT326 - Ch2 - Pre-Processing
No ratings yet
' 3 IT326 - Ch2 - Pre-Processing
48 pages
Unit 3
No ratings yet
Unit 3
164 pages
Data Pre Processing
No ratings yet
Data Pre Processing
63 pages
Mood Disorder
No ratings yet
Mood Disorder
18 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
Module 5 03preprocessing
No ratings yet
Module 5 03preprocessing
63 pages
Perkins
No ratings yet
Perkins
5 pages
Module 2
No ratings yet
Module 2
62 pages
Data Mining 3
No ratings yet
Data Mining 3
57 pages
Rwanda Eia Guidelines Road Construction
No ratings yet
Rwanda Eia Guidelines Road Construction
54 pages
Lec 7
No ratings yet
Lec 7
45 pages
How Cosmic Forces Shape Our Destiny, by Nikola Tesla, 1915
No ratings yet
How Cosmic Forces Shape Our Destiny, by Nikola Tesla, 1915
4 pages
DM Merged
No ratings yet
DM Merged
169 pages
Math Competition Exam
No ratings yet
Math Competition Exam
3 pages
The Geometry of Futon Comfort
No ratings yet
The Geometry of Futon Comfort
5 pages
Wecall Catalog
100% (2)
Wecall Catalog
20 pages
03 Pre Processing
No ratings yet
03 Pre Processing
63 pages
Pindell Dewey 1982
No ratings yet
Pindell Dewey 1982
34 pages
Resume Film
No ratings yet
Resume Film
1 page
UpdatedUnit 1 Data Preprocessing
No ratings yet
UpdatedUnit 1 Data Preprocessing
38 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
60 pages
PPT1
No ratings yet
PPT1
93 pages
03 Preprocessing
No ratings yet
03 Preprocessing
65 pages
03preprocessing 20160222
No ratings yet
03preprocessing 20160222
65 pages
2-Data Fundamentals For BI - Part1
No ratings yet
2-Data Fundamentals For BI - Part1
39 pages
03 Preprocessing
No ratings yet
03 Preprocessing
38 pages
Navsure N400i
No ratings yet
Navsure N400i
76 pages
Bar & Beverage Menu
No ratings yet
Bar & Beverage Menu
13 pages
Data - Preprocessing 1 19
No ratings yet
Data - Preprocessing 1 19
19 pages
2020 Preprocessing
No ratings yet
2020 Preprocessing
63 pages
problems-A.C. Circuit Complex
No ratings yet
problems-A.C. Circuit Complex
2 pages
14 Network Hardwares
No ratings yet
14 Network Hardwares
11 pages
Application-Form-FSEC-for-Building-Permit Koronadal
No ratings yet
Application-Form-FSEC-for-Building-Permit Koronadal
1 page
Policy Wordings
No ratings yet
Policy Wordings
19 pages
Mining
No ratings yet
Mining
63 pages
Slide 05 Chapter3 Data Preprocessing
No ratings yet
Slide 05 Chapter3 Data Preprocessing
58 pages
Micro (Nano) Plastic Contaminations From Soils To Plants: Human Food Risks
No ratings yet
Micro (Nano) Plastic Contaminations From Soils To Plants: Human Food Risks
6 pages
Lecture#2 Data Mining MS (DEIM) Spring 2025
No ratings yet
Lecture#2 Data Mining MS (DEIM) Spring 2025
61 pages
3 Processing
No ratings yet
3 Processing
79 pages
Biology 12 Unit 9 Assignment 2 Blood Type and Immune Response Virtual Lab
0% (1)
Biology 12 Unit 9 Assignment 2 Blood Type and Immune Response Virtual Lab
2 pages
Lec 3
No ratings yet
Lec 3
31 pages
Environmental Biotech Solutions
No ratings yet
Environmental Biotech Solutions
10 pages
From A Game of Polo With A Headless Goat-Annotated
No ratings yet
From A Game of Polo With A Headless Goat-Annotated
2 pages
LKG GK Syllabus Whole Session
No ratings yet
LKG GK Syllabus Whole Session
6 pages
FA4 10th Science (2024-25)
No ratings yet
FA4 10th Science (2024-25)
3 pages
The Cost of Obesity and Related NCDs in Brazil
No ratings yet
The Cost of Obesity and Related NCDs in Brazil
9 pages
Tiger in The Zoo
No ratings yet
Tiger in The Zoo
5 pages
Unit 1 C
No ratings yet
Unit 1 C
63 pages
Geopolitics of Water
No ratings yet
Geopolitics of Water
8 pages
Data Pre Processing
No ratings yet
Data Pre Processing
62 pages
Module 2 (C) - Data Preprocessing
No ratings yet
Module 2 (C) - Data Preprocessing
50 pages
ISBT 72 HourWash Guidelines
No ratings yet
ISBT 72 HourWash Guidelines
22 pages

CH 03-01 Data Preprocessing

Uploaded by

CH 03-01 Data Preprocessing

Uploaded by

— Chapter 3 —

 Data Preprocessing: An Overview

 Major Tasks in Data Preprocessing

 Data Transformation and Data Discretization

 Measures for data quality: A multidimensional view

 Accuracy: correct or wrong, accurate or not

 Completeness: not recorded, unavailable, …

 Consistency: some modied but some not,…

 Timeliness: timely update?

 Interpretability: how easily the data can be

 Data is not always available

 Noise: random error or variance in a measured variable

 Check uniqueness rule, consecutive rule and null rule

 Use commercial tools:

 Data scrubbing: use simple domain knowledge (e.g.,

postal code, spell-check) to detect errors and make

and relationship to detect violators (e.g., correlation

specied. Example: moving data from one location to

users to specify transformations through a graphical

correct discrepancies) iterates.

 Redundant data occur often when integration of

 Redundant attributes may be able to be detected by

 For nominal data, we use the (chi-square) test.

 For numeric attributes, we can use the correlation

 A group of 1500 people was surveyed. The gender of

male female Total

 Χ2 (chi-square) calculation (numbers in parenthesis are

( 250  90) ( 50  210 ) ( 200  360)  840) 2

 For this 2x2 table, the degrees of freedom are (2-1)(2-1) =

 Correlation coeﬃcient (also called Pearson’s product

 If rA,B > 0, A and B are positively correlated (A’s values

 To compute correlation, we standardize data objects, A

b 'k  ( bk  mean ( B )) / std ( B )

correlatio n ( A , B )  A ' B '

 Covariance is similar to correlation

 Negative covariance: If CovA,B < 0 then if A is larger than its

 Independence: CovA,B = 0 but the converse is not true:

 Question: If the stocks are aﬀected by the same industry trends,

will their prices rise or fall together?

E(B) = (20 + 10 + 14 + 5 + 5) /5 = 54/5 = $10.80

The covariance between A and B is dened as:

Cov(A,B) = (6×20 + 5×10 + 4×14 + 3×5 + 2×5) − 4 × 10.80 / 5

 Thus, A and B rise together since Cov(A, B) > 0.

You might also like