0% found this document useful (0 votes)

19 views27 pages

3 Preprocessing

Chapter 3 of 'Data Mining: Concepts and Techniques' focuses on data preprocessing, emphasizing the importance of data quality and the major tasks involved, such as data cleaning, integration, reduction, and transformation. It discusses various issues like missing and noisy data, along with strategies for handling these problems, including normalization and feature selection. The chapter also highlights the significance of dimensionality reduction and sampling methods to improve data analysis efficiency.

Uploaded by

wasiqbarat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views27 pages

3 Preprocessing

Uploaded by

wasiqbarat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Data Mining:

Concepts and Techniques

— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Reduction

 Data Transformation and Data Discretization

2
Data Quality: Why Preprocess the Data?

 Measures for data quality: A multidimensional view

 Accuracy: accurate or noisy (containing errors, or values
that deviate from the expected)
 Completeness: not recorded (lacking attribute values or
certain attributes of interest …)
 Consistency: e.g. discrepancy in the department codes used
to categorize items
 Timeliness: timely update?
 Believability: how much the data are trustable by users
 Interpretability: how easily the data can be understood?

3
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction (e.g. sampling)
 Data transformation and data discretization
 Normalization
 …

4
Major Tasks in Data Preprocessing

5
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Reduction

 Data Transformation and Data Discretization

6
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
 incomplete: lacking feature values, lacking certain features of
interest, or containing only aggregate data
 e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers
 e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday?
7
Incomplete (Missing) Data

 Data is not always available

 E.g., many tuples have no recorded value for several
features, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
 not register history or changes of the data
 Missing data may need to be inferred
8
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per feature varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the feature mean
 the feature mean for all samples belonging to the same
class: smarter
 the most probable value: inference-based such as
Bayesian formula or decision tree
9
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect feature values may be due to
 faulty data collection instruments

 data entry problems

 data transmission problems

 technology limitation

 inconsistency in naming convention

 Other data problems which require data cleaning

 duplicate records

 incomplete data

 inconsistent data

10
How to Handle Noisy Data?
 Binning
 First sort data and partition into (equal-frequency) bins
 Then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.

11
How to Handle Noisy Data (cont.)

 Regression
 smooth by fitting the data into regression functions

12
How to Handle Noisy Data (cont.)

 Clustering
 detect and remove outlier

13
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)

 Check field overloading

 Check uniqueness rule, consecutive rule and null rule

 Use commercial tools

 Data scrubbing: use simple domain knowledge (e.g., postal

code, spell-check) to detect errors and make corrections

 Data auditing: by analyzing data to discover rules and

relationship to detect violators (e.g., correlation and clustering

to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified

 ETL (Extraction/Transformation/Loading) tools: allow users to

specify transformations through a graphical user interface
 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)

14
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Reduction

 Data Transformation and Data Discretization

15
Feature Engineering
 Feature Extraction / Construction aims to reduce the number
of features in a dataset by creating new features from the existing
ones (and then discarding the original features).
 e.g. PCA

 Feature Selection: Instead of creating new features, Feature

Selection focuses on choosing a subset of the existing features
that contribute most significantly to the problem.
 This process eliminates irrelevant or redundant features while
preserving the important ones.
 e.g. Feature Subset Selection

 Feature Creation / Generation: Create new features that can

capture the important information in a data set more effectively
than the original ones.
16
Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
 Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant features

 Principal Components Analysis (PCA)

 Feature subset selection, feature creation

 Numerosity reduction (some simply call it: Data Reduction)

 Regression and Log-Linear Models

 Histograms, clustering, sampling

 Data cube aggregation

 Data compression

17
Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

18
Feature Subset Selection
 Another way to reduce dimensionality of data
 Redundant features
 Duplicate much or all of the information contained in
one or more other features
 E.g., purchase price of a product and the amount of
sales tax paid
 Irrelevant features
 Contain no information that is useful for the data
mining task at hand
 E.g., students' ID is often irrelevant to the task of
predicting students' GPA

19
Clustering
 Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
 Can be very effective if data is clustered but not if data
is “smeared”
 Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
 There are many choices of clustering definitions and
clustering algorithms

20
Sampling

 Sampling: obtaining a small sample s to represent the

whole data set N
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Key principle: Choose a representative subset of the data
 Simple random sampling may have very poor
performance in the presence of skew
 Develop adaptive sampling methods, e.g., stratified
sampling:
 Note: Sampling may not reduce database I/Os (page at a
time)
21
Types of Sampling

 Simple random sampling

 There is an equal probability of selecting any particular
item
 Sampling without replacement
 Once an object is selected, it is removed from the
population
 Sampling with replacement
 A selected object is not removed from the population

 Stratified sampling:
 Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
 Used in conjunction with skewed data

22
Sampling: With or without Replacement

Raw Data
23
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

24
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Reduction

 Data Transformation and Data Discretization

25
Data Transformation
 A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
 Methods
 Smoothing: Remove noise from data
 Attribute/feature construction
 New attributes constructed from the given ones
 Aggregation: Summarization, data cube construction
 Normalization: Scaled to fall within a smaller, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Discretization: Concept hierarchy climbing 26
Normalization
 min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12000 to $98000 normalized to [0.0, 1.0].
73600  12000
Then $73000 is mapped to 98000  12000 (1.0  0)  0  0.716
 z-score normalization (μ: mean, σ: standard deviation):
v  A
v'
 A

73600  54000
 Ex. Let μ = 54000, σ = 16000. Then  1.225
16000
 Normalization by decimal scaling:
v
v'  j Where j is the smallest integer such that max (|ν’|) < 1
10
27

Cse341 Assignment1 G6
No ratings yet
Cse341 Assignment1 G6
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
15 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Unit 2 Preprocessing
No ratings yet
Unit 2 Preprocessing
39 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
DWDM LS3 Fall 24 25
No ratings yet
DWDM LS3 Fall 24 25
50 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
Unit - II
No ratings yet
Unit - II
56 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
UNIT 3 Data Preprocessing
No ratings yet
UNIT 3 Data Preprocessing
22 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
45 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Lecture 3 and 4 - Data Preprocessing
No ratings yet
Lecture 3 and 4 - Data Preprocessing
25 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Correlation
No ratings yet
Correlation
14 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
01 Data Pre Processing
No ratings yet
01 Data Pre Processing
46 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
8 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
L6 Data Preprocessing
No ratings yet
L6 Data Preprocessing
9 pages
Lec07 - Data-Preprocessing-18052023-082951pm 2
No ratings yet
Lec07 - Data-Preprocessing-18052023-082951pm 2
32 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Unit 1 C
No ratings yet
Unit 1 C
63 pages
Lect 4
No ratings yet
Lect 4
30 pages
Chapter 6
No ratings yet
Chapter 6
32 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
19 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Week 2
No ratings yet
Week 2
96 pages
Data Mining Chapter3 0
No ratings yet
Data Mining Chapter3 0
32 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Module 2 (C) - Data Preprocessing
No ratings yet
Module 2 (C) - Data Preprocessing
50 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
ML 4
No ratings yet
ML 4
17 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Section 2.0 - Specifications Square Drive Tools: W ENG-5525-056 AD) Page 6 of 40 Eng Us
No ratings yet
Section 2.0 - Specifications Square Drive Tools: W ENG-5525-056 AD) Page 6 of 40 Eng Us
3 pages
Strong Swan Documentation (Updated Till Eap-Md5)
No ratings yet
Strong Swan Documentation (Updated Till Eap-Md5)
58 pages
38_SAE International Journal of Passenger Cars - Mechanical Systems Volume 7 Issue 1 2014 [Doi 10.4271_2014!01!0872] Li, Bin; Yang, Xiaobo; Yang, James -- Tire Model Application and Parameter Identific
No ratings yet
38_SAE International Journal of Passenger Cars - Mechanical Systems Volume 7 Issue 1 2014 [Doi 10.4271_2014!01!0872] Li, Bin; Yang, Xiaobo; Yang, James -- Tire Model Application and Parameter Identific
13 pages
Lecture # 05b, 06a (Vertical Curves)
No ratings yet
Lecture # 05b, 06a (Vertical Curves)
27 pages
UNIT3 2marks
No ratings yet
UNIT3 2marks
7 pages
Week 4
No ratings yet
Week 4
35 pages
Engineering Design for Rebar Installation
No ratings yet
Engineering Design for Rebar Installation
1 page
IPE Pre-Test 2nd Sem 23-24
No ratings yet
IPE Pre-Test 2nd Sem 23-24
3 pages
Common SQL Errors & Solutions Guide
No ratings yet
Common SQL Errors & Solutions Guide
13 pages
Electronic Cheat Sheet
No ratings yet
Electronic Cheat Sheet
1 page
The Effects of Instrument in Measurements
No ratings yet
The Effects of Instrument in Measurements
18 pages
Organic Chemistry A Modern Approach 1st Edition - Ebook PDF Online Version
100% (1)
Organic Chemistry A Modern Approach 1st Edition - Ebook PDF Online Version
136 pages
Brown Book
100% (1)
Brown Book
179 pages
Making Salts
No ratings yet
Making Salts
29 pages
Tricks and Treats For "CST Programmers": Amit Rappel, Itzik Haimov
No ratings yet
Tricks and Treats For "CST Programmers": Amit Rappel, Itzik Haimov
24 pages
CIE As & A LEVEL MECHANICS PREDICTION PAPER 1 FOR 250730 122147
No ratings yet
CIE As & A LEVEL MECHANICS PREDICTION PAPER 1 FOR 250730 122147
12 pages
B10 AutoCAD 201222
No ratings yet
B10 AutoCAD 201222
2 pages
Chapter 3: Semiconductors: Electronic Materials
No ratings yet
Chapter 3: Semiconductors: Electronic Materials
12 pages
Q1 LE Mathematics-8 Lesson-2 Week-2
No ratings yet
Q1 LE Mathematics-8 Lesson-2 Week-2
25 pages
(People and Ideas) Daniel C. Tosteson (Auth.), Daniel C. Tosteson (Eds.) - Membrane Transport - People and Ideas (1989, Springer New York)
100% (1)
(People and Ideas) Daniel C. Tosteson (Auth.), Daniel C. Tosteson (Eds.) - Membrane Transport - People and Ideas (1989, Springer New York)
410 pages
Scientific Aspects of Juggling by Claude Shannon
No ratings yet
Scientific Aspects of Juggling by Claude Shannon
11 pages
NM Release Notes en
No ratings yet
NM Release Notes en
11 pages
Stucor Ma3351 Er
No ratings yet
Stucor Ma3351 Er
149 pages
Electromagnetism Research Paper
No ratings yet
Electromagnetism Research Paper
3 pages
DVD Lens Actuator
No ratings yet
DVD Lens Actuator
6 pages
Stairwell Pressurization Analysis
No ratings yet
Stairwell Pressurization Analysis
17 pages
Electrical Wind Turbine Systems
No ratings yet
Electrical Wind Turbine Systems
101 pages
Plunger Lift Brochure
No ratings yet
Plunger Lift Brochure
4 pages

3 Preprocessing

Uploaded by

3 Preprocessing

Uploaded by

Data Mining:

Concepts and Techniques

 Data Preprocessing: An Overview

 Major Tasks in Data Preprocessing

 Data Transformation and Data Discretization

 Measures for data quality: A multidimensional view

 Data Preprocessing: An Overview

 Major Tasks in Data Preprocessing

 Data Transformation and Data Discretization

 Data is not always available

 data entry problems

 data transmission problems

 inconsistency in naming convention

 Other data problems which require data cleaning

 Check field overloading

 Check uniqueness rule, consecutive rule and null rule

 Use commercial tools

 Data scrubbing: use simple domain knowledge (e.g., postal

code, spell-check) to detect errors and make corrections

relationship to detect violators (e.g., correlation and clustering

 ETL (Extraction/Transformation/Loading) tools: allow users to

 Data Preprocessing: An Overview

 Major Tasks in Data Preprocessing

 Data Transformation and Data Discretization

 Feature Selection: Instead of creating new features, Feature

 Feature Creation / Generation: Create new features that can

 Principal Components Analysis (PCA)

 Feature subset selection, feature creation

 Numerosity reduction (some simply call it: Data Reduction)

 Regression and Log-Linear Models

 Histograms, clustering, sampling

 Data cube aggregation

 Sampling: obtaining a small sample s to represent the

 Simple random sampling

Raw Data Cluster/Stratified Sample

 Data Preprocessing: An Overview

 Major Tasks in Data Preprocessing

 Data Transformation and Data Discretization

You might also like