M 2.3 Data Preprocessing

The document discusses the challenges of real-world data, highlighting issues such as incompleteness, noise, and inconsistency, and emphasizes the necessity of quality data for effective data mining. It outlines major tasks in data pre-processing, including data cleaning, integration, transformation, reduction, and discretization, with a focus on methods for handling missing and noisy data. Various techniques for data cleaning, such as regression, clustering, and the use of data auditing tools, are also described to ensure data quality.

Uploaded by

anaghamelayil1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views22 pages

M 2.3 Data Preprocessing

Uploaded by

anaghamelayil1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

❖ Data in the real world is dirty

✔ incomplete: lacking attribute values, lacking certain

attributes of interest, or containing only aggregate
data
✔ noisy: containing errors or outliers
✔ inconsistent: containing discrepancies in codes
or names
❖ No quality data, no quality mining results!
✔ Quality decisions must be based on quality data
✔ Data warehouse needs consistent integration of
quality data
Major Tasks in Data pre-processing

• Data Cleaning
• Data Integration
• Data Transformation
• Data reduction
• Data discretization
Data pre-processing methods
Data Cleaning
• Real world data is incomplete, noisy and
inconsistent.
• Data cleaning fill in missing values, smooth
out noise while identify outliers and correct
inconsistencies in the data.
Data cleaning methods :-
1) Missing Values
• Ignore the tuple
Can be done when class label is missing. It
is not effective, unless tuple contains several
attribute with missing values.
• Fill in the missing value manually
• Time consuming given large data set with
many missing values
• Use some global constants to fill in the missing values
Use ex: -α , “unknown “ etc.
But there is a chance for misinterpreting
“unknown”
• Use the attribute mean to fill in the missing value. For
example customer average income is 25000 then you
can use this value to replace missing value for income.
• Use the attribute mean for all samples belonging to the
same class as given by the tuple
• Use the most probable value to fill in the missing
value. This value is determined by regression,
inference based tools or decision tree induction.
2) Noisy Data

• Noise is a random error or variance in a

measured variable. Noisy Data may be due to
faulty data collection instruments, data entry
problems and technology limitation.
Handling Noisy Data
1. Binning:
Binning methods sort data value by consulting
its “neighbour- hood,” that is, the values around
it. The sorted values are distributed into a
number of “buckets,” or bins.
1. Smoothing by bin means
2. Smoothing by bin medians
3. Smoothing by bin boundaries
Example: Data for price(in dollars):
15,4,8,21,21,24, 28,25,34
Example: Sorted data for price(in dollars):
4,8,15,21,21,24, 25,28,34
• Partition into equal frequency bins
Bin1: 4,8,15
Bin2:21,21,24
Bin3:25,28,34
Example: Sorted data for price(in dollars):
4,8,15,21,21,24, 25,28,34
• Partition into equal frequency bins
Bin1: 4,8,15
Bin2:21,21,24
Bin3:25,28,34
a) Smoothing by bin means
In smoothing by bin means, each value in a bin is replaced
by the mean value of the bin.
Bin1: 9,9,9 🡪 [(4+8+15)/3=9]
Bin2:22,22,22 🡪 [(22+22+22)/3=22]
Bin3:29,29,29 🡪 [(25+28+34)/3=29]
b) Smoothing by bin medians
Each value in a bin is replaced by the median of all the values
belonging to the same bin.
Bin1: 8,8,8
Bin2:21,21,21
Bin3:28,28,28

c)Smoothing by bin boundaries

In smoothing by bin boundaries, each bin value is replaced by the closest
boundary value.
Bin1: 4,4,15
Bin2:21,21,24
Bin3:25, 25, 34
Example 2:Partition the given data into 4 bins using Equi-
depth binning method and perform smoothing according to
the following methods

Data:11,13,13,15,15,16,19,20,20,20,21,21,22,23,24,30,
40,45,45,45,71,72,73,75

a) Smoothing by bin mean

b) Smoothing by bin median
c) Smoothing by bin boundaries
Divide the data into 4 equal-depth bins
bin 1:11,13,13,15,15,16
bin 2:9,20,20,20,21,21
bin3:22,23,24,30,40,45
bin4:,45,45,71,72,73,75
smoothing by means
bin 1-13.83,13.83,13.83,13.83,13.83,13.83
bin 2-20.16,20.16,20.16,20.16,20.16
bin 3-30.67,30.67,30.67,30.67,30.67,30.67
bin 4-63.5,63.5,63.5,63.5,63.5,63.5,63.5
smoothing by boundaries
bin 1:11,11,11,16,16,16
bin 2:19,19,19,21,21,21
bin3:22,22,22,22,45,45
bin4:,45,45,75,75,75,75
moothing by median
bin 1:14,14,14,14,14,14 🡪 [(13+15)/2=14]
bin 2:20,20,20,20,20,20 🡪 [(20+20)/2=20]
bin3: 27,27,27,27,27,27 🡪 [(24+30)/2=27]
bin4:,71.5,71.5,71.5,71.5,71.5,71.5
2. Regression
• Data can be smoothed by fitting the data to a
function , such as with regression.
• Linear Regression involves finding the best
line to fit two attribuutes so one attribuute can
be used to predict the other attribute
• Multiple linear regression where more than
two attribuutes are involved and the data are fit
to a multidimensional surface.
3. Clustering
• Outliers may be detected byy clustering .
• Similar values are organized into groups/clusters.
• Values fall outside the group may bee considered as
outliers.
Data Cleaning as a process
• Fiirst step in data cleaning is discrepancy
detection.
✔Discrepancy are caused by poorly designed
data entry forms, human error in data entry,
deliberate errors and data decay, inconsistent
data representation and inconsistent use of
codes.
✔Field Overloading may cause discrepancy
• Use metadata for discrepancy detection
Data Cleaning as a process
• Data should be examined by unique rules,
consecutive rules and null rules.
• Unique rule says value of the given attribute must
be different from the other values for that attribute.
• Consecutive rules there can be no missing values
between lowest and highest values for the attribute
and that all values must be unique.
• Null rules specifies the use of question marks, special
characters or other strings that may indicate null
conditions and how such values should be handled.
Data Cleaning as a process
• Data scrubbing tools use simple domain
knowledge to detect errors and correct the data
and use parsing and fuzzy matching
techniques.
• Data auditing tools find discrepancies by
analyzing the data to discover rules ,
relationships and detecting data violating the
conditions.
Data Cleaning as a process
• Data transformation define and apply a series
of transformation to correct discrepancy.

• Extraction/ transformation / Loading tool

(ETL) allows user to specify transformation
through Graphical User Interface

E Handbook of Statistical Methods (NIST SEMATECH)
100% (2)
E Handbook of Statistical Methods (NIST SEMATECH)
2,606 pages
Calculator Functions For The AP Stats Exam PDF
No ratings yet
Calculator Functions For The AP Stats Exam PDF
4 pages
Correlation & Regression Guide
100% (1)
Correlation & Regression Guide
25 pages
Regression Analysis Essentials
No ratings yet
Regression Analysis Essentials
55 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Syllabus: Data Warehousing and Data Mining
No ratings yet
Syllabus: Data Warehousing and Data Mining
18 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
SAMPLES Assignment 1 SIMPLE Level Plan To Build A Tree House PDF
No ratings yet
SAMPLES Assignment 1 SIMPLE Level Plan To Build A Tree House PDF
62 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Data Pre-Processing & Cleaning Guide
No ratings yet
Data Pre-Processing & Cleaning Guide
37 pages
Statistics and Probability
100% (2)
Statistics and Probability
71 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Unit - II
No ratings yet
Unit - II
56 pages
DWM
No ratings yet
DWM
14 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Chapter 2 3 Data Mining
No ratings yet
Chapter 2 3 Data Mining
4 pages
Book 2.0 - Python
100% (1)
Book 2.0 - Python
143 pages
4332bQAM601 - Statistics For Management
No ratings yet
4332bQAM601 - Statistics For Management
6 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
9 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Confidence Intervals For Kappa
No ratings yet
Confidence Intervals For Kappa
10 pages
New Stats QUETIONS
No ratings yet
New Stats QUETIONS
2 pages
Trees Handout
No ratings yet
Trees Handout
51 pages
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Histogram Analysis Guide
No ratings yet
Histogram Analysis Guide
3 pages
Chapter-3 Data Processing
No ratings yet
Chapter-3 Data Processing
54 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
Unit 2
No ratings yet
Unit 2
37 pages
q3 Weeks56 III Learning Material
No ratings yet
q3 Weeks56 III Learning Material
7 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Forecasting
No ratings yet
Forecasting
47 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
Central Tendency Measures Explained
No ratings yet
Central Tendency Measures Explained
44 pages
Week2 2
No ratings yet
Week2 2
25 pages
Data Preprocessing for Tech Students
No ratings yet
Data Preprocessing for Tech Students
59 pages
Potential Problems in The Statistical Control of Variables in Organizational Research, A Qualitative Analysis With Becker 2005 ORM
No ratings yet
Potential Problems in The Statistical Control of Variables in Organizational Research, A Qualitative Analysis With Becker 2005 ORM
16 pages
Performance Comparison of Simple Regression Random Forest and XGBoost Algorithms For Forecasting Electricity Demand
No ratings yet
Performance Comparison of Simple Regression Random Forest and XGBoost Algorithms For Forecasting Electricity Demand
7 pages
Dmi Unit 3
No ratings yet
Dmi Unit 3
12 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Data Preprocessing 1 - Annotated
No ratings yet
Data Preprocessing 1 - Annotated
23 pages
GLMM Guide for Ecologists
No ratings yet
GLMM Guide for Ecologists
9 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Business Statistics
No ratings yet
Business Statistics
4 pages
DWDM Unit-Ii
No ratings yet
DWDM Unit-Ii
18 pages
1 - Introduction To Deep Learning
No ratings yet
1 - Introduction To Deep Learning
87 pages
Estiiiiiiiii Terbaru
No ratings yet
Estiiiiiiiii Terbaru
4 pages
Econometrics for Finance Assignment
100% (1)
Econometrics for Finance Assignment
3 pages
Chi Square Test A4
No ratings yet
Chi Square Test A4
17 pages
Mathematics V Kas 304
No ratings yet
Mathematics V Kas 304
2 pages
Standing Broad Jump Study
No ratings yet
Standing Broad Jump Study
16 pages
3.1 Model Check
No ratings yet
3.1 Model Check
20 pages
PM Alternate Project
No ratings yet
PM Alternate Project
2 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
DM Questions
No ratings yet
DM Questions
7 pages
Data Mining
No ratings yet
Data Mining
22 pages
ML 4
No ratings yet
ML 4
17 pages
Entropy Balancing for Stata Users
No ratings yet
Entropy Balancing for Stata Users
32 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
20 pages
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
No ratings yet
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
88 pages
Lab Program 9
No ratings yet
Lab Program 9
5 pages
Data Pre Processing I
No ratings yet
Data Pre Processing I
37 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
Unit 2 Preprocessing in Data Analytics
No ratings yet
Unit 2 Preprocessing in Data Analytics
36 pages
DMDW Unit II
No ratings yet
DMDW Unit II
57 pages