❖ Data in the real world is dirty
✔ incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
✔ noisy: containing errors or outliers
✔ inconsistent: containing discrepancies in codes
or names
❖ No quality data, no quality mining results!
✔ Quality decisions must be based on quality data
✔ Data warehouse needs consistent integration of
quality data
Major Tasks in Data pre-processing
• Data Cleaning
• Data Integration
• Data Transformation
• Data reduction
• Data discretization
Data pre-processing methods
Data Cleaning
• Real world data is incomplete, noisy and
inconsistent.
• Data cleaning fill in missing values, smooth
out noise while identify outliers and correct
inconsistencies in the data.
Data cleaning methods :-
1) Missing Values
• Ignore the tuple
Can be done when class label is missing. It
is not effective, unless tuple contains several
attribute with missing values.
• Fill in the missing value manually
• Time consuming given large data set with
many missing values
• Use some global constants to fill in the missing values
Use ex: -α , “unknown “ etc.
But there is a chance for misinterpreting
“unknown”
• Use the attribute mean to fill in the missing value. For
example customer average income is 25000 then you
can use this value to replace missing value for income.
• Use the attribute mean for all samples belonging to the
same class as given by the tuple
• Use the most probable value to fill in the missing
value. This value is determined by regression,
inference based tools or decision tree induction.
2) Noisy Data
• Noise is a random error or variance in a
measured variable. Noisy Data may be due to
faulty data collection instruments, data entry
problems and technology limitation.
Handling Noisy Data
1. Binning:
Binning methods sort data value by consulting
its “neighbour- hood,” that is, the values around
it. The sorted values are distributed into a
number of “buckets,” or bins.
1. Smoothing by bin means
2. Smoothing by bin medians
3. Smoothing by bin boundaries
Example: Data for price(in dollars):
15,4,8,21,21,24, 28,25,34
Example: Sorted data for price(in dollars):
4,8,15,21,21,24, 25,28,34
• Partition into equal frequency bins
Bin1: 4,8,15
Bin2:21,21,24
Bin3:25,28,34
Example: Sorted data for price(in dollars):
4,8,15,21,21,24, 25,28,34
• Partition into equal frequency bins
Bin1: 4,8,15
Bin2:21,21,24
Bin3:25,28,34
a) Smoothing by bin means
In smoothing by bin means, each value in a bin is replaced
by the mean value of the bin.
Bin1: 9,9,9 🡪 [(4+8+15)/3=9]
Bin2:22,22,22 🡪 [(22+22+22)/3=22]
Bin3:29,29,29 🡪 [(25+28+34)/3=29]
b) Smoothing by bin medians
Each value in a bin is replaced by the median of all the values
belonging to the same bin.
Bin1: 8,8,8
Bin2:21,21,21
Bin3:28,28,28
c)Smoothing by bin boundaries
In smoothing by bin boundaries, each bin value is replaced by the closest
boundary value.
Bin1: 4,4,15
Bin2:21,21,24
Bin3:25, 25, 34
Example 2:Partition the given data into 4 bins using Equi-
depth binning method and perform smoothing according to
the following methods
Data:11,13,13,15,15,16,19,20,20,20,21,21,22,23,24,30,
40,45,45,45,71,72,73,75
a) Smoothing by bin mean
b) Smoothing by bin median
c) Smoothing by bin boundaries
Divide the data into 4 equal-depth bins
bin 1:11,13,13,15,15,16
bin 2:9,20,20,20,21,21
bin3:22,23,24,30,40,45
bin4:,45,45,71,72,73,75
smoothing by means
bin 1-13.83,13.83,13.83,13.83,13.83,13.83
bin 2-20.16,20.16,20.16,20.16,20.16
bin 3-30.67,30.67,30.67,30.67,30.67,30.67
bin 4-63.5,63.5,63.5,63.5,63.5,63.5,63.5
smoothing by boundaries
bin 1:11,11,11,16,16,16
bin 2:19,19,19,21,21,21
bin3:22,22,22,22,45,45
bin4:,45,45,75,75,75,75
moothing by median
bin 1:14,14,14,14,14,14 🡪 [(13+15)/2=14]
bin 2:20,20,20,20,20,20 🡪 [(20+20)/2=20]
bin3: 27,27,27,27,27,27 🡪 [(24+30)/2=27]
bin4:,71.5,71.5,71.5,71.5,71.5,71.5
2. Regression
• Data can be smoothed by fitting the data to a
function , such as with regression.
• Linear Regression involves finding the best
line to fit two attribuutes so one attribuute can
be used to predict the other attribute
• Multiple linear regression where more than
two attribuutes are involved and the data are fit
to a multidimensional surface.
3. Clustering
• Outliers may be detected byy clustering .
• Similar values are organized into groups/clusters.
• Values fall outside the group may bee considered as
outliers.
Data Cleaning as a process
• Fiirst step in data cleaning is discrepancy
detection.
✔Discrepancy are caused by poorly designed
data entry forms, human error in data entry,
deliberate errors and data decay, inconsistent
data representation and inconsistent use of
codes.
✔Field Overloading may cause discrepancy
• Use metadata for discrepancy detection
Data Cleaning as a process
• Data should be examined by unique rules,
consecutive rules and null rules.
• Unique rule says value of the given attribute must
be different from the other values for that attribute.
• Consecutive rules there can be no missing values
between lowest and highest values for the attribute
and that all values must be unique.
• Null rules specifies the use of question marks, special
characters or other strings that may indicate null
conditions and how such values should be handled.
Data Cleaning as a process
• Data scrubbing tools use simple domain
knowledge to detect errors and correct the data
and use parsing and fuzzy matching
techniques.
• Data auditing tools find discrepancies by
analyzing the data to discover rules ,
relationships and detecting data violating the
conditions.
Data Cleaning as a process
• Data transformation define and apply a series
of transformation to correct discrepancy.
• Extraction/ transformation / Loading tool
(ETL) allows user to specify transformation
through Graphical User Interface