Data preprocessing
Python for AI
1
What is Data?
• Collection of data objects and
their attributes Attributes
• An attribute is a property or
characteristic of an object
– Examples: eye color of a
person, temperature, etc.
– Attribute is also known as
variable, field, characteristic,
or feature
• A collection of attributes
describe an object Objects
– Object is also known as
record, point, case, sample,
entity, or instance
2
Attribute Values
• Attribute values are numbers or symbols assigned
to an attribute
• Distinction between attributes and attribute values
– Same attribute can be mapped to different attribute
values
• Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of
values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum value
3
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or
names
• No quality data, no quality mining results!
How to Handle Missing
• Data?
Ignore the tuple: usually done when class label is missing
(assuming the task is classification—not effective in certain cases)
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples of the same class
to fill in the missing value: smarter
• Use the most probable value to fill in the missing value:
inference-based such as regression, Bayesian formula, decision tree
How to Handle Noisy Data?
• Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
– used also for discretization (discussed later)
• Clustering
– detect and remove outliers
• Semi-automated method: combined computer
and human inspection
– detect suspicious values and check manually
• Regression
– smooth by fitting the data into regression functions
Data smoothing
• Data smoothing is executed by making use of a
specialized algorithm for removing noise from the given
data set.
7
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects
in the data set
8
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Duplicate Data
• Data set may include data objects that are
duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeous
sources
• Examples:
– Same person with multiple email addresses
• Data cleaning
– Process of dealing with duplicate data issues
10
Data Transformation:
Normalization
Particularly useful for classification (NNs, distance measurements,
nn classification, etc)
• min-max normalization
• z-score normalization
• normalization by decimal scaling
Where j is the smallest integer such that Max(| |)<1
Discretization/Quantization
• Three types of attributes:
– Nominal — values from an unordered set
– Ordinal — values from an ordered set
– Continuous — real numbers
• Discretization/Quantization:
● divide the range of a continuous attribute into
intervals
x1 x2 x3 x4 x5
y1 y2 y3 y4 y5 y6
– Some classification algorithms only accept
categorical attributes.
– Reduce data size by discretization
– Prepare for further analysis
Sampling
• Sampling is the main technique employed for data
selection.
– It is often used for both the preliminary investigation
of the data and the final data analysis.
• Statisticians sample because obtaining the entire set of
data of interest is too expensive or time consuming.
• Sampling is used in data mining because processing the
entire set of data of interest is too expensive or time
consuming.
13
Example and code
• Download code in the classroom
• On class: follow a step by step tutorial
14