Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
2 views14 pages

Data Pre Processing

The document provides an overview of data preprocessing, defining key concepts such as data, attributes, and attribute values. It discusses the importance of handling missing and noisy data, methods for data smoothing, and techniques for data transformation like normalization and discretization. Additionally, it highlights the significance of sampling in data selection for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views14 pages

Data Pre Processing

The document provides an overview of data preprocessing, defining key concepts such as data, attributes, and attribute values. It discusses the importance of handling missing and noisy data, methods for data smoothing, and techniques for data transformation like normalization and discretization. Additionally, it highlights the significance of sampling in data selection for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Data preprocessing

Python for AI

1
What is Data?
• Collection of data objects and
their attributes Attributes

• An attribute is a property or
characteristic of an object
– Examples: eye color of a
person, temperature, etc.
– Attribute is also known as
variable, field, characteristic,
or feature
• A collection of attributes
describe an object Objects
– Object is also known as
record, point, case, sample,
entity, or instance

2
Attribute Values
• Attribute values are numbers or symbols assigned
to an attribute

• Distinction between attributes and attribute values


– Same attribute can be mapped to different attribute
values
• Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of


values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum value
3
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or
names
• No quality data, no quality mining results!
How to Handle Missing
• Data?
Ignore the tuple: usually done when class label is missing
(assuming the task is classification—not effective in certain cases)

• Fill in the missing value manually: tedious + infeasible?


• Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!

• Use the attribute mean to fill in the missing value


• Use the attribute mean for all samples of the same class
to fill in the missing value: smarter
• Use the most probable value to fill in the missing value:
inference-based such as regression, Bayesian formula, decision tree
How to Handle Noisy Data?
• Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
– used also for discretization (discussed later)
• Clustering
– detect and remove outliers
• Semi-automated method: combined computer
and human inspection
– detect suspicious values and check manually
• Regression
– smooth by fitting the data into regression functions
Data smoothing
• Data smoothing is executed by making use of a
specialized algorithm for removing noise from the given
data set.

7
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects
in the data set

8
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Duplicate Data
• Data set may include data objects that are
duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeous
sources

• Examples:
– Same person with multiple email addresses

• Data cleaning
– Process of dealing with duplicate data issues
10
Data Transformation:
Normalization
Particularly useful for classification (NNs, distance measurements,
nn classification, etc)
• min-max normalization

• z-score normalization

• normalization by decimal scaling


Where j is the smallest integer such that Max(| |)<1
Discretization/Quantization
• Three types of attributes:
– Nominal — values from an unordered set
– Ordinal — values from an ordered set
– Continuous — real numbers
• Discretization/Quantization:
● divide the range of a continuous attribute into
intervals
x1 x2 x3 x4 x5

y1 y2 y3 y4 y5 y6

– Some classification algorithms only accept


categorical attributes.
– Reduce data size by discretization
– Prepare for further analysis
Sampling
• Sampling is the main technique employed for data
selection.
– It is often used for both the preliminary investigation
of the data and the final data analysis.

• Statisticians sample because obtaining the entire set of


data of interest is too expensive or time consuming.

• Sampling is used in data mining because processing the


entire set of data of interest is too expensive or time
consuming.

13
Example and code
• Download code in the classroom
• On class: follow a step by step tutorial

14

You might also like