0% found this document useful (0 votes)

2 views14 pages

Data Pre Processing

The document provides an overview of data preprocessing, defining key concepts such as data, attributes, and attribute values. It discusses the importance of handling missing and noisy data, methods for data smoothing, and techniques for data transformation like normalization and discretization. Additionally, it highlights the significance of sampling in data selection for analysis.

Uploaded by

Anh Thư Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views14 pages

Data Pre Processing

Uploaded by

Anh Thư Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Data preprocessing

Python for AI

1
What is Data?
• Collection of data objects and
their attributes Attributes

• An attribute is a property or
characteristic of an object
– Examples: eye color of a
person, temperature, etc.
– Attribute is also known as
variable, field, characteristic,
or feature
• A collection of attributes
describe an object Objects
– Object is also known as
record, point, case, sample,
entity, or instance

2
Attribute Values
• Attribute values are numbers or symbols assigned
to an attribute

• Distinction between attributes and attribute values

– Same attribute can be mapped to different attribute
values
• Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of

values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum value
3
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or
names
• No quality data, no quality mining results!
How to Handle Missing
• Data?
Ignore the tuple: usually done when class label is missing
(assuming the task is classification—not effective in certain cases)

• Fill in the missing value manually: tedious + infeasible?

• Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!

• Use the attribute mean to fill in the missing value

• Use the attribute mean for all samples of the same class
to fill in the missing value: smarter
• Use the most probable value to fill in the missing value:
inference-based such as regression, Bayesian formula, decision tree
How to Handle Noisy Data?
• Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
– used also for discretization (discussed later)
• Clustering
– detect and remove outliers
• Semi-automated method: combined computer
and human inspection
– detect suspicious values and check manually
• Regression
– smooth by fitting the data into regression functions
Data smoothing
• Data smoothing is executed by making use of a
specialized algorithm for removing noise from the given
data set.

7
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects
in the data set

8
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Duplicate Data
• Data set may include data objects that are
duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeous
sources

• Examples:
– Same person with multiple email addresses

• Data cleaning
– Process of dealing with duplicate data issues
10
Data Transformation:
Normalization
Particularly useful for classification (NNs, distance measurements,
nn classification, etc)
• min-max normalization

• z-score normalization

• normalization by decimal scaling

Where j is the smallest integer such that Max(| |)<1
Discretization/Quantization
• Three types of attributes:
– Nominal — values from an unordered set
– Ordinal — values from an ordered set
– Continuous — real numbers
• Discretization/Quantization:
● divide the range of a continuous attribute into
intervals
x1 x2 x3 x4 x5

y1 y2 y3 y4 y5 y6

– Some classification algorithms only accept

categorical attributes.
– Reduce data size by discretization
– Prepare for further analysis
Sampling
• Sampling is the main technique employed for data
selection.
– It is often used for both the preliminary investigation
of the data and the final data analysis.

• Statisticians sample because obtaining the entire set of

data of interest is too expensive or time consuming.

• Sampling is used in data mining because processing the

entire set of data of interest is too expensive or time
consuming.

13
Example and code
• Download code in the classroom
• On class: follow a step by step tutorial

Fundamentals of Academic Writing Level 1 PDF
83% (24)
Fundamentals of Academic Writing Level 1 PDF
236 pages
College Geometry
No ratings yet
College Geometry
27 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
85 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
Slide 2 - Data Preprocessing
100% (1)
Slide 2 - Data Preprocessing
39 pages
Unit - II
No ratings yet
Unit - II
56 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
Data Pre-Processing Guide
No ratings yet
Data Pre-Processing Guide
33 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Data - Part 1
No ratings yet
Data - Part 1
58 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
CH 2
No ratings yet
CH 2
36 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Week2 DataPreprocessing
No ratings yet
Week2 DataPreprocessing
43 pages
Absence Error Codes
100% (1)
Absence Error Codes
28 pages
Spherical Roller Bearings
No ratings yet
Spherical Roller Bearings
32 pages
OFP Interview Questions
100% (1)
OFP Interview Questions
2 pages
UG Open Architecture Programming Guide
No ratings yet
UG Open Architecture Programming Guide
4 pages
Measuring PH of Non-Aqueous and Mixed Samples
No ratings yet
Measuring PH of Non-Aqueous and Mixed Samples
4 pages
Hypothesis
No ratings yet
Hypothesis
3 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Project 1 (Physics)
100% (1)
Project 1 (Physics)
8 pages
Aspiring Bioenergy Innovator
No ratings yet
Aspiring Bioenergy Innovator
3 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
eNodeB Site Commissioning Guide
No ratings yet
eNodeB Site Commissioning Guide
22 pages
Unit 2
No ratings yet
Unit 2
34 pages
Real Estate Project Progress Report
No ratings yet
Real Estate Project Progress Report
9 pages
Fall 97 Principles of Microeconomics Slide 1: R. Larry Reynolds
No ratings yet
Fall 97 Principles of Microeconomics Slide 1: R. Larry Reynolds
40 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
離散數學第一次作業1 1～1 4
No ratings yet
離散數學第一次作業1 1～1 4
5 pages
Introducing Quality Patient Safety Program
No ratings yet
Introducing Quality Patient Safety Program
15 pages
Unit 2
No ratings yet
Unit 2
37 pages
Teaching Short Stories
No ratings yet
Teaching Short Stories
28 pages
Tutorial 26 Sarma Non-Vertical Slices
No ratings yet
Tutorial 26 Sarma Non-Vertical Slices
6 pages
002 Ostrich PDF
No ratings yet
002 Ostrich PDF
10 pages
Bridge Course
No ratings yet
Bridge Course
8 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Lecture 1 - Plane Wave
No ratings yet
Lecture 1 - Plane Wave
35 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
7866 Gas Analyzer/Indicator Modbus® RTU Serial Communications User Manual
No ratings yet
7866 Gas Analyzer/Indicator Modbus® RTU Serial Communications User Manual
42 pages
Unit 3
No ratings yet
Unit 3
41 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Hydraulic Bench Flow Rate Analysis
No ratings yet
Hydraulic Bench Flow Rate Analysis
10 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
Week2 2
No ratings yet
Week2 2
25 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
0s1 9MA0-32 Mechanics - Mock Set 1 Mark Scheme (Word)
No ratings yet
0s1 9MA0-32 Mechanics - Mock Set 1 Mark Scheme (Word)
13 pages
Test Bank For Cognitive Psychology: Connecting Mind, Research, and Everyday Experience, 5th Edition, E. Bruce Goldstein
100% (11)
Test Bank For Cognitive Psychology: Connecting Mind, Research, and Everyday Experience, 5th Edition, E. Bruce Goldstein
36 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
Ebert - Be13 - TB - 14 - Exam Pool
No ratings yet
Ebert - Be13 - TB - 14 - Exam Pool
3 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Terato Threshold Black Magic and Shattered Geometry Ryan Anschauung PDF Download
100% (1)
Terato Threshold Black Magic and Shattered Geometry Ryan Anschauung PDF Download
40 pages
Chapter1 FindingtheRightConversation 1
No ratings yet
Chapter1 FindingtheRightConversation 1
15 pages
DMiningKuliah2A (DPreparation) New
No ratings yet
DMiningKuliah2A (DPreparation) New
28 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
Riasec Test
No ratings yet
Riasec Test
2 pages
Comprehensive Multi-Modality Online Student Engagement Dataset With High-Quality Labels
No ratings yet
Comprehensive Multi-Modality Online Student Engagement Dataset With High-Quality Labels
11 pages
Practicing Our Faith A Way of Life For A Searching People Second Edition Dorothy C. Bass Download
No ratings yet
Practicing Our Faith A Way of Life For A Searching People Second Edition Dorothy C. Bass Download
140 pages

Data Pre Processing

Uploaded by

Data Pre Processing

Uploaded by

Data preprocessing

• Distinction between attributes and attribute values

– Different attributes can be mapped to the same set of

• Fill in the missing value manually: tedious + infeasible?

• Use the attribute mean to fill in the missing value

• normalization by decimal scaling

– Some classification algorithms only accept

• Statisticians sample because obtaining the entire set of

• Sampling is used in data mining because processing the

You might also like