Data Preparation

This document discusses the importance of data preparation before data analysis and mining. It covers understanding the data types, attributes, values and distributions. It also discusses understanding the current data schema, identifying data quality problems like missing values, outliers and inconsistencies. The major tasks of data preprocessing like data cleaning, integration and reduction are introduced. Specific techniques for handling missing data, noisy data, data integration and reduction are also provided.

Uploaded by

Abebe Chekol

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views21 pages

Data Preparation

Uploaded by

Abebe Chekol

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 21

Data Preparation

1
Understanding Data
•
Types of attributes or fields
•
Kind of values each attribute have
•
Which attributes are discrete, and which are continuous-valued?
•
What do the data look like?
•
How are the values distributed?
•
Are there ways we can visualize the data
•
Can we spot any outliers?
•
Can we measure the similarity of some data objects with respect to
others?
•
Knowing basic statistics about the data

2
Understanding Data

3
Understand the Current Schema/Data
 To understand one attribute:
 min, max, avg, histogram, amount of missing values,

value range
 data type, length of values, etc.

 synonyms, formats

 To understand the relationship between two attributes

 various plots

4
Why
•
Knowing such basic statistics regarding each attribute makes it easier to fill
missing values, smooth noisy values, and spot outliers during data
preprocessing.
•
Knowledge of the attributes and attribute values can also help in fixing
inconsistencies incurred during data integration.
•
Plotting the measures of central tendency shows us if the data are
symmetric or skewed.
•
Quantile plots, histograms, and scatter plots are other graphic displays of
basic statistical descriptions.
•
These can all be useful during data preprocessing and can provide insight
into areas for mining.
•
The field of data visualization provides many additional techniques for
viewing data through graphical means.
5
Types of Data

•
Datasets are made up of data objects and attributes .
•
A data object represents an entity—
•
In a sales database, the objects may be customers, store
items, and sales;
•
In a medical database, the objects may be patients;
•
In a university database, the objects may be students,
professors,and courses.

6
Attribute
•
An attribute is a data field, representing a characteristic or feature of a data
object.
•
Attributes describing a customer object for example, customer ID, name,
and address.
•
Observed values for a given attribute are known as observations
•
Attribute vector (feature vector) set of attributes for an object
•
Type of attributes
•
nominal, binary(symmetric and asymmetric), ordinal, numeric

7
Identify Data Problems
 Data Quality Problems
 missing values
 incorrect values, illegal values, outliers
 synonyms
 mispellings
 conflicting data (eg, age and birth year)
 wrong value formats
 variations of values
 duplicate tuples
 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling, …
 Timeliness: timely update
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be understood?

8
Major Tasks in Data Preprocessing

• Data cleaning
• Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation

9
1. Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
e.g., Occupation = “ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary = “ −10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
Age = “ 42” , Birthday = “ 03/07/2010”
Was rating “ 1, 2, 3” , now rating “ A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?

10
A. Incomplete (Missing) Data
Data is not always available
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data
Missing data may need to be inferred

11
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., “ unknown” , a new class?!
the attribute mean
the attribute mean for all samples belonging to the same
class: smarter
the most probable value: inference-based such as Bayesian
formula or decision tree

12
B. Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which require data cleaning
duplicate records
incomplete data
inconsistent data
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers

14
Binning Methods
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

15
2. Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id  B.cust-#
• Integrate metadata from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
• Tuple duplication
• There are two or more identical tuples for a given unique data entry case
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different sources are
different
• Possible reasons: different representations, different scales, e.g., metric vs.
British units
16
Handling Redundancy in Data Integration

• Redundant data occur often when integration of multiple

databases
• Object identification: The same attribute or object may
have different names in different databases
• Derivable data: One attribute may be a “ derived” attribute
in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
17
3. Data Reduction
Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
Why data reduction?
A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data
set.
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Reduce time and space required in data mining
Allow easier visualization

18
Data reduction strategies

• Data cube aggregation, where aggregation operations are applied to the data in
the construction of a data cube.
• Attribute subset selection, where irrelevant, weakly relevant, or redundant
attributes or dimensions may be detected and removed.
• Dimensionality reduction, where encoding mechanisms are used to reduce the dataset
size.
• Numerosity reduction, where the data are replaced or estimated by alternative, smaller
data representations such as parametric models (which need store only the model
parameters instead of the actual data) or nonparametric methods such as clustering,
sampling, and the use of histograms.
• Compression - Dimensionality and numerosity reduction are considered
as forms of data compression
• Discretization and concept hierarchy generation, where raw data values for attributes are
replaced by ranges or higher conceptual levels.

19
Attribute Subset Selection
Attribute subset selection reduces the dataset size by removing
irrelevant or redundant attributes (or dimensions).
The goal of attribute subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained using
all attributes.
Mining on a reduced set of attributes has an additional benefit of
reducing the number of attributes appearing in the discovered patterns,
helping to make the patterns easier to understand.

20
Lab Exercise

• Practice the different data preprocessing tasks in Python

Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
8 pages
Data Preparation Modeling Evaluation
No ratings yet
Data Preparation Modeling Evaluation
145 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
Synthetic Indices Trading Guide
100% (12)
Synthetic Indices Trading Guide
25 pages
Unit 2 Preprocessing
No ratings yet
Unit 2 Preprocessing
39 pages
Session 4
No ratings yet
Session 4
40 pages
Data Preprocessing
No ratings yet
Data Preprocessing
60 pages
Data Preprocessing
No ratings yet
Data Preprocessing
15 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
DP
No ratings yet
DP
44 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Data - Part 1
No ratings yet
Data - Part 1
58 pages
Free AI Tools To Boost Task Productivity and Work Efficiency
No ratings yet
Free AI Tools To Boost Task Productivity and Work Efficiency
3 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
19 B9 IELTS T2 Essays 240 T2 Questions
100% (1)
19 B9 IELTS T2 Essays 240 T2 Questions
116 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
EHealth Architecture and Standards
No ratings yet
EHealth Architecture and Standards
30 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
CH 2
No ratings yet
CH 2
36 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
45 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
EHR Assignment Group1
100% (1)
EHR Assignment Group1
10 pages
Which Chart or Graph Is Right For You? Tell Impactful Stories With Data
No ratings yet
Which Chart or Graph Is Right For You? Tell Impactful Stories With Data
14 pages
Correlation
No ratings yet
Correlation
14 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
Rest-Assured Rest
No ratings yet
Rest-Assured Rest
17 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Unit I
No ratings yet
Unit I
57 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
19 pages
M Health
No ratings yet
M Health
81 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Data Mining and Preprocessing Guide
No ratings yet
Data Mining and Preprocessing Guide
40 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
The Meshing Sequence: Meshing With Default Settings
No ratings yet
The Meshing Sequence: Meshing With Default Settings
9 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
HMIS
0% (1)
HMIS
6 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Syllabus IST 8105-Spring 2024
No ratings yet
Syllabus IST 8105-Spring 2024
10 pages
How To Easily Generate Sales Funnels and Growth Hack Your Business Using ClickFunnels - Kev Chavez - Your Keen & Crisp VP
50% (8)
How To Easily Generate Sales Funnels and Growth Hack Your Business Using ClickFunnels - Kev Chavez - Your Keen & Crisp VP
103 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Parent Card Stage 5
100% (1)
Parent Card Stage 5
2 pages
2013 SNUG SV Synthesizable SystemVerilog Paper
No ratings yet
2013 SNUG SV Synthesizable SystemVerilog Paper
45 pages
EC8491 CT Notes Full - by WWW - EasyEngineering.net 4 PDF
No ratings yet
EC8491 CT Notes Full - by WWW - EasyEngineering.net 4 PDF
152 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Chapter One
No ratings yet
Chapter One
29 pages
Fan Kit Instruction
No ratings yet
Fan Kit Instruction
4 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
IntelliSteer Operating Guide PDF
No ratings yet
IntelliSteer Operating Guide PDF
240 pages
DC Charging TCP/IP (Optional) Micro Usb (Optional) USB Link: Realtime T502
No ratings yet
DC Charging TCP/IP (Optional) Micro Usb (Optional) USB Link: Realtime T502
1 page
Dell Latitude E5400 and E5500 Spec Sheet
100% (1)
Dell Latitude E5400 and E5500 Spec Sheet
2 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
Audison Thesis Car Audio
100% (3)
Audison Thesis Car Audio
5 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Hard Disk Basics for Tech Enthusiasts
0% (1)
Hard Disk Basics for Tech Enthusiasts
16 pages
Ppt1overview of Health Analytics
No ratings yet
Ppt1overview of Health Analytics
46 pages
Hazardous in Underground Mines
No ratings yet
Hazardous in Underground Mines
26 pages
Amor de Siempre (Audiotree Live Version) : Download Print
No ratings yet
Amor de Siempre (Audiotree Live Version) : Download Print
1 page
SDQCQAManual
No ratings yet
SDQCQAManual
344 pages
M.Sc. Electronics & Instrumentation
No ratings yet
M.Sc. Electronics & Instrumentation
70 pages
DSP LAB Manual - ECE - KNCET
No ratings yet
DSP LAB Manual - ECE - KNCET
60 pages
Mandarine Log
No ratings yet
Mandarine Log
37 pages
Silicon Rectifier Specs
No ratings yet
Silicon Rectifier Specs
4 pages
Evolution of The Practice of Software Testing in Java Projects
No ratings yet
Evolution of The Practice of Software Testing in Java Projects
5 pages
Flow Chart 0: Overall Flow For Normal Purchase Procedure
No ratings yet
Flow Chart 0: Overall Flow For Normal Purchase Procedure
1 page
Esther Joy. M: Resume
No ratings yet
Esther Joy. M: Resume
7 pages
WM 2024
No ratings yet
WM 2024
6 pages
GE3151 - Python
No ratings yet
GE3151 - Python
2 pages
Week 4 Cyber Attacks On Online Learning Platforms Transcript
No ratings yet
Week 4 Cyber Attacks On Online Learning Platforms Transcript
3 pages

Data Preparation

Uploaded by

Data Preparation

Uploaded by

Data Preparation

 To understand the relationship between two attributes

• Redundant data occur often when integration of multiple

• Practice the different data preprocessing tasks in Python

You might also like