Data Preparation

This document discusses the importance of data preparation before data analysis and mining. It covers understanding the data types, attributes, values and distributions. It also discusses understanding the current data schema, identifying data quality problems like missing values, outliers and inconsistencies. The major tasks of data preprocessing like data cleaning, integration and reduction are introduced. Specific techniques for handling missing data, noisy data, data integration and reduction are also provided.

Uploaded by

Abebe Chekol

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views21 pages

Data Preparation

Uploaded by

Abebe Chekol

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 21

Data Preparation

1
Understanding Data
•
Types of attributes or fields
•
Kind of values each attribute have
•
Which attributes are discrete, and which are continuous-valued?
•
What do the data look like?
•
How are the values distributed?
•
Are there ways we can visualize the data
•
Can we spot any outliers?
•
Can we measure the similarity of some data objects with respect to
others?
•
Knowing basic statistics about the data

2
Understanding Data

3
Understand the Current Schema/Data
 To understand one attribute:
 min, max, avg, histogram, amount of missing values,

value range
 data type, length of values, etc.

 synonyms, formats

 To understand the relationship between two attributes

 various plots

4
Why
•
Knowing such basic statistics regarding each attribute makes it easier to fill
missing values, smooth noisy values, and spot outliers during data
preprocessing.
•
Knowledge of the attributes and attribute values can also help in fixing
inconsistencies incurred during data integration.
•
Plotting the measures of central tendency shows us if the data are
symmetric or skewed.
•
Quantile plots, histograms, and scatter plots are other graphic displays of
basic statistical descriptions.
•
These can all be useful during data preprocessing and can provide insight
into areas for mining.
•
The field of data visualization provides many additional techniques for
viewing data through graphical means.
5
Types of Data

•
Datasets are made up of data objects and attributes .
•
A data object represents an entity—
•
In a sales database, the objects may be customers, store
items, and sales;
•
In a medical database, the objects may be patients;
•
In a university database, the objects may be students,
professors,and courses.

6
Attribute
•
An attribute is a data field, representing a characteristic or feature of a data
object.
•
Attributes describing a customer object for example, customer ID, name,
and address.
•
Observed values for a given attribute are known as observations
•
Attribute vector (feature vector) set of attributes for an object
•
Type of attributes
•
nominal, binary(symmetric and asymmetric), ordinal, numeric

7
Identify Data Problems
 Data Quality Problems
 missing values
 incorrect values, illegal values, outliers
 synonyms
 mispellings
 conflicting data (eg, age and birth year)
 wrong value formats
 variations of values
 duplicate tuples
 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling, …
 Timeliness: timely update
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be understood?

8
Major Tasks in Data Preprocessing

• Data cleaning
• Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation

9
1. Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
e.g., Occupation = “ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary = “ −10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
Age = “ 42” , Birthday = “ 03/07/2010”
Was rating “ 1, 2, 3” , now rating “ A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?

10
A. Incomplete (Missing) Data
Data is not always available
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data
Missing data may need to be inferred

11
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., “ unknown” , a new class?!
the attribute mean
the attribute mean for all samples belonging to the same
class: smarter
the most probable value: inference-based such as Bayesian
formula or decision tree

12
B. Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which require data cleaning
duplicate records
incomplete data
inconsistent data
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers

14
Binning Methods
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

15
2. Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id  B.cust-#
• Integrate metadata from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
• Tuple duplication
• There are two or more identical tuples for a given unique data entry case
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different sources are
different
• Possible reasons: different representations, different scales, e.g., metric vs.
British units
16
Handling Redundancy in Data Integration

• Redundant data occur often when integration of multiple

databases
• Object identification: The same attribute or object may
have different names in different databases
• Derivable data: One attribute may be a “ derived” attribute
in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
17
3. Data Reduction
Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
Why data reduction?
A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data
set.
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Reduce time and space required in data mining
Allow easier visualization

18
Data reduction strategies

• Data cube aggregation, where aggregation operations are applied to the data in
the construction of a data cube.
• Attribute subset selection, where irrelevant, weakly relevant, or redundant
attributes or dimensions may be detected and removed.
• Dimensionality reduction, where encoding mechanisms are used to reduce the dataset
size.
• Numerosity reduction, where the data are replaced or estimated by alternative, smaller
data representations such as parametric models (which need store only the model
parameters instead of the actual data) or nonparametric methods such as clustering,
sampling, and the use of histograms.
• Compression - Dimensionality and numerosity reduction are considered
as forms of data compression
• Discretization and concept hierarchy generation, where raw data values for attributes are
replaced by ranges or higher conceptual levels.

19
Attribute Subset Selection
Attribute subset selection reduces the dataset size by removing
irrelevant or redundant attributes (or dimensions).
The goal of attribute subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained using
all attributes.
Mining on a reduced set of attributes has an additional benefit of
reducing the number of attributes appearing in the discovered patterns,
helping to make the patterns easier to understand.

20
Lab Exercise

• Practice the different data preprocessing tasks in Python

Data Preparation Modeling Evaluation
No ratings yet
Data Preparation Modeling Evaluation
145 pages
Data Preprocessing
No ratings yet
Data Preprocessing
60 pages
Unit 2 Preprocessing
No ratings yet
Unit 2 Preprocessing
39 pages
Session 4
No ratings yet
Session 4
40 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Data Preprocessing
No ratings yet
Data Preprocessing
15 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
DP
No ratings yet
DP
44 pages
HMIS
0% (1)
HMIS
6 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
Data - Part 1
No ratings yet
Data - Part 1
58 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
8 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Cerry Candra E Pert 03 Sia 1
No ratings yet
Cerry Candra E Pert 03 Sia 1
7 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
45 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
CH 2
No ratings yet
CH 2
36 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
Unit I
No ratings yet
Unit I
57 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Software Requirements Specification For The Android Application For Travelling Salesman Problem (TSP)
No ratings yet
Software Requirements Specification For The Android Application For Travelling Salesman Problem (TSP)
9 pages
EHR Assignment Group1
100% (1)
EHR Assignment Group1
10 pages
Correlation
No ratings yet
Correlation
14 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Data Mining and Preprocessing Guide
No ratings yet
Data Mining and Preprocessing Guide
40 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
AVEVA Mechanical Equipment Interface Your Questions Answered
No ratings yet
AVEVA Mechanical Equipment Interface Your Questions Answered
4 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
19 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
FE Exam Preparation Book VOL1 LimitedDisclosureVer
100% (2)
FE Exam Preparation Book VOL1 LimitedDisclosureVer
643 pages
Project On Student Management System
67% (3)
Project On Student Management System
59 pages
The Role of Automated Information Systems in The Tourism Industry
No ratings yet
The Role of Automated Information Systems in The Tourism Industry
6 pages
Pega Customer Decision Hub User Guide 6
100% (1)
Pega Customer Decision Hub User Guide 6
263 pages
SQL Server Basics
No ratings yet
SQL Server Basics
50 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Intelligent Management of Electrical Systems in Industries
100% (26)
Intelligent Management of Electrical Systems in Industries
30 pages
Sinclair Uim SDT sdv5 en
No ratings yet
Sinclair Uim SDT sdv5 en
41 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
EHealth Architecture and Standards
No ratings yet
EHealth Architecture and Standards
30 pages
SRS Document For Online Election ANKIT
No ratings yet
SRS Document For Online Election ANKIT
13 pages
Ppt1overview of Health Analytics
No ratings yet
Ppt1overview of Health Analytics
46 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
Project Caretaker Management
100% (2)
Project Caretaker Management
68 pages
Essentials of Management Information Systems 10th Edition by Jane Laudon, Kenneth Laudon ISBN 0133051108 9780133051105 PDF Download
No ratings yet
Essentials of Management Information Systems 10th Edition by Jane Laudon, Kenneth Laudon ISBN 0133051108 9780133051105 PDF Download
55 pages
BCA Semester IV Web Programming Using PHP - Mark Wise Questi
No ratings yet
BCA Semester IV Web Programming Using PHP - Mark Wise Questi
6 pages
GenAIRAG LLM 71731191 PDF
No ratings yet
GenAIRAG LLM 71731191 PDF
32 pages
Project Synopsis Format
No ratings yet
Project Synopsis Format
26 pages
M Health
No ratings yet
M Health
81 pages
Islamic Studies: Curriculum of (Bs 4 Years)
No ratings yet
Islamic Studies: Curriculum of (Bs 4 Years)
121 pages
Ragbuilder Env
No ratings yet
Ragbuilder Env
7 pages
Database Basics Quiz
No ratings yet
Database Basics Quiz
327 pages
Questioning Patterns A.K.A. Q Patterns: Vipul Kocher Gaurav Khera River Run Software Group
No ratings yet
Questioning Patterns A.K.A. Q Patterns: Vipul Kocher Gaurav Khera River Run Software Group
19 pages
Grade 12 IT Exam Prep Guide
No ratings yet
Grade 12 IT Exam Prep Guide
11 pages
Database Management System Lab With Mini Project Manual (21CSL55)
No ratings yet
Database Management System Lab With Mini Project Manual (21CSL55)
49 pages
Chapter One
No ratings yet
Chapter One
29 pages
II PUC - Computer Science
No ratings yet
II PUC - Computer Science
28 pages
Developing of M&E Database
No ratings yet
Developing of M&E Database
1 page
Chapter 11: Indexing and Storage: Modified From: Database System Concepts, 6 Ed
No ratings yet
Chapter 11: Indexing and Storage: Modified From: Database System Concepts, 6 Ed
53 pages
PGDCA Assignment
No ratings yet
PGDCA Assignment
15 pages
BiblioteQ SRS V1
No ratings yet
BiblioteQ SRS V1
20 pages
Instructions CTB by Charlie Fleed v3 1
No ratings yet
Instructions CTB by Charlie Fleed v3 1
19 pages
The Formal Design Model of An Automatic Teller Machine (ATM)
No ratings yet
The Formal Design Model of An Automatic Teller Machine (ATM)
31 pages
HSC Ipt Notes
No ratings yet
HSC Ipt Notes
45 pages

Data Preparation

Uploaded by

Data Preparation

Uploaded by

Data Preparation

 To understand the relationship between two attributes

• Redundant data occur often when integration of multiple

• Practice the different data preprocessing tasks in Python

You might also like