0% found this document useful (0 votes)

34 views34 pages

Mod1 DM Part2

Uploaded by

josephabraham808

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views34 pages

Mod1 DM Part2

Uploaded by

josephabraham808

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Data Pre Processing

Data preprocessing is a data mining technique that involves

transforming raw data into an understandable format.

Data preprocessing is a proven method of resolving such issues.

Data preprocessing prepares raw data for further processing.

Data preprocessing is used database-driven applications such as

customer relationship management and rule-based applications.

Data Preprocessing
Preprocess Steps
Data cleaning
Data integration
Data transformation
Data reduction
Why Data Preprocessing?
Data in the real world is dirty

incomplete: lacking attribute values, lacking certain attributes of interest, or

containing only aggregate data

e.g., occupation=“ ”
noisy: containing errors or outliers

e.g., Salary=“-10”

inconsistent: containing discrepancies in codes or names

Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view:

Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or similar
analytical results
Forms of data preprocessing
Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
not register history or changes of the data
Missing data may need to be inferred.
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing values
per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g., “unknown”, a new
class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class to fill in
the missing value: smarter
Use the most probable value to fill in the missing value: inference-based
such as Bayesian formula or decision tree
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
How to Handle Noisy Data?
Binning method:
first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin median, smooth by
boundaries
Cluster Analysis

Clustering: detect and remove outliers

Regression
y

Regression: Y1
smooth by fitting
the data into
regression functions Y1’ y=x+1

X1 x
Data cleaning as a process
Discrepancy detection

Use meta data

Field overloading

Unique rules

Consecutive rules

Null rules

15 April 13, 2021

Data Integration
Data integration:
combines data from multiple sources.
Schema integration
integrate metadata from different sources
Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id ≡ B.cust-#
Detecting and resolving data value conflicts
for the same real world entity, attribute values from different sources
are different
possible reasons: different representations, different scales, e.g.,
metric vs. British units
Handling Redundant Data in Data Integration

Redundant data occur often when integration of multiple

databases
The same attribute may have different names in different databases
One attribute may be a “derived” attribute in another table.
Redundant data may be able to be detected by correlational analysis

Careful integration of the data from multiple sources may help

reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
Correlation analysis
Data Transformation

Smoothing: remove noise from data

Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
Data Transformation: Normalization

min-max normalization
Min-max normalization performs a linear transformation on the original
data.

Suppose that mina and maxa are the minimum and the maximum values
for attribute A. Min-max normalization maps a value v of A to v’ in the
range [new-mina, new-maxa] by computing:
v’= ( (v-mina) / (maxa – mina) ) * (new-maxa – newmina)+ new-mina
Data Transformation: Normalization
Z-score Normalization:
In z-score normalization, attribute A are normalized based on the
mean and standard deviation of A. a value v of A is normalized to v’
by computing:
v’ = ( ( v – A ) / µA )

where and A are the mean and the standard deviation

respectively of attribute A.

This method of normalization is useful when the actual minimum and

maximum of attribute A are unknown.
Data Transformation: Normalization
Normalization by Decimal Scaling
Normalization by decimal scaling normalizes by moving the decimal
point of values of attribute A.

The number of decimal points moved depends on the maximum

absolute value of A.

a value v of A is normalized to v’ by computing: v’ = ( v / 10j ). Where j

is the smallest integer such that Max(|v’|)<1.
Data reduction
Obtain a reduced representation of the data set that is much smaller

in volume but yet produces the same (or almost the same) analytical

results

Why data reduction? — A database/data warehouse may store

terabytes of data. Complex data analysis may take a very long time

to run on the complete data set.

22
Data reduction strategies
Data cube aggregation

Attribute subset selection

Dimensionality reduction

Numerosity reduction

Discretization and concept hierarchy generation

Data cube aggregation

aggregation operations are applied to the data in the construction

of a data cube.
This is achieved by aggregation operations on data cube.

24 April 13, 2021

Attribute subset selection

Irrelevant ,weakly relevant or redundant attributes or

dimensions may be detected and removed.
Stepwise forward selection:
Stepwise backward elimination
Combination of forward selection and backward
elimination:
Decision tree induction:

25 April 13, 2021

Dimensionality reduction
Encoding mechanisms are used to reduce the data size.
Wavelet transforms
The discrete wavelet transform(DWT) is a linear signal processing
technique that, when applied to a data vector X, transforms it to a
numerically different vector, X , of wavelet coefficients.
0

Principal components analysis, or PCA

Unlike attribute subset selection, which reduces the attribute set size by
retaining a subset of the initial set of attributes, PCA “combines” the
essence of attributes by creating an alternative, smaller set of variables.
Data compression
Numerosity reduction
The data are replaced or estimated by alternative, smaller data
representations such as parametric models(which need to store only
the model parameters instead of the actual data) or nonparametric
methods such as clustering, sampling and the use of histograms.
Regression and Log-Linear Models
Histograms
Clustering
Sampling
Data compression
Regression and Log-Linear Models

Regression and log-linear models can be used to approximate the

given data.
linear regression, the data are modeled to fit a straight line.
y (called a response variable)
X (called a predictor variable)
y = wx+b
Log-linear models approximate discrete multidimensional
probability distributions.
This allows a higher-dimensional data space to be constructed from
lower dimensional spaces.
Log-linear models are therefore also useful for dimensionality
reduction
28 April 13, 2021
Histograms
Histograms use binning to approximate data distributions and are a

popular form of

The following data are a list of prices of commonly sold items at

AllElectronics(rounded to the nearest dollar). The numbers have been

sorted: 1, 1, 5, 5, 5, 5, 5,8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15,

15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20,20, 20, 20, 20, 20, 20, 21,

21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.

29 April 13, 2021

30 April 13, 2021
Sampling
Sampling can be used as a data reduction technique because it allows a
large data set to be represented by a much smaller random sample (or
subset) of the data.

31 April 13, 2021

Sampling
Simple random sample without replacement (SRSWOR) of size s
Simple random sample with replacement (SRSWR) of size s
Cluster sample
Stratified sample

32 April 13, 2021

Data Discretization and Concept
Hierarchy Generation
Data discretization techniques can be used to reduce the number of
values for a given continuous attribute by dividing the range of the
attribute into intervals.

Interval labels can then be used to replace actual data values.

Supervised discretization
Unsupervised discretization
Top-down discretization or splitting
Bottom-up discretization or merging
33 April 13, 2021
Data Discretization and Concept
Hierarchy Generation
A concept hierarchy for a given numerical attribute defines a
discretization of the attribute.

Concept hierarchies can be used to reduce the data by

collecting and replacing low-level concepts (such as numerical
values for the attribute age) with higher-level concepts (such as
youth, middle-aged, or senior).

34 April 13, 2021

Apple iPhone 6S Plus Invoice Receipt
No ratings yet
Apple iPhone 6S Plus Invoice Receipt
5 pages
PlayStation Vita's First Year
50% (2)
PlayStation Vita's First Year
33 pages
GMP Training for Medical Devices
67% (3)
GMP Training for Medical Devices
110 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
8 pages
Preprocessing-Cleaning & Reduction
No ratings yet
Preprocessing-Cleaning & Reduction
42 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
19 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Unit 2 Preprocessing
No ratings yet
Unit 2 Preprocessing
39 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
45 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
01 Data Pre Processing
No ratings yet
01 Data Pre Processing
46 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Preprocessing
No ratings yet
Data Preprocessing
15 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
633777800398832500ata Minig Presentation
No ratings yet
633777800398832500ata Minig Presentation
20 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Module 2 - DM - AI
No ratings yet
Module 2 - DM - AI
61 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Session 4
No ratings yet
Session 4
40 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Week 2
No ratings yet
Week 2
96 pages
Unit II - Data Preprocessing and Classification RSK-1
No ratings yet
Unit II - Data Preprocessing and Classification RSK-1
115 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Fan Kit Instruction
No ratings yet
Fan Kit Instruction
4 pages
Outdoor Waterproof PoE Switch Guide
No ratings yet
Outdoor Waterproof PoE Switch Guide
12 pages
3408-Data Structure
No ratings yet
3408-Data Structure
3 pages
UserGuide10 PDF
No ratings yet
UserGuide10 PDF
494 pages
Zoom Client For Meetings: Step 1: Download Zoom Download Application From The Following Link
No ratings yet
Zoom Client For Meetings: Step 1: Download Zoom Download Application From The Following Link
6 pages
Boost OEE with TPM and Pareto Analysis
No ratings yet
Boost OEE with TPM and Pareto Analysis
15 pages
Term Paper On Management Information System
100% (1)
Term Paper On Management Information System
4 pages
Rfmipi PDF
No ratings yet
Rfmipi PDF
10 pages
Latest - IFTA International Schedule of Definitions (V2018)
No ratings yet
Latest - IFTA International Schedule of Definitions (V2018)
9 pages
2013 SNUG SV Synthesizable SystemVerilog Paper
No ratings yet
2013 SNUG SV Synthesizable SystemVerilog Paper
45 pages
Lecture Notes Cybersecurity Ethical Hacking Networking
No ratings yet
Lecture Notes Cybersecurity Ethical Hacking Networking
2 pages
Nour Abdelhafiz CV
No ratings yet
Nour Abdelhafiz CV
2 pages
Course Material (Lecture Notes) : Sri Vidya College of Engineering & Technology, Virudhunagar
No ratings yet
Course Material (Lecture Notes) : Sri Vidya College of Engineering & Technology, Virudhunagar
17 pages
PGDCA Project: Time Table System
No ratings yet
PGDCA Project: Time Table System
4 pages
Fpga Implementation of Neural Networks: Main Contents
No ratings yet
Fpga Implementation of Neural Networks: Main Contents
21 pages
Discrete Math for CS Students
No ratings yet
Discrete Math for CS Students
46 pages
Datasheet ST S5H100
No ratings yet
Datasheet ST S5H100
5 pages
1 - Introduction To BI
No ratings yet
1 - Introduction To BI
16 pages
Loxone Compendium Building Automation
No ratings yet
Loxone Compendium Building Automation
44 pages
Payroll Calculator & Database Code
No ratings yet
Payroll Calculator & Database Code
49 pages
Maintenance Manual mb491 PDF
No ratings yet
Maintenance Manual mb491 PDF
298 pages
IBM POST & BIOS Error Codes Guide
No ratings yet
IBM POST & BIOS Error Codes Guide
4 pages
PHP Math Functions
No ratings yet
PHP Math Functions
5 pages
Design A Cloud-Enabled Humanoid Robot Application System To Assess The ABA Learning For Autistic Children
No ratings yet
Design A Cloud-Enabled Humanoid Robot Application System To Assess The ABA Learning For Autistic Children
8 pages
Stage 4 Business Analysis and System Recommendation
No ratings yet
Stage 4 Business Analysis and System Recommendation
8 pages
EEE363.2 Mid SU2020 Pineapple
No ratings yet
EEE363.2 Mid SU2020 Pineapple
2 pages
Caliptra Security Insights
No ratings yet
Caliptra Security Insights
71 pages

Mod1 DM Part2

Uploaded by

Mod1 DM Part2

Uploaded by

Data Pre Processing

Data preprocessing is a data mining technique that involves

transforming raw data into an understandable format.

Data preprocessing is a proven method of resolving such issues.

Data preprocessing prepares raw data for further processing.

Data preprocessing is used database-driven applications such as

customer relationship management and rule-based applications.

incomplete: lacking attribute values, lacking certain attributes of interest, or

containing only aggregate data

inconsistent: containing discrepancies in codes or names

A well-accepted multidimensional view:

Clustering: detect and remove outliers

Use meta data

15 April 13, 2021

Redundant data occur often when integration of multiple

Careful integration of the data from multiple sources may help

Smoothing: remove noise from data

where and A are the mean and the standard deviation

This method of normalization is useful when the actual minimum and

The number of decimal points moved depends on the maximum

a value v of A is normalized to v’ by computing: v’ = ( v / 10j ). Where j

Why data reduction? — A database/data warehouse may store

to run on the complete data set.

Attribute subset selection

Discretization and concept hierarchy generation

aggregation operations are applied to the data in the construction

24 April 13, 2021

Irrelevant ,weakly relevant or redundant attributes or

25 April 13, 2021

Principal components analysis, or PCA

Regression and log-linear models can be used to approximate the

The following data are a list of prices of commonly sold items at

AllElectronics(rounded to the nearest dollar). The numbers have been

29 April 13, 2021

31 April 13, 2021

32 April 13, 2021

Interval labels can then be used to replace actual data values.

Concept hierarchies can be used to reduce the data by

34 April 13, 2021

You might also like