St.
Vincent Pallotti College of Engineering
& Technology
Data Warehousing and Mining
(BEIT701T)
7th Sem B.E. (IT)
Presented By
Samir Siddiqui
CR FINAL YEAR IT
Department of Information Technology
1
December 22, 2022 Data Mining: Concepts and Techniques 2
Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data
e.g., occupation=“ ”
noisy: containing errors or outliers
e.g., Salary=“-10”
inconsistent: containing discrepancies in codes
or names
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
December 22, 2022 Data Mining: Concepts and Techniques 3
Why Is Data Dirty?
Incomplete data may come from
“Not applicable” data value when collected
Different considerations between the time when the data was
collected and when it is analyzed.
Human/hardware/software problems
Noisy data (incorrect values) may come from
Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from
Different data sources
Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
December 22, 2022 Data Mining: Concepts and Techniques 4
Why Is Data Preprocessing Important?
No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
Data warehouse needs consistent integration of quality
data
Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse
December 22, 2022 Data Mining: Concepts and Techniques 5
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same
or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially
for numerical data
December 22, 2022 Data Mining: Concepts and Techniques 6
December 22, 2022 Data Mining: Concepts and Techniques 7
Forms of Data Preprocessing
December 22, 2022 Data Mining: Concepts and Techniques 8
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., “unknown”, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class:
smarter
the most probable value: inference-based such as Bayesian formula
or decision tree
December 22, 2022 Data Mining: Concepts and Techniques 9
December 22, 2022 Data Mining: Concepts and Techniques 10
December 22, 2022 Data Mining: Concepts and Techniques 11
December 22, 2022 Data Mining: Concepts and Techniques 12
December 22, 2022 Data Mining: Concepts and Techniques 13
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g.,
deal with possible outliers)
December 22, 2022 Data Mining: Concepts and Techniques 14
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
December 22, 2022 Data Mining: Concepts and Techniques 15
Regression
Y1
Y1’ y=x+1
X1 x
December 22, 2022 Data Mining: Concepts and Techniques 16
Cluster Analysis
December 22, 2022 Data Mining: Concepts and Techniques 17
December 22, 2022 Data Mining: Concepts and Techniques 18
December 22, 2022 Data Mining: Concepts and Techniques 19
Data Integration
Data integration:
Combines data from multiple sources into a coherent
store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources,
e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from
different sources are different
Possible reasons: different representations, different
scales, e.g., metric vs. British units
December 22, 2022 Data Mining: Concepts and Techniques 20
December 22, 2022 Data Mining: Concepts and Techniques 21
December 22, 2022 Data Mining: Concepts and Techniques 22
December 22, 2022 Data Mining: Concepts and Techniques 23
December 22, 2022 Data Mining: Concepts and Techniques 24
Data Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified
range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
December 22, 2022 Data Mining: Concepts and Techniques 25
Data Transformation: Normalization
Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600 12,000
1.0]. Then $73,000 is mapped to 98,000 12,000 (1.0 0) 0 0.716
Z-score normalization (μ: mean, σ: standard deviation):
v A
v'
A
73,600 54,000
1.225
Ex. Let μ = 54,000, σ = 16,000. Then 16,000
Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
December 22, 2022 Data Mining: Concepts and Techniques 26
December 22, 2022 Data Mining: Concepts and Techniques 27
December 22, 2022 Data Mining: Concepts and Techniques 28
December 22, 2022 Data Mining: Concepts and Techniques 29
December 22, 2022 Data Mining: Concepts and Techniques 30
December 22, 2022 Data Mining: Concepts and Techniques 31
December 22, 2022 Data Mining: Concepts and Techniques 32
Data Reduction Strategies
Why data reduction?
A database/data warehouse may store terabytes of data
Complex data analysis/mining may take a very long time to run
on the complete data set
Data reduction
Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results
Data reduction strategies
Data cube aggregation:
Dimensionality reduction — e.g., remove unimportant attributes
Data Compression
Numerosity reduction — e.g., fit data into models
Discretization and concept hierarchy generation
December 22, 2022 Data Mining: Concepts and Techniques 33
Data Cube Aggregation
The lowest level of a data cube (base cuboid)
The aggregated data for an individual entity of interest
E.g., a customer in a phone calling data warehouse
Multiple levels of aggregation in data cubes
Further reduce the size of data to deal with
Reference appropriate levels
Use the smallest representation which is enough to
solve the task
Queries regarding aggregated information should be
answered using data cube, when possible
December 22, 2022 Data Mining: Concepts and Techniques 34
Attribute Subset Selection
Feature selection (i.e., attribute subset selection):
Select a minimum set of features such that the
probability distribution of different classes given the
values for those features is as close as possible to the
original distribution given the values of all features
reduce # of patterns in the patterns, easier to
understand
Heuristic methods (due to exponential # of choices):
Step-wise forward selection
Step-wise backward elimination
Combining forward selection and backward elimination
Decision-tree induction
December 22, 2022 Data Mining: Concepts and Techniques 35
Example of Decision Tree Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}
December 22, 2022 Data Mining: Concepts and Techniques 36
Dimensionality Reduction: Principal
Component Analysis (PCA)
Given N data vectors from n-dimensions, find k ≤ n orthogonal
vectors (principal components) that can be best used to represent data
Steps
Normalize input data: Each attribute falls within the same range
Compute k orthonormal (unit) vectors, i.e., principal components
Each input data (vector) is a linear combination of the k principal
component vectors
The principal components are sorted in order of decreasing
“significance” or strength
Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance. (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data
Works for numeric data only
Used when the number of dimensions is large
December 22, 2022 Data Mining: Concepts and Techniques 37
Principal Component Analysis
X2
Y1
Y2
X1
December 22, 2022 Data Mining: Concepts and Techniques 38
Chapter 2: Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary
December 22, 2022 Data Mining: Concepts and Techniques 39
Summary
Data preparation or preprocessing is a big issue for both
data warehousing and data mining
Discriptive data summarization is need for quality data
preprocessing
Data preparation includes
Data cleaning and data integration
Data reduction and feature selection
Discretization
A lot a methods have been developed but data
preprocessing still an active area of research
December 22, 2022 Data Mining: Concepts and Techniques 40
Question Bank
Q1. What is the need of data preprocessing. Explain in brief. [6M][S-17], [6M][W-16], [5M][S-19]
Q2. Summarize the data preprocessing steps in brief. [7M][W-17], [7M][S-18],
Q3. What is data cleaning? Explain different methods of data cleaning. [7M][W-17], [6M][W-16]
Q4. What is data transformation? Explain different methods of transformation[8M][S-17]
Q5. Write short notes on:
a. Missing value b. Noisy data c. Cluster d. Outlier
Q6. Write short note on data cleaning. OR How data cleaning can be can be handled in
preprocessing.[6M][S-18], [3M][S-16]
Q7. Q.10. What is data reduction? Explain different methods of data reduction. [7M][W-17], [7M]
[S-18], [4M][S-16], [7M][W-16], [4M][S-19]
Q.8.What is normalization. Explain various types of Normalization techniques with example. [7M]
[S-18]
December 22, 2022 Data Mining: Concepts and Techniques 41
Question Bank
Q9. Explain the data discretization and concept hierarchy generation. [6M][S-17], [7M]
[S-19]
Q10. What are the measures of data dispersion. [4M][S-19]
Q11. What is the need for multidimensional analysis. [5M][S-16]
Q12. Write short notes on:
a. Binning b. Regressionc. Clustering d. Smoothing
e. Generalization f. Aggregation
Q13. Explain MIN-MAX normalization and Z-score normalization. [7M][W-17], [4M]
[S-16], [6M][S-19]
Q14. Explain the various issues to be considered in data integration. Also give the
various forms of preprocessing? [6M][S-16]
Q.15. What are the challenges in data preprocessing?
December 22, 2022 Data Mining: Concepts and Techniques 42