Unit-Ii Data Preprocessing
Unit-Ii Data Preprocessing
Data Preprocessing
Major tasks in data pre-processing, Data
cleaning: Missing values, noisy Data; Data
reduction: Overview of data reduction
strategies, principal components analysis,
attribute subset selection, histograms,
sampling; Data transformation: Data
transformation strategies overview, data
transformation by normalization.
Why Data Preprocessing?
• Data in the real world is huge size, may contain
– Incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
– Noisy: containing errors or outliers
– Inconsistent: containing discrepancies in codes or names
3
Data Quality: Why Preprocess the Data?
4
Major tasks in Data preprocessing:
• Overall process of making data more suitable for data mining
• It includes several tasks employed in the process to make data more
relevant
• Data cleaning : to remove noise and inconsistencies in the data.
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies.
• Data Integration : merge data from multiple sources
-- Integration of multiple databases, data cubes, or files.
• Data transformation or data discretization : normalization may be
applied to improve the accuracy and efficiency of the algorithms
– Normalization
• Data reduction: can reduce the data size by aggregating, eliminating
redundant features or clustering.
– Dimensionality reduction
– Numorosity reduction
– Data compression
Data Cleaning
• Data is cleansed through processes such as
filling in missing values, smoothing the
noisy data, or resolving the inconsistencies
in the data.
• To remove noise and inconsistencies in the
data.
1. Missing values
2. Noisy data
3. Inconsistent Data
Data Integration
• Data Integration:
merge data from multiple sources or data stores.
Careful integration can help reduce and avoid redundancies
and inconsistencies in the resulting data set
improve the accuracy and speed of the subsequent data
mining process.
The semantic heterogeneity and structure of data pose great
challenges in data integration.
How can we match schema and objects from different
sources?
entity identification problem
Redundancy and Correlation Analysis
Tuple Duplication
Data Value Conflict Detection and Resolution
Data Reduction
Data reduction:
This process can reduce the data size by
aggregating, eliminating redundant
features or clustering.
Various methods in reduction process:
1. Data Cube Aggregation
2. Dimensionality Reduction
3. Data Compression
4. Numerosity Reduction
Data transformation
• This preprocessing step, the data are transformed or
consolidated so that the resulting mining process may
be more efficient, and the patterns found may be
easier to understand.
• normalization is mostly used to improve the accuracy
and efficiency of algorithms
– Smoothing
– Attribute construction
– Aggregation
– Normalization
– Discretization
– Concept hierarchy generation for nominal data
Data Cleaning - Missing values
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus
deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time
of entry not register history or changes of the data
• Missing data may need to be inferred.
Missing Values – Data Cleaning
• If dataset have no recorded value for several attributes.
• How can you go about filling in the missing values for this
attribute?
• Methods of filling missing values :
1) Ignore the tuple
2) Fill in the missing value manually
3) Use a global constant to fill in the missing value
4) Use a measure of central tendency for the attribute (e.g., the
mean or median) to fill in the missing value
5) Use the attribute mean or median for all samples belonging
to the same class as the given tuple
6) Use the most probable value to fill in the missing value
1. Ignore the tuple – Missing values
1) When the attribute with missing values does not contribute to any
of the classes or has missing class label.
2) Effective only when more number of missing values are there for
many attributes in the tuple.
Example AGE CLASS After ignore the tuple the data set is given
25 A
AGE CLASS
28 A
45 B 25 A
29 A 28 A
A 45 B
A 29 A
B
49 A
49 A
33 A
33 A
11 B 11 B
A 20 B
20 B
Contd
…
• Drawback of ignore the tuple:
– Poor practise when varies considerably
– Not effective when only few of the attribute
values are missing in a tuple
–.
2. Fill in the missing value manually
• This method fills the missing values with the
assistance of humans.
Disadvantages:
1) This method is time consuming
2) It is not efficient
3) The method is not feasible
Contd
…
• Example for manual filling
AGE CLASS AGE CLASS
25 A 25 A
28 A 28 A
45 B 45 B
29 A 29 A
A 26 A
A 39 A
B 23 B
49 A 49 A
33 A 33 A
11 B 11 B
A 21 A
20 B 20 B
3. Use a global constant to fill in the missing value
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29,
34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means: Smoothing by bin medians:
- Bin 1: 9, 9, 9, 9 Bin1: 9,9,9,9
- Bin 2: 23, 23, 23, 23 Bin2: 23,23,23,23
- Bin 3: 29, 29, 29, 29 Bin3: 29,29,29,29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Regression:: Noisy Data
• Data can be smoothed by fitting the data into a
regression functions.
• Regression analysis is used to model the relationship
between one or more independent(predictor) variables
and dependent(response) variable(which is
continuous-valued )
• Two types of regression :
– Linear regression
» Straight line regression or single
» Multiple
– Non-linear regression
y
D) Regression Analysis
Y1
• Regression analysis: A collective name for
techniques for the modeling and analysis of Y1’ y=x+1
numerical data consisting of values of a dependent
variable (also called response variable or
measurement) and of one or more independent x
variables (aka. explanatory variables or predictors) X1
• The parameters are estimated so as to give a "best
fit" of the data. Most commonly the best fit is • Used for prediction (including
evaluated by using the least squares method, but forecasting of time-series data),
other criteria have also been used. inference, hypothesis testing,
• Linear regression involves finding the “best” line to and modeling of causal
fit two attributes (or variables),so that one attribute relationships
can be used to predict the other.
31
Linear regression
• Straight line regression analysis involves , a
random variable, y (called a response variable),
can be modeled as a linear function of another
random variable, x (called a predictor variable),
with the equation
y = b+ wx ------> 1
where,
b = (mean of y)Y – w * (mean of x) X------>
35
35
Handling Redundancy in Data Integration
• Χ2 (chi-square) test
2
(Observed Expected)
2
Expected
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
• Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population
38
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
R = {A1,A4,A6}
2. Stepwise backward elimination:
The procedure starts with the full set of attributes. At each step, it
removes the worst attribute remaining in the set.
Step 1: consider, the initial original data set
{A1,A2,A3,A4,A5,A6}
Step 2: Now, initialize reduced set ‘R’ with original data set of attributes .
R={A1,A2,A3,A4,A5,A6} ----> original set
Step 3: select the worst attribute from initialized attributes dataset and
remove from reduced set R
i) R = {A1,A2,A3,A4,A5,A6}
ii) R = {A1,A3,A4,A5,A6} - A2 is removed
iii) R = {A1,A4,A5,A6} - A3 is removed
iv) R = {A1,A4,A6} - A5 is removed
At each subsequent iteration or step, worst attribute is removed based
on lesser than threshold value and will stop iteration if remaining
attributes are more than threshold value
Step 4: There are no more attributes left to scan in the orignial dataset
so, R is the final reduced subset
R = {A1,A4,A6}
3. Combination of forward selection and
backward elimination:
The stepwise forward selection and backward
elimination methods can be combined so that,
at each step, the procedure selects the best
attribute and removes the worst from among
the remaining attributes.
Decision tree induction:
– Decision tree algorithms (e.g., ID3, C4.5, and CART) were given
for classification.
– Decision tree induction constructs a flowchartlike structure:
• where each internal (nonleaf) node denotes a test on an
attribute
• Each branch corresponds to an outcome of the test,
• each external (leaf) node denotes a class prediction.
– At each node, the algorithm chooses the “best” attribute to partition
the data into individual classes.
– When decision tree induction is used for attribute subset
selection, a tree is constructed from the given data.
– All attributes that do not appear in the tree are assumed
to be irrelevant.
– The set of attributes appearing in the tree form the reduced
subset of attributes.
Principal Components Analysis(PCA)
• Suppose that the data ‘D’ to be reduced
consist of ‘n’ attributes or dimensions of
tuples or data vectors described.
• Principal components analysis (PCA; also
called the Karhunen-Loeve, or K-L, method)
• PCA searches for k n-dimensional orthogonal
vectors that can best be used to represent the
data, where k ≤ n.
Contd
…“combines” the essence of attributes by
• PCA
creating an alternative, smaller set of
variables.
• The initial data can then be projected onto
this smaller set.
• PCA often reveals relationships that were not
previously suspected and interprets that
would not ordinarily result.
Procedure of PCA:-
1. The input data are normalized, so that each
attribute falls within the same range. This step
helps ensure that attributes with large domains
will not dominate attributes with smaller
domains.
2. PCA computes k orthonormal vectors that
provide a basis for the normalized input data.
These are unit vectors that each point in a
direction perpendicular to the others. These
vectors are referred to as the principal
components. The input data are a linear
combination of the principal components.
Contd….
3. The principal components are sorted in order
of decreasing “significance” or strength. The
principal components essentially serve as a
new set of axes for the data providing
important information about variance.
Ex: Y1 and Y2, for the given set
of data originally mapped to the
axes X1 and X2.
Contd….
4. Because the components are sorted in decreasing order of
“significance,” the data size can be reduced by eliminating
the weaker components(low variance). Using the strongest
principal components, it should be possible to reconstruct
a good approximation of the original data.
Conclusion:
• PCA can be applied to ordered and unordered attributes, and
can handle sparse data and skewed data.
• Multidimensional data of more than two dimensions can
be handled by reducing the problem to two dimensions.
• Principal components may be used as inputs to
multiple regression and cluster analysis.
• PCA tends to be better at handling sparse data.
Procedure of PCA: (for easy understanding)
• Input data Normalized. All attributes values are mapped to the same
range.
• Compute N Orthonormal vectors called as principal components. These
are unit vectors perpendicular to each other.
• Thus input data = linear combination of principal components
• Principal Components are ordered in the decreasing order of
“Significance” or strength.
• Size of the data can be reduced by eliminating the components with less
“Significance” or the weaker components are removed.
• Thus the Strongest Principal Component can be used to reconstruct a
good approximation of the original data.
• PCA can be applied to ordered & unordered attributes, sparse and skewed
data.
• It can also be applied on multi dimensional data by reducing the same into
2 dimensional data.
• Works only for numeric data.
B) Principal Component Analysis (PCA)
x2
x1
57
2) Data Compression: It is the process of
reducing the amount of data required to
represent a given quantity of information.
data compression data uncomp
• Input data->encodero/p-decodero/p
• Data compression types
1. Lossless Compression Exact Replica. no data
is lost ex: test encryption
2. Lossy Compressionsome infor. Is lost. Less
important information from the media is
removed. Ex: images,audio,video
• Data encoding or transformations are applied so
as to obtain a reduced or “compressed”
representation of the original data.
• If the original data can be reconstructed from the
compressed data without any loss of
information, the data reduction is called lossless.
• If, instead, we can reconstruct only an
approximation of the original data, then the data
reduction is called lossy.
• Two popular and effective methods of lossy data
compression are wavelet transforms and
principal component analysis
3) Numerosity reduction
• Numerosity reduction techniques replace the original data volume by
alternative, smaller forms of data representation.
• These techniques divided into:
– parametric
– Nonparametric
• parametric methods: a model is used to estimate the data, so that
typically only the data parameters need to be stored, instead of the actual
data. (Outliers may also be stored.)
1. Regression
2. log-linear models
• Nonparametric methods: a model for storing reduced representations of
the data.
1. histograms 3. sampling
2. clustering 4. data cube aggregation.
(OR)
• Linear Regression :
– data are modeled to fit in a straight line.
y = wx + b
– Where, x is called “Response Variable”
y is called “Predictor Variable”
w and b are called the regression coefficients.
b is the y-intercept and w is the Slope of the equation.
– These regression coefficients can be solved by using “method of
least squares”.
• Multiple Regression :
– Extension of linear regression
– Response variable Y is modeled as a multidimensional vector.
• Log-Linear Models:
– Estimates the probability of each cell in a base cuboid for a set of
discretized attributes.
– In this higher order data cubes are constructed from lower
ordered data cubes.
• The following data are a list of prices of commonly
sold items at AllElectronics
• The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5,
8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15,
15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20,
20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25,
28, 28, 30, 30, 30.
• To solve we need to know:
– How are the buckets determined and the attribute values
partitioned?
Step 1: arrange data in sorted order
Step 2: count the frequency of data existing in the dataset
1 -2 ; 5- 5; 8- 2…so on;
Step 3: consider the partitioning rule for the bin or bucket.
Types of Histograms:
• The partitioning strategies:
1. Singleton buckets-
2. Equal-width
3. Equal-frequency
Contd
…
• Equi-Width:
In an equal-width
histogram, the width of
each bucket range is
uniform(ex: $10)
• Equi-Frequency (or
equal-depth):
In an equal-frequency histogram, the buckets are
created so that, roughly, the frequency of each
bucket is constant.
Sampling
• Sampling helps in taking large data set into
smaller random data samples(subsets)
• Let D be the data set with N tuples and S be the
sample selected from D
• Types of sampling :
– Non-probability Sampling
– Probability sampling
1. Simple random sample without replacement(SRSWOR)
2. Simple random sample with replacement (SRSWR)
3. Stratified sampling(statistical)
4. Cluster sampling (grouping)
71
1. Simple random sample without replacement (SRSWOR)
R S WOR ndom
S le ra hout
i m p
(s
p l e wit
sam ment)
ce
repla
SRSW
R
Raw Data
77
Sampling: Cluster or Stratified Sampling
78
Data Transformation
• Data are transformed or consolidated so that
the resulting mining process may be more
efficient, and the patterns found may be
easier to understand
• What is this new form? Or how the new form
of data should be?
• KG ----- Grams( lower units but larger data)
Data Transformation Strategies
• Data are transformed or consolidated into forms
appropriate for mining is called data transformation
• Strategies for data transformation include the following:
1.Smoothing: which works to remove noise from the data.
Techniques for smoothing are
1.binning
2.regression
3.clustering.
2.Attribute construction (or feature construction): where new
attributes are constructed and added from the given set of
attributes to help the mining process.
3. Aggregation
• summary or aggregation operations are applied to
the data.
• Example: daily sales data may be aggregated so as to
compute monthly and annual total amounts. This
step is typically used in constructing a data cube for
data analysis at multiple abstraction levels.
Multidimensional aggregation example
4. Normalization
– In this attribute data are scaled so as to fall within a
smaller range, such as −1.0 to 1.0, or 0.0 to 1.0.
– When multiple attributes are having values on
different scales, this may lead to poor data models
for data mining operations. So, they are normalized
to bring all the attributes on same scale.
1. Min-Max normalization
2. Z-score normalization
3. Decimal Scaling
5. Discretization
• The raw values of a numeric attribute (e.g., age) are replaced by interval
labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult,
senior). The labels, in turn, can be recursively organized into higher-level
concepts, resulting in a concept hierarchy for the numeric attribute.
• Example: attribute ‘Price’ ($X ...$Y] each interval has more than one
concept hierarchy can be defined for the same attribute to accommodate
the needs of various users.
6. Concept hierarchy generation for nominal
data
• where attributes such as street can be generalized to
higher-level concepts, like city or country
Many hierarchies for nominal attributes
are implicit within the database schema
and can be automatically defined at the
schema definition level.
Data Transformation by Normalization
• The measurement unit used can affect the data
analysis.
• For example, changing measurement units from
meters to inches for height, or from kilograms to
pounds for weight, may lead to very different results
• Why normalization?
– expressing an attribute in smaller units will lead to a
larger range for that attribute, So, to make the attribute
choice free measurement units, the data should be
normalized or standardized.
• Ex: give a common range representation like
[-1,1] or [0.0,1.0]
• Normalization is particularly used for
classification algorithms in the neural
networks or nearest neighbour classification
and many distance measures techniques
Different types of Normalization
1. Min-Max normalization
2. Z-score normalization
3. Decimal Scaling
To implement these techniques let us consider,
D be the dataset from which
‘A’ be a numeric attribute
with n observed values
V={ v1, v2,..., vn} vi is each value from V
1. Min-max normalization
• Min-max normalization performs a linear transformation on
the original data. [0.0 ,0.1]or [-1.0, 1.0]
• Suppose that from attribute A {v1,v2…vn}
– minA minimum value from list
– maxA maximum values from list
• Then, Min-max normalization maps a value, vi , of A to vi‘ in
the range [new_minA,new_maxA] by computing
Example:: Min-Max Normalization
Q. The minimum and maximum values for the attribute income
are $12,000 and $98,000, respectively. By min-max
normalization, transform a value of $73,600 for income to map
income to the range [0.0,1.0].
A. By min-max normalization, a value of $73,600 for income is
transformed to :