0% found this document useful (0 votes)

35 views94 pages

Unit-Ii Data Preprocessing

Here are the key steps in this approach: 1. Calculate the mean (or median) of the attribute for each class based on the available values. 2. When encountering a missing value, replace it with the mean (or median) of that attribute for the class of the sample. For example, if the mean income for class A is $50,000, and a sample from class A is missing its income value, we would replace it with $50,000. The advantage is it uses information about the attribute distribution within each class to impute the missing value, rather than a single global value. However, it still may not accurately capture the actual missing value.

Uploaded by

akash1 annavarapu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views94 pages

Unit-Ii Data Preprocessing

Uploaded by

akash1 annavarapu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 94

UNIT-II

Data Preprocessing
Major tasks in data pre-processing, Data
cleaning: Missing values, noisy Data; Data
reduction: Overview of data reduction
strategies, principal components analysis,
attribute subset selection, histograms,
sampling; Data transformation: Data
transformation strategies overview, data
transformation by normalization.
Why Data Preprocessing?
• Data in the real world is huge size, may contain
– Incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
– Noisy: containing errors or outliers
– Inconsistent: containing discrepancies in codes or names

• No quality data, no quality mining results!

– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data

• Real world data tend to dirty, incomplete and inconsistent. Data

preprocessing techniques can improve the quality of the data there by
helping to improve the accuracy and efficiency of the subsequent mining
process.
• Detecting anomalies, rectifying them early and reducing the data to be
analyzed can lead to huge profit for decision making. So need
preprocessing of data.
Why Data Preprocessing
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?

3
Data Quality: Why Preprocess the Data?

• Measures for data quality: A multidimensional view

– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?

4
Major tasks in Data preprocessing:
• Overall process of making data more suitable for data mining
• It includes several tasks employed in the process to make data more
relevant
• Data cleaning : to remove noise and inconsistencies in the data.
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies.
• Data Integration : merge data from multiple sources
-- Integration of multiple databases, data cubes, or files.
• Data transformation or data discretization : normalization may be
applied to improve the accuracy and efficiency of the algorithms
– Normalization
• Data reduction: can reduce the data size by aggregating, eliminating
redundant features or clustering.
– Dimensionality reduction
– Numorosity reduction
– Data compression
Data Cleaning
• Data is cleansed through processes such as
filling in missing values, smoothing the
noisy data, or resolving the inconsistencies
in the data.
• To remove noise and inconsistencies in the
data.
1. Missing values
2. Noisy data
3. Inconsistent Data
Data Integration
• Data Integration:
 merge data from multiple sources or data stores.
 Careful integration can help reduce and avoid redundancies
and inconsistencies in the resulting data set
 improve the accuracy and speed of the subsequent data
mining process.
 The semantic heterogeneity and structure of data pose great
challenges in data integration.
 How can we match schema and objects from different
sources?
 entity identification problem
 Redundancy and Correlation Analysis
 Tuple Duplication
 Data Value Conflict Detection and Resolution
Data Reduction
Data reduction:
This process can reduce the data size by
aggregating, eliminating redundant
features or clustering.
Various methods in reduction process:
1. Data Cube Aggregation
2. Dimensionality Reduction
3. Data Compression
4. Numerosity Reduction
Data transformation
• This preprocessing step, the data are transformed or
consolidated so that the resulting mining process may
be more efficient, and the patterns found may be
easier to understand.
• normalization is mostly used to improve the accuracy
and efficiency of algorithms
– Smoothing
– Attribute construction
– Aggregation
– Normalization
– Discretization
– Concept hierarchy generation for nominal data
Data Cleaning - Missing values
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus
deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time
of entry not register history or changes of the data
• Missing data may need to be inferred.
Missing Values – Data Cleaning
• If dataset have no recorded value for several attributes.
• How can you go about filling in the missing values for this
attribute?
• Methods of filling missing values :
1) Ignore the tuple
2) Fill in the missing value manually
3) Use a global constant to fill in the missing value
4) Use a measure of central tendency for the attribute (e.g., the
mean or median) to fill in the missing value
5) Use the attribute mean or median for all samples belonging
to the same class as the given tuple
6) Use the most probable value to fill in the missing value
1. Ignore the tuple – Missing values
1) When the attribute with missing values does not contribute to any
of the classes or has missing class label.
2) Effective only when more number of missing values are there for
many attributes in the tuple.
Example AGE CLASS After ignore the tuple the data set is given
25 A
AGE CLASS
28 A
45 B 25 A
29 A 28 A
A 45 B
A 29 A
B
49 A
49 A
33 A
33 A
11 B 11 B
A 20 B
20 B
Contd
…
• Drawback of ignore the tuple:
– Poor practise when varies considerably
– Not effective when only few of the attribute
values are missing in a tuple
–.
2. Fill in the missing value manually
• This method fills the missing values with the
assistance of humans.
Disadvantages:
1) This method is time consuming
2) It is not efficient
3) The method is not feasible
Contd
…
• Example for manual filling
AGE CLASS AGE CLASS
25 A 25 A
28 A 28 A
45 B 45 B
29 A 29 A
A 26 A
A 39 A
B 23 B
49 A 49 A
33 A 33 A
11 B 11 B
A 21 A
20 B 20 B
3. Use a global constant to fill in the missing value

• Replace all missing attribute values by the

same constant as a label.
• Ex: “Unknown” , −∞, none ,N/A, etc...
• Disadvantages:
– Mining results may be in terms of constant
itself
– Some times mining program may mistakenly think
that they form an interesting concept.
3. Use a global constant to fill in the missing value
4. Use a measure of central tendency for the attribute
(e.g., the mean or median) to fill in the missing value

• measures of central tendency means “middle”

value of a data distribution.
• In General,
– For symmetric data distributions mean is
computed
– For skewed data distribution employ the median
Contd
…Use Attribute Mean for Numerical Data
Ex: 1-
AGE CLASS Mean Substitution with Missing Values AGE CLASS
25 A sum of all attribute values 25 A
MEAN= 28 A
28 A count of tuple 45 B
45 B
29 A MEAN= 25+28+45+29+49+33+11+2 29 A
12 20 A
A 20 A
A = 240/12 20 B
B =20 49 A
49 A 33 A
33 A 11 B
11 B 20 A
20 B
A
20 B
Contd
5. Use attribute mean or median for all samples
belonging to same class as the given tuple
• measures for filling missing value is by mean or median
value of a data distribution.
• In General,
– For symmetric data distributions mean is computed
– For skewed data distribution employ the median
• But, here mean and median is evaluated for the same
class values only.
• For example, if classifying customers according to
credit risk, we may replace the missing value with the
mean income value for customers in the same credit
risk category as that of the given tuple.
Contd
…Use Attribute Mean for same class
Ex: 1-
AGE CLASS Mean Substitution with Missing Values AGE CLASS
25 A (sum of all attribute values 25 A
28 A belonging to same class) 28 A
MEAN= 45 B
45 B count of tuple of same class
29 A
29 A
Class A Mean = 25+28+29+49+33 20.5 A
A 20.5 A
A 8 19 B
B Class A Mean = 20.5 49 A
45+11+20
49 A Class B Mean = 33 A
4 11 B
33 A
11 B Class B Mean = 19 20.5 A
20 B
A
20 B
6. Use the most probable value to fill
in the missing value
• This may be determined with
– Regression
– inference-based tools using a Bayesian formalism
– decision tree induction.
• This is the proper strategy and more used
Noisy Data
• Noise is a random error or variance in a measured
variable.
• Recorded errors ,unusual values or inconsistency in the
dataset
• Noisy Data may be due to faulty data collection
instruments, data entry problems and technology
limitation.
• How to Handle Noisy Data?
• Binning
• Regression
• Outlier analysis or Clustering
Binning::Noisy Data
• Binning methods smooth a sorted data value
by consulting its “neighborhood,” that is,
the values around it. The sorted values are
distributed into a number of “buckets,” or
bins. Because binning methods consult the
neighborhood of values, they perform local
smoothing.
Binning Methods for Data Smoothing/Simple
Discretization Method or Binning Technique:
Example 2:

• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29,
34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means: Smoothing by bin medians:
- Bin 1: 9, 9, 9, 9 Bin1: 9,9,9,9
- Bin 2: 23, 23, 23, 23 Bin2: 23,23,23,23
- Bin 3: 29, 29, 29, 29 Bin3: 29,29,29,29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Regression:: Noisy Data
• Data can be smoothed by fitting the data into a
regression functions.
• Regression analysis is used to model the relationship
between one or more independent(predictor) variables
and dependent(response) variable(which is
continuous-valued )
• Two types of regression :
– Linear regression
» Straight line regression or single
» Multiple
– Non-linear regression
y

D) Regression Analysis
Y1
• Regression analysis: A collective name for
techniques for the modeling and analysis of Y1’ y=x+1
numerical data consisting of values of a dependent
variable (also called response variable or
measurement) and of one or more independent x
variables (aka. explanatory variables or predictors) X1
• The parameters are estimated so as to give a "best
fit" of the data. Most commonly the best fit is • Used for prediction (including
evaluated by using the least squares method, but forecasting of time-series data),
other criteria have also been used. inference, hypothesis testing,
• Linear regression involves finding the “best” line to and modeling of causal
fit two attributes (or variables),so that one attribute relationships
can be used to predict the other.

31
Linear regression
• Straight line regression analysis involves , a
random variable, y (called a response variable),
can be modeled as a linear function of another
random variable, x (called a predictor variable),
with the equation
y = b+ wx ------> 1
where,
b = (mean of y)Y – w * (mean of x) X------>

w = DΣ i=1 [ (xi - X)(yi - Y) ] / D Σ i=1 [ (xi - X)2] 

3
Clustering
C) Combined computer and human inspection:
• Outliers may be identified through a combination of computer and human
inspection
• Ex : in one application an information-theoretic measure was used to help
identify outlier patterns in a handwritten character database for
classification.
• The measure’s value reflected the “surprise” content of the predicted
character label with respect to the known label.
• Patterns whose surprise content is above a threshold are output to a list.
• A human can then sort through the patterns in the list to identify the actual
garbage ones.
• This is much faster than having to manually search through the entire
database.
• Outlier patterns may be informative (ex: identifying useful data exceptions,
such as different versions of the characters “0” or “7” or garbage)
• The garbage values can then be excluded from use in subsequent data
mining.
II) Data Integration
• Data integration:
– Combines data from multiple sources into a coherent store
• Schema integration:
– Integrate metadata from different sources
• Entity identification problem:
– Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton. e.g., A.cust-id  B.cust-#
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different sources are
different
– Possible reasons: different representations, different scales, e.g., metric
vs. British units

35
35
Handling Redundancy in Data Integration

• Redundant data occur often when integration of multiple

databases
– Object identification: The same attribute or object may
have different names in different databases
– Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
36
36
Correlation Analysis (Nominal Data)

• Χ2 (chi-square) test
2
(Observed  Expected)
2  
Expected
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
• Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population

38
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500

• Χ2 (chi-square) calculation (numbers in parenthesis are expected

counts calculated based on the data distribution in the two
categories) 2 2 2 2
2 ( 250  90 ) ( 50  210 ) ( 200  360 ) (1000  840 )
      507 . 93
90 210 360 840
• Degrees of freedom=(no. of columns-1)(no. of rows-1)=1 Level of significance 0.05=503

• It shows that like_science_fiction and play_chess are correlated

in the group
39
Data reduction

• Warehouse may store terabytes of data: Complex

data analysis/mining may take a very long time to run
on the complete data set.

• Data reduction technique can be applied to obtain a

reduced representation of the data set that is much
smaller in volume, yet closely maintains the integrity
of the original data.
Overview of Data reduction strategies

1) Dimensionality reduction : irrelevant, weakly

relevant or redundant attributes or dimensions may
be detected and removed.
2) Data compression : encoding mechanisms are used
to reduce the data set size.
3) Numerosity reduction : the data are replaced or
estimated alternative, smaller data representations
such as parametric models or nonparametric
methods such as clustering, sampling and the use of
histograms.
Dimensionality reduction---Lossy Technique
Data compression---Lossy and Lossless
Numerosity reduction---Lossless
1) Dimensionality reduction
• The process of reducing the number of
random variables or attributes under
consideration.
• Dimensionality reduction methods:
A) Attribute subset selection
• irrelevant, weakly relevant, or redundant attributes or
dimensions are detected and removed
B) principal components analysis PCA
• which transform or project the original data onto a
smaller space.
C) Wavelet transforms
A) Attribute subset selection
– reduces the data set size by removing irrelevant or
redundant attributes (or dimensions).
• Ex: 1. total number of students will pass need to
be analyzed. Then student name, section,
phone…(all are irrelevant)
2. customers based on whether or not they
are likely to purchase a popular new CD,
attributes such as the customer’s telephone
number are likely to be irrelevant, unlike
attributes such as age or music taste
• If done by Domain expert:
– time consuming
– Complex when data’s behavior is not well known
How can we find a ‘good’ subset of the original attributes?
• For ‘n’ attributes, there are 2n possible subsets.
• So, search is expensive as, for n attributes , the number of
data classes also increase.
• Hence, we follow heuristic methods that explore a reduced
search space are commonly used for attribute subset
selection.
• These methods are greedy , while searching through
attribute space, searches for best attributes, effective in
practice and close to estimating an optimal solutions.
• Many other :
– Statistical (best or worst attributes and independencies)
– information gain measure used in building decision trees
for classification
Goal of attribute subset selection :
The goal of attribute subset selection is to find
a minimum set of attributes such that the
resulting probability distribution of the data
classes is as close as possible to the original
distribution obtained using all attributes.
• Heuristic methods of attribute subset
selection include the following techniques.
1. step-wise forward selection
2. step-wise backward elimination
3. combining forward selection and backward
elimination
4. decision-tree induction
1.Stepwise forward selection:
The procedure starts with an empty set of attributes as the
reduced set. The best of the original attributes is determined and
added to the reduced set. At each subsequent iteration or step, the
best of the remaining original attributes is added to the set.
Step 1: consider, the initial original data set
{A1,A2,A3,A4,A5,A6}
Step 2: Now, start with an empty set of attributes as the reduced set ‘R’.
R={ } ----> empty set
Step 3: select the best attributes from original attributes dataset and added to
the reduced set R
i) R = {A1}
ii) R= {A1,A4}
iii) R= {A1,A4,A6}
At each subsequent iteration or step, best of the remaining original
attributes is added to the set.
Step 4: There are no more attributes left to scan in the orignial dataset
so, R is the final reduced subset

R = {A1,A4,A6}
2. Stepwise backward elimination:
The procedure starts with the full set of attributes. At each step, it
removes the worst attribute remaining in the set.
Step 1: consider, the initial original data set
{A1,A2,A3,A4,A5,A6}
Step 2: Now, initialize reduced set ‘R’ with original data set of attributes .
R={A1,A2,A3,A4,A5,A6} ----> original set
Step 3: select the worst attribute from initialized attributes dataset and
remove from reduced set R
i) R = {A1,A2,A3,A4,A5,A6}
ii) R = {A1,A3,A4,A5,A6} - A2 is removed
iii) R = {A1,A4,A5,A6} - A3 is removed
iv) R = {A1,A4,A6} - A5 is removed
At each subsequent iteration or step, worst attribute is removed based
on lesser than threshold value and will stop iteration if remaining
attributes are more than threshold value
Step 4: There are no more attributes left to scan in the orignial dataset
so, R is the final reduced subset

R = {A1,A4,A6}
3. Combination of forward selection and
backward elimination:
The stepwise forward selection and backward
elimination methods can be combined so that,
at each step, the procedure selects the best
attribute and removes the worst from among
the remaining attributes.
Decision tree induction:

– Decision tree algorithms (e.g., ID3, C4.5, and CART) were given
for classification.
– Decision tree induction constructs a flowchartlike structure:
• where each internal (nonleaf) node denotes a test on an
attribute
• Each branch corresponds to an outcome of the test,
• each external (leaf) node denotes a class prediction.
– At each node, the algorithm chooses the “best” attribute to partition
the data into individual classes.
– When decision tree induction is used for attribute subset
selection, a tree is constructed from the given data.
– All attributes that do not appear in the tree are assumed
to be irrelevant.
– The set of attributes appearing in the tree form the reduced
subset of attributes.
Principal Components Analysis(PCA)
• Suppose that the data ‘D’ to be reduced
consist of ‘n’ attributes or dimensions of
tuples or data vectors described.
• Principal components analysis (PCA; also
called the Karhunen-Loeve, or K-L, method)
• PCA searches for k n-dimensional orthogonal
vectors that can best be used to represent the
data, where k ≤ n.
Contd
…“combines” the essence of attributes by
• PCA
creating an alternative, smaller set of
variables.
• The initial data can then be projected onto
this smaller set.
• PCA often reveals relationships that were not
previously suspected and interprets that
would not ordinarily result.
Procedure of PCA:-
1. The input data are normalized, so that each
attribute falls within the same range. This step
helps ensure that attributes with large domains
will not dominate attributes with smaller
domains.
2. PCA computes k orthonormal vectors that
provide a basis for the normalized input data.
These are unit vectors that each point in a
direction perpendicular to the others. These
vectors are referred to as the principal
components. The input data are a linear
combination of the principal components.
Contd….
3. The principal components are sorted in order
of decreasing “significance” or strength. The
principal components essentially serve as a
new set of axes for the data providing
important information about variance.
Ex: Y1 and Y2, for the given set
of data originally mapped to the
axes X1 and X2.
Contd….
4. Because the components are sorted in decreasing order of
“significance,” the data size can be reduced by eliminating
the weaker components(low variance). Using the strongest
principal components, it should be possible to reconstruct
a good approximation of the original data.
Conclusion:
• PCA can be applied to ordered and unordered attributes, and
can handle sparse data and skewed data.
• Multidimensional data of more than two dimensions can
be handled by reducing the problem to two dimensions.
• Principal components may be used as inputs to
multiple regression and cluster analysis.
• PCA tends to be better at handling sparse data.
Procedure of PCA: (for easy understanding)
• Input data Normalized. All attributes values are mapped to the same
range.
• Compute N Orthonormal vectors called as principal components. These
are unit vectors perpendicular to each other.
• Thus input data = linear combination of principal components
• Principal Components are ordered in the decreasing order of
“Significance” or strength.
• Size of the data can be reduced by eliminating the components with less
“Significance” or the weaker components are removed.
• Thus the Strongest Principal Component can be used to reconstruct a
good approximation of the original data.
• PCA can be applied to ordered & unordered attributes, sparse and skewed
data.
• It can also be applied on multi dimensional data by reducing the same into
2 dimensional data.
• Works only for numeric data.
B) Principal Component Analysis (PCA)

• Find a projection that captures the largest amount of variation in data

• The original data are projected onto a much smaller space, resulting in
dimensionality reduction. We find the eigenvectors of the covariance
matrix, and these eigenvectors define the new space

x1
57
2) Data Compression: It is the process of
reducing the amount of data required to
represent a given quantity of information.
data compression data uncomp
• Input data->encodero/p-decodero/p
• Data compression types
1. Lossless Compression Exact Replica. no data
is lost ex: test encryption
2. Lossy Compressionsome infor. Is lost. Less
important information from the media is
removed. Ex: images,audio,video
• Data encoding or transformations are applied so
as to obtain a reduced or “compressed”
representation of the original data.
• If the original data can be reconstructed from the
compressed data without any loss of
information, the data reduction is called lossless.
• If, instead, we can reconstruct only an
approximation of the original data, then the data
reduction is called lossy.
• Two popular and effective methods of lossy data
compression are wavelet transforms and
principal component analysis
3) Numerosity reduction
• Numerosity reduction techniques replace the original data volume by
alternative, smaller forms of data representation.
• These techniques divided into:
– parametric
– Nonparametric
• parametric methods: a model is used to estimate the data, so that
typically only the data parameters need to be stored, instead of the actual
data. (Outliers may also be stored.)
1. Regression
2. log-linear models
• Nonparametric methods: a model for storing reduced representations of
the data.
1. histograms 3. sampling
2. clustering 4. data cube aggregation.
(OR)
• Linear Regression :
– data are modeled to fit in a straight line.
y = wx + b
– Where, x is called “Response Variable”
y is called “Predictor Variable”
w and b are called the regression coefficients.
b is the y-intercept and w is the Slope of the equation.
– These regression coefficients can be solved by using “method of
least squares”.
• Multiple Regression :
– Extension of linear regression
– Response variable Y is modeled as a multidimensional vector.
• Log-Linear Models:
– Estimates the probability of each cell in a base cuboid for a set of
discretized attributes.
– In this higher order data cubes are constructed from lower
ordered data cubes.
• The following data are a list of prices of commonly
sold items at AllElectronics
• The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5,
8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15,
15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20,
20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25,
28, 28, 30, 30, 30.
• To solve we need to know:
– How are the buckets determined and the attribute values
partitioned?
Step 1: arrange data in sorted order
Step 2: count the frequency of data existing in the dataset
1 -2 ; 5- 5; 8- 2…so on;
Step 3: consider the partitioning rule for the bin or bucket.
Types of Histograms:
• The partitioning strategies:
1. Singleton buckets-
2. Equal-width
3. Equal-frequency
Contd
…
• Equi-Width:
In an equal-width
histogram, the width of
each bucket range is
uniform(ex: $10)

• Equi-Frequency (or
equal-depth):
In an equal-frequency histogram, the buckets are
created so that, roughly, the frequency of each
bucket is constant.
Sampling
• Sampling helps in taking large data set into
smaller random data samples(subsets)
• Let D be the data set with N tuples and S be the
sample selected from D
• Types of sampling :
– Non-probability Sampling
– Probability sampling
1. Simple random sample without replacement(SRSWOR)
2. Simple random sample with replacement (SRSWR)
3. Stratified sampling(statistical)
4. Cluster sampling (grouping)
71
1. Simple random sample without replacement (SRSWOR)

Once an object is selected, it is removed from the population

• By drawing S samples from D data set

• where N are the tuples in D ,
S<N
• the probability of drawing any tuple in D is 1/N, that is,
all tuples are equally likely to be sampled.
• Ex: 1. coin tossed
2. dice rolled ( 6 face 3 chance) sampling
2. Simple random sample with replacement (SRSWR)
A selected object is not removed from the population

• This is similar to SRSWOR, except that each time a tuple

is drawn from D, it is recorded and then replaced. That is,
after a tuple is drawn, it is placed back in D so that it may
be drawn again.
• Example of SRSWOR and SRSWR
3. Stratified Sampling (SS)
• If D is divided into mutually disjoint parts called strata, a
stratified sample of D is generated by obtaining an SRS at each
stratum. This helps ensure a representative sample, especially
when the data are skewed.
• Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the
data)
4. Cluster sample(CS)

• If the tuples in D are grouped into M mutually disjoint “clusters,”

then an SRS of s clusters can be obtained, where s < M.
• A reduced data representation can be obtained by applying, say,
SRSWOR to the pages, resulting in a cluster sample of the tuples.
Sampling: With or without Replacement

R S WOR ndom
S le ra hout
i m p
(s
p l e wit
sam ment)
ce
repla

SRSW
R

Raw Data
77
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

78
Data Transformation
• Data are transformed or consolidated so that
the resulting mining process may be more
efficient, and the patterns found may be
easier to understand
• What is this new form? Or how the new form
of data should be?
• KG ----- Grams( lower units but larger data)
Data Transformation Strategies
• Data are transformed or consolidated into forms
appropriate for mining is called data transformation
• Strategies for data transformation include the following:
1.Smoothing: which works to remove noise from the data.
Techniques for smoothing are
1.binning
2.regression
3.clustering.
2.Attribute construction (or feature construction): where new
attributes are constructed and added from the given set of
attributes to help the mining process.
3. Aggregation
• summary or aggregation operations are applied to
the data.
• Example: daily sales data may be aggregated so as to
compute monthly and annual total amounts. This
step is typically used in constructing a data cube for
data analysis at multiple abstraction levels.
Multidimensional aggregation example
4. Normalization
– In this attribute data are scaled so as to fall within a
smaller range, such as −1.0 to 1.0, or 0.0 to 1.0.
– When multiple attributes are having values on
different scales, this may lead to poor data models
for data mining operations. So, they are normalized
to bring all the attributes on same scale.
1. Min-Max normalization
2. Z-score normalization
3. Decimal Scaling
5. Discretization
• The raw values of a numeric attribute (e.g., age) are replaced by interval
labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult,
senior). The labels, in turn, can be recursively organized into higher-level
concepts, resulting in a concept hierarchy for the numeric attribute.
• Example: attribute ‘Price’ ($X ...$Y] each interval has more than one
concept hierarchy can be defined for the same attribute to accommodate
the needs of various users.
6. Concept hierarchy generation for nominal
data
• where attributes such as street can be generalized to
higher-level concepts, like city or country
Many hierarchies for nominal attributes
are implicit within the database schema
and can be automatically defined at the
schema definition level.
Data Transformation by Normalization
• The measurement unit used can affect the data
analysis.
• For example, changing measurement units from
meters to inches for height, or from kilograms to
pounds for weight, may lead to very different results
• Why normalization?
– expressing an attribute in smaller units will lead to a
larger range for that attribute, So, to make the attribute
choice free measurement units, the data should be
normalized or standardized.
• Ex: give a common range representation like
[-1,1] or [0.0,1.0]
• Normalization is particularly used for
classification algorithms in the neural
networks or nearest neighbour classification
and many distance measures techniques
Different types of Normalization
1. Min-Max normalization
2. Z-score normalization
3. Decimal Scaling
To implement these techniques let us consider,
D be the dataset from which
‘A’ be a numeric attribute
with n observed values
V={ v1, v2,..., vn} vi is each value from V
1. Min-max normalization
• Min-max normalization performs a linear transformation on
the original data. [0.0 ,0.1]or [-1.0, 1.0]
• Suppose that from attribute A {v1,v2…vn}
– minA minimum value from list
– maxA maximum values from list
• Then, Min-max normalization maps a value, vi , of A to vi‘ in
the range [new_minA,new_maxA] by computing
Example:: Min-Max Normalization
Q. The minimum and maximum values for the attribute income
are $12,000 and $98,000, respectively. By min-max
normalization, transform a value of $73,600 for income to map
income to the range [0.0,1.0].
A. By min-max normalization, a value of $73,600 for income is
transformed to :

We mapped income $73,600 to the range [0.0,1.0] with 0.716

2. Z-score Normalization
• Also referred as “zero-mean normalization”
• values for an attribute, A are normalized based
on the mean (i.e., average) and standard
deviation of A.
• A value, vi , of A is normalized to vi ‘ by
computing

where and are the mean and standard

deviation.
3. Decimal Scaling
• Normalization by moving the decimal point of
values of attribute A.
• The number of decimal points moved depends
on the maximum absolute value of A.
Example:: Decimal scaling
Q. By using Decimal scaling find normalized value for
the recorded values of A range from −986 to 917.
A. The maximum absolute value of A is 986.
Divide each value by 1000 (i.e., j = 3)
hence,
−986/1000 normalizes to −0.986 and
917/1000 normalizes to 0.917.
therefore, range from −986 to 917 is normalized to
-0.986 to 0.917.

4pm1 01 Que 20250521
100% (1)
4pm1 01 Que 20250521
36 pages
Session 4
No ratings yet
Session 4
40 pages
DMDW Unit II
No ratings yet
DMDW Unit II
57 pages
Data Preprocessing
No ratings yet
Data Preprocessing
67 pages
18mca52c U2
No ratings yet
18mca52c U2
23 pages
UNIT II Data Processing (1) .PPTX DMT
No ratings yet
UNIT II Data Processing (1) .PPTX DMT
43 pages
Business and Professional Ethics 9th Edition Leonard J. Brooks Download
No ratings yet
Business and Professional Ethics 9th Edition Leonard J. Brooks Download
64 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Module 2 - DM - AI
No ratings yet
Module 2 - DM - AI
61 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
Aids To Selection and Selection Methods.
No ratings yet
Aids To Selection and Selection Methods.
172 pages
Data Pre Processing
No ratings yet
Data Pre Processing
28 pages
Dmi Unit 3
No ratings yet
Dmi Unit 3
12 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
8 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Order of Magnitude & Vector Basics
No ratings yet
Order of Magnitude & Vector Basics
24 pages
Lecture 02
No ratings yet
Lecture 02
41 pages
Assessment Record 2024-2025
No ratings yet
Assessment Record 2024-2025
12 pages
R20 DMT Unit-Ii
No ratings yet
R20 DMT Unit-Ii
17 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
Age Calculation
No ratings yet
Age Calculation
4 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
Gandhinagar Institute of Technology: Question Bank
No ratings yet
Gandhinagar Institute of Technology: Question Bank
5 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
15 pages
Unit 2
No ratings yet
Unit 2
37 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
The Secant Method
No ratings yet
The Secant Method
7 pages
Bayesian Leak Prediction for Utilities
No ratings yet
Bayesian Leak Prediction for Utilities
15 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
29402data Preprocessing - Data Cleaning
No ratings yet
29402data Preprocessing - Data Cleaning
12 pages
MTPPT4 ELECTRIC FIELD - With Solution
No ratings yet
MTPPT4 ELECTRIC FIELD - With Solution
37 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
12 Hookes Law and Youngs Modulus
No ratings yet
12 Hookes Law and Youngs Modulus
6 pages
Optional Challenge 2
0% (6)
Optional Challenge 2
3 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
9 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Coding Interview Questions & Solutions
No ratings yet
Coding Interview Questions & Solutions
56 pages
Exams Questions and Model Answers
No ratings yet
Exams Questions and Model Answers
6 pages
MATH Grade 4 Quarter 1 Module 7 FINAL
No ratings yet
MATH Grade 4 Quarter 1 Module 7 FINAL
32 pages
Deep Work Rules For Focused Success in A Distracted World 1st Edition Cal Newport Download
100% (4)
Deep Work Rules For Focused Success in A Distracted World 1st Edition Cal Newport Download
139 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
DURERS MAGIC SQUARE Inclusion and Home Learning Guide
No ratings yet
DURERS MAGIC SQUARE Inclusion and Home Learning Guide
8 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
A Divergence Dating Analysis of Turtle Using Fossil Calibrations An Example of Best Practices
No ratings yet
A Divergence Dating Analysis of Turtle Using Fossil Calibrations An Example of Best Practices
24 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
How To Crack BITCOIN - Algorithm
0% (1)
How To Crack BITCOIN - Algorithm
3 pages
Data Pre-Processing Guide
No ratings yet
Data Pre-Processing Guide
33 pages
Option Delta With Skew Adjustment
100% (1)
Option Delta With Skew Adjustment
33 pages
Optimization Problems
No ratings yet
Optimization Problems
23 pages
Statistical Quality Control
100% (1)
Statistical Quality Control
3 pages
BG3801 L3 Medical Image Processing 14-15
No ratings yet
BG3801 L3 Medical Image Processing 14-15
18 pages
Grade 9 Tos - WW1
No ratings yet
Grade 9 Tos - WW1
2 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Data Preprocessing for Tech Students
No ratings yet
Data Preprocessing for Tech Students
59 pages
Automata Theory Chapter 2 PDF
No ratings yet
Automata Theory Chapter 2 PDF
12 pages
The Joys of Compounding
100% (1)
The Joys of Compounding
20 pages
Chap 1 Data Preprocessing
No ratings yet
Chap 1 Data Preprocessing
17 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
An Introduction To Synthetic CDO and Its Structure
100% (2)
An Introduction To Synthetic CDO and Its Structure
39 pages
Managerial Math Assignment 2013
No ratings yet
Managerial Math Assignment 2013
4 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
How Indian Highways Are Numbered
No ratings yet
How Indian Highways Are Numbered
3 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Phylogenetic Tree Creation Morphological and Molecular Methods For 07-Johnson
100% (2)
Phylogenetic Tree Creation Morphological and Molecular Methods For 07-Johnson
35 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages

Unit-Ii Data Preprocessing

Uploaded by

Unit-Ii Data Preprocessing

Uploaded by

UNIT-II

• No quality data, no quality mining results!

• Real world data tend to dirty, incomplete and inconsistent. Data

• Measures for data quality: A multidimensional view

• Replace all missing attribute values by the

• measures of central tendency means “middle”

w = DΣ i=1 [ (xi - X)(yi - Y) ] / D Σ i=1 [ (xi - X)2] 

• Redundant data occur often when integration of multiple

• Χ2 (chi-square) calculation (numbers in parenthesis are expected

• It shows that like_science_fiction and play_chess are correlated

• Warehouse may store terabytes of data: Complex

• Data reduction technique can be applied to obtain a

1) Dimensionality reduction : irrelevant, weakly

• Find a projection that captures the largest amount of variation in data

Once an object is selected, it is removed from the population

• By drawing S samples from D data set

• This is similar to SRSWOR, except that each time a tuple

• If the tuples in D are grouped into M mutually disjoint “clusters,”

Raw Data Cluster/Stratified Sample

We mapped income $73,600 to the range [0.0,1.0] with 0.716

where and are the mean and standard

You might also like