Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
35 views94 pages

Unit-Ii Data Preprocessing

Here are the key steps in this approach: 1. Calculate the mean (or median) of the attribute for each class based on the available values. 2. When encountering a missing value, replace it with the mean (or median) of that attribute for the class of the sample. For example, if the mean income for class A is $50,000, and a sample from class A is missing its income value, we would replace it with $50,000. The advantage is it uses information about the attribute distribution within each class to impute the missing value, rather than a single global value. However, it still may not accurately capture the actual missing value.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views94 pages

Unit-Ii Data Preprocessing

Here are the key steps in this approach: 1. Calculate the mean (or median) of the attribute for each class based on the available values. 2. When encountering a missing value, replace it with the mean (or median) of that attribute for the class of the sample. For example, if the mean income for class A is $50,000, and a sample from class A is missing its income value, we would replace it with $50,000. The advantage is it uses information about the attribute distribution within each class to impute the missing value, rather than a single global value. However, it still may not accurately capture the actual missing value.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 94

UNIT-II

Data Preprocessing
Major tasks in data pre-processing, Data
cleaning: Missing values, noisy Data; Data
reduction: Overview of data reduction
strategies, principal components analysis,
attribute subset selection, histograms,
sampling; Data transformation: Data
transformation strategies overview, data
transformation by normalization.
Why Data Preprocessing?
• Data in the real world is huge size, may contain
– Incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
– Noisy: containing errors or outliers
– Inconsistent: containing discrepancies in codes or names

• No quality data, no quality mining results!


– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data

• Real world data tend to dirty, incomplete and inconsistent. Data


preprocessing techniques can improve the quality of the data there by
helping to improve the accuracy and efficiency of the subsequent mining
process.
• Detecting anomalies, rectifying them early and reducing the data to be
analyzed can lead to huge profit for decision making. So need
preprocessing of data.
Why Data Preprocessing
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?

3
Data Quality: Why Preprocess the Data?

• Measures for data quality: A multidimensional view


– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?

4
Major tasks in Data preprocessing:
• Overall process of making data more suitable for data mining
• It includes several tasks employed in the process to make data more
relevant
• Data cleaning : to remove noise and inconsistencies in the data.
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies.
• Data Integration : merge data from multiple sources
-- Integration of multiple databases, data cubes, or files.
• Data transformation or data discretization : normalization may be
applied to improve the accuracy and efficiency of the algorithms
– Normalization
• Data reduction: can reduce the data size by aggregating, eliminating
redundant features or clustering.
– Dimensionality reduction
– Numorosity reduction
– Data compression
Data Cleaning
• Data is cleansed through processes such as
filling in missing values, smoothing the
noisy data, or resolving the inconsistencies
in the data.
• To remove noise and inconsistencies in the
data.
1. Missing values
2. Noisy data
3. Inconsistent Data
Data Integration
• Data Integration:
 merge data from multiple sources or data stores.
 Careful integration can help reduce and avoid redundancies
and inconsistencies in the resulting data set
 improve the accuracy and speed of the subsequent data
mining process.
 The semantic heterogeneity and structure of data pose great
challenges in data integration.
 How can we match schema and objects from different
sources?
 entity identification problem
 Redundancy and Correlation Analysis
 Tuple Duplication
 Data Value Conflict Detection and Resolution
Data Reduction
Data reduction:
This process can reduce the data size by
aggregating, eliminating redundant
features or clustering.
Various methods in reduction process:
1. Data Cube Aggregation
2. Dimensionality Reduction
3. Data Compression
4. Numerosity Reduction
Data transformation
• This preprocessing step, the data are transformed or
consolidated so that the resulting mining process may
be more efficient, and the patterns found may be
easier to understand.
• normalization is mostly used to improve the accuracy
and efficiency of algorithms
– Smoothing
– Attribute construction
– Aggregation
– Normalization
– Discretization
– Concept hierarchy generation for nominal data
Data Cleaning - Missing values
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus
deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time
of entry not register history or changes of the data
• Missing data may need to be inferred.
Missing Values – Data Cleaning
• If dataset have no recorded value for several attributes.
• How can you go about filling in the missing values for this
attribute?
• Methods of filling missing values :
1) Ignore the tuple
2) Fill in the missing value manually
3) Use a global constant to fill in the missing value
4) Use a measure of central tendency for the attribute (e.g., the
mean or median) to fill in the missing value
5) Use the attribute mean or median for all samples belonging
to the same class as the given tuple
6) Use the most probable value to fill in the missing value
1. Ignore the tuple – Missing values
1) When the attribute with missing values does not contribute to any
of the classes or has missing class label.
2) Effective only when more number of missing values are there for
many attributes in the tuple.
Example AGE CLASS After ignore the tuple the data set is given
25 A
AGE CLASS
28 A
45 B 25 A
29 A 28 A
A 45 B
A 29 A
B
49 A
49 A
33 A
33 A
11 B 11 B
A 20 B
20 B
Contd

• Drawback of ignore the tuple:
– Poor practise when varies considerably
– Not effective when only few of the attribute
values are missing in a tuple
–.
2. Fill in the missing value manually
• This method fills the missing values with the
assistance of humans.
Disadvantages:
1) This method is time consuming
2) It is not efficient
3) The method is not feasible
Contd

• Example for manual filling
AGE CLASS AGE CLASS
25 A 25 A
28 A 28 A
45 B 45 B
29 A 29 A
A 26 A
A 39 A
B 23 B
49 A 49 A
33 A 33 A
11 B 11 B
A 21 A
20 B 20 B
3. Use a global constant to fill in the missing value

• Replace all missing attribute values by the


same constant as a label.
• Ex: “Unknown” , −∞, none ,N/A, etc...
• Disadvantages:
– Mining results may be in terms of constant
itself
– Some times mining program may mistakenly think
that they form an interesting concept.
3. Use a global constant to fill in the missing value
4. Use a measure of central tendency for the attribute
(e.g., the mean or median) to fill in the missing value

• measures of central tendency means “middle”


value of a data distribution.
• In General,
– For symmetric data distributions mean is
computed
– For skewed data distribution employ the median
Contd
…Use Attribute Mean for Numerical Data
Ex: 1-
AGE CLASS Mean Substitution with Missing Values AGE CLASS
25 A sum of all attribute values 25 A
MEAN= 28 A
28 A count of tuple 45 B
45 B
29 A MEAN= 25+28+45+29+49+33+11+2 29 A
12 20 A
A 20 A
A = 240/12 20 B
B =20 49 A
49 A 33 A
33 A 11 B
11 B 20 A
20 B
A
20 B
Contd
5. Use attribute mean or median for all samples
belonging to same class as the given tuple
• measures for filling missing value is by mean or median
value of a data distribution.
• In General,
– For symmetric data distributions mean is computed
– For skewed data distribution employ the median
• But, here mean and median is evaluated for the same
class values only.
• For example, if classifying customers according to
credit risk, we may replace the missing value with the
mean income value for customers in the same credit
risk category as that of the given tuple.
Contd
…Use Attribute Mean for same class
Ex: 1-
AGE CLASS Mean Substitution with Missing Values AGE CLASS
25 A (sum of all attribute values 25 A
28 A belonging to same class) 28 A
MEAN= 45 B
45 B count of tuple of same class
29 A
29 A
Class A Mean = 25+28+29+49+33 20.5 A
A 20.5 A
A 8 19 B
B Class A Mean = 20.5 49 A
45+11+20
49 A Class B Mean = 33 A
4 11 B
33 A
11 B Class B Mean = 19 20.5 A
20 B
A
20 B
6. Use the most probable value to fill
in the missing value
• This may be determined with
– Regression
– inference-based tools using a Bayesian formalism
– decision tree induction.
• This is the proper strategy and more used
Noisy Data
• Noise is a random error or variance in a measured
variable.
• Recorded errors ,unusual values or inconsistency in the
dataset
• Noisy Data may be due to faulty data collection
instruments, data entry problems and technology
limitation.
• How to Handle Noisy Data?
• Binning
• Regression
• Outlier analysis or Clustering
Binning::Noisy Data
• Binning methods smooth a sorted data value
by consulting its “neighborhood,” that is,
the values around it. The sorted values are
distributed into a number of “buckets,” or
bins. Because binning methods consult the
neighborhood of values, they perform local
smoothing.
Binning Methods for Data Smoothing/Simple
Discretization Method or Binning Technique:
Example 2:

• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29,
34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means: Smoothing by bin medians:
- Bin 1: 9, 9, 9, 9 Bin1: 9,9,9,9
- Bin 2: 23, 23, 23, 23 Bin2: 23,23,23,23
- Bin 3: 29, 29, 29, 29 Bin3: 29,29,29,29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Regression:: Noisy Data
• Data can be smoothed by fitting the data into a
regression functions.
• Regression analysis is used to model the relationship
between one or more independent(predictor) variables
and dependent(response) variable(which is
continuous-valued )
• Two types of regression :
– Linear regression
» Straight line regression or single
» Multiple
– Non-linear regression
y

D) Regression Analysis
Y1
• Regression analysis: A collective name for
techniques for the modeling and analysis of Y1’ y=x+1
numerical data consisting of values of a dependent
variable (also called response variable or
measurement) and of one or more independent x
variables (aka. explanatory variables or predictors) X1
• The parameters are estimated so as to give a "best
fit" of the data. Most commonly the best fit is • Used for prediction (including
evaluated by using the least squares method, but forecasting of time-series data),
other criteria have also been used. inference, hypothesis testing,
• Linear regression involves finding the “best” line to and modeling of causal
fit two attributes (or variables),so that one attribute relationships
can be used to predict the other.

31
Linear regression
• Straight line regression analysis involves , a
random variable, y (called a response variable),
can be modeled as a linear function of another
random variable, x (called a predictor variable),
with the equation
y = b+ wx ------> 1
where,
b = (mean of y)Y – w * (mean of x) X------>

w = DΣ i=1 [ (xi - X)(yi - Y) ] / D Σ i=1 [ (xi - X)2] 


3
Clustering
C) Combined computer and human inspection:
• Outliers may be identified through a combination of computer and human
inspection
• Ex : in one application an information-theoretic measure was used to help
identify outlier patterns in a handwritten character database for
classification.
• The measure’s value reflected the “surprise” content of the predicted
character label with respect to the known label.
• Patterns whose surprise content is above a threshold are output to a list.
• A human can then sort through the patterns in the list to identify the actual
garbage ones.
• This is much faster than having to manually search through the entire
database.
• Outlier patterns may be informative (ex: identifying useful data exceptions,
such as different versions of the characters “0” or “7” or garbage)
• The garbage values can then be excluded from use in subsequent data
mining.
II) Data Integration
• Data integration:
– Combines data from multiple sources into a coherent store
• Schema integration:
– Integrate metadata from different sources
• Entity identification problem:
– Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton. e.g., A.cust-id  B.cust-#
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different sources are
different
– Possible reasons: different representations, different scales, e.g., metric
vs. British units

35
35
Handling Redundancy in Data Integration

• Redundant data occur often when integration of multiple


databases
– Object identification: The same attribute or object may
have different names in different databases
– Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
36
36
Correlation Analysis (Nominal Data)

• Χ2 (chi-square) test
2
(Observed  Expected)
2  
Expected
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
• Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population

38
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500

• Χ2 (chi-square) calculation (numbers in parenthesis are expected


counts calculated based on the data distribution in the two
categories) 2 2 2 2
2 ( 250  90 ) ( 50  210 ) ( 200  360 ) (1000  840 )
      507 . 93
90 210 360 840
• Degrees of freedom=(no. of columns-1)(no. of rows-1)=1 Level of significance 0.05=503

• It shows that like_science_fiction and play_chess are correlated


in the group
39
Data reduction

• Warehouse may store terabytes of data: Complex


data analysis/mining may take a very long time to run
on the complete data set.

• Data reduction technique can be applied to obtain a


reduced representation of the data set that is much
smaller in volume, yet closely maintains the integrity
of the original data.
Overview of Data reduction strategies

1) Dimensionality reduction : irrelevant, weakly


relevant or redundant attributes or dimensions may
be detected and removed.
2) Data compression : encoding mechanisms are used
to reduce the data set size.
3) Numerosity reduction : the data are replaced or
estimated alternative, smaller data representations
such as parametric models or nonparametric
methods such as clustering, sampling and the use of
histograms.
Dimensionality reduction---Lossy Technique
Data compression---Lossy and Lossless
Numerosity reduction---Lossless
1) Dimensionality reduction
• The process of reducing the number of
random variables or attributes under
consideration.
• Dimensionality reduction methods:
A) Attribute subset selection
• irrelevant, weakly relevant, or redundant attributes or
dimensions are detected and removed
B) principal components analysis PCA
• which transform or project the original data onto a
smaller space.
C) Wavelet transforms
A) Attribute subset selection
– reduces the data set size by removing irrelevant or
redundant attributes (or dimensions).
• Ex: 1. total number of students will pass need to
be analyzed. Then student name, section,
phone…(all are irrelevant)
2. customers based on whether or not they
are likely to purchase a popular new CD,
attributes such as the customer’s telephone
number are likely to be irrelevant, unlike
attributes such as age or music taste
• If done by Domain expert:
– time consuming
– Complex when data’s behavior is not well known
How can we find a ‘good’ subset of the original attributes?
• For ‘n’ attributes, there are 2n possible subsets.
• So, search is expensive as, for n attributes , the number of
data classes also increase.
• Hence, we follow heuristic methods that explore a reduced
search space are commonly used for attribute subset
selection.
• These methods are greedy , while searching through
attribute space, searches for best attributes, effective in
practice and close to estimating an optimal solutions.
• Many other :
– Statistical (best or worst attributes and independencies)
– information gain measure used in building decision trees
for classification
Goal of attribute subset selection :
The goal of attribute subset selection is to find
a minimum set of attributes such that the
resulting probability distribution of the data
classes is as close as possible to the original
distribution obtained using all attributes.
• Heuristic methods of attribute subset
selection include the following techniques.
1. step-wise forward selection
2. step-wise backward elimination
3. combining forward selection and backward
elimination
4. decision-tree induction
1.Stepwise forward selection:
The procedure starts with an empty set of attributes as the
reduced set. The best of the original attributes is determined and
added to the reduced set. At each subsequent iteration or step, the
best of the remaining original attributes is added to the set.
Step 1: consider, the initial original data set
{A1,A2,A3,A4,A5,A6}
Step 2: Now, start with an empty set of attributes as the reduced set ‘R’.
R={ } ----> empty set
Step 3: select the best attributes from original attributes dataset and added to
the reduced set R
i) R = {A1}
ii) R= {A1,A4}
iii) R= {A1,A4,A6}
At each subsequent iteration or step, best of the remaining original
attributes is added to the set.
Step 4: There are no more attributes left to scan in the orignial dataset
so, R is the final reduced subset

R = {A1,A4,A6}
2. Stepwise backward elimination:
The procedure starts with the full set of attributes. At each step, it
removes the worst attribute remaining in the set.
Step 1: consider, the initial original data set
{A1,A2,A3,A4,A5,A6}
Step 2: Now, initialize reduced set ‘R’ with original data set of attributes .
R={A1,A2,A3,A4,A5,A6} ----> original set
Step 3: select the worst attribute from initialized attributes dataset and
remove from reduced set R
i) R = {A1,A2,A3,A4,A5,A6}
ii) R = {A1,A3,A4,A5,A6} - A2 is removed
iii) R = {A1,A4,A5,A6} - A3 is removed
iv) R = {A1,A4,A6} - A5 is removed
At each subsequent iteration or step, worst attribute is removed based
on lesser than threshold value and will stop iteration if remaining
attributes are more than threshold value
Step 4: There are no more attributes left to scan in the orignial dataset
so, R is the final reduced subset

R = {A1,A4,A6}
3. Combination of forward selection and
backward elimination:
The stepwise forward selection and backward
elimination methods can be combined so that,
at each step, the procedure selects the best
attribute and removes the worst from among
the remaining attributes.
Decision tree induction:

– Decision tree algorithms (e.g., ID3, C4.5, and CART) were given
for classification.
– Decision tree induction constructs a flowchartlike structure:
• where each internal (nonleaf) node denotes a test on an
attribute
• Each branch corresponds to an outcome of the test,
• each external (leaf) node denotes a class prediction.
– At each node, the algorithm chooses the “best” attribute to partition
the data into individual classes.
– When decision tree induction is used for attribute subset
selection, a tree is constructed from the given data.
– All attributes that do not appear in the tree are assumed
to be irrelevant.
– The set of attributes appearing in the tree form the reduced
subset of attributes.
Principal Components Analysis(PCA)
• Suppose that the data ‘D’ to be reduced
consist of ‘n’ attributes or dimensions of
tuples or data vectors described.
• Principal components analysis (PCA; also
called the Karhunen-Loeve, or K-L, method)
• PCA searches for k n-dimensional orthogonal
vectors that can best be used to represent the
data, where k ≤ n.
Contd
…“combines” the essence of attributes by
• PCA
creating an alternative, smaller set of
variables.
• The initial data can then be projected onto
this smaller set.
• PCA often reveals relationships that were not
previously suspected and interprets that
would not ordinarily result.
Procedure of PCA:-
1. The input data are normalized, so that each
attribute falls within the same range. This step
helps ensure that attributes with large domains
will not dominate attributes with smaller
domains.
2. PCA computes k orthonormal vectors that
provide a basis for the normalized input data.
These are unit vectors that each point in a
direction perpendicular to the others. These
vectors are referred to as the principal
components. The input data are a linear
combination of the principal components.
Contd….
3. The principal components are sorted in order
of decreasing “significance” or strength. The
principal components essentially serve as a
new set of axes for the data providing
important information about variance.
Ex: Y1 and Y2, for the given set
of data originally mapped to the
axes X1 and X2.
Contd….
4. Because the components are sorted in decreasing order of
“significance,” the data size can be reduced by eliminating
the weaker components(low variance). Using the strongest
principal components, it should be possible to reconstruct
a good approximation of the original data.
Conclusion:
• PCA can be applied to ordered and unordered attributes, and
can handle sparse data and skewed data.
• Multidimensional data of more than two dimensions can
be handled by reducing the problem to two dimensions.
• Principal components may be used as inputs to
multiple regression and cluster analysis.
• PCA tends to be better at handling sparse data.
Procedure of PCA: (for easy understanding)
• Input data Normalized. All attributes values are mapped to the same
range.
• Compute N Orthonormal vectors called as principal components. These
are unit vectors perpendicular to each other.
• Thus input data = linear combination of principal components
• Principal Components are ordered in the decreasing order of
“Significance” or strength.
• Size of the data can be reduced by eliminating the components with less
“Significance” or the weaker components are removed.
• Thus the Strongest Principal Component can be used to reconstruct a
good approximation of the original data.
• PCA can be applied to ordered & unordered attributes, sparse and skewed
data.
• It can also be applied on multi dimensional data by reducing the same into
2 dimensional data.
• Works only for numeric data.
B) Principal Component Analysis (PCA)

• Find a projection that captures the largest amount of variation in data


• The original data are projected onto a much smaller space, resulting in
dimensionality reduction. We find the eigenvectors of the covariance
matrix, and these eigenvectors define the new space

x2

x1
57
2) Data Compression: It is the process of
reducing the amount of data required to
represent a given quantity of information.
data compression data uncomp
• Input data->encodero/p-decodero/p
• Data compression types
1. Lossless Compression Exact Replica. no data
is lost ex: test encryption
2. Lossy Compressionsome infor. Is lost. Less
important information from the media is
removed. Ex: images,audio,video
• Data encoding or transformations are applied so
as to obtain a reduced or “compressed”
representation of the original data.
• If the original data can be reconstructed from the
compressed data without any loss of
information, the data reduction is called lossless.
• If, instead, we can reconstruct only an
approximation of the original data, then the data
reduction is called lossy.
• Two popular and effective methods of lossy data
compression are wavelet transforms and
principal component analysis
3) Numerosity reduction
• Numerosity reduction techniques replace the original data volume by
alternative, smaller forms of data representation.
• These techniques divided into:
– parametric
– Nonparametric
• parametric methods: a model is used to estimate the data, so that
typically only the data parameters need to be stored, instead of the actual
data. (Outliers may also be stored.)
1. Regression
2. log-linear models
• Nonparametric methods: a model for storing reduced representations of
the data.
1. histograms 3. sampling
2. clustering 4. data cube aggregation.
(OR)
• Linear Regression :
– data are modeled to fit in a straight line.
y = wx + b
– Where, x is called “Response Variable”
y is called “Predictor Variable”
w and b are called the regression coefficients.
b is the y-intercept and w is the Slope of the equation.
– These regression coefficients can be solved by using “method of
least squares”.
• Multiple Regression :
– Extension of linear regression
– Response variable Y is modeled as a multidimensional vector.
• Log-Linear Models:
– Estimates the probability of each cell in a base cuboid for a set of
discretized attributes.
– In this higher order data cubes are constructed from lower
ordered data cubes.
• The following data are a list of prices of commonly
sold items at AllElectronics
• The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5,
8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15,
15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20,
20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25,
28, 28, 30, 30, 30.
• To solve we need to know:
– How are the buckets determined and the attribute values
partitioned?
Step 1: arrange data in sorted order
Step 2: count the frequency of data existing in the dataset
1 -2 ; 5- 5; 8- 2…so on;
Step 3: consider the partitioning rule for the bin or bucket.
Types of Histograms:
• The partitioning strategies:
1. Singleton buckets-
2. Equal-width
3. Equal-frequency
Contd

• Equi-Width:
In an equal-width
histogram, the width of
each bucket range is
uniform(ex: $10)

• Equi-Frequency (or
equal-depth):
In an equal-frequency histogram, the buckets are
created so that, roughly, the frequency of each
bucket is constant.
Sampling
• Sampling helps in taking large data set into
smaller random data samples(subsets)
• Let D be the data set with N tuples and S be the
sample selected from D
• Types of sampling :
– Non-probability Sampling
– Probability sampling
1. Simple random sample without replacement(SRSWOR)
2. Simple random sample with replacement (SRSWR)
3. Stratified sampling(statistical)
4. Cluster sampling (grouping)
71
1. Simple random sample without replacement (SRSWOR)

Once an object is selected, it is removed from the population

• By drawing S samples from D data set


• where N are the tuples in D ,
S<N
• the probability of drawing any tuple in D is 1/N, that is,
all tuples are equally likely to be sampled.
• Ex: 1. coin tossed
2. dice rolled ( 6 face 3 chance) sampling
2. Simple random sample with replacement (SRSWR)
A selected object is not removed from the population

• This is similar to SRSWOR, except that each time a tuple


is drawn from D, it is recorded and then replaced. That is,
after a tuple is drawn, it is placed back in D so that it may
be drawn again.
• Example of SRSWOR and SRSWR
3. Stratified Sampling (SS)
• If D is divided into mutually disjoint parts called strata, a
stratified sample of D is generated by obtaining an SRS at each
stratum. This helps ensure a representative sample, especially
when the data are skewed.
• Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the
data)
4. Cluster sample(CS)

• If the tuples in D are grouped into M mutually disjoint “clusters,”


then an SRS of s clusters can be obtained, where s < M.
• A reduced data representation can be obtained by applying, say,
SRSWOR to the pages, resulting in a cluster sample of the tuples.
Sampling: With or without Replacement

R S WOR ndom
S le ra hout
i m p
(s
p l e wit
sam ment)
ce
repla

SRSW
R

Raw Data
77
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

78
Data Transformation
• Data are transformed or consolidated so that
the resulting mining process may be more
efficient, and the patterns found may be
easier to understand
• What is this new form? Or how the new form
of data should be?
• KG ----- Grams( lower units but larger data)
Data Transformation Strategies
• Data are transformed or consolidated into forms
appropriate for mining is called data transformation
• Strategies for data transformation include the following:
1.Smoothing: which works to remove noise from the data.
Techniques for smoothing are
1.binning
2.regression
3.clustering.
2.Attribute construction (or feature construction): where new
attributes are constructed and added from the given set of
attributes to help the mining process.
3. Aggregation
• summary or aggregation operations are applied to
the data.
• Example: daily sales data may be aggregated so as to
compute monthly and annual total amounts. This
step is typically used in constructing a data cube for
data analysis at multiple abstraction levels.
Multidimensional aggregation example
4. Normalization
– In this attribute data are scaled so as to fall within a
smaller range, such as −1.0 to 1.0, or 0.0 to 1.0.
– When multiple attributes are having values on
different scales, this may lead to poor data models
for data mining operations. So, they are normalized
to bring all the attributes on same scale.
1. Min-Max normalization
2. Z-score normalization
3. Decimal Scaling
5. Discretization
• The raw values of a numeric attribute (e.g., age) are replaced by interval
labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult,
senior). The labels, in turn, can be recursively organized into higher-level
concepts, resulting in a concept hierarchy for the numeric attribute.
• Example: attribute ‘Price’ ($X ...$Y] each interval has more than one
concept hierarchy can be defined for the same attribute to accommodate
the needs of various users.
6. Concept hierarchy generation for nominal
data
• where attributes such as street can be generalized to
higher-level concepts, like city or country
Many hierarchies for nominal attributes
are implicit within the database schema
and can be automatically defined at the
schema definition level.
Data Transformation by Normalization
• The measurement unit used can affect the data
analysis.
• For example, changing measurement units from
meters to inches for height, or from kilograms to
pounds for weight, may lead to very different results
• Why normalization?
– expressing an attribute in smaller units will lead to a
larger range for that attribute, So, to make the attribute
choice free measurement units, the data should be
normalized or standardized.
• Ex: give a common range representation like
[-1,1] or [0.0,1.0]
• Normalization is particularly used for
classification algorithms in the neural
networks or nearest neighbour classification
and many distance measures techniques
Different types of Normalization
1. Min-Max normalization
2. Z-score normalization
3. Decimal Scaling
To implement these techniques let us consider,
D be the dataset from which
‘A’ be a numeric attribute
with n observed values
V={ v1, v2,..., vn} vi is each value from V
1. Min-max normalization
• Min-max normalization performs a linear transformation on
the original data. [0.0 ,0.1]or [-1.0, 1.0]
• Suppose that from attribute A {v1,v2…vn}
– minA minimum value from list
– maxA maximum values from list
• Then, Min-max normalization maps a value, vi , of A to vi‘ in
the range [new_minA,new_maxA] by computing
Example:: Min-Max Normalization
Q. The minimum and maximum values for the attribute income
are $12,000 and $98,000, respectively. By min-max
normalization, transform a value of $73,600 for income to map
income to the range [0.0,1.0].
A. By min-max normalization, a value of $73,600 for income is
transformed to :

We mapped income $73,600 to the range [0.0,1.0] with 0.716


2. Z-score Normalization
• Also referred as “zero-mean normalization”
• values for an attribute, A are normalized based
on the mean (i.e., average) and standard
deviation of A.
• A value, vi , of A is normalized to vi ‘ by
computing

where and are the mean and standard


deviation.
3. Decimal Scaling
• Normalization by moving the decimal point of
values of attribute A.
• The number of decimal points moved depends
on the maximum absolute value of A.
Example:: Decimal scaling
Q. By using Decimal scaling find normalized value for
the recorded values of A range from −986 to 917.
A. The maximum absolute value of A is 986.
Divide each value by 1000 (i.e., j = 3)
hence,
−986/1000 normalizes to −0.986 and
917/1000 normalizes to 0.917.
therefore, range from −986 to 917 is normalized to
-0.986 to 0.917.

You might also like