Data Preprocessing Guide for Analysis
Data Preprocessing Guide for Analysis
Xuan–Hieu Phan
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 1 / 103
Outline
1 Introduction
2 Data cleaning
3 Data integration
4 Data transformation
5 Data reduction
6 Data sampling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 2 / 103
Outline
1 Introduction
2 Data cleaning
3 Data integration
4 Data transformation
5 Data reduction
6 Data sampling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 3 / 103
Why data preprocessing?
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 4 / 103
Major tasks in data preprocessing
Data cleaning:
“clean” the data by filling in missing values, smoothing noisy data, identifying or
removing outliers, and resolving inconsistencies.
Data integration:
involve smoothly merging data from multiple sources, e.g., databases, data cubes, or
files into a coherent data store such as a data warehouse.
Data reduction:
reduce data in different ways, e.g., dimensionality reduction, removing irrelevant
variables/attributes, data reduction using sampling, etc.
Data transformation:
perform data type conversion, discretization, data smoothing, data scaling and
normalization, etc.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 5 / 103
Major tasks in data preprocessing (cont’d)
Major
data analysis and mining course @ tasksPhan
Xuan–Hieu of data preprocessing [1] and preparation
data preprocessing 6 / 103
Cross–industry standard process for data mining
1 Business understanding
2 Data understanding
3 Data preparation
4 Modeling
5 Evaluation
6 Deployment
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 7 / 103
CRISP–DM process
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 8 / 103
CRISP–DM phase 1: Business understanding
This phase focuses on understanding the objectives and requirements of the project. It
includes four tasks:
1 Determine business objectives:
thoroughly understand, from a business perspective, what the customer/company
really wants to accomplish, and then define business success criteria.
2 Assess situation:
determine resources availability, project requirements, assess risks and contingencies,
and conduct a cost–benefit analysis.
3 Determine data mining goals:
in addition to defining the business objectives, you should also define what success
looks like from a technical data mining perspective.
4 Produce project plan:
select technologies and tools and define detailed plans for each project phase.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 9 / 103
CRISP–DM phase 2: Data understanding
This phase drives the focus to identify, collect, and analyze the data sets that can help
you accomplish the project goals. This phase also has four tasks:
1 Collect initial data:
acquire the necessary data and (if necessary) load it into your analysis tool.
2 Describe data:
examine the data and document its surface properties like data format, number of
records, or field identities.
3 Explore data:
dig deeper into the data. Query it, visualize it, and identify relationships among the
data.
4 Verify data quality:
how clean/dirty is the data? Document any quality issues.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 10 / 103
CRISP–DM phase 3: Data preparation
This phase prepares the final data for modeling. It has five tasks:
1 Select data:
determine which data sets will be used and document reasons for inclusion/exclusion.
2 Clean data:
often this is the lengthiest task. Without it, you will likely fall victim to garbage-in,
garbage-out. A common practice during this task is to correct, impute, or remove
erroneous values.
3 Construct data:
derive new attributes that will be helpful. For example, derive someone’s body mass
index from height and weight fields.
4 Integrate data:
create new data sets by combining data from multiple sources.
5 Format data:
re-format data as necessary. For example, you might convert string values that store
numbers to numeric values so that you can perform mathematical operations.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 11 / 103
Outline
1 Introduction
2 Data cleaning
3 Data integration
4 Data transformation
5 Data reduction
6 Data sampling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 12 / 103
Data cleaning
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 13 / 103
Handling missing values
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 14 / 103
Ignore tuples with missing values
This is usually done when the class label is missing (assuming the mining task
involves classification).
This method is not very effective, unless the tuple contains several attributes with
missing values.
It is especially poor when the percentage of missing values per attribute varies
considerably.
By ignoring tuples, we do not make use of the remaining attributes’ values in the
tuples.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 15 / 103
Fill in the missing values manually
In general, this approach is time consuming and may not be feasible given a large
data set with many missing values.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 16 / 103
Use a global constant to fill in the missing values
Replace all missing attribute values by the same constant such as a label like
“Unknown” or −∞.
If missing values are replaced by, say, “Unknown” then the mining program may
mistakenly think that they form an interesting concept, since they all have a value
in common – that of “Unknown”.
Hence, although this method is simple, it is not reliable.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 17 / 103
Use a measure of central tendency for the attribute
Use a measure of central tendency for the attribute (e.g., the mean or median)
to fill in the missing values.
For normal (symmetric) data distributions, the mean can be used, while skewed
data distribution should employ the median.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 18 / 103
Class–based filling
Use the attribute mean or median for all samples belonging to the same class as
the given tuple.
For example, if classifying customers according to credit_risk, we may replace the
missing value with the mean income value for customers in the same credit risk
category as that of the given tuple.
If the data distribution for a given class is skewed, the median value is a better
choice
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 19 / 103
Use the most probable value to fill in the missing values
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 20 / 103
Missing values may not errors in data
It is important to note that, in some cases, a missing value may not imply an
error in the data!
For example, when applying for a credit card, candidates may be asked to supply
their driver’s license number. Candidates who do not have a driver’s license
may naturally leave this field blank.
Ideally, each attribute should have one or more rules regarding the null condition.
The rules may specify whether or not nulls are allowed and/or how such values
should be handled or transformed.
Fields may also be intentionally left blank if they are to be provided in a later
step of the business process. Hence, although we can try our best to clean the data
after it is seized, good database and data entry procedure design should help
minimize the number of missing values or errors in the first place.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 21 / 103
Handling incorrect and inconsistent values
The key methods that are used for removing or correcting the incorrect and inconsistent
entries are as follows:
Inconsistency detection
Domain knowledge
Data-centric methods
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 22 / 103
Inconsistency detection
This is typically done when the data is available from different sources in
different formats.
For example, a person’s name may be spelled out in full in one source, whereas
the other source may only contain the initials and a last name.
In such cases, the key issues are duplicate detection and inconsistency
detection.
These topics are studied under the general umbrella of data integration within the
database field.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 23 / 103
Domain knowledge
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 24 / 103
Data–centric methods
In these cases, the statistical behavior of the data is used to detect outliers. For
example, the two isolated data points in the above figure marked as “noise” are
outliers. These isolated points might have arisen because of errors in the data
collection process.
However, this may not always be the case because the anomalies may be the
result of interesting behavior of the underlying system. Therefore, any detected
outlier may need to be manually examined before it is discarded. The use of
data–centric methods for cleaning can sometimes be dangerous because they can
result in the removal of useful knowledge from the underlying system.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 25 / 103
Handling noisy data
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 26 / 103
Handling noisy data: binning
Binning methods smooth a sorted data value by consulting its “neighborhood,” that
is, the values around it.
The sorted values are put into a number of “buckets,” or bins.
Because binning methods consult the neighborhood of values, they perform local
smoothing.
Example (next slide): in smoothing by bin means, each value in a bin is replaced
by the mean value of the bin. For example, the mean of the values 4, 8, and 15 in
Bin 1 is 9. Therefore, each original value in this bin is replaced by the value 9.
Similarly, smoothing by bin medians can be employed, in which each bin value is
replaced by the bin median.
In smoothing by bin boundaries, the minimum and maximum values in a given
bin are identified as the bin boundaries. Each bin value is then replaced by the
closest boundary value. In general, the larger the width, the greater the effect of the
smoothing.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 27 / 103
Handling noisy data: binning example
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 28 / 103
Handling noisy data: regression
Data smoothing can also be done by regression, a technique that conforms data
values to a function.
Linear regression involves finding the “best” line to fit two attributes (or variables)
so that one attribute can be used to predict the other.
Multiple linear regression is an extension of linear regression, where more than two
attributes are involved and the data are fit to a multidimensional surface.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 29 / 103
Handling noisy data: outlier analysis
Outliers may be detected by clustering, for example, where similar values are
organized into groups, or “clusters.”
Intuitively, values that fall outside of the set of clusters may be considered outliers.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 30 / 103
Outline
1 Introduction
2 Data cleaning
3 Data integration
4 Data transformation
5 Data reduction
6 Data sampling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 31 / 103
Data integration
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 32 / 103
Entity identification problem
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 33 / 103
Redundancy and correlation analysis
Inconsistent attribute naming can also cause redundancies in the resulting data.
Some redundancies can be detected by correlation analysis.
Correlation analysis can measure how strongly one attribute implies the other.
Correlation analysis for nominal/categorical data:
use the χ2 (chi–square) test of independence (previous lecture: Data understanding).
Correlation analysis for numeric attributes:
use covariance or correlation coefficient, e.g., Pearson correlation (previous lecture).
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 34 / 103
Tuple duplication
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 35 / 103
Data value conflict detection and resolution
Data integration also involves the detection and resolution of data value conflicts.
For example, for the same real–world entity, attribute values from different sources
may differ.
This may be due to differences in representation, scaling, or encoding. For instance,
a weight attribute may be stored in metric units in one system and British imperial
units in another. For a hotel chain, the price of rooms in different cities may involve
not only different currencies but also different services (e.g., free breakfast) and
taxes.
When exchanging information between schools, for example, each school may have
its own curriculum and grading scheme. One university may adopt a quarter system,
offer three courses on database systems, and assign grades from A+ to F, whereas
another may adopt a semester system, offer two courses on databases, and assign
grades from 1 to 10.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 36 / 103
Outline
1 Introduction
2 Data cleaning
3 Data integration
4 Data transformation
5 Data reduction
6 Data sampling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 37 / 103
Data transformation
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 38 / 103
Data conversion and discretization
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 39 / 103
Numeric to categorical data: discretization
Discretization is the task that divides the value range of a numeric attribute into k
sub–ranges.
Then, the attribute is assumed to contain k different categorical labeled values from
1 to k, depending on the range in which the original attribute lies.
For example, consider the age attribute. One could create sub–ranges [0, 10], [11, 20],
[21, 30], and so on. The symbolic value for any record in the sub–range [11, 20] is “2”
and the symbolic value for a record in the sub–range [21, 30] is “3”. Because these
are symbolic values, no ordering is assumed between the values “2” and “3”.
Variations within a sub–range are not distinguishable after discretization. Thus, the
discretization process does lose some information for the mining process.
However, for some applications, this loss of information is not a big problem.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 40 / 103
Discretization challenges and techniques
One challenge with discretization is that the data may be non–uniformly distributed
across the different intervals.
For example, for the salary attribute, a large subset of the population may be
grouped in the [40, 000, 80, 000] sub–range, but very few will be grouped in the
[1, 040, 000, 1, 080, 000] sub–range (both sub–ranges have the same size).
Thus, the use of sub–ranges of equal size may not be very helpful in discriminating
between different data segments. Many attributes, such as age, are not as
non–uniformly distributed, and therefore ranges of equal size may work reasonably
well.
The discretization process can be performed in a variety of ways depending on
application–specific goals:
Equi–width sub–ranges
Equi–log sub–ranges
Equi–depth sub–ranges
Clustering–based discretization
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 41 / 103
Discretization: equi–width sub–ranges
In this case, each sub–range [a, b] is chosen in such a way that b − a is the same for
each sub–range.
This approach has the drawback that it will not work for data sets that are
distributed non–uniformly across the different sub–ranges.
To determine the actual values of the sub–ranges, the minimum and maximum
values of each attribute are determined. This range [min, max] is then divided into
k sub–ranges of equal length.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 42 / 103
Discretization: equi–log sub–ranges
Each sub–range [a, b] is chosen in such a way that log(b) − log(a) has the same
value.
This kinds of sub–range selection has the effect of geometrically increasing
sub–ranges [a, aα], [aα, aα2 ], [aα2 , aα3 ], and so on, for some α > 1.
This kind of sub–range may be useful when the attribute shows an exponential
distribution across a range.
In fact, if the attribute frequency distribution for an attribute can be modeled in
functional form, then a natural approach would be to select sub–ranges [a, b] such
that f (b) − f (a) is the same for some function f (·).
The idea is to select this function f (·) in such a way that each sub–range contains
an approximately similar number of records. However, in most cases, it is hard to
find such a function f (·) in closed form.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 43 / 103
Discretization: equi–depth sub–ranges
In this case, the sub–ranges are selected so that each sub–range has an equal
number of records.
The idea is to provide the same level of granularity to each sub–range.
An attribute can be divided into equi–depth sub–ranges by first sorting it, and then
selecting the division points on the sorted attribute value, such that each sub–range
contains an equal number of records.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 44 / 103
Clustering–based discretization
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 45 / 103
Categorical to numeric data: binarization
In some cases, it is desirable to use numeric data mining algorithms on categorical
data.
Because binary data is a special form of both numeric and categorical data, it is
possible to convert the categorical attributes to binary form and then use numeric
algorithms on the binarized data.
If a categorical attribute has k different values, then k different binary attributes are
created. Each binary attribute corresponds to one possible value of the categorical
attribute.
Therefore, exactly one of the k attributes takes on the value of 1, and the remaining
take on the value of 0.
ID Male Female IsStudent IsEmployee IsRetired
001 1 0 1 0 0
002 0 1 1 0 0
003 1 0 0 1 0
004 1 0 0 0 1
005 0 1 0 1 0
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 46 / 103
Text to numeric data
Vector space model (a.k.a. bag of words): converting documents, paragraphs,
sentences into sparse vectors. The vector dimension is the size of vocabulary of the
text collection. Weights in vectors can be binary, raw frequency, normalized
frequency, or TF–IDF.
N–gram model: this is a special case of vector space model where consecutive words
or tokens can be combine to form n–grams (uni–grams, bi–grams, tri–grams, etc.)
Latent semantic and topics models: transforming documents, paragraphs, sentences
into dense semantic/topic vectors. The dimension is much smaller, e.g., 50, 100, 200,
300. Well–known models are LSA, LDA, and various topic models.
Text embeddings: using deep representation learning models to transform texts
(words, sentences, paragraphs, documents) into embeddings (dense) vectors.
Well–known techniques are Word2Vec, GloVe, Doc2Vec, etc.
Popular libraries: NLTK, Gensim, CoreNLP, and spaCy. For Vietnamese NLP:
VLSP libraries, VnCoreNLP, PhoBERT, and many more from
https://github.com/topics/vietnamese-nlp.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 47 / 103
Time–series to discrete sequence data
Converting time–series data to discrete sequence using symbolic aggregate approximation
(SAX). This method comprises two steps:
Window–based averaging: the series is divided into windows of length w, and the
average time–series value over each window is computed.
Value–based discretization: the averaged time–series values are discretized into a
smaller number of equi–depth intervals.
The idea is to ensure that each symbol has an approximately equal frequency in the
time–series. This is identical to the equi–depth discretization of numeric attributes
that was discussed earlier.
The interval boundaries are constructed by assuming that the time–series values are
distributed with a Gaussian assumption. The mean and standard deviation of the
(windowed) time–series values are estimated in the data–driven manner.
The quantiles of the Gaussian distribution are used to determine the boundaries of the
intervals. This is more efficient than sorting all the data values to determine quantiles,
and it may be a more practical approach for a long time–series. The number of
intervals is normally from 3 to 10.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 48 / 103
Data smoothing
Smoothing, which works to remove noise from data. Techniques include binning,
regression, and clustering.
Data smoothing has been presented in the Data cleaning section (subsection:
Handling noisy data).
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 49 / 103
Data aggregation
Aggregation, where summary or aggregation operations are applied to the data.
This step is typically used in constructing a data cube for data analysis at multiple
levels of abstraction.
Dat cube is a common concept in data warehousing that represents a
multidimensional dataset where each attribute may have a concept hierarchy. For
example, attribute Time can be at different levels of granularity like {year > quarter
> month > week > day > hour }.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 50 / 103
Construction of attributes
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 51 / 103
Data scaling and normalization
In many cases, the different features represent different scales of reference and may
therefore not be comparable to one another.
For example, an attribute such as age is drawn on a very different scale than an
attribute such as salary. The latter attribute is typically orders of magnitude larger
than the former. As a result, any aggregate function computed on the different
features (e.g., Euclidean distances) will be dominated by the attribute of larger
magnitude.
To address this problem, values of attributes are usually re–scaled or normalized to
a more suitable range.
There are two main standardization techniques:
Min-max normalization
Standard score normalization
Decimal scaling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 52 / 103
Min–max normalization
Let D = {x1 , x2 , . . . , xn } be a univariate data sample consisting of n
observations/values drawn from a variable/attribute X. In min–max normalization,
each value is scaled as follows:
xi − minX
x′i = (maxnew new new
X − minX ) + minX (1)
maxX − minX
where:
minX and maxX are the minimum and the maximum values in D (i.e., before
normalization), respectively.
minnew
X and maxnew
X are the minimum and the maximum values of the data attribute
after being normalized, respectively.
After transformation, all the new values x′i (i = 1..n) is in [minnew new
X , maxX ].
new new
Normally, minX = 0 and maxX = 1 and the new range is [0, 1].
Min–max normalization is also called range normalization. This normalization
technique can also be applied to any numeric data attributes in bivariate and
multivariate data.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 53 / 103
Standard score normalization
After transformation, the new attribute has mean µ̂′ = 0, and standard deviation
σ̂ ′ = 1.
The vast majority of the normalized values will typically lie in the range [−3, 3]
under the normal distribution assumption.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 54 / 103
Decimal scaling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 55 / 103
Bivariate data sample
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 56 / 103
Min–max normalization example
maxAge = 40 and minAge = 12. For example, x31 = 18, after normalization,
x′31 = 40−12
18−12
= 0.214.
Similarly, maxIncome = 6000 and minIncome = 300.
After normalization, we have the new data sample D′ with two new attributes Age′
and Income′ as follows:
xi Age′ (X1′ ) Income′ (X2′ )
x1 0.0 0.0
x 0.071 0.035
2
x3 0.214 0.123
x4 0.393 0.298
′
D = x5 0.536 0.561
x 0.571 0.649
6
x7 0.786 0.702
x8 0.893 1.0
x9 0.964 0.386
x10 1.0 0.421
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 57 / 103
Standard score normalization example
For z–normalization, we first compute the mean and standard deviation of both
attributes:
µ̂Age = 27.2 and σ̂Age = 9.77
µ̂Income = 2680 and σ̂Income = 1726.15
The new data after z–normalization is as follows:
xi Age′ (X1′ ) Income′ (X2′ )
x1 −1.56 −1.38
x
2 −1.35 −1.26
x3 −0.94 −0.97
−0.43 −0.39
x4
′
D = x5 −0.02 0.48
x 0.08 0.76
6
x7 0.70 0.94
x8 1.0 1.92
x9 1.21 −0.10
x10 1.31 0.01
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 58 / 103
Distance between points before and after normalization
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 59 / 103
Concept hierarchy generation for nominal data
Concept hierarchy generation for nominal data, where attributes such as street can be
generalized to higher–level concepts, like city or country. Many hierarchies for nominal
attributes are implicit within the database schema and can be automatically defined at
the schema definition level.
1 Introduction
2 Data cleaning
3 Data integration
4 Data transformation
5 Data reduction
6 Data sampling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 61 / 103
Overview of data reduction
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 62 / 103
Data reduction methods
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 63 / 103
Principal component analysis (PCA)
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 64 / 103
Steps of PCA
PCA: Y1 and Y2 are the first two principal components for the given data. [1]
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 66 / 103
Attribute subset selection
Data sets for analysis may contain hundreds of attributes, many of which may be
irrelevant to the mining task or redundant.
For example, if the task is to classify customers based on whether or not they are
likely to purchase a popular new CD at AllElectronics when notified of a sale,
attributes such as the customer’s telephone_number are likely to be irrelevant,
unlike attributes such as age or music_taste.
Although it may be possible for a domain expert to pick out some of the useful
attributes, this can be a difficult and time–consuming task, especially when the
data’s behavior is not well known.
Leaving out relevant attributes or keeping irrelevant attributes may be detrimental,
causing confusion for the mining algorithm employed. This can result in discovered
patterns of poor quality.
In addition, the added volume of irrelevant or redundant attributes can slow down
the mining process.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 67 / 103
Attribute subset selection (cont’d)
Attribute subset selection reduces the data set size by removing irrelevant or
redundant attributes (or dimensions).
The goal of attribute subset selection is to find a minimum set of attributes such
that the resulting probability distribution of the data classes is as close as possible
to the original distribution obtained using all attributes.
Mining on a reduced set of attributes has an additional benefit: It reduces the
number of attributes appearing in the discovered patterns, helping to make the
patterns easier to understand.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 68 / 103
How to find a “good ” subset of the original attributes?
1 Stepwise forward selection: the procedure starts with an empty reduced set of
attributes. The best of the original attributes is determined and added to the
reduced set, and similarly for the second, third steps, etc.
2 Stepwise backward elimination: the procedure starts with the full set of
attributes. At each step, it removes the worst attribute remaining in the set.
3 Combination of forward selection and backward elimination: the stepwise
forward selection and backward elimination methods can be combined in each step.
4 Decision tree induction: decision tree algorithms (e.g., ID3, C4.5, and CART)
were originally intended for classification. All attributes that appear in the tree are
assumed to be relevant.
The stopping criteria for the methods may vary. The procedure may employ a threshold
on the measure used to determine when to stop the attribute selection process.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 70 / 103
Histograms
Histograms use binning to approximate data distributions and are a popular form of
data reduction.
A histogram for an attribute, X, partitions the data distribution of X into disjoint
subsets, referred to as buckets or bins.
If each bucket represents only a single attribute–value/frequency pair, the buckets
are called singleton buckets. Often, buckets instead represent continuous ranges for
the given attribute.
How are the buckets determined and the attribute values partitioned? There are
several partitioning ways:
Equal–width: in an equal–width histogram, the width of each bucket range is
uniform.
Equal–frequency (or equal–depth): in an equal–frequency histogram, each bucket
contains roughly the same number of contiguous data samples.
Histograms are highly effective at approximating both sparse and dense data, as well
as highly skewed and uniform data. The histograms can be extended for multiple
attributes.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 71 / 103
Example of histograms
The following data are a list of AllElectronics prices for commonly sold items [1]. The
numbers have been sorted: {1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15,
15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25,
25, 25, 25, 25, 28, 28, 30, 30, 30}.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 72 / 103
Clustering
Clustering techniques partition data objects into groups, or clusters, so that objects
within a cluster are “similar” to one another and “dissimilar” to objects in other
clusters.
Similarity is commonly defined in terms of how “close” the objects are in space,
based on a distance function.
In data reduction, the cluster representations of the data are used to replace the
actual data. The effectiveness of this technique depends on the data’s nature. It is
much more effective for data that can be organized into distinct clusters than for
smeared data.
Different clustering approaches and techniques will be given later in lecture Data
clustering.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 73 / 103
Data sampling
Data sampling can be used as a data reduction method because it allows a large data set
to be represented by a much smaller random/non–random data sample (a subset). There
are different sampling techniques and they are classified into two main types:
Random sampling:
Simple random sampling (with and without replacement)
Stratified sampling
Cluster sampling
Systematic sampling
Multi–stage sampling
Non–random sampling:
Convenience sampling
Judgement sampling
Quota sampling
Snowball sampling
Sampling techniques will be presented in details in the next section.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 74 / 103
Data cube aggregation
Dat cube is a common concept in data warehousing. A data cube represents a
multidimensional dataset where each attribute may have a concept hierarchy. For
example, attribute Time can be at different levels of granularity like {year > quarter
> month > week > day > hour }.
Drill–down allows to view data in more details while roll–up allows to view data
in more general/summary.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 75 / 103
Outline
1 Introduction
2 Data cleaning
3 Data integration
4 Data transformation
5 Data reduction
6 Data sampling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 76 / 103
Data sampling
Data sampling is the process of selecting a subset of data observations (data objects
elements, instances, points), called a data sample, from a large population of data.
The resulting data sample normally consists of data points that are good
representatives of the population, reflecting the nature and characteristics of the
population properly.
Data sampling is critical because most of data analysis tasks will be performed on a
data sample rather than on the whole population.
Data sampling can be seen as a data reduction technique because it allows a large
data set to be represented by a much smaller (random/non–random) data sample.
Data sampling techniques are classified into two main types:
1 Random sampling
2 Non–random sampling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 77 / 103
Data sampling (cont’d)
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 78 / 103
Random sampling methods
In random sapling, every data element in the population has a chance to be chosen via a
random selection process. Popular random sampling techniques are as follows:
1 Simple random sampling
2 Stratified sampling
3 Cluster sampling
4 Systematic sampling
5 Multi–stage sampling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 79 / 103
Simple random sampling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 80 / 103
Bootstrap resampling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 81 / 103
Stratified sampling
In this technique, the data elements are first divided into different groups or strata.
Elements in each group/stratum are similar in some way, and different from the
elements in other strata.
After division, the simple random sampling will be applied for each group/stratum
to choose random elements into the final data sample.
With this technique, we need an understanding of the population in order to divide
the elements into strata.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 82 / 103
Stratified sampling example
The previous example: select randomly 30 employees from all 100 employees of a
company.
Now we have constraint that half of 30 selected employees are female. The simple
random sampling will not always meet this constraint.
Supposing that, in 100 employees of the company, there are 25 female. Then, we
apply stratified random sampling by dividing the 100 employees into two
groups/strata: 75 male and 25 female.
Then, we apply the simple random sampling for each group. For each group, we
randomly select only 15 employees. The final data sample consists of 30 employees
and half of them are female.
This sampling technique is usually applied when we have some constraints for the
resulting data sample and the population distribution is skewed.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 83 / 103
Cluster sampling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 84 / 103
Cluster sampling (cont’d)
Step 4: collect all the data elements in each selected clusters to form the resulting
data sample.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 85 / 103
Multi–stage cluster sampling
In multistage cluster sampling, rather than collect data from every single element in
the selected clusters, we randomly select individual elements from within the cluster
to form the final data sample. This is called double–stage cluster sampling.
We can also continue this procedure, taking progressively smaller and smaller
random samples, which is usually called multi–stage cluster sampling.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 86 / 103
Cluster sampling: advantages and disadvantages
Advantages:
Cluster sampling is time– and cost–efficient, especially for samples that are widely
geographically spread and would be difficult to properly sample otherwise.
Because cluster sampling uses randomization, if the population is clustered properly,
your study will have high external validity because your sample will reflect the
characteristics of the larger population.
Disadvantages:
Internal validity is less strong than with simple random sampling, particularly as you
use more stages of clustering.
If your clusters are not a good mini–representation of the population as a whole, then
it is more difficult to rely upon your sample to provide valid results.
Cluster sampling is much more complex to plan than other forms of sampling.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 87 / 103
Systematic sampling
In this technique, only the first element is selected randomly. From the second, the
selection is based on the previous one.
First, all the elements in the population are placed in a sequence in random order.
Suppose the population size is N and the size of the resulting sample is n (normally
n ≪ N ). Then divide the population sequence into n sub-sequences with length
k = ⌊N
n ⌋ (except the length of the last sub-sequence may be greater than k).
The sampling starting by randomly selecting the first element from the first
sub-sequence, say at the position/index i1 . Then, the position of the second element
is i2 = i1 + k, the third element is at i3 = i2 + k = i1 + 2k. Similarly, in = in−1 +
k = i1 + (n − 1)k.
For example, with N = 22, n = 5, then k = ⌊ 22 5 ⌋ = 4 (the last sub-sequence has
length of 6). Suppose the first element is selected at i1 = 2, then the indexes of all
the other elements of the sample are 6, 10, 14, and 18.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 88 / 103
Systematic sampling example
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 89 / 103
Multi–stage sampling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 90 / 103
Multi–stage sampling example
In order to evaluate the knowledge of high–school students national-wide and
independent with the national high–school graduation examination, The Ministry of
Education and Training (MOET) want to organize an interview with 10,000
students from different regions of the country.
MOET could not interview students from all high schools from all
districts/cities/provinces due to the time, human resource, and cost constraints.
MOET will perform a multi–stage sampling as follows:
Step 1: all provinces and cities are put into different clusters based on the
geographical regions. Clusters are Northwest, Northeast, Red River Delta, North
Central Coast, South Central Coast, Central Highlands, Southeast, and Mekong
River Delta. From each region, several provinces and cities will be selected
randomly. For example:
Son La, Lai Chau, Ha Giang from Northwest
Hanoi, Hai Duong from Red River Delta
...
Bac Lieu, Can Tho, Tra Vinh from Mekong River Delta
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 91 / 103
Multi–stage sampling example (cont’d)
Step 2: All high schools from the selected provinces/cities above will be classified
into different sub–clusters like region Khu vuc 1 (KV1), Khu vuc 2 (KV2), Khu vuc
2 nong thon (KV2-NT), and Khu vuc 3 (KV3). From each of these clusters, MOET
randomly selects 25 high schools and the to get a final total 100 high schools. These
100 high schools represent for all geographical regions and they can be at the city,
town, country side, or mountainous areas.
Step 3: From each of 100 selected high schools, select 100 12th –grade students in
order to form the resulting data sample with 10,000 students.
The final data sample of 10,000 students are very diverse, including students from
different regions of the country. They can live at city, town, country side, or they
can be ethnic minority. This is obviously a suitable sample for MOET to evaluate
the high–school education quality. In reality, MOET can even perform a more
complicated sampling to meet other constraints that they may have.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 92 / 103
Non–random sampling methods
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 93 / 103
Convenience sampling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 94 / 103
Applications of convenience sampling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 95 / 103
Advantages of convenience sampling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 96 / 103
Judgement sampling
Judgmental sampling, also called purposive sampling or authoritative sampling, is a
non–probability sampling technique in which the sample members are chosen only
on the basis of the researcher’s knowledge and judgment.
As the researcher’s knowledge is instrumental in creating a sample in this technique,
there are chances that the results will be highly accurate with a minimum margin of
error.
The process of selecting a sample involves the researchers carefully picking and
choosing each individual to be a part of the sample. The researcher’s knowledge is
primary in this sampling process as the members of the sample are not randomly
chosen.
Judgmental sampling is most effective in situations where there are only a restricted
number of people in a population who own qualities that a researcher expects from
the target population.
Judgmental sampling is usually used in situations where the target population
comprises of highly intellectual individuals who cannot be chosen by using any other
probability or non–probability sampling technique.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 97 / 103
Advantages of judgement sampling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 98 / 103
Quota sampling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 99 / 103
Snowball sampling
This technique is usually used when we have little information or find it difficult to
approach elements in the population.
Then, if we find the first appropriate element, we will ask that element (normally a
person) to introduce more appropriate elements in the population that we did not
know or could not approach them directly.
In this way, the sample will be expanded quickly like a snowball.
For example, if we need to interview a group of LGBT people, but we do not know
who they are. We can find the first person to interview and politely ask him/her to
introduce other LGBT people in his/her community. Therefore, this technique is
also called referral sampling.
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 100 / 103
Outline
1 Introduction
2 Data cleaning
3 Data integration
4 Data transformation
5 Data reduction
6 Data sampling
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 101 / 103
References
[1] J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan
Kaufmann, Elsevier, 2012 [Book1].
[2] C. Aggarwal. Data Mining: The Textbook. Springer, 2015 [Book2].
[3] J. Leskovec, A. Rajaraman, and J. D. Ullman. Mining of Massive Datasets.
Cambridge University Press, 2014 [Book3].
[4] M. J. Zaki and W. M. Jr. Data Mining and Analysis: Fundamental Concepts and
Algorithms. Cambridge University Press, 2013 [Book4].
[5] D. Easley and J. Kleinberg. Networks, Crowds, and Markets: Reasoning About a
Highly Connected World. Cambridge University Press, 2010 [Book5].
[6] J. VanderPlas. Python Data Science Handbook: Essential Tools for Working with
Data. O’Reilly, 2017 [Book6].
[7] J. Grus. Data Science from Scratch: First Principles with Python. O’Reilly, 2015
[Book7].
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 102 / 103
Summary
data analysis and mining course @ Xuan–Hieu Phan data preprocessing and preparation 103 / 103