M2 - Preprocessing
M2 - Preprocessing
3
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation 4
Data Transformations for Individual
Predictors
• Transformations of predictor variables may be needed for several
reasons.
• Some modeling techniques may have strict requirements, such as the
predictors having a common scale.
• In other cases, creating a good model may be difficult due to specific
characteristics of the data (e.g., outliers).
• Techniques: Centering and scaling, and skewness transformations
Centering and Scaling
• The most straightforward and common data transformation is to center scale the predictor
variables.
• To center a predictor variable, the average predictor value is subtracted from all the values.
• As a result of centering, the predictor has a zero mean.
• Similarly, to scale the data, each value of the predictor variable is divided by its standard
deviation.
• Scaling the data coerce the values to have a common standard deviation of one.
• These manipulations are generally used to improve the numerical stability of some calculations.
• The real downside to these transformations is a loss of interpretability of the individual values
since the data are no longer in the original units.
Examples
• Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
• Ex. Let income range $12,000 to $98,00073,normalized
600 12,000 to [0.0, 1.0]. Then
(1.0 0) 0 0.716
$73,000 is mapped to 98, 000 12, 000
73,600 54,000
1.225
16,000
• Ex. Let μ = 54,000, σ = 16,000. Then
• Normalization
v by decimal scaling
v' Where j is the smallest integer such that Max(|ν’|) < 1
10 j
7
Transformations to Resolve
Skewness
• An un-skewed distribution is one that is roughly symmetric.
• This means that the probability of falling on either side of the distribution’s mean is
roughly equal.
• A right-skewed distribution has a large number of points on the left side of the
distribution (smaller values) than on the right side (larger values).
• If the predictor distribution is roughly symmetric, the skewness values will be
close to zero.
• As the distribution becomes more right skewed, the skewness statistic becomes larger.
• As the distribution becomes more left skewed, the value becomes negative.
Transformations to Resolve
Skewness
• The formula for the sample skewness statistic is
• where x is the predictor variable, n is the number of values, and x bar is the
sample mean of the predictor.
Transformations to Resolve
Skewness
• Replacing the data with the log, square root, or inverse may help to remove the skew.
• For the data in Fig. 3.2, After the transformation, the distribution is not entirely symmetric but
these data are better behaved than when they were in the natural units.
Transformations to Resolve
Skewness
• Statistical methods can be used to empirically identify an appropriate transformation.
• Box and Cox (1964) propose a family of transformations that are indexed by a parameter, denoted as λ:
• In addition to the log transformation, this family can identify square transformation (λ = 2), square root (λ =
0.5), inverse (λ = −1), and others in-between.
• Using the training data, λ can be estimated. Box and Cox (1964) show how to use maximum likelihood
estimation to determine the transformation parameter.
• This procedure would be applied independently to each predictor data that contain values greater than zero.
Binning Predictors
• Output:
[5, 10, 11, 13]
[15, 35, 50, 55]
[72, 92, 204, 215]
• Equal Width:
• Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215], no. of bins=3
• bins have equal width with a range of each bin being defined as [min + w], [min + 2w] …. [min + nw] where w = (max
– min) / (no of bins).
• Output:
[5, 10, 11, 13, 15, 35, 50, 55, 72]
[92]
[204, 215]
Binning Methods for Data Smoothing
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
14
Incomplete (Missing) Data
• In many cases, some predictors have no values for a given sample.
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• error on part of the researcher
• unavailable respondents
• accidental deletion of observations
• forgetfulness on part of the respondents
• error in accounting, etc.
• Missing data is a concern and may need to be inferred 15
Missing Data
• These missing data could be structurally missing, such as the number
of movies a man has watched.
• In other cases, the value cannot or was not determined at the time of
model building.
• It is important to understand why the values are missing.
• It is important to know if the pattern of missing data is related to the outcome. This is called
“informative missingness” since the missing data pattern is instructional on its own.
• Informative missingness can induce significant bias in the model.
• Example: Predicting a patient’s response to a drug.
• Suppose the drug was extremely ineffective or had significant side effects. The patient may be likely to miss
doctor visits or to drop out of the study. In this case, there clearly is a relationship between the probability of
missing values and the treatment.
Missing Data
• Customer ratings can often have informative missingness; people are more
compelled to rate products when they have strong opinions (good or bad).
• In this case, the data are more likely to be polarized by having few values in the middle of the rating
scale.
• In the Netflix Prize machine learning competition to predict which movies people will like based on their
previous ratings, the “Napoleon Dynamite Effect” confounded many of the contestants because people
who did rate the movie Napoleon Dynamite either loved or hated it.
Missing Data
• Missing data should not be confused with censored data where the exact value is missing but something
is known about its value.
• Censoring is a condition in which the value of a measurement or observation is only partially known.
• Example: Suppose a study is conducted to measure the impact of a drug on mortality rate.
• In such a study, it may be known that an individual's age at death is at least 75 years (but may be more). Such a
situation could occur if the individual withdrew from the study at age 75, or if the individual is currently alive at the
age of 75.
• For example, a company that rents movie disks by mail may use the duration that a customer has kept a
movie in their models.
• If a customer has not yet returned a movie, we do not know the actual time span, only that it is as least as
long as the current duration.
• Censored data can also be common when using laboratory measurements. Some assays cannot measure
below their limit of detection. In such cases, we know that the value is smaller than the limit but was not
precisely measured.
Types of missing values
• Missing Completely at Random (MCAR)
• Missing at Random (MAR)
• Missing Not at Random (MNAR)
Missing Completely at Random
(MCAR)
• This happens when the missing values have no hidden dependency on
any other variable or any characteristic of observations.
• Example: If a doctor forgets to record the age of every tenth patient
entering an ICU, the presence of missing value would not depend on
the characteristic of the patients.
Missing at Random (MAR)
• In this case, the probability of missing value depends on the
characteristics of observable data.
• Example: In survey data, high-income respondents are less likely to
inform the researcher about the number of properties owned. The
missing value for the variable number of properties owned will
depend on the income variable.
Missing Not at Random (MNAR)
• This happens when the missing values depend on both characteristics
of the data and also on missing values.
• In this case, determining the mechanism of the generation of missing
value is difficult.
• For example, missing values for a variable like blood pressure may
partially depend on the values of blood pressure as patients who have
low blood pressure are less likely to get their blood pressure checked
frequently.
Dealing with missing
values
• Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same
class: smarter
• the most probable value: inference-based such as Bayesian
formula or decision tree 23
Dealing with missing values -
Deletion
• If the proportion of missing observations in data is small relative to
the total number of observations, we can simply remove those
observations.
• However, this is not the most often case.
• Deleting the rows containing missing values may lead to parting away
with useful information or patterns.
Dealing with missing values -
Deletion
• For example, consider the following
questionnaire, as answered by 10
subjects:
• A researcher is hoping to model
income (dependent variable) based
on age and gender (independent
variables). Using listwise deletion, the
researcher would remove subjects 3,
4, and 8 from the sample before
performing any further analysis.
Dealing with missing values -
Deletion
• Example:
Dealing with missing values
• For large data sets, removal of samples based on missing values is not
a problem, assuming that the missingness is not informative.
• In smaller data sets, there is a steep price in removing samples.
• If we do not remove the missing data, there are two general
approaches.
• A few predictive models, especially tree-based techniques, can specifically
account for missing data.
• Missing data can be imputed.
• In this case, we can use information in the training set predictors to, in essence, estimate
the values of other predictors.
• This amounts to a predictive model within a predictive model.
Dealing with missing values
• Imputation is just another layer of modeling where we try to estimate values of the predictor variables based on other predictor
variables.
• The most relevant scheme for accomplishing this is to use the training set to build an imputation model for each predictor in the
data set.
• Prior to model training or the prediction of new samples, missing values are filled in using imputation.
• This extra layer of models adds uncertainty.
• If we are using resampling to select tuning parameter values or to estimate performance, the imputation should be incorporated within the
resampling.
• This will increase the computational time for building models, but it will also provide honest estimates of model performance.
• If the number of predictors affected by missing values is small, an exploratory analysis of the relationships between the predictors
is a good idea.
• For example, visualizations or methods like PCA can be used to determine if there are strong relationships between the predictors.
• If a variable with missing values is highly correlated with another predictor that has few missing values, a focused model can often
be effective for imputation.
Dealing with missing values
• Troyanskaya et al. (2001) found the nearest neighbor approach to be
fairly robust.
• One could create a simple linear regression model using the data to
predict the missing values.
Dealing with missing values – kNN
Imputation
• One popular technique for imputation is a K-nearest neighbor model.
• A new sample is imputed by finding the samples in the training set
“closest” to it and averages these nearby points to fill in the value.
• Advantage of this approach is that the imputed data are confined to
be within the range of the training set values.
• Disadvantage is that the entire training set is required every time a
missing value needs to be imputed. Also, the number of neighbors is a
tuning parameter, as is the method for determining “closeness” of
two points
Dealing with missing values – kNN
Imputation
• In the presence of missing coordinates, the Euclidean distance is calculated by ignoring
the missing values and scaling up the weight of the non-missing coordinates(NaN
Euclidean distances).
• For example, the Euclidean distances between two points (3, NA, 5) and (1, 0, 0) is:
sqrt( (3/2) (3-1)2 +(0-5)2 ) = 6.595
Dealing with missing values – kNN
Imputation
• Example: Use 2NN to impute the missing values.
Object
ID X1 X2 X3 X4
1 2 6 7 4
2 3 5 10 -
3 - 9 8 8
4 5 6 9 2
5 4 4 8 7
Removing Predictors
• Potential advantages to removing predictors prior to modelling
1. Fewer predictors means decreased computational time and complexity.
2. If two predictors are highly correlated, this implies that they are measuring
the same underlying information.
• Removing one should not compromise the performance of the model and might lead to
a more parsimonious and interpretable model.
3. Some models can be crippled by predictors with degenerate distributions.
• In these cases, there can be a significant improvement in model
performance and/or stability without the problematic variables.
Removing Predictors
• Consider a predictor variable that has a single unique value; we refer to this type of data as a zero
variance predictor.
• For some models, such an uninformative variable may have little effect on the calculations.
• A tree-based model is impervious to this type of predictor since it would never be used in a split.
• However, a model such as linear regression would find these data problematic and is likely to cause an error in the
computations.
• In either case, these data have no information and can easily be discarded.
• Similarly, some predictors might have only a handful of unique values that occur with very low
frequencies.
• These “near-zero variance predictors” may have a single value for the vast majority of the
samples.
Removing Predictors
• Consider a text mining application where keyword counts are collected for a large set of documents.
• After filtering out commonly used “stop words” such as the and of, predictor variables can be created for
interesting keywords.
• Suppose a keyword occurs in a small group of documents but is otherwise unused.
• A hypothetical distribution of such a word count distribution is given in Table 3.1.
• Of the 531 documents that were searched, there were only four unique counts. The majority of the documents (523) do not have the keyword; while six documents have two
occurrences, one document has three and another has six occurrences.
• Since 98 % of the data have values of zero, a minority of documents might have an undue influence on the model. Also, if any resampling is used there is a strong possibility that
one of the resampled data sets will only contain documents without the keyword.
Removing Predictors
• How can the user diagnose this mode of problematic data?
• First, the number of unique points in the data must be small relative to the number of samples.
• In the document example, there were 531 documents in the data set, but only four unique values, so the
percentage of unique values is 0.8 %.
• A small percentage of unique values is, in itself, not a cause for concern as many “dummy variables”
generated from categorical predictors would fit this description.
• The problem occurs when the frequency of these unique values is severely disproportionate.
• The ratio of the most common frequency to the second most common reflects the imbalance in the
frequencies.
• Most of the documents in the data set (n = 523) do not have the keyword.
• After this, the most frequent case is documents with two occurrences (n = 6).
• The ratio of these frequencies, 523/6 = 87, is rather high and is indicative of a strong imbalance.
Removing Predictors
• Given this, a rule of thumb for detecting near-zero variance predictors
is:
• The fraction of unique values over the sample size is low (say 10 %).
• The ratio of the frequency of the most prevalent value to the frequency of the
second most prevalent value is large (say around 20).
• If both of these criteria are true and the model in question is
susceptible to this type of predictor, it may be advantageous to
remove the variable from the model.
Between-Predictor Correlations
• Collinearity is the technical term for the situation where a pair of
predictor variables have a substantial correlation with each other.
• It is also possible to have relationships between multiple predictors at
once (called multicollinearity).
• For example, the cell segmentation dataset (given in text book 1) have
a number of predictors that reflect the size of the cell.
• There are measurements of the cell perimeter, width, and length as well as
other, more complex calculations.
Between-Predictor Correlations
• Good reasons to avoid data with highly correlated predictors:
• Redundant predictors frequently add more complexity to the model than
information they provide to the model.
• In situations where obtaining the predictor data is costly (either in time or
money), fewer variables is obviously better.
• Mathematical disadvantages to having correlated predictor data:
• Using highly correlated predictors in techniques like linear regression can
result in highly unstable models, numerical errors, and degraded predictive
performance.
Between-Predictor Correlations
• A heuristic approach to dealing with this issue is to remove the minimum number of predictors to
ensure that all pairwise correlations are below a certain threshold.
Between-Predictor Correlations
• Feature extraction methods (e.g., principal components) are another
techniques for mitigating the effect of strong correlations between
predictors.
• However, these techniques make the connection between the
predictors and the outcome more complex.
Between Predictor- Heuristic
Approach
• Example: Consider the following 4 predictors and apply the Between
Predictor Heuristic approach and remove predictor(s).
A: [1, 2, 3, 4, 5]
B: [2, 4, 6, 8, 10]
C: [5, 3, 2, 4, 1]
D: [10, 8, 6, 4, 2]
Adding Predictors
• When a predictor is categorical, such as gender or race, it is common to decompose the predictor into a set of more
specific variables.
• For example, the credit scoring data as in Table 3.2 contains a predictor based on how much money was in the applicant’s
savings account.
• These data were encoded into several groups, including a group for “unknown.”
• Table 3.2 shows the values of this predictor and the number of applicants falling into each bin.
• To use these data in models, the categories are re-encoded into smaller bits of information called “dummy variables.”
• Usually, each category get its own dummy variable that is a zero/one indicator for that group. Table 3.2 shows the possible dummy
variables for these data.
Adding Predictors
• Only four dummy variables are needed here; once we know the value of four of the dummy
variables, the fifth can be inferred.
• However, the decision to include all of the dummy variables can depend on the choice of the model.
• Models that include an intercept term, such as simple linear regression, would have numerical issues if
each dummy variable was included in the model.
• The reason is that, for each sample, these variables all add up to one and this would provide the same
information as the intercept. If the model is insensitive to this type of issue, using the complete set of
dummy variables would help improve interpretation of the model.
Adding Predictors
• Many of the models automatically generate highly complex, nonlinear relationships between the
predictors and the outcome.
• Example: A Logistic regression is a well-known classification model that, by default, generates linear
classification boundaries.
• Figure 3.11 shows another illustrative example with two predictors and two classes.
• The left-hand panel shows the basic logistic regression classification boundaries when the predictors
are added in the usual (linear) manner.
• The right-hand panel shows a logistic model with the basic linear terms and an additional term with
the square of predictor B.
• Since logistic regression is a well-characterized and stable model, using this model with some
additional nonlinear terms may be preferable to highly complex techniques.
Adding Predictors
• Example:
Imputing Missing Data using Linear
Regression
• Example:
• The sales of a company (in million dollars) for each year are shown in the table below.
• The primary advantage of PCA, and the reason that it has retained its popularity as a data reduction method, is that it
creates components that are uncorrelated.
• Some predictive models prefer predictors to be uncorrelated (or at least low correlation) in order to find solutions and to
improve the model’s numerical stability.
• PCA preprocessing creates new predictors with desirable characteristics for these kinds of models.
• While PCA delivers new predictors with desirable characteristics, it must be used with understanding and care.
• Practitioners must understand that PCA seeks predictor-set variation without regard to any further understanding of the
predictors (i.e., measurement scales or distributions) or to knowledge of the modeling objectives (i.e., response variable).
• PCA can generate components that summarize characteristics of the data that are irrelevant to the underlying structure of
the data and also to the ultimate modeling objective.
Data Reduction and Feature
Extraction
• Because PCA seeks linear combinations of predictors that maximize variability, it will naturally first be drawn to summarizing
predictors that have more variation.
• If the original predictors are on measurement scales that differ in orders of magnitude [consider demographic predictors such
as income level (in dollars) and height (in feet)], then the first few components will focus on summarizing the higher
magnitude predictors (e.g., income), while latter components will summarize lower variance predictors (e.g., height).
• This means that the PC weights will be larger for the higher variability predictors on the first few components.
• In addition, it means that PCA will be focusing its efforts on identifying the data structure based on measurement scales rather
than based on the important relationships within the data for the current problem. For most data sets, predictors are on
different scales.
• In addition, predictors may have skewed distributions.
• Hence, to help PCA avoid summarizing distributional differences and predictor scale information, it is best to first transform
skewed predictors and then center and scale the predictors prior to performing PCA.
• Centering and scaling enables PCA to find the underlying relationships in the data without being influenced by the original
measurement scales.
Data Reduction and Feature
Extraction
• PCA does not consider the modeling objective or response variable when summarizing variability.
• Because PCA is blind to the response, it is an unsupervised technique.
• If the predictive relationship between the predictors and response is not connected to the predictors’
variability, then the derived PCs will not provide a suitable relationship with the response.
• In this case, a supervised technique, like PLS (Partial Least Squares), will derive components while
simultaneously considering the corresponding response.
• Once we have decided on the appropriate transformations of the predictor variables, we can then apply PCA.
• For data sets with many predictor variables, we must decide how many components to retain.
• A heuristic approach for determining the number of components to retain is to create a scree plot, which
contains the ordered component number (x-axis) and the amount of summarized variability (y-axis).
Data Reduction and Feature
Extraction
• For most data sets, the first few PCs will summarize a majority of the variability, and the plot will show a
steep descent; variation will then taper off for the remaining components. Generally, the component
number prior to the tapering off of variation is the maximal component that is retained.
• In Fig. 3.6, the variation tapers off at component 5. Using this rule of thumb, four PCs would be retained. In
an automated model building process, the optimal number of components can be determined by cross-
validation .
Handling Redundancy in Data Integration
62
Handle Noisy Data
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with
possible outliers)
63
Handling Redundancy in Data Integration
65
Concept Hierarchy Generation
for Nominal Data
• Specification of a partial/total ordering of attributes explicitly at
the schema level by users or experts
• street < city < state < country
• Specification of a hierarchy for a set of values by explicit data
grouping
• {Urbana, Champaign, Chicago} < Illinois
• Specification of only a partial set of attributes
• E.g., only street < city, not others
• Automatic generation of hierarchies (or attribute levels) by the
analysis of the number of distinct values
• E.g., for a set of attributes: {street, city, state, country}
66
Automatic Concept Hierarchy Generation
• For example, if the data cube displays the daily income of a customer,
we can use a roll-up operation to find the monthly income of his
salary.
Drill Down
• This operation is the reverse of the roll-up operation.
• It allows us to take particular information and then subdivide it
further for coarser granularity analysis.
• It zooms into more detail.
• For example- if India is an attribute of a country column and we wish
to see villages in India, then the drill-down operation splits India into
states, districts, towns, cities, villages and then displays the required
information.
Slicing
• This operation filters the unnecessary portions.
• Suppose in a particular dimension, the user doesn't need everything
for analysis, rather a particular attribute.
• For example, country=“Jamaica", this will display only about Jamaica
and only display other countries present on the country list.
Dicing
• This operation does a multidimensional cutting, that not only cuts
only one dimension but also can go to another dimension and cut a
certain range of it.
• For example- the user wants to see the annual salary of Jharkhand
state employees.
Pivot
• This operation is very important from a viewing point of view.
• It basically transforms the data cube in terms of view.
• It doesn't change the data present in the data cube.
• For example, if the user is comparing year versus branch, using the
pivot operation, the user can change the viewpoint and now compare
branch versus item type.