Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views74 pages

M2 - Preprocessing

Data preprocessing is essential for cleaning and consolidating raw data before analysis, consuming 70-80% of data scientists' time. Key tasks include data cleaning, integration, reduction, and transformation to improve data quality and model performance. Missing data can significantly impact analysis, requiring careful handling through methods like deletion or imputation to avoid bias and maintain data integrity.

Uploaded by

Rahul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views74 pages

M2 - Preprocessing

Data preprocessing is essential for cleaning and consolidating raw data before analysis, consuming 70-80% of data scientists' time. Key tasks include data cleaning, integration, reduction, and transformation to improve data quality and model performance. Missing data can significantly impact analysis, requiring careful handling through methods like deletion or imputation to avoid bias and maintain data integrity.

Uploaded by

Rahul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 74

Module2_Data_Preprocessing

Reference: Kuhn, Max, and Kjell Johnson. Applied Predictive Modeling,


3rd Edition, Springer, 2019
“Data Mining : Concepts and Techniques”, Jiawei Han, Micheline
Kamber, Jian Pei, 3rd Edition
Data Preprocessing
• Data preparation, also sometimes called “pre-processing,” is the act of
cleaning and consolidating raw data prior to using it for business
analysis.
• It might not be the most celebrated of tasks, but careful data
preparation is a key component of successful data analysis.
• Most industry observers report that data preparation steps for
business analysis or machine learning consume 70 to 80% of the time
spent by data scientists and analysts.
Data Quality: Why Preprocess the Data?

• Measures for data quality: A multidimensional view


• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?

3
Major Tasks in Data Preprocessing

• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation 4
Data Transformations for Individual
Predictors
• Transformations of predictor variables may be needed for several
reasons.
• Some modeling techniques may have strict requirements, such as the
predictors having a common scale.
• In other cases, creating a good model may be difficult due to specific
characteristics of the data (e.g., outliers).
• Techniques: Centering and scaling, and skewness transformations
Centering and Scaling
• The most straightforward and common data transformation is to center scale the predictor
variables.
• To center a predictor variable, the average predictor value is subtracted from all the values.
• As a result of centering, the predictor has a zero mean.
• Similarly, to scale the data, each value of the predictor variable is divided by its standard
deviation.
• Scaling the data coerce the values to have a common standard deviation of one.
• These manipulations are generally used to improve the numerical stability of some calculations.
• The real downside to these transformations is a loss of interpretability of the individual values
since the data are no longer in the original units.
Examples
• Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
• Ex. Let income range $12,000 to $98,00073,normalized
600  12,000 to [0.0, 1.0]. Then
(1.0  0)  0 0.716
$73,000 is mapped to 98, 000  12, 000

• Z-score normalization (Centering and Scaling)(μ: mean, σ: standard deviation):


v  A
v' 
 A

73,600  54,000
1.225
16,000
• Ex. Let μ = 54,000, σ = 16,000. Then
• Normalization
v by decimal scaling
v'  Where j is the smallest integer such that Max(|ν’|) < 1
10 j
7
Transformations to Resolve
Skewness
• An un-skewed distribution is one that is roughly symmetric.
• This means that the probability of falling on either side of the distribution’s mean is
roughly equal.
• A right-skewed distribution has a large number of points on the left side of the
distribution (smaller values) than on the right side (larger values).
• If the predictor distribution is roughly symmetric, the skewness values will be
close to zero.
• As the distribution becomes more right skewed, the skewness statistic becomes larger.
• As the distribution becomes more left skewed, the value becomes negative.
Transformations to Resolve
Skewness
• The formula for the sample skewness statistic is

• where x is the predictor variable, n is the number of values, and x bar is the
sample mean of the predictor.
Transformations to Resolve
Skewness
• Replacing the data with the log, square root, or inverse may help to remove the skew.
• For the data in Fig. 3.2, After the transformation, the distribution is not entirely symmetric but
these data are better behaved than when they were in the natural units.
Transformations to Resolve
Skewness
• Statistical methods can be used to empirically identify an appropriate transformation.
• Box and Cox (1964) propose a family of transformations that are indexed by a parameter, denoted as λ:

• In addition to the log transformation, this family can identify square transformation (λ = 2), square root (λ =
0.5), inverse (λ = −1), and others in-between.
• Using the training data, λ can be estimated. Box and Cox (1964) show how to use maximum likelihood
estimation to determine the transformation parameter.
• This procedure would be applied independently to each predictor data that contain values greater than zero.
Binning Predictors

Firstly sort data and partition into (equal-frequency) bins


• then one can smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.

• Equal-width (distance) partitioning


• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of intervals will
be: W = (B –A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well
• Equal-depth (frequency) partitioning
• Divides the range into N intervals, each containing approximately same number of
samples
• Good data scaling
• Managing categorical attributes can be tricky 12
Binning Predictors
• Equal frequency:
• Input:[5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215] , no. of bins=3

• Output:
[5, 10, 11, 13]
[15, 35, 50, 55]
[72, 92, 204, 215]

• Equal Width:
• Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215], no. of bins=3
• bins have equal width with a range of each bin being defined as [min + w], [min + 2w] …. [min + nw] where w = (max
– min) / (no of bins).
• Output:
[5, 10, 11, 13, 15, 35, 50, 55, 72]
[92]
[204, 215]
Binning Methods for Data Smoothing
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

14
Incomplete (Missing) Data
• In many cases, some predictors have no values for a given sample.
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• error on part of the researcher
• unavailable respondents
• accidental deletion of observations
• forgetfulness on part of the respondents
• error in accounting, etc.
• Missing data is a concern and may need to be inferred 15
Missing Data
• These missing data could be structurally missing, such as the number
of movies a man has watched.
• In other cases, the value cannot or was not determined at the time of
model building.
• It is important to understand why the values are missing.
• It is important to know if the pattern of missing data is related to the outcome. This is called
“informative missingness” since the missing data pattern is instructional on its own.
• Informative missingness can induce significant bias in the model.
• Example: Predicting a patient’s response to a drug.
• Suppose the drug was extremely ineffective or had significant side effects. The patient may be likely to miss
doctor visits or to drop out of the study. In this case, there clearly is a relationship between the probability of
missing values and the treatment.
Missing Data
• Customer ratings can often have informative missingness; people are more
compelled to rate products when they have strong opinions (good or bad).
• In this case, the data are more likely to be polarized by having few values in the middle of the rating
scale.
• In the Netflix Prize machine learning competition to predict which movies people will like based on their
previous ratings, the “Napoleon Dynamite Effect” confounded many of the contestants because people
who did rate the movie Napoleon Dynamite either loved or hated it.
Missing Data
• Missing data should not be confused with censored data where the exact value is missing but something
is known about its value.
• Censoring is a condition in which the value of a measurement or observation is only partially known.
• Example: Suppose a study is conducted to measure the impact of a drug on mortality rate.
• In such a study, it may be known that an individual's age at death is at least 75 years (but may be more). Such a
situation could occur if the individual withdrew from the study at age 75, or if the individual is currently alive at the
age of 75.
• For example, a company that rents movie disks by mail may use the duration that a customer has kept a
movie in their models.
• If a customer has not yet returned a movie, we do not know the actual time span, only that it is as least as
long as the current duration.
• Censored data can also be common when using laboratory measurements. Some assays cannot measure
below their limit of detection. In such cases, we know that the value is smaller than the limit but was not
precisely measured.
Types of missing values
• Missing Completely at Random (MCAR)
• Missing at Random (MAR)
• Missing Not at Random (MNAR)
Missing Completely at Random
(MCAR)
• This happens when the missing values have no hidden dependency on
any other variable or any characteristic of observations.
• Example: If a doctor forgets to record the age of every tenth patient
entering an ICU, the presence of missing value would not depend on
the characteristic of the patients.
Missing at Random (MAR)
• In this case, the probability of missing value depends on the
characteristics of observable data.
• Example: In survey data, high-income respondents are less likely to
inform the researcher about the number of properties owned. The
missing value for the variable number of properties owned will
depend on the income variable.
Missing Not at Random (MNAR)
• This happens when the missing values depend on both characteristics
of the data and also on missing values.
• In this case, determining the mechanism of the generation of missing
value is difficult.
• For example, missing values for a variable like blood pressure may
partially depend on the values of blood pressure as patients who have
low blood pressure are less likely to get their blood pressure checked
frequently.
Dealing with missing
values
• Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same
class: smarter
• the most probable value: inference-based such as Bayesian
formula or decision tree 23
Dealing with missing values -
Deletion
• If the proportion of missing observations in data is small relative to
the total number of observations, we can simply remove those
observations.
• However, this is not the most often case.
• Deleting the rows containing missing values may lead to parting away
with useful information or patterns.
Dealing with missing values -
Deletion
• For example, consider the following
questionnaire, as answered by 10
subjects:
• A researcher is hoping to model
income (dependent variable) based
on age and gender (independent
variables). Using listwise deletion, the
researcher would remove subjects 3,
4, and 8 from the sample before
performing any further analysis.
Dealing with missing values -
Deletion
• Example:
Dealing with missing values
• For large data sets, removal of samples based on missing values is not
a problem, assuming that the missingness is not informative.
• In smaller data sets, there is a steep price in removing samples.
• If we do not remove the missing data, there are two general
approaches.
• A few predictive models, especially tree-based techniques, can specifically
account for missing data.
• Missing data can be imputed.
• In this case, we can use information in the training set predictors to, in essence, estimate
the values of other predictors.
• This amounts to a predictive model within a predictive model.
Dealing with missing values
• Imputation is just another layer of modeling where we try to estimate values of the predictor variables based on other predictor
variables.
• The most relevant scheme for accomplishing this is to use the training set to build an imputation model for each predictor in the
data set.
• Prior to model training or the prediction of new samples, missing values are filled in using imputation.
• This extra layer of models adds uncertainty.
• If we are using resampling to select tuning parameter values or to estimate performance, the imputation should be incorporated within the
resampling.
• This will increase the computational time for building models, but it will also provide honest estimates of model performance.
• If the number of predictors affected by missing values is small, an exploratory analysis of the relationships between the predictors
is a good idea.
• For example, visualizations or methods like PCA can be used to determine if there are strong relationships between the predictors.
• If a variable with missing values is highly correlated with another predictor that has few missing values, a focused model can often
be effective for imputation.
Dealing with missing values
• Troyanskaya et al. (2001) found the nearest neighbor approach to be
fairly robust.
• One could create a simple linear regression model using the data to
predict the missing values.
Dealing with missing values – kNN
Imputation
• One popular technique for imputation is a K-nearest neighbor model.
• A new sample is imputed by finding the samples in the training set
“closest” to it and averages these nearby points to fill in the value.
• Advantage of this approach is that the imputed data are confined to
be within the range of the training set values.
• Disadvantage is that the entire training set is required every time a
missing value needs to be imputed. Also, the number of neighbors is a
tuning parameter, as is the method for determining “closeness” of
two points
Dealing with missing values – kNN
Imputation
• In the presence of missing coordinates, the Euclidean distance is calculated by ignoring
the missing values and scaling up the weight of the non-missing coordinates(NaN
Euclidean distances).

• For example, the Euclidean distances between two points (3, NA, 5) and (1, 0, 0) is:
sqrt( (3/2) (3-1)2 +(0-5)2 ) = 6.595
Dealing with missing values – kNN
Imputation
• Example: Use 2NN to impute the missing values.
Object
ID X1 X2 X3 X4
1 2 6 7 4
2 3 5 10 -
3 - 9 8 8
4 5 6 9 2
5 4 4 8 7
Removing Predictors
• Potential advantages to removing predictors prior to modelling
1. Fewer predictors means decreased computational time and complexity.
2. If two predictors are highly correlated, this implies that they are measuring
the same underlying information.
• Removing one should not compromise the performance of the model and might lead to
a more parsimonious and interpretable model.
3. Some models can be crippled by predictors with degenerate distributions.
• In these cases, there can be a significant improvement in model
performance and/or stability without the problematic variables.
Removing Predictors
• Consider a predictor variable that has a single unique value; we refer to this type of data as a zero
variance predictor.
• For some models, such an uninformative variable may have little effect on the calculations.
• A tree-based model is impervious to this type of predictor since it would never be used in a split.
• However, a model such as linear regression would find these data problematic and is likely to cause an error in the
computations.
• In either case, these data have no information and can easily be discarded.

• Similarly, some predictors might have only a handful of unique values that occur with very low
frequencies.
• These “near-zero variance predictors” may have a single value for the vast majority of the
samples.
Removing Predictors
• Consider a text mining application where keyword counts are collected for a large set of documents.
• After filtering out commonly used “stop words” such as the and of, predictor variables can be created for
interesting keywords.
• Suppose a keyword occurs in a small group of documents but is otherwise unused.
• A hypothetical distribution of such a word count distribution is given in Table 3.1.
• Of the 531 documents that were searched, there were only four unique counts. The majority of the documents (523) do not have the keyword; while six documents have two
occurrences, one document has three and another has six occurrences.
• Since 98 % of the data have values of zero, a minority of documents might have an undue influence on the model. Also, if any resampling is used there is a strong possibility that
one of the resampled data sets will only contain documents without the keyword.
Removing Predictors
• How can the user diagnose this mode of problematic data?
• First, the number of unique points in the data must be small relative to the number of samples.
• In the document example, there were 531 documents in the data set, but only four unique values, so the
percentage of unique values is 0.8 %.
• A small percentage of unique values is, in itself, not a cause for concern as many “dummy variables”
generated from categorical predictors would fit this description.
• The problem occurs when the frequency of these unique values is severely disproportionate.
• The ratio of the most common frequency to the second most common reflects the imbalance in the
frequencies.
• Most of the documents in the data set (n = 523) do not have the keyword.
• After this, the most frequent case is documents with two occurrences (n = 6).
• The ratio of these frequencies, 523/6 = 87, is rather high and is indicative of a strong imbalance.
Removing Predictors
• Given this, a rule of thumb for detecting near-zero variance predictors
is:
• The fraction of unique values over the sample size is low (say 10 %).
• The ratio of the frequency of the most prevalent value to the frequency of the
second most prevalent value is large (say around 20).
• If both of these criteria are true and the model in question is
susceptible to this type of predictor, it may be advantageous to
remove the variable from the model.
Between-Predictor Correlations
• Collinearity is the technical term for the situation where a pair of
predictor variables have a substantial correlation with each other.
• It is also possible to have relationships between multiple predictors at
once (called multicollinearity).
• For example, the cell segmentation dataset (given in text book 1) have
a number of predictors that reflect the size of the cell.
• There are measurements of the cell perimeter, width, and length as well as
other, more complex calculations.
Between-Predictor Correlations
• Good reasons to avoid data with highly correlated predictors:
• Redundant predictors frequently add more complexity to the model than
information they provide to the model.
• In situations where obtaining the predictor data is costly (either in time or
money), fewer variables is obviously better.
• Mathematical disadvantages to having correlated predictor data:
• Using highly correlated predictors in techniques like linear regression can
result in highly unstable models, numerical errors, and degraded predictive
performance.
Between-Predictor Correlations
• A heuristic approach to dealing with this issue is to remove the minimum number of predictors to
ensure that all pairwise correlations are below a certain threshold.
Between-Predictor Correlations
• Feature extraction methods (e.g., principal components) are another
techniques for mitigating the effect of strong correlations between
predictors.
• However, these techniques make the connection between the
predictors and the outcome more complex.
Between Predictor- Heuristic
Approach
• Example: Consider the following 4 predictors and apply the Between
Predictor Heuristic approach and remove predictor(s).
A: [1, 2, 3, 4, 5]
B: [2, 4, 6, 8, 10]
C: [5, 3, 2, 4, 1]
D: [10, 8, 6, 4, 2]
Adding Predictors
• When a predictor is categorical, such as gender or race, it is common to decompose the predictor into a set of more
specific variables.

• For example, the credit scoring data as in Table 3.2 contains a predictor based on how much money was in the applicant’s
savings account.

• These data were encoded into several groups, including a group for “unknown.”

• Table 3.2 shows the values of this predictor and the number of applicants falling into each bin.

• To use these data in models, the categories are re-encoded into smaller bits of information called “dummy variables.”
• Usually, each category get its own dummy variable that is a zero/one indicator for that group. Table 3.2 shows the possible dummy
variables for these data.
Adding Predictors
• Only four dummy variables are needed here; once we know the value of four of the dummy
variables, the fifth can be inferred.
• However, the decision to include all of the dummy variables can depend on the choice of the model.
• Models that include an intercept term, such as simple linear regression, would have numerical issues if
each dummy variable was included in the model.
• The reason is that, for each sample, these variables all add up to one and this would provide the same
information as the intercept. If the model is insensitive to this type of issue, using the complete set of
dummy variables would help improve interpretation of the model.
Adding Predictors
• Many of the models automatically generate highly complex, nonlinear relationships between the
predictors and the outcome.
• Example: A Logistic regression is a well-known classification model that, by default, generates linear
classification boundaries.
• Figure 3.11 shows another illustrative example with two predictors and two classes.
• The left-hand panel shows the basic logistic regression classification boundaries when the predictors
are added in the usual (linear) manner.
• The right-hand panel shows a logistic model with the basic linear terms and an additional term with
the square of predictor B.
• Since logistic regression is a well-characterized and stable model, using this model with some
additional nonlinear terms may be preferable to highly complex techniques.
Adding Predictors
• Example:
Imputing Missing Data using Linear
Regression
• Example:
• The sales of a company (in million dollars) for each year are shown in the table below.

x (year) 2005 2006 2007 2008 2009 2012


y
12 19 29 37 45 ?
(sales)

a)Find the least square regression line y = ax + b.


b) Use the least squares regression line to find the missing value.
We know: a = (nΣx y - ΣxΣy) / (nΣx2 - (Σx)2)
b = (1/n)(Σy - a Σx)
• Let us change the variable x into t such that t = x - 2005 and therefore t represents the number of years
after 2005. The table of values becomes.
Data Transformations for Multiple
Predictors
• These transformations act on groups of predictors, typically the entire
set under consideration.
• Discussions on:
• Transformations to Resolve Outliers
• Data Reduction and Feature Extraction
Data Transformations for Multiple Predictors:
Transformations to Resolve Outliers
• Outliers as samples that are exceptionally far from the mainstream of the data.
• Under certain assumptions, there are formal statistical definitions of an outlier.
• Even with a thorough understanding of the data, outliers can be hard to define.
• However, we can often identify an unusual value by looking at a figure.
• When one or more samples are suspected to be outliers, the first step is to make sure that the values are
scientifically valid (e.g., positive blood pressure) and that no data recording errors have occurred.
• Great care should be taken not to hastily remove or change values, especially if the sample size is small.
• With small sample sizes, apparent outliers might be a result of a skewed distribution where there are not
yet enough data to see the skewness.
• Also, the outlying data may be an indication of a special part of the population under study that is just
starting to be sampled.
Transformations to Resolve Outliers
• Depending on how the data were collected, a “cluster” of valid points that reside
outside the mainstream of the data might belong to a different population than
the other samples.
• There are several predictive models that are resistant to outliers.
• Tree-based classification models create splits of the training data and the prediction equation is a set of
logical statements such as “if predictor A is greater than X, predict the class to be Y ,” so the outlier does
not usually have an exceptional influence on the model.
• Support vector machines for classification generally disregard a portion of the training set samples
when creating a prediction equation.
• The excluded samples may be far away from the decision boundary and outside of the data mainstream.
Transformations to Resolve Outliers
• If a model is considered to be sensitive to outliers, one data
transformation that can minimize the problem is the spatial
sign(Serneels et al. 2006).
• This procedure projects the predictor values onto a multidimensional sphere.
• This has the effect of making all the samples the same distance from the
center of the sphere.
• Mathematically, each sample is divided by its squared norm:
Transformations to Resolve Outliers
• Since the denominator is intended to measure the squared distance to the
center of the predictor’s distribution, it is important to center and scale the
predictor data prior to using this transformation.

• Unlike centering or scaling, this manipulation of the predictors transforms them as


a group.
• Removing predictor variables after applying the spatial sign transformation may be
problematic.
Transformations to Resolve Outliers
• Example:
• In these data, at least eight samples cluster away from
the majority of other data. These data points are likely a
valid, but poorly sampled subpopulation of the data.
• The modeler would investigate why these points are
different; perhaps they represent a group of interest,
such as highly profitable customers.
• The spatial sign transformation is shown on the right-
hand panel where all the data points are projected to be
a common distance away from the origin.
• The outliers still reside in the Northwest section of the
distribution but are contracted inwards.
• This mitigates the effect of the samples on model
training.
Data Transformations for Multiple Predictors:
Data Reduction and Feature Extraction
• Data reduction techniques are another class of predictor transformations.
• These methods reduce the data by generating a smaller set of predictors that
seek to capture a majority of the information in the original variables.
• Fewer variables can be used that provide reasonable fidelity to the original data.
• For most data reduction techniques, the new predictors are functions of the
original predictors; therefore, all the original predictors are still needed to create
the surrogate variables.
• This class of methods is often called signal extraction or feature extraction
techniques.
Data Reduction and Feature
Extraction
• Principal Component Analysis (PCA) is a commonly used data reduction
technique.
• This method seeks to find linear combinations of the predictors, known as principal components (PCs),
which capture the most possible variance.
• The first PC is defined as the linear combination of the predictors that captures the most variability of all
possible linear combinations.
• Then, subsequent PCs are derived such that these linear combinations capture the most remaining
variability while also being uncorrelated with all previous PCs.
• Mathematically, the jth PC can be written as:
Data Reduction and Feature
Extraction
• Consider the data in Fig. 3.5.
• This set contains a subset of two correlated predictors,
average pixel intensity of channel 1 and entropy of intensity
values in the cell (a measure of cell shape), and a categorical
response.
• Given the high correlation between the predictors (0.93), we
could infer that average pixel intensity and entropy of
intensity values measure redundant information about the
cells and that either predictor or a linear combination of these
predictors could be used in place of the original predictors.
• In this example, two PCs can be derived (right plot in Fig. 3.5);
this transformation represents a rotation of the data about
the axis of greatest variation.
• The first PC summarizes 97 % of the original variability, while
the second summarizes 3 %.
Data Reduction and Feature
Extraction
• Hence, it is reasonable to use only the first PC for modeling since it accounts for the majority of information in the data.

• The primary advantage of PCA, and the reason that it has retained its popularity as a data reduction method, is that it
creates components that are uncorrelated.

• Some predictive models prefer predictors to be uncorrelated (or at least low correlation) in order to find solutions and to
improve the model’s numerical stability.

• PCA preprocessing creates new predictors with desirable characteristics for these kinds of models.

• While PCA delivers new predictors with desirable characteristics, it must be used with understanding and care.

• Practitioners must understand that PCA seeks predictor-set variation without regard to any further understanding of the
predictors (i.e., measurement scales or distributions) or to knowledge of the modeling objectives (i.e., response variable).

• PCA can generate components that summarize characteristics of the data that are irrelevant to the underlying structure of
the data and also to the ultimate modeling objective.
Data Reduction and Feature
Extraction
• Because PCA seeks linear combinations of predictors that maximize variability, it will naturally first be drawn to summarizing
predictors that have more variation.
• If the original predictors are on measurement scales that differ in orders of magnitude [consider demographic predictors such
as income level (in dollars) and height (in feet)], then the first few components will focus on summarizing the higher
magnitude predictors (e.g., income), while latter components will summarize lower variance predictors (e.g., height).
• This means that the PC weights will be larger for the higher variability predictors on the first few components.
• In addition, it means that PCA will be focusing its efforts on identifying the data structure based on measurement scales rather
than based on the important relationships within the data for the current problem. For most data sets, predictors are on
different scales.
• In addition, predictors may have skewed distributions.
• Hence, to help PCA avoid summarizing distributional differences and predictor scale information, it is best to first transform
skewed predictors and then center and scale the predictors prior to performing PCA.
• Centering and scaling enables PCA to find the underlying relationships in the data without being influenced by the original
measurement scales.
Data Reduction and Feature
Extraction
• PCA does not consider the modeling objective or response variable when summarizing variability.
• Because PCA is blind to the response, it is an unsupervised technique.
• If the predictive relationship between the predictors and response is not connected to the predictors’
variability, then the derived PCs will not provide a suitable relationship with the response.
• In this case, a supervised technique, like PLS (Partial Least Squares), will derive components while
simultaneously considering the corresponding response.
• Once we have decided on the appropriate transformations of the predictor variables, we can then apply PCA.
• For data sets with many predictor variables, we must decide how many components to retain.
• A heuristic approach for determining the number of components to retain is to create a scree plot, which
contains the ordered component number (x-axis) and the amount of summarized variability (y-axis).
Data Reduction and Feature
Extraction
• For most data sets, the first few PCs will summarize a majority of the variability, and the plot will show a
steep descent; variation will then taper off for the remaining components. Generally, the component
number prior to the tapering off of variation is the maximal component that is retained.
• In Fig. 3.6, the variation tapers off at component 5. Using this rule of thumb, four PCs would be retained. In
an automated model building process, the optimal number of components can be determined by cross-
validation .
Handling Redundancy in Data Integration

• Redundant data occur often when integration of multiple


databases
• Object identification: The same attribute or object may
have different names in different databases
• Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality 61
61
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which require data cleaning
• duplicate records
• incomplete data
• inconsistent data

62
Handle Noisy Data
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with
possible outliers)

63
Handling Redundancy in Data Integration

• Redundant data occur often when integration of multiple


databases
• Object identification: The same attribute or object may
have different names in different databases
• Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality 64
64
Concept Hierarchy Generation
• Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and
is usually associated with each dimension in a data warehouse
• Concept hierarchies facilitate drilling and rolling in data warehouses to view
data in multiple granularity
• Concept hierarchy formation: Recursively reduce the data by collecting and
replacing low level concepts (such as numeric values for age) by higher level
concepts (such as youth, adult, or senior)
• Concept hierarchies can be explicitly specified by domain experts and/or data
warehouse designers
• Concept hierarchy can be automatically formed for both numeric and nominal
data. For numeric data, use discretization methods shown.

65
Concept Hierarchy Generation
for Nominal Data
• Specification of a partial/total ordering of attributes explicitly at
the schema level by users or experts
• street < city < state < country
• Specification of a hierarchy for a set of values by explicit data
grouping
• {Urbana, Champaign, Chicago} < Illinois
• Specification of only a partial set of attributes
• E.g., only street < city, not others
• Automatic generation of hierarchies (or attribute levels) by the
analysis of the number of distinct values
• E.g., for a set of attributes: {street, city, state, country}
66
Automatic Concept Hierarchy Generation

• Some hierarchies can be automatically generated based on


the analysis of the number of distinct values per attribute in
the data set
• The attribute with the most distinct values is placed at the
lowest level of the hierarchy
• Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values


67
Data Cube
• A data cube is a multidimensional data structure used to represent
data in a way that facilitates online analytical processing (OLAP) and
efficient analysis from various perspectives.
• It organizes data along multiple dimensions, such as time, product,
location, or customer, with cells containing aggregated measures like
sales, count, or average
Data Cube Operations
Roll UP
• operation that aggregates certain similar data attributes having the
same dimension together.

• For example, if the data cube displays the daily income of a customer,
we can use a roll-up operation to find the monthly income of his
salary.
Drill Down
• This operation is the reverse of the roll-up operation.
• It allows us to take particular information and then subdivide it
further for coarser granularity analysis.
• It zooms into more detail.
• For example- if India is an attribute of a country column and we wish
to see villages in India, then the drill-down operation splits India into
states, districts, towns, cities, villages and then displays the required
information.
Slicing
• This operation filters the unnecessary portions.
• Suppose in a particular dimension, the user doesn't need everything
for analysis, rather a particular attribute.
• For example, country=“Jamaica", this will display only about Jamaica
and only display other countries present on the country list.
Dicing
• This operation does a multidimensional cutting, that not only cuts
only one dimension but also can go to another dimension and cut a
certain range of it.

• As a result, it looks more like a sub-cube out of the whole.

• For example- the user wants to see the annual salary of Jharkhand
state employees.
Pivot
• This operation is very important from a viewing point of view.
• It basically transforms the data cube in terms of view.
• It doesn't change the data present in the data cube.
• For example, if the user is comparing year versus branch, using the
pivot operation, the user can change the viewpoint and now compare
branch versus item type.

You might also like