Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
17 views73 pages

Lec 3 Data Preprocessing and Transformation

The document discusses the importance of data preprocessing and transformation in big data analytics, highlighting key steps such as data collection, cleaning, integration, and reduction. It addresses common issues like missing values, noise, and inconsistencies, and outlines techniques for handling these challenges to improve data quality. The document emphasizes that effective data preprocessing is crucial for deriving accurate insights from analytics.

Uploaded by

hasaanahmadn6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views73 pages

Lec 3 Data Preprocessing and Transformation

The document discusses the importance of data preprocessing and transformation in big data analytics, highlighting key steps such as data collection, cleaning, integration, and reduction. It addresses common issues like missing values, noise, and inconsistencies, and outlines techniques for handling these challenges to improve data quality. The document emphasizes that effective data preprocessing is crucial for deriving accurate insights from analytics.

Uploaded by

hasaanahmadn6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 73

BIG DATA ANALYTICS

DATA PREPROCESSING AND


TRANSFORMATION

Data

Collection

Issues with

Data

Data Cleaning, dealing with missing values, noise

and outliers Data Integration, removing

inconsistencies, and deduplication Data Reduction -

Sampling and Feature Selection


Data Preprocessing and 1/
Data Collection

Data Preprocessing and 2/


Data Collection

Data collection is the first step in the data anlysis pipeline


▷ Often from multiple
sources

Importance: The quality and quantity of collected data


directly influence the insights derived from big data analytics

Challenges: Ensuring data accuracy, dealing with large


volumes, and integrating diverse data formats

Data Preprocessing and 3/


Issues in Data Collection and Techniques

Identifying and addressing common issues in data collection is


essential for ensuring the integrity of data

Incomplete data collection


Biases in data due to collection
methods Collection of irrelevant or
redundant data

To overcome common issues, several


techniques can be employed:

Automation: Use scripts and APIs


to collect data systematically
Validation: Implement real-time
data validation to catch errors
early
Data Preprocessing and 4/
Data Preprocessing

Data preprocessing is a very


important step It helps improve
quality of data
Makes the data ready and more suitable for
analytics Should be followed and guided by a
thorough EDA
Data Preprocessing and 5/
Issues with data

Bad Formatting: Grade ’A’ vs. ’a’


Trailing Space: Extra spaces in commentary, white font ’,’
to avoid plagiarism detection
Duplicates and Redundant Data: A ball repeated could be
confused with a wide/No ball, a grade repeated confused
with repetition
Empty Rows: Could cause a lot of troubles during
programming Synonyms, Abbreviations: rhb, right hand
batsman
Skewed Distribution and Outliers: Outliers could be points
of interest or could be just noise, errors, extremities
Missing Values: Missing grades, missing score
Different norms, units, and standards: miles vs. kilometers
1999: NASA lostData
equipment
Preprocessing worth
and $125m because of an 6/
Steps in Preprocessing

Steps and processes are performed when


necessary

Data
Integratio
n

DATA
PREPROCESSIN
Data G Data
Transformatio Cleanin
n g

Data
Reductio
n

Data Preprocessing and 7/


Data Cleaning

Data Preprocessing and 8/


Data Cleaning

Data cleaning is a critical process that ensures the


accuracy and completeness of data in analytics
It involves correcting or removing incorrect, corrupted,
incorrectly formatted, duplicate, or incomplete data
within a dataset

Objective: Enhance data quality to produce


reliable analytics
Common Issues: Inconsistencies, missing values,
noise, and outliers.

Also called data scrubbing, data munging, data


wrangling

Dealing with Missing


values Noise
Data Preprocessing and 9/
Data Cleaning: Missing Values

Missing data is very common and generally significantly


consequential

Causes:
Changes in
experiments
human/data entry
error measurement
impossible hardware
failure source: Azure AI
Gallery

Missing values
human bias can have a meaning, e.g. absence of a
medical test could mean that it was not conducted for a
combined
reason
datasets
DataKnowing why
imputation andprocess
is the how data is missing
of filling could
in missing help in data
or incomplete data
in aimputation
dataset.
Data Preprocessing and 10 /
Data Cleaning: Missing Values

Knowing why and how data is missing could help in data


imputation

Missing Completely at Random (MCAR)


Missingness independent of any observed or unobserved
variables
The data is missing in a purely random way.

Missing at Random (MAR)


Missingness independent of missing values or unobserved
variables Missingness depend on observed variables with
complete info
Data is not missing completely randomly, but the
missingness can be explained using other variables
in the dataset.

Missing Not at Ranodm (MNAR)


Data Preprocessing and 11 /
Data Cleaning: Missing Values - MCAR

Missing Completely at Random (MCAR)


Missingness independent of any observed or unobserved
variables Values of a variable being missing is completely
nsystematic/Random
This assumption can somewhat be verified by examining
complete and incomplete cases
Data is likely representative sample and analysis will be
Ag 25 26 29 30 30 31 44 46 48 51 52 54
unbiased
e
IQ 12 91 11 11 93 116
1 0 8
Note that values of age variable are roughly the ”same” when
IQ value is missing and when it is not

Data Preprocessing and 12 /


Data Cleaning: Missing Values - MAR

Missing at Random (MAR)


Missingness independent of missing values or
unobserved variables Missingness depend on observed
variables with complete info
The event that a value for Variable 1 is missing
depends only on another observed variables with no
missing values
Ag 25 26 29 30 30 31 44 46 48 51 52 54
Not
e statistically verifiable (rely on subjective
judgment)
IQ 118 93 116 14 97 10
Note that only young people have missing values1 4
for IQ
Shouldn’t be the case that only high IQ people have
missing values Or that only males have IQ values missing
(unobserved variable)
Data Preprocessing and 14 /
Data Cleaning: Missing Value - MNAR

Missing Not at Random (MNAR)


Missingness depends on the missing values or unobserved
variable(s)
Pattern is non-random, non-ignorable, and typically arises
due to the variable on which the data is missing
Generally very hard to ascertain the assumption
e.g. only low IQ people have missing
values Or only males have missing IQ
Ag 25 26 29 30 30 31 44 46 48 51 52 54
values
e
IQ 133 12 110 11 11 14 10
1 8 6 1 4

Data Preprocessing and 16 /


Data Cleaning: Dealing with missing values

Ignore the objects with missing attributes


May lose many objects
Ignore the attribute which has “many” missing values
May lose many meaningful attributes what if class label is
missing?
Impute Data
Domain knowledge and understanding of missing values
help

source: towards data science


Data Preprocessing and 18 /
Data Cleaning: Data Imputation

Manually fill in, works for small data and few missing
values
Use a global constant, e.g. MGMT Major, or Unknown, or ∞
Substitute a measure of central tendency, e.g. mode, mean
or median
Missed Quiz: student mean, class mean, class mean in
this or all quizzes, the student mean in remaining
quizzes
Cricket DLS system
Use class-wise mean or median
for missing players score in a match, use player’s average,
average of Pak batsmen, average of Pak batsmen against
India, average of middle order Pak batsmen again India in
Summer in Sharjah

Use average of topData similar objects


k Preprocessing and ▷ based on non-
19 /
Data Cleaning: Data Imputation

Advanced techniques for imputing


missing values

Expectation Maximization
Imputation Regression based
Imputation

Data Preprocessing and 20 /


Data Cleaning: Noise

Noise: Random error or variation in


measured data Elimination is generally
difficult
Analytics should be robust to have acceptable quality
despite presence of noise

Data Preprocessing and 21 /


Data Cleaning: Handling Noise and Outliers

Noise and outliers can distort the true picture of data


insights and must be managed carefully

Age Salary
25 50,000
30 55,000
35 60,000
40 650,000
Table: Data with Outlier in
Salary

Data Preprocessing and 22 /


Data Cleaning: Noise

Dealing with noise


Smoothing by
Binning
Essentially replace each value by the average of values in
the bin Could be mean, median, midrange etc. of
values in the bin Could use equal width or equal depth
(sized) bins
Smoothing by local neighborhoods
k-nearest neighbors, blurring, boundaries
Smoothing is also used for data reduction and discretization

Smoothing Time Series


Moving Average
Divide by variance of each period/cycle
Data Preprocessing and 23 /
Data Cleaning: Correcting Inconsistencies

Inconsistencies in data can arise from various sources such as


human error, data migration, or integration of multiple datasets

ID Product Price
Name
1 Product-A 20
2 product-a 20
3 PRODUCT-A 19
Table: Inconsistent Data
Entries

Data Preprocessing and 27 /


Data Cleaning: Correcting Inconsistencies

Data can contain inconsistent values


e.g. an address with both ZIP code and city, but they
don’t match

source: medium.com

Some are easy to detect, e.g. negative age of a


person
Some require consulting an external source
Correcting inconsistencies may requires additional
information
Data Preprocessing and 28 /
Data Cleaning: Identifying Outliers

Outliers are either


Objects that have characteristics substantially different from most
other data
▷ the object is an outlier
Value of a variable that is substantially different than the
variable’s typical values
▷ the feature value is an
outlier

Unlike noise, outliers can be legitimate data


or values Outliers could be points of interest
Consider students record in Zambeel, what
values of age could be
noise
inconsiten
Data Preprocessing and 29 /
Data Integration

Data Preprocessing and 31 /


Data Integration

Data integration involves combining data from different sources


to provide a unified view. This process is crucial for
comprehensive analysis but comes with challenges

Objective: To merge diverse datasets into a coherent


whole
Common Issues: Inconsistencies, entity resolution,
duplication

Inconsistencies arise when data from different sources conflict


in format, scale,Date
or interpretation
(Source Date (Source
1) 2)
2024-04-14 14/04/2024
2024-04-15 15/04/2024
Table: Format inconsistencies in date fields from two
sources.
Data Preprocessing and 32 /
Data Integration

Merging data from multiple


sources
e.g. RO and Admissions Cricinfo and PCB
Data Data

Entity identification
Data merging causes or problem Data duplication
require
and redundancy Data
conflict & inconsistencies
Data Preprocessing and 33 /
Data Integration

Entity Identification Problem: Objects do not have same IDs in


all sources
e.g. Sentiment analysis on cricket match tweets to assess player
contribution Network Reconciliation Project

Schema
Integration
Object Matching
Make sure that player ID in cricinfo dataset is the same as
player code in PCB data (source of domestic games)

Check metadata, names of attributes, range, data types


and formats

Data Preprocessing and 34 /


Data Integration

Object Duplication: instance/object etc. may be duplicated

Occasionally two or more object can have all feature values


identical, yet they could be different instances
e.g. two students with the same grades in all courses

Data Preprocessing and 35 /


Data Integration

Redundancy and Correlation Analyses

Redundant (not necessarily duplicate) features


Sometimes caused by data integration ▷ Data
duplication An attribute is redundant if it can be derived
from one or more others
e.g. if runs scored and balls faced are given, then no
need to store strike rate
If aggregate score in course is given in absolute grading,
then no need to store letter grade

Covariance/Correlation and χ2-statistics are used for


pairs of numerical or ordinal/categorical attributes

Data Preprocessing and 36 /


Data Integration

Data Value Conflict Detection and Resolution

Sometimes there are two conflicting values in different


sources
e.g. name is spelled differently in educational and
NADRA’s record This might require expert knowledge

Data Preprocessing and 37 /


Entity Resolution

Entity resolution is the process of linking and merging


records that correspond to the same entities from
different databases.

Name (Source Email (Source 1) Email (Source 2)


1)
John Doe [email protected] [email protected]
Jane Smith om om
janesmith@example. jane.smith@example.
Table: Different email formats for the same individuals
across sources. com com

Data Preprocessing and 38 /


Data Integration: Data Duplication

Duplication occurs when identical or nearly identical records


exist across datasets, leading to redundancy and possible
errors in analysis.

Customer ID Name
1 John Doe
1 John Doe
Table: Duplicate records in customer
data.

Data Preprocessing and 39 /


Data Reduction

Data Preprocessing and 40 /


Data Reduction

Sometime we do not need all


the data We reduce the data in
either direction
Reduce
instances
Reduce
dimensions

Helps reduce computational


complexity Reduces storage
requirements
Make data visualization more
Four Classes Random
Sample
effective Get a representative
Dataset

sample of data Potentially


Data Preprocessing and 41 /
Data Reduction: Sampling

Equal probability sampling of k out of n objects


select objects from an ordered sampling window
first select an object, then every (n/k)th element (going
circular)
If there is some peculiar regularity in the how the objects
are ordered, there is a risk of getting a very bad sample

Random Sampling of k out of n objects


Randomly permute objects
(shuffle) Select the first k in
this order
Deals with the above regularity issue, but if there is big
imbalance among classes or groups, we can get very
Data Preprocessing and 42 /
Data Reduction: Sampling

Stratified Sampling of k out of n objects


Suppose data is grouped into groups (strata)
Randomly sample k/n fraction from each
stratum New sample will exhibits the
distribution of population
Works for imbalanced classes but is
computationally expensive

Clustered Sampling of k out of n objects


Cluster data items based on some ‘similarity’
(details later) Randomly sample k/n fraction from
each cluster
Efficient but not necessarily
Data Preprocessingoptimal,
and similarity definition 43 /
Data Reduction: Sampling

Imbalanced Classes: Classes or groups have huge difference in


frequencies and the target class is rare
Class imbalance is a common issue where some classes are
significantly underrepresented in the data, potentially leading
to biased models.

Attrition prediction: 97% stay, 3% attrite (in a


month) Medical diagnosis: 95% healthy, 5%
diseased eCommerce: 99% do not buy, 1% buy
Security: > 99.99% of people are not
terrorists Similar situation with multiple
classes Predictions can be 97% correct,
but useless
Data Preprocessing and 44 /
Data Reduction: Feature Selection

More importantly, one does dimensionality reduction

We will study in quite detail the Curse of Dimensionality


(problems associated with high dimensions and difficulties
in dealing with higher dimensional vectors)

We will discuss these techniques for dimensionality


reduction (time permitting)
Locality Sensitive Hashing
Johnson-Lindenstrauss
Transform AMS Sketch
PCA and SVD

Data Preprocessing and 45 /


Data Reduction: Feature Selection & Extraction

Represent data by fewer (and “better”) attributes

The new features should be so that the probability


distribution of class is roughly the same as the one
obtained from original features

Feature Feature
orgina Selection Extraction
l
data

new
represent
.

Data Preprocessing and 46 /


Data Reduction: Feature Selection and Correlation
Analysis
Feature selection reduces the number of input variables by
selecting only the relevant features, often using statistical
tests for association like correlation coefficients or chi-square
tests.

High correlation between two features might mean


redundancy.
Chi-square tests are used to determine the
independence of two categorical variables.

Data Preprocessing and 47 /


Data Transformation

Data Preprocessing and 48 /


Data Transformation

Data transformation involves converting raw data into a


format that is more appropriate for analysis.
Values in original data is transformed via a mathematical
function so that

Compatibility with machine learning


algorithms Analytics is more efficient -
improved data consistency
Analytics is more meaningful - Enhanced model
accuracy Visualization is more meaningful and
Data
easier Transformation

source: 7B Software

Data Preprocessing and 49 /


Data Transformation

Values in original data is transformed via a mathematical


function Depending on given data and requirements of
analytics,
Ordinalthis
to include ▷ We will discuss it later
Numeric ▷ e.g. by binning see dealing with
Smoothing
Aggregation (e.g. GPA from noise

grades) Discretization and ▷ needed e.g. for decision


Quantization trees

source: www.audiolabs-erlangen.de

Standardization, scaling and


normalization
Data Preprocessing and 50 /
Standardization and Scaling

The goal is to make an entire set of values have a


particular property
e.g. variables to have the same range, same unit (or
lack thereof) to shift the data to a manageable range
e.g. shifting to positive

Variety of possibilities for different applications

Data Preprocessing and 51 /


Standardization and Scaling

Scaling data so it falls in a smaller, comparable or


manageable range

Data could be in different units e.g. kilometers


and miles Units might not be known
Small units means larger values and larger
ranges
In values of “norms” and many distance measures,
attributes of smaller units get more weights than
attributes with larger units
All attributes will get the same weight
Huge implications in distance values (see clustering &
recommenders)

Data Preprocessing and 52 /


MAX-MIN Scaling

Transform the data (values of an attribute X ) to


the ≤ 1

xi′ =
xi
Xmax
Xmin Xmax
70 100
X
xr =
i X max
xi

Xr
0 1

new max is 1 ▷ new min could be


negative
Preserves relationships among original objects
max, min, median and all quantiles are the same objects
May get very narrow range within
Original [0, 1]
Scaled
Value Value
10 0
20 0.5
30 1
Table: before and after
Data Preprocessing and Min-Ma Scalin 53 /
MAX-MIN Scaling

Transform the data (values of an attribute X ) to the


interval [0, 1]
xi − Xmin
xi′ =
Xmax − Xmin
X min Xmax
X
xri xi —X m i n
X max— X min
=
Xr
0 1

First shift everything to [0, sth] by subtracting Xmin


We get different (scaled) std-dev, can suppress effect of
outliers
If attribute Y is also scaled similarly, then X and Y are
comparable Two sections oneandwith harsh and lenient
Data Preprocessing 54 /
z-score Normalization

Transform the data to a scale with mean 0 and


std-dev 1
xi − x
xi′ = σx
Good, if we don’t know min/max (no full data) or outliers are
dominant
in such cases MAX-MIN scaled data is harder to interpret
Stable data,
Resulting common
data scale, all of
have properties variables are unit-less
standard ▷µ= and
0, σ
scalar = 1
normal Again the relative order of points is
maintained
It makes
Sec1 90 no10difference
50 to
30the shape
40 of a 74
80 68 61
distribution
Sec2 63 40 35 38 21 18 28 19 30
Sec1 1.4 −1.9 −.24 −1.0 −.65 .99 0.75 .5 .21
7
Sec2 2.3 .3 −.14 .13 .3 −1.6 −.74 .04 −.57
Data Preprocessing and 55 /
Other families of transformation

In statistical analysis we often transform a variable X by a


function f (X ) of that variable

It changes the distribution of X or the relationship of X with


another variable Y
“Transformations are needed because there is no
guarantee that the world works on the scales it happens to
be measured on”
Often it helps and is needed to transform the results
back to the original scale by taking the inverse
transform
Mathematical transformations are applied to data to
improve its properties for analysis, which includes
enhancing normality, linear relationships, and uniformity
across features
Data Preprocessing and 56 /
Reasons for Transformation

In statistical analysis we often transform a variable X by a


function f (X ) of that variable

Convenience
Improve the statistical properties of the data
Reduced skew
Equal Spreads - homogeneity of variance

Linear relationship: Normalize relationships between


features for better correlation analysis
Additive relations
Enhance algorithm convergence speeds and
accuracy For one variable the first three
reasons apply
Data Preprocessing and 57 /
Reasons for Transformation

In statistical analysis we often transform a variable X by a


function f (X )

Convenience

The transformed scale may be as natural as the original


and more convenient for a specific purpose
Since transformation often change units, one can
transform the data to a unit that is easier to think about
z -score normalization is extremely useful for
comparing variables expressed in different units
Rather than 101/120, 130/140, and 10/73, easier to work with
percentages. We might want to work with sines rather
than degrees

Data Preprocessing and 58 /


Reasons for Transformation

In statistical analysis we often transform a variable X by a


function f (X )

Reucing Skew

Many statistical model assume data is from certain


distribution with fixed parameters ▷ Generally the
(easiest) normal distribution
Needed to say something like the probability to get a
max/mean etc. Assumption doesn’t have to be true ▷
Data might have skew

Data Preprocessing and 59 /


Reasons for Transformation

In statistical analysis we often transform a variable X by a


function f (X )

Equal Spread, Homoskedasticity


Data is transformed to achieve approximately equally
spread across the regression line (marginals)
Homoskedasticity: Subsets of data having roughly
equal spread Its opposite property is
heteroskedasticity

Data Preprocessing and 60 /


Common Transformations

In statistical analysis we often transform a variable X by a


function f (X )

All the following transformations improves normality


Some reduce the relative distance among values while still
preserving the relative order
They reduce the relative distance of values on the right
sides (larger values) more than the values on the left side
They are used to reduce right skew of data
Issue of dealing with left skew of data is discussed
afterwards

Data Preprocessing and 61 /


Transformations to Reduce Right Skew

Right skew in data can be handled effectively using


transformations that compress large values more than smaller
ones

Logarithmic Transformation: Reduces multiplicative


relationships to additive.
Square Root Transformation: Mildly reduces skew and
is useful for count data.

Data Preprocessing and 62 /


Common Transformations: Logarithms

= log
x′
x
It has major effect on the shape of the
distribution Commonly used to reduce right
skewness
Often appropriate for measured variables
(real numbers)
Since log of negative numbers are not defined and that of
numbers 0 < x < 1 are negatives, we must shift values to
a minimum of 1.00
Can use different bases (commonly used: natural log, base
2, base 10)
One often tries multiple first to settle on one

Higher bases pull larger values


Data Preprocessing anddrastically 63 /
Common Transformations: Logarithms

Data Preprocessing and 64 /


Common Transformations: Cube-root

x′ = x
1/3
Has significant effect on shape of ▷ weaker than
distribution log
Reduces right skew
Can be applied to 0 and negative
numbers Cube root of a volume has the
units of a length

Data Preprocessing and 65 /


Common Transformations: Square-root

x′ = √x

Reduces right skew,


square root of an area has unit of a
length Commonly applied to
counted data
Negative values must first be
shifted to positives
Important consideration: roots of x
∈ (0, 1) is ≥ x , while roots of
x ∈ [1, ∞)] decreases (≤ x ), so we
must be careful
Might not be desirable to treat some number differently
than others, though the relative order of values will be
Data Preprocessing and 66 /
Reciprocal and Negative Reciprocal Transformations

1 1
x′ = OR x′ =
x x

Cannot be applied to 0 ▷ used when all data is positive
or negative
population density (people per unit area) becomes
area/person persons per doctor becomes doctors per
person
rates of erosion become time to erode a unit depth

Reciprocal reverses order among values of the same


sign
Makes very large number very small and very small
numbers very large
Negative reciprocal preserves order among values of the
same sign, this is commonly used
Data Preprocessing and 67 /
Left Skewed Data: Squares and higher powers

All the above transformation essentially deal with right skew


Left skew (or negative skew) can be reduced by applying
transformations that expand smaller values more significantly.
For left skew first reflect the data (multiply −1) and then
apply these transformations
Generally one needs to shift the data to a new minimum
of 1.0 after reflection and then apply the transform

Squaring: Amplifies larger values disproportionately


compared to smaller ones, suitable for data with negative
values after adjustment.
Cubing: Stronger effect than squaring, can also
handle zero and negative = x
x ′ values.
moderate affect on shape
2 of

distribution can be used to reduce


Data Preprocessing and 68 /
Transformation to make linear relationship

Suppose we want to describe a variable Y in


terms of X
We want to express it as linear relationship

Y = aX + b

Transformation in many cases helps us fit a

good line

Data Preprocessing and 69 /


Transformation to make linear relationship

Y = aX +
b

Data Preprocessing and 70 /


Transformation to make linear relationship

Y = aX +
b

Data Preprocessing and 71 /


Transformation to make linear relationship

Y = aX +
b

Instead, express as Y = aX 2
+b
Data Preprocessing and 72 /
Transformation to make linear relationship

Y = aX +
b

Can also do log Y = aX


+b
Data Preprocessing and 73 /

You might also like