0% found this document useful (0 votes)

39 views23 pages

Week 3

The document discusses the process of data preprocessing, which involves preparing raw data for analysis through steps like cleaning, transformation, and reduction. It describes techniques for handling issues like missing values, outliers, and data from various sources. The goal of preprocessing is to convert raw data into a clean and analyzable format suitable for modeling and insights.

Uploaded by

Muneeba Mehmood

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views23 pages

Week 3

Uploaded by

Muneeba Mehmood

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Data Preprocessing: A primer

 Data preprocessing is a foundational step in the data

mining process.

 It entails preparing raw data for analysis by transforming

and refining it. This presentation delves into the intricacies
of cleaning, structuring, and addressing imbalances to
ensure data is ready for analysis.
What is Data Pre-processing

 Data preprocessing is the process of converting raw data

into a clean and analyzable format.

 It involves multiple steps, including cleaning,

transformation, and reduction.

 This initial phase is pivotal, as the quality and precision of

data preprocessing can dictate the success of subsequent
analytical procedures.
Steps of Data Pre-Processing
Data Profiling

 Data profiling is the process of examining datasets to gather

descriptive statistics about the data.

 It provides a summary of a dataset's attributes, patterns,

anomalies, and unique values.

 By understanding the data's structure, relationships, and

inconsistencies, data profiling lays the groundwork for
further data preprocessing and quality enhancement, ensuring
that the data is well-understood before any advanced
processing or analysis.
Data Cleaning

 Dirty data can be a major impediment in data analysis.

Errors, inconsistencies, and redundancies can mislead
analysts and produce skewed results.

 Data cleaning becomes imperative to ensure the integrity

of data by spotting and correcting inaccuracies.
Handling Missing Values

 Missing values are a common issue in datasets. Their

presence can distort data analysis and lead to incorrect
interpretations.

 Techniques like imputation, predictive filling, and

elimination are employed based on the nature and pattern
of the missing data to ensure completeness.
Handling Missing Values

 Imputation: Replace missing values with statistical

measures like mean, median, or mode. For categorical data,
the mode is often used.
 Deletion: Remove rows with missing values, especially if
the data is randomly missing and its absence doesn't create
bias.
 Prediction: Use algorithms to predict and fill missing
values based on other attributes.
/ Handling Outliers

 Outliers can introduce variance and bias.

 While sometimes they carry significant information, they
can also distort results. Various statistical methods are
available to detect outliers.
 Based on their nature, outliers can be adjusted, removed, or
even retained.
/ Handling Outliers

 Identifying Outliers:
 Use statistical measures like IQR (Interquartile Range), Z-
score, or visual tools like box plots and scatter plots to detect
outliers.
 Understand if the outliers are genuine or data errors.
 Handling Outliers:
 Deletion: Remove outliers if they are the result of data entry
errors.
 Transformation: Use log or square root transformations to
reduce the impact of outliers.
 Capping: Limit the maximum and minimum values for certain
attributes.
 Imputation: Replace outliers with mean/median/mode values.
/ Data Reduction

 Data reduction refers to the process of transforming large volumes of

data into a reduced representation, retaining as much meaningful
information as possible.
 The goal is to simplify, compress, or condense the original data,
making it more manageable and easier to analyze, without losing
significant information.
 Common Techniques:
 Dimensionality Reduction: Reducing the number of random
variables under consideration. Techniques include Principal
Component Analysis (PCA), Linear Discriminant Analysis
(LDA), and feature selection methods.
 Binomialization: Reducing data by turning numerical data into
binary values (0 or 1).
 Histogram Analysis: Dividing data into bins and then
representing the data by its bin.
/ Handling Outliers
 Clustering: Grouping similar data points together. Algorithms
include K-means, hierarchical clustering, and DBSCAN.
 Aggregation: Summarizing and grouping data in various ways,
like computing the sum, average, or count for groups of data.
 Sampling: Using a subset of the data that's representative of
the entire dataset.
 Data Compression: Techniques like Run-Length Encoding
(RLE) or algorithms such as JPEG images
 Challenges:
 Loss of Information: Some data reduction techniques can lead
to the loss of original data.
 Complexity: The process can be computationally intensive or
complex, especially with high-dimensional data.
 Reversibility: Some reduction techniques are irreversible,
meaning the original data cannot be reconstructed from the
reduced data.
/ Data Transformations

 Data transformation is the process of converting data from one

format, structure, or value to another to make it suitable for various
analytical needs or specific tasks.
 The aim is to improve the data's quality and usability by ensuring it
is in the most appropriate form to meet the requirements of
different operations, such as data analysis, machine learning, or
visualization.
 Reasons for Data Transformation:
 Compatibility: Ensuring data from different sources aligns
well for consolidated analyses.
 Performance: Optimizing data for faster queries or algorithm
processing.
 Analysis Requirements: Certain algorithms or analytical
techniques require data in specific formats.
/ Data Transformations

 Common Techniques:
 Normalization: Scaling numerical data to fall within a smaller,
standard range, like 0-1.
 Standardization (Z-Score Normalization): Rescaling data so
it has a mean of 0 and a standard deviation of 1.
 One-Hot Encoding: Converting categorical variables into a
format that can be provided to machine learning algorithms to
improve predictions.
 Binning: Converting continuous data into discrete intervals or
bins.
 Log Transformation: Used to transform skewed data into a
more normal or Gaussian distribution.
 Feature Extraction: Creating new variables from the existing
ones, like Principal Component Analysis (PCA).
/ Data Transformations

 Handling Complex Data:

 Date and Time Transformation: Extracting specific
components like day, month, year, or time of day.
 Text Transformation: Techniques like tokenization,
stemming, or encoding to convert text into numerical data.
 Spatial Transformation: Converting spatial data (like latitude
and longitude) into distinct zones or distances.
 Data Integration:
 Aggregation: Combining multiple data rows into a single row,
often using methods like sum, average, or count.
 Pivoting: Rotating data from a long format to a wide format, or
vice versa.
 Joining: Combining data from multiple sources based on
common attributes.
/ Data Transformations

 Data transformation is the process of converting data from one

 Data enrichment is the process of enhancing, refining, and

improving raw data by supplementing it with relevant information
from external sources. The main goal of data enrichment is to add
value to the original dataset, making it more comprehensive,
accurate, and insightful for decision-making or analysis.

 Purpose of Data Enrichment:

 Completeness: Filling gaps in datasets with missing or incomplete

information.
 Accuracy: Correcting or verifying existing data entries.
 Enhanced Insights: Adding depth and context to facilitate better
analyses and informed decision-making.
/ Data Transformations

 Common Techniques & Sources:

 Third-party Databases: Leveraging external databases to pull in
relevant data, such as demographic information or industry statistics.
 Web Scraping: Extracting data from websites to supplement existing
datasets.
 Geospatial Enrichment: Augmenting datasets with geographical or
locational data.
 Social Media & Online Platforms: Pulling user-generated content or
sentiments to enhance consumer data.
 Benefits:
 Enhanced Decision-making: Offers a broader perspective by adding
layers of context to existing data.
 Personalization: Helps in tailoring products, services, or content to
individual preferences or profiles.
 Better Segmentation: Facilitates a deeper understanding of customer
segments.
 Improved Data Quality: Boosts the reliability and accuracy of the
dataset.
/ Data Transformations

 Data validation is the process of ensuring that data is accurate,

reliable, and meets the specified criteria before it's used in any
system or analysis. It checks the quality and integrity of the data,
ensuring that it's free from errors and inconsistencies.
 Purpose of Data Validation:
 Accuracy: Ensure that the data collected or entered is correct.
 Consistency: Make sure data is logical and consistent across
datasets.
 Completeness: Check that no essential data points are missing.
 Reliability: Ensure that the data can be trusted for decision-making
and analysis.
/ Data Transformations

 Common Techniques:
 Range Check: Verifying that a data value falls within a
specified range.
 Format Check: Ensuring data is in a specific format, like a
valid email address or phone number.
 List Check: Validating data against a predefined list of
acceptable values.
 Consistency Check: Ensuring data doesn't have contradictions,
such as a date of birth indicating a person is 150 years old.
 Uniqueness Check: Verifying that entries in a unique field, like
a user ID, are not duplicated.
 Logical Check: Confirming that data combinations make logical
sense, such as gender and salutation alignment.
/ Data Transformations

 Types of Data Validation:

 Manual Verification: Humans checking data for errors, often
used for subjective data.
 Automated Validation: Using software or algorithms to check
data against certain rules or patterns.
 Real-time Validation: Validating data immediately as it's entered
into a system, often seen in online forms.
 Benefits:
 Reduced Errors: Minimizing the number of inaccuracies and
mistakes in datasets.
 Improved Decision-making: Reliable data leads to more
accurate analyses and better decisions.
 Efficiency: Identifying and correcting errors early can save time
and resources later on.
 Compliance: Meeting regulatory and industry standards that
require accurate data.
/ Data Transformations

 Types of Data Validation:

 Challenges & Considerations:

 False Positives/Negatives: Validation rules might incorrectly
flag valid data or miss invalid data.
 Complexity: As data grows in volume and variety, validation
can become more complex.
 Balancing Rigor with Flexibility: Overly strict validation
rules can reject data that's slightly off but still valuable.
 Post-validation Activities:
 Data Cleansing: Once data validation identifies errors, the
next step is often to clean or correct the data.
 Feedback Loops: Especially in real-time validation, providing
users with immediate feedback can help correct errors at the
source.

Software Engineering Presentation
No ratings yet
Software Engineering Presentation
14 pages
UNIT 2 Data Warehousing
No ratings yet
UNIT 2 Data Warehousing
45 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Data Preprocessing: Clean, Transform, Integrate
No ratings yet
Data Preprocessing: Clean, Transform, Integrate
6 pages
DM Unit 1
No ratings yet
DM Unit 1
18 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Data Integration & Manipulation Guide
No ratings yet
Data Integration & Manipulation Guide
10 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Module Pool Programming: Sap Abap Training Document
No ratings yet
Module Pool Programming: Sap Abap Training Document
7 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Unit - II
No ratings yet
Unit - II
56 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)
No ratings yet
COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)
13 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
UNIT 3 Data Preprocessing
No ratings yet
UNIT 3 Data Preprocessing
22 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
BI Unit 4
No ratings yet
BI Unit 4
21 pages
DAI101 4 Data Preparation
No ratings yet
DAI101 4 Data Preparation
45 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
Unit II (DWDM)
No ratings yet
Unit II (DWDM)
19 pages
Chap 3
No ratings yet
Chap 3
26 pages
DM Unit2
No ratings yet
DM Unit2
9 pages
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
22 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
Unit II - Data Preprocessing and Classification RSK-1
No ratings yet
Unit II - Data Preprocessing and Classification RSK-1
115 pages
How Should Data Preparation Be Done For An Analytics Project
No ratings yet
How Should Data Preparation Be Done For An Analytics Project
30 pages
Unit 3
No ratings yet
Unit 3
18 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
Requirements: System & Architecture
No ratings yet
Requirements: System & Architecture
22 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
Chapter 6
No ratings yet
Chapter 6
32 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Big Data Day II
No ratings yet
Big Data Day II
38 pages
Data Mining
No ratings yet
Data Mining
22 pages
Unit 2 Preprocessing
No ratings yet
Unit 2 Preprocessing
39 pages
Cours Preprocessing
No ratings yet
Cours Preprocessing
23 pages
Data Mining
No ratings yet
Data Mining
55 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Session 2-Data Preprocessing
No ratings yet
Session 2-Data Preprocessing
29 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
BI Unit 4 Final
No ratings yet
BI Unit 4 Final
2 pages
Unit-2 Data Warehouse Notes
No ratings yet
Unit-2 Data Warehouse Notes
11 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Computer Science 2210 0478 (2023-2026) Term Wise Breakdown IX, X, XI
No ratings yet
Computer Science 2210 0478 (2023-2026) Term Wise Breakdown IX, X, XI
2 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
Data Preprocessing
No ratings yet
Data Preprocessing
15 pages
SAP HANA Curriculum
No ratings yet
SAP HANA Curriculum
5 pages
Exception Handling For APIs Mule 4 For Emails
100% (1)
Exception Handling For APIs Mule 4 For Emails
7 pages
Imaster NCE V100R020C10SPC200 System Overview
No ratings yet
Imaster NCE V100R020C10SPC200 System Overview
31 pages
Salesforce Admin Report
No ratings yet
Salesforce Admin Report
34 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
UHCS Pres New V3
No ratings yet
UHCS Pres New V3
27 pages
SF Course
No ratings yet
SF Course
36 pages
Man-In-The-Middle-Attack: Understanding in Simple Words: Avijit Mallik
No ratings yet
Man-In-The-Middle-Attack: Understanding in Simple Words: Avijit Mallik
26 pages
Alcina PDF
No ratings yet
Alcina PDF
24 pages
Blue Iris Remote Access Guide
No ratings yet
Blue Iris Remote Access Guide
24 pages
Java Accenture Questions
No ratings yet
Java Accenture Questions
12 pages
Data Preparation Steps for Analysis
No ratings yet
Data Preparation Steps for Analysis
3 pages
Firestorm Viewer Build Guide
No ratings yet
Firestorm Viewer Build Guide
10 pages
2.2 Data Modeling and Management Relationship Types
No ratings yet
2.2 Data Modeling and Management Relationship Types
15 pages
Zomato's Global Growth and Strategy
100% (2)
Zomato's Global Growth and Strategy
2 pages
Digital Marketing
No ratings yet
Digital Marketing
10 pages
Payara Server in Production - Ops Teams Guide-1
No ratings yet
Payara Server in Production - Ops Teams Guide-1
11 pages
Data Science & Big Data Analysis (ITS-836-55) University of The Cumberlands
No ratings yet
Data Science & Big Data Analysis (ITS-836-55) University of The Cumberlands
8 pages
Lingaraj Bhuyan Resume
No ratings yet
Lingaraj Bhuyan Resume
3 pages
Implementation of DDL Commands
No ratings yet
Implementation of DDL Commands
6 pages
Evdo Tutorial PDF
No ratings yet
Evdo Tutorial PDF
2 pages
Web Application Assignment No 3
No ratings yet
Web Application Assignment No 3
4 pages
Jahanvi Sharma Resume
No ratings yet
Jahanvi Sharma Resume
2 pages
BGP Best Path Selection Algorithm
No ratings yet
BGP Best Path Selection Algorithm
7 pages
Microsoft 365 Modules Overview
No ratings yet
Microsoft 365 Modules Overview
3 pages
3DRi-brochure 2018
No ratings yet
3DRi-brochure 2018
2 pages
Playfair Cipher: × 5 Square in Some Predetermined Order
No ratings yet
Playfair Cipher: × 5 Square in Some Predetermined Order
3 pages
2nd Com-Ch5test2
No ratings yet
2nd Com-Ch5test2
3 pages

Week 3

Uploaded by

Week 3

Uploaded by

Data Preprocessing: A primer

 Data preprocessing is a foundational step in the data

 It entails preparing raw data for analysis by transforming

 Data preprocessing is the process of converting raw data

 It involves multiple steps, including cleaning,

 This initial phase is pivotal, as the quality and precision of

 Data profiling is the process of examining datasets to gather

 It provides a summary of a dataset's attributes, patterns,

 By understanding the data's structure, relationships, and

 Dirty data can be a major impediment in data analysis.

 Data cleaning becomes imperative to ensure the integrity

 Missing values are a common issue in datasets. Their

 Techniques like imputation, predictive filling, and

 Imputation: Replace missing values with statistical

 Outliers can introduce variance and bias.

 Data reduction refers to the process of transforming large volumes of

 Data transformation is the process of converting data from one

 Handling Complex Data:

 Data transformation is the process of converting data from one

 Data enrichment is the process of enhancing, refining, and

 Purpose of Data Enrichment:

 Completeness: Filling gaps in datasets with missing or incomplete

 Common Techniques & Sources:

 Data validation is the process of ensuring that data is accurate,

 Types of Data Validation:

 Types of Data Validation:

 Challenges & Considerations:

You might also like