Data Cleaning Preprocessing

Data cleaning and preprocessing are essential steps in data science that involve correcting errors, handling missing values, and transforming raw data into a suitable format for analysis. Key aspects of data quality include accuracy, completeness, consistency, timeliness, believability, and interpretability, which directly influence machine learning model performance. Effective data cleaning and preprocessing techniques, such as handling duplicates, filtering outliers, and feature selection, are critical for ensuring reliable and accurate outcomes in various real-world applications.

Uploaded by

alzoubishorooq022

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views28 pages

Data Cleaning Preprocessing

Uploaded by

alzoubishorooq022

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Data Cleaning & Data

Preprocessing
Dr. Najah Al-shanableh
Data Cleaning
• Data cleaning involves identifying and
correcting errors in datasets such as missing,
inconsistent data, and outliers.
Data Preprocessing
• Data preprocessing is the process of
transforming raw data into a suitable format
for analysis and modeling.
Data Quality
Understanding the Multi-Dimensionality of Data Quality

• Data is said to be of good quality as long as it can satisfy the requirements of its
intended use.Factors that makeup data quality include :
• 1. Accuracy
• Accuracy refers to how well the information recorded reflects a real event or object.
Data inaccuracy can be caused by faulty data collection instruments or computer
errors, purposeful submission of incorrect data values by users, errors in data
transmission, inconsistencies in naming conventions and input formats, etc.
• 2. Completeness
• Data is considered “complete” when all the mandatory or necessary features are
present. The incompleteness of data can occur due to unavailability of requisite
information, equipment malfunctions during data collection, unintended deletion,
or failure to record history or modifications.
• 3. Consistency
• If the same information stored and used at multiple instances matches, with or
without formatting inconsistencies between the various sources or datastores,
then the data is consistent. It is quantitatively expressed as the percentage of
values that match across the different stored instances.
Understanding the Multi-Dimensionality of Data Quality

• 4. Timeliness
• Timeliness also affects data quality as data is of value only when it is available
when needed. If the data is outdated or the corrections are incorporated post
evaluations or analysis of the dataset, the data quality is affected.
• 5. Believability
• The believability describes the trust the users have in the data. If the data was at
any point found to be rife with errors and inconsistencies, its users will likely harbor
reservations when it comes to using this data in the future.
• 6. Interpretability
• Interpretability of data defines how easy it is to understand the information present
in the dataset and derive meaning from it. The availability of statistical data
collection and processing methodologies for the users can affect the
interpretability of the dataset.

• Ref: https://www.ovaledge.com/blog/data-
quality-metrics
Best Practice

• Ref: https://estuary.dev/data-quality/
Data Cleaning & Data Preprocessing
• Data quality is paramount in data science and machine learning. The input
data quality heavily influences machine learning models' performance. In
this context, data cleaning and preprocessing are not just preliminary
steps but crucial components of the machine learning pipeline.
• Data cleaning involves identifying and correcting errors in the dataset,
such as dealing with missing or inconsistent data, removing duplicates,
and handling outliers. Ensuring you train the machine learning mode on
accurate and reliable data is essential. The model may learn from incorrect
data without proper cleaning, leading to inaccurate predictions or
classifications.
• On the other hand, data preprocessing is a broader concept that includes
data cleaning and other steps to prepare the data for machine learning
algorithms. These steps may include data transformation, feature
selection, normalization, and reduction. The goal of data preprocessing is
to convert raw data into a suitable format that machine learning
algorithms can learn.
Data Cleaning & Data Preprocessing
• The importance of data cleaning and data
preprocessing cannot be overstated, as it can
significantly impact the model's performance. A
well-cleaned and preprocessed dataset can lead
to more accurate and reliable machine learning
models, while a poorly cleaned and preprocessed
dataset can lead to misleading results and
conclusions.
Common Data
Preprocessing
Techniques
• Data cleaning,
transformation, feature
selection,
normalization, and
data reduction.
Data Preprocessing
What is Data Cleaning?
• In data science and machine learning, the quality of input
data is paramount. It's a well-established fact that data
quality heavily influences the performance of machine
learning models. This makes data cleaning, detecting, and
correcting (or removing) corrupt or inaccurate records from
a dataset a critical step in the data science pipeline.
• Data cleaning is not just about erasing data or filling in
missing values. It's a comprehensive process involving
various techniques to transform raw data into a format
suitable for analysis. These techniques include handling
missing values, removing duplicates, data type conversion,
and more. Each technique has its specific use case and is
applied based on the data's nature and the analysis's
requirements.
Data Cleaning Process
• 1. Identifying duplicates
• 2. Fixing errors
• 3. Filtering outliers
• 4. Handling missing values.
Step-by-Step Guide to Data Cleaning
1. Identifying and Removing Duplicate or Irrelevant Data: Duplicate data can arise
from various sources, such as the same individual participating in a survey multiple
times or redundant fields in the data collection process. Irrelevant data refers to
information you can safely remove because it is not likely to contribute to the
model's predictive capacity. This step is particularly important when dealing with
large datasets.
2. Fixing Syntax Errors: Syntax errors can occur due to inconsistencies in data entry,
such as date formats, spelling mistakes, or grammatical errors. You must identify
and correct these errors to ensure the data's consistency. This step is crucial in
maintaining the quality of data.
3. Filtering out Unwanted Outliers: Outliers, or data points that significantly deviate
from the rest of the data, can distort the model's learning process. These outliers
must be identified and handled appropriately by removal or statistical treatment.
This process is a part of data reduction.
4. Handling Missing Data: Missing data is a common issue in data collection.
Depending on the extent and nature of the missing data, you can employ different
strategies, including dropping the data points or imputing missing values. This step
is especially important when dealing with large data.
5. Validating Data Accuracy: Validate the accuracy of the data through cross-checks
and other verification methods. Ensuring data accuracy is crucial for maintaining
the reliability of the machine-learning model. This step is particularly important for
data scientists as it directly impacts the model's performance.
Common Data Cleaning Techniques

• Handling Missing Values: Missing data can occur for

various reasons, such as errors in data collection or transfer.
There are several ways to handle missing data, depending
on the nature and extent of the missing values.
• Imputation: Here, you replace missing values with
substituted values. The substituted value could be a central
tendency measure like mean, median, or mode for
numerical data or the most frequent category for
categorical data. More sophisticated imputation methods
include regression imputation and multiple imputation.
• Deletion: You remove the instances with missing values
from the dataset. While this method is straightforward, it
can lead to loss of information, especially if the missing
data is not random.
Common Data Cleaning Techniques

• Removing Duplicates: Duplicate entries can occur for various reasons, such as data
entry errors or data merging. These duplicates can skew the data and lead to
biased results. Techniques for removing duplicates involve identifying these
redundant entries based on key attributes and eliminating them from the dataset.
• Data Type Conversion: Sometimes, the data may be in an inappropriate format for
a particular analysis or model. For instance, a numerical attribute may be recorded
as a string. In such cases, data type conversion, also known as datacasting, is used
to change the data type of a particular attribute or set of attributes. This process
involves converting the data into a suitable format that machine learning
algorithms can easily process.
•
Outlier Detection: Outliers are data points that significantly deviate from other
observations. They can be caused by variability in the data or errors. Outlier
detection techniques are used to identify these anomalies. These techniques
include statistical methods, such as the Z-score or IQR method, and machine
learning methods, such as clustering or anomaly detection algorithms.
Data cleaning
• Data cleaning is a vital step in the data science
pipeline. It ensures that the data used for
analysis and modeling is accurate, consistent,
and reliable, leading to more robust and
reliable machine learning models.
Best Practices for Data Cleaning

Here are some practical tips and best practices for data
cleaning:
• Maintain a strict data quality measure while importing new
data.
• Use efficient and accurate algorithms to fix typos and fill in
missing regions.
• Validate data accuracy with known factors and cross-
checks.
• Remember that data cleaning is not a one-time process but
a continuous one. As new data comes in, it should also be
cleaned and preprocessed before being used in the model.
• By following these practices, we can ensure that our data is
clean and structured to maximize the performance of our
machine-learning models.
Data Preprocessing
• Data preprocessing is critical in data science, particularly for
machine learning applications. It involves preparing and
cleaning the dataset to make it more suitable for machine
learning algorithms. This process can reduce complexity,
prevent overfitting, and improve the model's overall
performance.
• The data preprocessing phase begins with understanding
your dataset's nuances and the data's main issues through
Exploratory Data Analysis. Real-world data often presents
inconsistencies, typos, missing data, and different scales.
You must address these issues to make the data more
useful and understandable. This process of cleaning and
solving most of the issues in the data is what we call the
data preprocessing step.
Data Preprocessing
• Skipping the data preprocessing step can affect the
performance of your machine learning model and
downstream tasks. Most models can't handle missing
values, and some are affected by outliers, high
dimensionality, and noisy data. By preprocessing the data,
you make the dataset more complete and accurate, which
is critical for making necessary adjustments in the data
before feeding it into your machine learning model.
• Data preprocessing techniques include data cleaning,
dimensionality reduction, feature engineering, sampling
data, transformation, and handling imbalanced data. Each
of these techniques has its own set of methods and
approaches for handling specific issues in the data.
Common Data Preprocessing
Techniques
Data Scaling
• Data scaling is a technique used to standardize the range of
independent variables or features of data. It aims to
standardize the data's range of features to prevent any
feature from dominating the others, especially when
dealing with large datasets. This is a crucial step in data
preprocessing, particularly for algorithms sensitive to the
range of the data, such as deep learning models.

• There are several ways to achieve data scaling, including

Min-Max normalization and Standardization. Min-Max
normalization scales the data within a fixed range (usually 0
to 1), while Standardization scales data with a mean of 0
and a standard deviation of 1.
Common Data Preprocessing
Techniques
Encoding Categorical Variables
• Machine learning models require inputs to be
numerical. If your data contains categorical data,
you must encode them to numerical values
before fitting and evaluating a model. This
process, known as encoding categorical variables,
is a common data preprocessing technique. One
common method is One-Hot Encoding, which
creates new binary columns for each
category/label in the original columns.
Common Data Preprocessing
Techniques
Data Splitting
• Data Splitting is a technique to divide the dataset
into two or three sets, typically training,
validation, and test sets. You use the training set
to train the model and the validation set to tune
the model's parameters. The test set provides an
unbiased evaluation of the final model. This
technique is essential when dealing with large
data, as it ensures the model is not overfitted to a
particular subset of data.
Common Data Preprocessing
Techniques
Handling Missing Values
• Missing data in the dataset can lead to misleading
results. Therefore, it's essential to handle missing
values appropriately. Techniques for handling
missing values include deletion, removing the
rows with missing values, and imputation,
replacing the missing values with statistical
measures like mean, median, or model. This step
is crucial in ensuring the quality of data used for
training machine learning models.
Common Data Preprocessing
Techniques
Feature Selection
• Feature selection is a process in machine learning where you
automatically select those features in your data that contribute most to
the prediction variable or output in which you are interested. Having
irrelevant features in your data can decrease the accuracy of many
models, especially linear algorithms like linear and logistic regression. This
process is particularly important for data scientists working with high-
dimensional data, as it reduces overfitting, improves accuracy, and
reduces training time.
• Three benefits of performing feature selection before modeling your data
are:
• Reduces Overfitting: Less redundant data means less opportunity to make
noise-based decisions.
• Improves Accuracy: Less misleading data means modeling accuracy improves.
• Reduces Training Time: Fewer data points reduce algorithm complexity, and it
trains faster.
Real-World Applications
• Examples include retail customer
segmentation, predictive maintenance in
manufacturing, and fraud detection in finance.
Real-World Applications of Data
Cleaning and Data Preprocessing
Improving Customer Segmentation in Retail
• One of the most common data cleaning and preprocessing applications is in the
retail industry, particularly in customer segmentation. Retailers often deal with
vast amounts of customer data, which can be messy and unstructured. They can
ensure the data's quality by employing data-cleaning techniques such as handling
missing values, removing duplicates, and correcting inconsistencies.
• When preprocessed through techniques like normalization and encoding, this
cleaned data can significantly enhance the performance of machine learning
models for customer segmentation, leading to more accurate targeting and
personalized marketing strategies.
Enhancing Predictive Maintenance in Manufacturing
• The manufacturing sector also benefits immensely from data cleaning and data
preprocessing. For instance, machine learning models predict equipment failures
in predictive maintenance. However, the sensor data collected can be noisy and
contain outliers. One can improve the data quality by applying data cleaning
techniques to remove these outliers and fill in missing values.
• Further, preprocessing steps like feature scaling can help create more accurate
predictive models, reducing downtime and saving costs.
Key Takeaways
• Data cleaning and preprocessing are crucial to
improving machine learning model accuracy
and reliability.

Maximo Application Suite (MAS) Level 2 Manage EAM
No ratings yet
Maximo Application Suite (MAS) Level 2 Manage EAM
13 pages
Module 1 - Intermediate Recipe Concepts
50% (2)
Module 1 - Intermediate Recipe Concepts
50 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
Data Cleaning Guide
No ratings yet
Data Cleaning Guide
4 pages
Foundation of DS
No ratings yet
Foundation of DS
21 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
SMA Expt 3
No ratings yet
SMA Expt 3
9 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
9 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
m4t5 - PDF - Eng Data Cleaning & Etl
No ratings yet
m4t5 - PDF - Eng Data Cleaning & Etl
6 pages
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
No ratings yet
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
88 pages
Data Mining Group Assignment4
No ratings yet
Data Mining Group Assignment4
10 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Unit II (DWDM)
No ratings yet
Unit II (DWDM)
19 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Unit 2
No ratings yet
Unit 2
16 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Data Cleaning (Examples)
No ratings yet
Data Cleaning (Examples)
9 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
05 Data Cleaning
No ratings yet
05 Data Cleaning
9 pages
Data Cleaning Using Pandas
No ratings yet
Data Cleaning Using Pandas
9 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
DS-Unit-2 ABM Final
No ratings yet
DS-Unit-2 ABM Final
134 pages
Ch03 DS-Unit-2 ABM Final
No ratings yet
Ch03 DS-Unit-2 ABM Final
143 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Data Segmentation
No ratings yet
Data Segmentation
11 pages
Ids Unit 2
No ratings yet
Ids Unit 2
26 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Cours Preprocessing
No ratings yet
Cours Preprocessing
23 pages
Intro. Data Science 3
No ratings yet
Intro. Data Science 3
38 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
Data Processing
No ratings yet
Data Processing
14 pages
Data Cleaning Why What and How
No ratings yet
Data Cleaning Why What and How
10 pages
Chapter 2.data Warehouse
No ratings yet
Chapter 2.data Warehouse
42 pages
Dmi Unit 3
No ratings yet
Dmi Unit 3
12 pages
U1 - DA - Data Preprocessing
No ratings yet
U1 - DA - Data Preprocessing
6 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Data Munging for Data Scientists
No ratings yet
Data Munging for Data Scientists
54 pages
Data Warehouse and Data Mining - Unit 3
No ratings yet
Data Warehouse and Data Mining - Unit 3
14 pages
Data Mining
No ratings yet
Data Mining
22 pages
06 02 Lessonarticle
No ratings yet
06 02 Lessonarticle
4 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
Study Material Data Preprocessing
No ratings yet
Study Material Data Preprocessing
11 pages
Data Cleaning for Analysts
No ratings yet
Data Cleaning for Analysts
1 page
Data Preprocessing
No ratings yet
Data Preprocessing
4 pages
? Data Cleaning 101
No ratings yet
? Data Cleaning 101
17 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
C-42 Exp 3 Sma
No ratings yet
C-42 Exp 3 Sma
8 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Cloud Computing Chapter4
No ratings yet
Cloud Computing Chapter4
23 pages
Big Mart Sales Analysis DOCUMENT
No ratings yet
Big Mart Sales Analysis DOCUMENT
58 pages
Mrugesh Patel SR Engineering Manager Detailed Primary
No ratings yet
Mrugesh Patel SR Engineering Manager Detailed Primary
7 pages
BSA Technical Specifications
No ratings yet
BSA Technical Specifications
12 pages
Caterpillar Tunneling BI Adoption
No ratings yet
Caterpillar Tunneling BI Adoption
2 pages
Sreekar Chidurala Resume v1
No ratings yet
Sreekar Chidurala Resume v1
1 page
98 - Dumb 0 e Mail Interceptor
No ratings yet
98 - Dumb 0 e Mail Interceptor
2 pages
Essentials C3D2010 Session 01 Introduction
No ratings yet
Essentials C3D2010 Session 01 Introduction
13 pages
The Role of Technology in Modern Public Administration
No ratings yet
The Role of Technology in Modern Public Administration
2 pages
Kerberos Golden Ticket Attack Guide
No ratings yet
Kerberos Golden Ticket Attack Guide
7 pages
Installing Numara Track-It! 10.5 (Full Installation)
No ratings yet
Installing Numara Track-It! 10.5 (Full Installation)
13 pages
Salesforce Admin Report
No ratings yet
Salesforce Admin Report
34 pages
Network Security - Firstday Handout
No ratings yet
Network Security - Firstday Handout
6 pages
Getting Started Guide UK Update
No ratings yet
Getting Started Guide UK Update
15 pages
DG Setup
No ratings yet
DG Setup
5 pages
IoT India 2018 Expo Show Daily Day2
No ratings yet
IoT India 2018 Expo Show Daily Day2
17 pages
DBMS Question Bank
No ratings yet
DBMS Question Bank
4 pages
April (Eti) - Eti PDF
100% (1)
April (Eti) - Eti PDF
85 pages
Digi Quotes: Ultimate Quotebasket Script
No ratings yet
Digi Quotes: Ultimate Quotebasket Script
24 pages
Adobe Livecycle Designer 11.0 - Installation / Patch
No ratings yet
Adobe Livecycle Designer 11.0 - Installation / Patch
4 pages
Pegasystems PEGAPCSSA87V1
No ratings yet
Pegasystems PEGAPCSSA87V1
8 pages
Dwf13 Amf Aut t1059
No ratings yet
Dwf13 Amf Aut t1059
151 pages
SUSE Manager
No ratings yet
SUSE Manager
7 pages
What Cloud Computing Really Means: by Eric Knorr, Galen Gruman Created 2008-04-07 02:00AM
No ratings yet
What Cloud Computing Really Means: by Eric Knorr, Galen Gruman Created 2008-04-07 02:00AM
4 pages
DNS Training: Hands-On Course for Engineers
No ratings yet
DNS Training: Hands-On Course for Engineers
2 pages
Microsoft Teams White Paper PDF
No ratings yet
Microsoft Teams White Paper PDF
4 pages
Informatics and Cyber Law 20230921 193824 0000
0% (1)
Informatics and Cyber Law 20230921 193824 0000
80 pages
Github Copilot
No ratings yet
Github Copilot
7 pages

Data Cleaning Preprocessing

Uploaded by

Data Cleaning Preprocessing

Uploaded by

Data Cleaning & Data

• Handling Missing Values: Missing data can occur for

• There are several ways to achieve data scaling, including

You might also like