Data Analytics 2024 Data Cleaning
How To
Prepare
Messy Data
for Analysis
using SQL?
How to clean data using SQL? Swipe next
Data Analytics 2024 Data Cleaning
You work as a data analyst for a retail company that collects
daily sales data from various stores.
Let’s walk through the process of cleaning a dataset
using SQL with a real-world example.
Below is a sample of the SalesTransactions table:
TransactionID CustomerID TransactionDate ProductID QuantitySold TotalAmount
1001 2001 12/09/2024 P101 5 500
1002 2001 12/09/2024 P102 NULL 1500
1003 2002 11/09/2024 P103 3 450
1003 2002 11/09/2024 P103 3 450
1004 2003 10/09/2024 P104 10000 100000
1005 2004 09/09/2024 XYZ 2 300
Issues in the Data:
There are several issues in this table
Missing values (NULL in QuantitySold)
Duplicate transactions (same TransactionID repeated)
Inconsistent date format (dd/mm/yyyy instead of yyyy-mm-dd)
Outliers in QuantitySold and TotalAmount (e.g., 10,000 units
sold)
Invalid ProductID (e.g., XYZ is not a valid product).
Data Analytics 2024 Data Cleaning
Let’s clean this data using SQL.
Handling Missing Values
Problem :
The QuantitySold column has a missing value (NULL) for
ProductID = P102. We can estimate this missing value based on
the TotalAmount and the product's price.
First, identify rows where QuantitySold is missing (NULL).
SELECT *
FROM SalesTransactions
WHERE QuantitySold IS NULL;
If a price for the product is available in a Products table, calculate
the missing QuantitySold by dividing TotalAmount by the price.
UPDATE SalesTransactions
SET QuantitySold = TotalAmount / (
SELECT PricePerUnit
FROM Products
WHERE Products.ProductID = .ProductID
SalesTransactions
WHERE QuantitySold IS NULL;
Data Analytics 2024 Data Cleaning
If you can’t calculate QuantitySold, you may choose to remove
the row:
DELETE FROM SalesTransactions
WHERE QuantitySold IS NULL;
After Handling Missing Values
We used SQL to calculate the missing QuantitySold by dividing
the TotalAmount by the product's price from another table
(Products).
The missing value in QuantitySold is now replaced by the correct
value 10.
TransactionID CustomerID TransactionDate ProductID QuantitySold TotalAmount
1001 2001 12/09/2024 P101 5 500
1002 2001 12/09/2024 P102 10 1500
Removing Duplicate Records
Problem :
There’s a duplicate record for TransactionID = 1003, where the
exact same transaction is repeated twice.
Data Analytics 2024 Data Cleaning
Duplicate records can lead to inaccurate analysis. We can remove
duplicates by using the ROW_NUMBER() function.
WITH DuplicateCheck AS (
SELECT *, ROW_NUMBER() OVER
(PARTITION BY TransactionID, CustomerID, TransactionDate
ORDER BY TransactionID) AS RowNum
FROM SalesTransactions
DELETE FROM SalesTransactions
WHERE TransactionID IN (
SELECT TransactionID FROM DuplicateCheck WHERE RowNum > 1
);
After Removing Duplicate Records
The duplicate row has been deleted, leaving only one record
for TransactionID = 1003.
TransactionID CustomerID TransactionDate ProductID QuantitySold TotalAmount
1001 2001 12/09/2024 P101 5 500
1002 2001 12/09/2024 P102 10 1500
1003 2002 11/09/2024 P103 3 450
Data Analytics 2024 Data Cleaning
Standardizing Date Formats
Problem :
The TransactionDate is in different formats like dd/mm/yyyy
(e.g., 12/09/2024), which makes it inconsistent for analysis.
Now, let’s convert TransactionDate from the dd/mm/yyyy format
to yyyy-mm-dd for consistency.
UPDATE SalesTransactions
SET TransactionDate = STR_TO_DATE(TransactionDate, '%d/%m/%Y')
WHERE TransactionDate LIKE '%/%';
After Standardizing Date Formats
All transaction dates are now in the same format: 2024-09-12,
making it easier for queries, comparisons, and reporting.
TransactionID CustomerID TransactionDate ProductID QuantitySold TotalAmount
1001 2001 12-09-2024 P101 5 500
1002 2001 12-09-2024 P102 10 5
1 00
1003 2002 11-09-2024 P103 3 5
4 0
Data Analytics 2024 Data C g
leanin
Identify & Handle Outliers
Problem :
There’s an abnormally high QuantitySold of 10,000 units for
TransactionID = 1004, which is likely an error or an extreme outlier.
We can flag or delete such extreme values.
Identify outliers in QuantitySold
SELECT *
FROM SalesTransactions
WHERE QuantitySold > 1000;
Remove the outlier
DELETE FROM SalesTransactions
WHERE QuantitySold > 1000;
After Identify & Handle Outliers
We flagged this as an outlier using a simple rule (e.g., quantities greater
than 1,000), and then removed this transaction from the data.
TransactionID C
ustomerID TransactionDate ProductID Q S
uantity old TotalAmount
1004 3
200 Deleted
Data Analytics 2024 Data Cleaning
Validating Product IDs
Problem :
The ProductID for TransactionID = 1005 is XYZ, which is invalid
because there is no such product in the Products table.
Identify invalid ProductIDs
SELECT *
FROM SalesTransactions
WHERE ProductID NOT IN (SELECT ProductID FROM Products);
Remove rows with invalid ProductIDs
DELETE FROM SalesTransactions
WHERE ProductID NOT IN (SELECT ProductID FROM Products);
After Identify & Handle Outliers
The row with the invalid ProductID = XYZ has been deleted, leaving
only valid product entries.
TransactionID CustomerID TransactionDate ProductID QuantitySold TotalAmount
1005 2004 Deleted
Data Analytics 2024 Data Cleaning
Final Cleaned Data
After performing these steps, your cleaned SalesTransactions
table would look something like this:
TransactionID CustomerID TransactionDate ProductID QuantitySold TotalAmount
1001 2001 12-09-2024 P101 5 500
1002 2001 12-09-2024 P102 10 1500
1003 2002 11-09-2024 P103 3 450
Summary
Handled missing values by filling or removing rows with NULL
values
Removed duplicates to ensure each transaction is unique
Standardized date formats for consistent querying
Handled outliers in QuantitySold and TotalAmount to ensure
realistic data
Validated ProductIDs by removing invalid entries.
By applying these SQL techniques, you ensure the dataset is clean,
accurate, and ready for analysis or reporting.
Data Analytics 2024 Data Cleaning
Dr. Jayen Thakker MetricMinds.in
Was This Helpful?
Be sure to save it so you can
come back to it later
Follow us for more.