0% found this document useful (0 votes)

8 views31 pages

3 Data - Transforming ST

The document provides an overview of data cleaning and transformation techniques essential for data analytics and visualization. It covers common data quality issues, workflows for cleaning data, and various transformation methods such as normalization, encoding, and aggregation. Tools like Python (Pandas) and Power BI are recommended for implementing these techniques.

Uploaded by

raghav.balbharati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views31 pages

3 Data - Transforming ST

Uploaded by

raghav.balbharati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

ECE 4446 DATA ANALYTICS AND

VISUALIZATION

Dr Ramya S ,E & C Dept, MIT, Manipal

Data Cleaning with Examples

• An Introduction to Cleaning and Preparing Data for Analysis

Dr Ramya S ,E & C Dept, MIT, Manipal

Introduction to Data Cleaning

• • Definition: Process of detecting and correcting (or

removing) corrupt or inaccurate records.
• • Why it matters: Dirty data leads to incorrect analysis.
• • Goal: Improve data quality and consistency.

Dr Ramya S ,E & C Dept, MIT, Manipal

Common Data Quality Issues

• • Missing values
• • Duplicates
• • Inconsistent formats
• • Outliers
• • Typos and spelling errors
• • Incorrect data types

Dr Ramya S ,E & C Dept, MIT, Manipal

Data Cleaning Workflow

• 1. Data profiling
• 2. Identifying issues
• 3. Cleaning methods
• 4. Validation
• 5. Documentation

Dr Ramya S ,E & C Dept, MIT, Manipal

Dr Ramya S ,E & C Dept, MIT, Manipal
Removing Duplicates

• • Use Pandas: df.drop_duplicates()

• • Example:
• - Remove repeated name entry

Dr Ramya S ,E & C Dept, MIT, Manipal

Handling Missing Data

• • Techniques:
• - Remove rows/columns
• - Imputation (mean, median, mode)
• • Example:
• - Fill missing Age using median
• - Drop row with no name

Dr Ramya S ,E & C Dept, MIT, Manipal

Data Type Conversion

• • Convert salary from string to numeric (60K → 60000)

• • Convert DOB to YYYY-MM-DD
• • Code Example:
• df['Salary'] = df['Salary'].replace({'K':'000'},
regex=True).astype(int)
• df['DOB'] = pd.to_datetime(df['DOB'])

Dr Ramya S ,E & C Dept, MIT, Manipal

Handling Outliers

• • Techniques:
• - Z-score
• - IQR method
• • Example:
• - Remove salaries < 1000 or > 10,00,000

Dr Ramya S ,E & C Dept, MIT, Manipal

Standardization & Validation

• • Standardize emails (e.g., lowercase)

• • Validate formats using regex
• • Code:
• df['Email'] = df['Email'].str.lower()

Dr Ramya S ,E & C Dept, MIT, Manipal

Dr Ramya S ,E & C Dept, MIT, Manipal
Summary

• • Clean data = reliable analysis

• • Always document your changes
• • Use tools:
• Python (Pandas), Power BI, Excel, OpenRefine

Dr Ramya S ,E & C Dept, MIT, Manipal

Data Transforming with Examples

• Overview of common data transformation techniques with

examples

Dr Ramya S ,E & C Dept, MIT, Manipal

What is Data Transformation?

• Data Transformation is the process of converting data into a

suitable format for analysis.
• It includes normalization, standardization, encoding,
aggregation, etc.

Dr Ramya S ,E & C Dept, MIT, Manipal

Types of Data Transformation

• - Normalization
• - Standardization
• - Encoding (Label, One-Hot)
• - Aggregation
• - Binning
• - Log Transformation
• - Pivoting/Unpivoting
• - Datetime Extraction

Dr Ramya S ,E & C Dept, MIT, Manipal

Normalization

• Normalization scales numerical values into a specific range

(usually [0,1]) to bring all features to the same scale.

• Example:
• Original: [10, 20, 30]
• Min-Max Normalized: [(10-10)/(30-10), (20-10)/(30-10), (30-
10)/(30-10)] → [0.0, 0.5, 1.0]

Dr Ramya S ,E & C Dept, MIT, Manipal

Normalization Example

• Original: [150, 160, 170, 180]

• Min-Max Normalized: [0.0, 0.33, 0.67, 1.0]

Dr Ramya S ,E & C Dept, MIT, Manipal

Standardization Example
• Standardization transforms data to have a mean of 0 and
standard deviation of 1 using Z-score.

• Original Scores: [55, 85, 75, 65]

• Z = (X - μ) / σ

Step 1 . Calculate the mean (𝜇) of the dataset.

•Sum the values:

55+85+75+65=280
.
•Divide by the number of values:

μ=280/ 4=70
Dr Ramya S ,E & C Dept, MIT, Manipal
Dr Ramya S ,E & C Dept, MIT, Manipal
Dr Ramya S ,E & C Dept, MIT, Manipal
Label Encoding

• Converts categorical labels into integers. Useful when

categories have an ordinal relationship.

• Example:
• Gender: ['Male', 'Female', 'Female', 'Male'] → [1, 0, 0, 1]

Dr Ramya S ,E & C Dept, MIT, Manipal

Label & One-Hot Encoding

• Label Encoding: ['Male', 'Female'] →

❖[1, 0]

Dr Ramya S ,E & C Dept, MIT, Manipal

One-Hot Encoding

• Creates binary columns for each category. Useful when

categories are nominal (no order).

• Example:
• Color: ['Red', 'Green', 'Blue'] →
• Red: [1,0,0], Green: [0,1,0], Blue: [0,0,1]
• One-Hot Encoding: ['Red', 'Blue'] →
❖[1, 0], [0, 1]

Dr Ramya S ,E & C Dept, MIT, Manipal

Binning (Discretization)

• Converts continuous data into discrete bins. Helps in reducing

the impact of outliers.

• Example:
• Age: [5, 12, 37, 45, 67] → Bins: Child (0–12), Adult (13–59),
Senior (60+)
• Result: [Child, Child, Adult, Adult, Senior]

Dr Ramya S ,E & C Dept, MIT, Manipal

Log Transformation

• Reduces skewness in highly skewed data. Useful when data

spans multiple orders of magnitude.

• Example:
• Salary: [1000, 10000, 100000] → log10(Salary): [3.0, 4.0, 5.0]

Dr Ramya S ,E & C Dept, MIT, Manipal

Aggregation

• Summarizes data by group (e.g., mean, sum, count). Often

used in grouped analysis or dashboards.

• Example:
• Sales by Region:
• Region A: [100, 200], Region B: [300, 400]
• → Sum: A: 300, B: 700

Dr Ramya S ,E & C Dept, MIT, Manipal

Pivoting

• Converts row-based data to column-based format. Often

used to restructure data for analysis.

• Example:
• Data:
• Product | Month | Sales
• A | Jan | 100
• → Pivot to have Months as columns with Sales as values

Dr Ramya S ,E & C Dept, MIT, Manipal

Datetime Extraction

• Extracts components such as year, month, day from datetime

values. Useful for time series analysis.

• Example:
• Date: '2025-08-07' → Year: 2025, Month: 8, Day: 7, Day of
Week: Thursday

Dr Ramya S ,E & C Dept, MIT, Manipal

Power BI: Data Transformation

• In Power BI, data can be transformed using Power Query and DAX
functions before visualizing it.

• Example:
• Sales Data:
• | Date | Sales |
• |------------|--------|
• | 2025-01-01 | 5000 |
• Transformations:
• - Extract Year from Date
• - Apply log on Sales
• - Group by Month

Dr Ramya S ,E & C Dept, MIT, Manipal

Power BI Example

• Data:
• | Date | Sales |
• |------------|--------|
• | 2025-01-01 | 5000 |

• Transformations:
• - Extract Year/Month
• - Group by Month
• - Apply Log to Sales
• → Useful for trend analysis & normalization

Dr Ramya S ,E & C Dept, MIT, Manipal

Unit II - Data Preprocessing and Classification RSK-1
No ratings yet
Unit II - Data Preprocessing and Classification RSK-1
115 pages
ML - Data - Preprocessing For Machine Learning
No ratings yet
ML - Data - Preprocessing For Machine Learning
44 pages
IntroToPython Unit 5
No ratings yet
IntroToPython Unit 5
42 pages
Intro and Power Query Slides
No ratings yet
Intro and Power Query Slides
29 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Session 4
No ratings yet
Session 4
40 pages
Part 5
No ratings yet
Part 5
22 pages
Unit 2 - Data Visualization Techniques
No ratings yet
Unit 2 - Data Visualization Techniques
101 pages
Data Migration Plan
75% (4)
Data Migration Plan
25 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
20 pages
Data Migration
100% (2)
Data Migration
31 pages
DWDM PDF
No ratings yet
DWDM PDF
21 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
DMIRUnit2pdf 2023 08 16 18 29 11
No ratings yet
DMIRUnit2pdf 2023 08 16 18 29 11
71 pages
DAI101 4 Data Preparation
No ratings yet
DAI101 4 Data Preparation
45 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Dunning Procedure in SAP S - 4HANA
No ratings yet
Dunning Procedure in SAP S - 4HANA
91 pages
Data Proprocesing
No ratings yet
Data Proprocesing
18 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
14 pages
Unit 3
No ratings yet
Unit 3
41 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
DS Unit 2
No ratings yet
DS Unit 2
42 pages
Week2 DataPreprocessing
No ratings yet
Week2 DataPreprocessing
43 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
71 pages
3-Preparing The Data-10-01-2024
No ratings yet
3-Preparing The Data-10-01-2024
127 pages
Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences
No ratings yet
Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences
6 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
34 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
85 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Unit Iii
No ratings yet
Unit Iii
33 pages
Week2 2
No ratings yet
Week2 2
25 pages
Chap 3
No ratings yet
Chap 3
26 pages
Data Warehouse Testing Guide
No ratings yet
Data Warehouse Testing Guide
16 pages
Data Cleaning Techniques
No ratings yet
Data Cleaning Techniques
11 pages
4 Data Pre Processing II
No ratings yet
4 Data Pre Processing II
26 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
SQL To Pyspark
No ratings yet
SQL To Pyspark
28 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
BI Unit 4
No ratings yet
BI Unit 4
21 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
SCA - Module 3
No ratings yet
SCA - Module 3
48 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Lecture 4 Data Pre-Processing
No ratings yet
Lecture 4 Data Pre-Processing
43 pages
Data Cleaning Essentials
No ratings yet
Data Cleaning Essentials
42 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Data Warehousing and Data Mining Lab
No ratings yet
Data Warehousing and Data Mining Lab
63 pages
Techniques
No ratings yet
Techniques
31 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Data Quality For Analytics
No ratings yet
Data Quality For Analytics
9 pages
3 1 Chapter 3 Normalization
No ratings yet
3 1 Chapter 3 Normalization
22 pages
Data Conversion Plan Template
50% (2)
Data Conversion Plan Template
75 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Data Normalization and Aggregation
No ratings yet
Data Normalization and Aggregation
25 pages
Entrevista Data Migration
No ratings yet
Entrevista Data Migration
18 pages
Assigment 3 Data Science
No ratings yet
Assigment 3 Data Science
3 pages
EDA Guide for Data Analysts
No ratings yet
EDA Guide for Data Analysts
35 pages
Data Curation and Management
No ratings yet
Data Curation and Management
24 pages
Video Pres
No ratings yet
Video Pres
7 pages
Data - Preprocessing 1 19
No ratings yet
Data - Preprocessing 1 19
19 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Data Quality Management
No ratings yet
Data Quality Management
10 pages
White Paper-Simplifying Oracle Retail Data Conversion
No ratings yet
White Paper-Simplifying Oracle Retail Data Conversion
6 pages
MDG P1
No ratings yet
MDG P1
17 pages
Data Cleaning Ebook
No ratings yet
Data Cleaning Ebook
25 pages
AML Takaful DM ApproachScope Paper V1.0
No ratings yet
AML Takaful DM ApproachScope Paper V1.0
29 pages
Madhuri Poluri: Competencies
No ratings yet
Madhuri Poluri: Competencies
4 pages
Data Conversion and Cleansing Methodology
No ratings yet
Data Conversion and Cleansing Methodology
21 pages
1
No ratings yet
1
6 pages
Franchisee Data Insights
No ratings yet
Franchisee Data Insights
2 pages
Data Cleansing Guide for Analysts
No ratings yet
Data Cleansing Guide for Analysts
5 pages
Wipro Data Migration Rodney+Cole-6d0de9c6
No ratings yet
Wipro Data Migration Rodney+Cole-6d0de9c6
6 pages
Assessment 3-Group Assignment
No ratings yet
Assessment 3-Group Assignment
3 pages
What Is Data Preparation? An In-Depth Guide - TechTarget
No ratings yet
What Is Data Preparation? An In-Depth Guide - TechTarget
14 pages
Data Mining Functionalities & Systems
No ratings yet
Data Mining Functionalities & Systems
23 pages
Data Mining Exam Questions 2019
No ratings yet
Data Mining Exam Questions 2019
10 pages
13 - Chapter 4 PDF
No ratings yet
13 - Chapter 4 PDF
46 pages

3 Data - Transforming ST

Uploaded by

3 Data - Transforming ST

Uploaded by

ECE 4446 DATA ANALYTICS AND

Dr Ramya S ,E & C Dept, MIT, Manipal

• An Introduction to Cleaning and Preparing Data for Analysis

Dr Ramya S ,E & C Dept, MIT, Manipal

• • Definition: Process of detecting and correcting (or

Dr Ramya S ,E & C Dept, MIT, Manipal

Dr Ramya S ,E & C Dept, MIT, Manipal

Dr Ramya S ,E & C Dept, MIT, Manipal

• • Use Pandas: df.drop_duplicates()

Dr Ramya S ,E & C Dept, MIT, Manipal

Dr Ramya S ,E & C Dept, MIT, Manipal

• • Convert salary from string to numeric (60K → 60000)

Dr Ramya S ,E & C Dept, MIT, Manipal

Dr Ramya S ,E & C Dept, MIT, Manipal

• • Standardize emails (e.g., lowercase)

Dr Ramya S ,E & C Dept, MIT, Manipal

• • Clean data = reliable analysis

Dr Ramya S ,E & C Dept, MIT, Manipal

• Overview of common data transformation techniques with

Dr Ramya S ,E & C Dept, MIT, Manipal

• Data Transformation is the process of converting data into a

Dr Ramya S ,E & C Dept, MIT, Manipal

Dr Ramya S ,E & C Dept, MIT, Manipal

• Normalization scales numerical values into a specific range

Dr Ramya S ,E & C Dept, MIT, Manipal

• Original: [150, 160, 170, 180]

Dr Ramya S ,E & C Dept, MIT, Manipal

• Original Scores: [55, 85, 75, 65]

Step 1 . Calculate the mean (𝜇) of the dataset.

• Converts categorical labels into integers. Useful when

Dr Ramya S ,E & C Dept, MIT, Manipal

• Label Encoding: ['Male', 'Female'] →

Dr Ramya S ,E & C Dept, MIT, Manipal

• Creates binary columns for each category. Useful when

Dr Ramya S ,E & C Dept, MIT, Manipal

• Converts continuous data into discrete bins. Helps in reducing

Dr Ramya S ,E & C Dept, MIT, Manipal

• Reduces skewness in highly skewed data. Useful when data

Dr Ramya S ,E & C Dept, MIT, Manipal

• Summarizes data by group (e.g., mean, sum, count). Often

Dr Ramya S ,E & C Dept, MIT, Manipal

• Converts row-based data to column-based format. Often

Dr Ramya S ,E & C Dept, MIT, Manipal

• Extracts components such as year, month, day from datetime

Dr Ramya S ,E & C Dept, MIT, Manipal

Dr Ramya S ,E & C Dept, MIT, Manipal

Dr Ramya S ,E & C Dept, MIT, Manipal

You might also like