Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views31 pages

3 Data - Transforming ST

The document provides an overview of data cleaning and transformation techniques essential for data analytics and visualization. It covers common data quality issues, workflows for cleaning data, and various transformation methods such as normalization, encoding, and aggregation. Tools like Python (Pandas) and Power BI are recommended for implementing these techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views31 pages

3 Data - Transforming ST

The document provides an overview of data cleaning and transformation techniques essential for data analytics and visualization. It covers common data quality issues, workflows for cleaning data, and various transformation methods such as normalization, encoding, and aggregation. Tools like Python (Pandas) and Power BI are recommended for implementing these techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

ECE 4446 DATA ANALYTICS AND

VISUALIZATION

Dr Ramya S ,E & C Dept, MIT, Manipal


Data Cleaning with Examples

• An Introduction to Cleaning and Preparing Data for Analysis

Dr Ramya S ,E & C Dept, MIT, Manipal


Introduction to Data Cleaning

• • Definition: Process of detecting and correcting (or


removing) corrupt or inaccurate records.
• • Why it matters: Dirty data leads to incorrect analysis.
• • Goal: Improve data quality and consistency.

Dr Ramya S ,E & C Dept, MIT, Manipal


Common Data Quality Issues

• • Missing values
• • Duplicates
• • Inconsistent formats
• • Outliers
• • Typos and spelling errors
• • Incorrect data types

Dr Ramya S ,E & C Dept, MIT, Manipal


Data Cleaning Workflow

• 1. Data profiling
• 2. Identifying issues
• 3. Cleaning methods
• 4. Validation
• 5. Documentation

Dr Ramya S ,E & C Dept, MIT, Manipal


Dr Ramya S ,E & C Dept, MIT, Manipal
Removing Duplicates

• • Use Pandas: df.drop_duplicates()


• • Example:
• - Remove repeated name entry

Dr Ramya S ,E & C Dept, MIT, Manipal


Handling Missing Data

• • Techniques:
• - Remove rows/columns
• - Imputation (mean, median, mode)
• • Example:
• - Fill missing Age using median
• - Drop row with no name

Dr Ramya S ,E & C Dept, MIT, Manipal


Data Type Conversion

• • Convert salary from string to numeric (60K → 60000)


• • Convert DOB to YYYY-MM-DD
• • Code Example:
• df['Salary'] = df['Salary'].replace({'K':'000'},
regex=True).astype(int)
• df['DOB'] = pd.to_datetime(df['DOB'])

Dr Ramya S ,E & C Dept, MIT, Manipal


Handling Outliers

• • Techniques:
• - Z-score
• - IQR method
• • Example:
• - Remove salaries < 1000 or > 10,00,000

Dr Ramya S ,E & C Dept, MIT, Manipal


Standardization & Validation

• • Standardize emails (e.g., lowercase)


• • Validate formats using regex
• • Code:
• df['Email'] = df['Email'].str.lower()

Dr Ramya S ,E & C Dept, MIT, Manipal


Dr Ramya S ,E & C Dept, MIT, Manipal
Summary

• • Clean data = reliable analysis


• • Always document your changes
• • Use tools:
• Python (Pandas), Power BI, Excel, OpenRefine

Dr Ramya S ,E & C Dept, MIT, Manipal


Data Transforming with Examples

• Overview of common data transformation techniques with


examples

Dr Ramya S ,E & C Dept, MIT, Manipal


What is Data Transformation?

• Data Transformation is the process of converting data into a


suitable format for analysis.
• It includes normalization, standardization, encoding,
aggregation, etc.

Dr Ramya S ,E & C Dept, MIT, Manipal


Types of Data Transformation

• - Normalization
• - Standardization
• - Encoding (Label, One-Hot)
• - Aggregation
• - Binning
• - Log Transformation
• - Pivoting/Unpivoting
• - Datetime Extraction

Dr Ramya S ,E & C Dept, MIT, Manipal


Normalization

• Normalization scales numerical values into a specific range


(usually [0,1]) to bring all features to the same scale.

• Example:
• Original: [10, 20, 30]
• Min-Max Normalized: [(10-10)/(30-10), (20-10)/(30-10), (30-
10)/(30-10)] → [0.0, 0.5, 1.0]

Dr Ramya S ,E & C Dept, MIT, Manipal


Normalization Example

• Original: [150, 160, 170, 180]


• Min-Max Normalized: [0.0, 0.33, 0.67, 1.0]

Dr Ramya S ,E & C Dept, MIT, Manipal


Standardization Example
• Standardization transforms data to have a mean of 0 and
standard deviation of 1 using Z-score.

• Original Scores: [55, 85, 75, 65]


• Z = (X - μ) / σ

Step 1 . Calculate the mean (𝜇) of the dataset.


•Sum the values:

55+85+75+65=280
.
•Divide by the number of values:

μ=280/ 4=70
Dr Ramya S ,E & C Dept, MIT, Manipal
Dr Ramya S ,E & C Dept, MIT, Manipal
Dr Ramya S ,E & C Dept, MIT, Manipal
Label Encoding

• Converts categorical labels into integers. Useful when


categories have an ordinal relationship.

• Example:
• Gender: ['Male', 'Female', 'Female', 'Male'] → [1, 0, 0, 1]

Dr Ramya S ,E & C Dept, MIT, Manipal


Label & One-Hot Encoding

• Label Encoding: ['Male', 'Female'] →


❖[1, 0]

Dr Ramya S ,E & C Dept, MIT, Manipal


One-Hot Encoding

• Creates binary columns for each category. Useful when


categories are nominal (no order).

• Example:
• Color: ['Red', 'Green', 'Blue'] →
• Red: [1,0,0], Green: [0,1,0], Blue: [0,0,1]
• One-Hot Encoding: ['Red', 'Blue'] →
❖[1, 0], [0, 1]

Dr Ramya S ,E & C Dept, MIT, Manipal


Binning (Discretization)

• Converts continuous data into discrete bins. Helps in reducing


the impact of outliers.

• Example:
• Age: [5, 12, 37, 45, 67] → Bins: Child (0–12), Adult (13–59),
Senior (60+)
• Result: [Child, Child, Adult, Adult, Senior]

Dr Ramya S ,E & C Dept, MIT, Manipal


Log Transformation

• Reduces skewness in highly skewed data. Useful when data


spans multiple orders of magnitude.

• Example:
• Salary: [1000, 10000, 100000] → log10(Salary): [3.0, 4.0, 5.0]

Dr Ramya S ,E & C Dept, MIT, Manipal


Aggregation

• Summarizes data by group (e.g., mean, sum, count). Often


used in grouped analysis or dashboards.

• Example:
• Sales by Region:
• Region A: [100, 200], Region B: [300, 400]
• → Sum: A: 300, B: 700

Dr Ramya S ,E & C Dept, MIT, Manipal


Pivoting

• Converts row-based data to column-based format. Often


used to restructure data for analysis.

• Example:
• Data:
• Product | Month | Sales
• A | Jan | 100
• → Pivot to have Months as columns with Sales as values

Dr Ramya S ,E & C Dept, MIT, Manipal


Datetime Extraction

• Extracts components such as year, month, day from datetime


values. Useful for time series analysis.

• Example:
• Date: '2025-08-07' → Year: 2025, Month: 8, Day: 7, Day of
Week: Thursday

Dr Ramya S ,E & C Dept, MIT, Manipal


Power BI: Data Transformation

• In Power BI, data can be transformed using Power Query and DAX
functions before visualizing it.

• Example:
• Sales Data:
• | Date | Sales |
• |------------|--------|
• | 2025-01-01 | 5000 |
• Transformations:
• - Extract Year from Date
• - Apply log on Sales
• - Group by Month

Dr Ramya S ,E & C Dept, MIT, Manipal


Power BI Example

• Data:
• | Date | Sales |
• |------------|--------|
• | 2025-01-01 | 5000 |

• Transformations:
• - Extract Year/Month
• - Group by Month
• - Apply Log to Sales
• → Useful for trend analysis & normalization

Dr Ramya S ,E & C Dept, MIT, Manipal

You might also like