ECE 4446 DATA ANALYTICS AND
VISUALIZATION
Dr Ramya S ,E & C Dept, MIT, Manipal
Data Cleaning with Examples
• An Introduction to Cleaning and Preparing Data for Analysis
Dr Ramya S ,E & C Dept, MIT, Manipal
Introduction to Data Cleaning
• • Definition: Process of detecting and correcting (or
removing) corrupt or inaccurate records.
• • Why it matters: Dirty data leads to incorrect analysis.
• • Goal: Improve data quality and consistency.
Dr Ramya S ,E & C Dept, MIT, Manipal
Common Data Quality Issues
• • Missing values
• • Duplicates
• • Inconsistent formats
• • Outliers
• • Typos and spelling errors
• • Incorrect data types
Dr Ramya S ,E & C Dept, MIT, Manipal
Data Cleaning Workflow
• 1. Data profiling
• 2. Identifying issues
• 3. Cleaning methods
• 4. Validation
• 5. Documentation
Dr Ramya S ,E & C Dept, MIT, Manipal
Dr Ramya S ,E & C Dept, MIT, Manipal
Removing Duplicates
• • Use Pandas: df.drop_duplicates()
• • Example:
• - Remove repeated name entry
Dr Ramya S ,E & C Dept, MIT, Manipal
Handling Missing Data
• • Techniques:
• - Remove rows/columns
• - Imputation (mean, median, mode)
• • Example:
• - Fill missing Age using median
• - Drop row with no name
Dr Ramya S ,E & C Dept, MIT, Manipal
Data Type Conversion
• • Convert salary from string to numeric (60K → 60000)
• • Convert DOB to YYYY-MM-DD
• • Code Example:
• df['Salary'] = df['Salary'].replace({'K':'000'},
regex=True).astype(int)
• df['DOB'] = pd.to_datetime(df['DOB'])
Dr Ramya S ,E & C Dept, MIT, Manipal
Handling Outliers
• • Techniques:
• - Z-score
• - IQR method
• • Example:
• - Remove salaries < 1000 or > 10,00,000
Dr Ramya S ,E & C Dept, MIT, Manipal
Standardization & Validation
• • Standardize emails (e.g., lowercase)
• • Validate formats using regex
• • Code:
• df['Email'] = df['Email'].str.lower()
Dr Ramya S ,E & C Dept, MIT, Manipal
Dr Ramya S ,E & C Dept, MIT, Manipal
Summary
• • Clean data = reliable analysis
• • Always document your changes
• • Use tools:
• Python (Pandas), Power BI, Excel, OpenRefine
Dr Ramya S ,E & C Dept, MIT, Manipal
Data Transforming with Examples
• Overview of common data transformation techniques with
examples
Dr Ramya S ,E & C Dept, MIT, Manipal
What is Data Transformation?
• Data Transformation is the process of converting data into a
suitable format for analysis.
• It includes normalization, standardization, encoding,
aggregation, etc.
Dr Ramya S ,E & C Dept, MIT, Manipal
Types of Data Transformation
• - Normalization
• - Standardization
• - Encoding (Label, One-Hot)
• - Aggregation
• - Binning
• - Log Transformation
• - Pivoting/Unpivoting
• - Datetime Extraction
Dr Ramya S ,E & C Dept, MIT, Manipal
Normalization
• Normalization scales numerical values into a specific range
(usually [0,1]) to bring all features to the same scale.
• Example:
• Original: [10, 20, 30]
• Min-Max Normalized: [(10-10)/(30-10), (20-10)/(30-10), (30-
10)/(30-10)] → [0.0, 0.5, 1.0]
Dr Ramya S ,E & C Dept, MIT, Manipal
Normalization Example
• Original: [150, 160, 170, 180]
• Min-Max Normalized: [0.0, 0.33, 0.67, 1.0]
Dr Ramya S ,E & C Dept, MIT, Manipal
Standardization Example
• Standardization transforms data to have a mean of 0 and
standard deviation of 1 using Z-score.
• Original Scores: [55, 85, 75, 65]
• Z = (X - μ) / σ
Step 1 . Calculate the mean (𝜇) of the dataset.
•Sum the values:
55+85+75+65=280
.
•Divide by the number of values:
μ=280/ 4=70
Dr Ramya S ,E & C Dept, MIT, Manipal
Dr Ramya S ,E & C Dept, MIT, Manipal
Dr Ramya S ,E & C Dept, MIT, Manipal
Label Encoding
• Converts categorical labels into integers. Useful when
categories have an ordinal relationship.
• Example:
• Gender: ['Male', 'Female', 'Female', 'Male'] → [1, 0, 0, 1]
Dr Ramya S ,E & C Dept, MIT, Manipal
Label & One-Hot Encoding
• Label Encoding: ['Male', 'Female'] →
❖[1, 0]
Dr Ramya S ,E & C Dept, MIT, Manipal
One-Hot Encoding
• Creates binary columns for each category. Useful when
categories are nominal (no order).
• Example:
• Color: ['Red', 'Green', 'Blue'] →
• Red: [1,0,0], Green: [0,1,0], Blue: [0,0,1]
• One-Hot Encoding: ['Red', 'Blue'] →
❖[1, 0], [0, 1]
Dr Ramya S ,E & C Dept, MIT, Manipal
Binning (Discretization)
• Converts continuous data into discrete bins. Helps in reducing
the impact of outliers.
• Example:
• Age: [5, 12, 37, 45, 67] → Bins: Child (0–12), Adult (13–59),
Senior (60+)
• Result: [Child, Child, Adult, Adult, Senior]
Dr Ramya S ,E & C Dept, MIT, Manipal
Log Transformation
• Reduces skewness in highly skewed data. Useful when data
spans multiple orders of magnitude.
• Example:
• Salary: [1000, 10000, 100000] → log10(Salary): [3.0, 4.0, 5.0]
Dr Ramya S ,E & C Dept, MIT, Manipal
Aggregation
• Summarizes data by group (e.g., mean, sum, count). Often
used in grouped analysis or dashboards.
• Example:
• Sales by Region:
• Region A: [100, 200], Region B: [300, 400]
• → Sum: A: 300, B: 700
Dr Ramya S ,E & C Dept, MIT, Manipal
Pivoting
• Converts row-based data to column-based format. Often
used to restructure data for analysis.
• Example:
• Data:
• Product | Month | Sales
• A | Jan | 100
• → Pivot to have Months as columns with Sales as values
Dr Ramya S ,E & C Dept, MIT, Manipal
Datetime Extraction
• Extracts components such as year, month, day from datetime
values. Useful for time series analysis.
• Example:
• Date: '2025-08-07' → Year: 2025, Month: 8, Day: 7, Day of
Week: Thursday
Dr Ramya S ,E & C Dept, MIT, Manipal
Power BI: Data Transformation
• In Power BI, data can be transformed using Power Query and DAX
functions before visualizing it.
• Example:
• Sales Data:
• | Date | Sales |
• |------------|--------|
• | 2025-01-01 | 5000 |
• Transformations:
• - Extract Year from Date
• - Apply log on Sales
• - Group by Month
Dr Ramya S ,E & C Dept, MIT, Manipal
Power BI Example
• Data:
• | Date | Sales |
• |------------|--------|
• | 2025-01-01 | 5000 |
• Transformations:
• - Extract Year/Month
• - Group by Month
• - Apply Log to Sales
• → Useful for trend analysis & normalization
Dr Ramya S ,E & C Dept, MIT, Manipal