0% found this document useful (0 votes)

35 views4 pages

ASSi2 DSBDA

This document discusses handling missing values and outliers in a dataset. It uses various techniques like mode, mean, and median imputation to handle missing categorical and numeric values. It also identifies outliers using z-scores and applies a square root transformation to make the distribution more symmetric.

Uploaded by

adagalepayale023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views4 pages

ASSi2 DSBDA

Uploaded by

adagalepayale023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Assi2_DSBDA

Importing all the required libraries

#used for data manipulation
import pandas as pd
#provides support for large, multi-dimensional arrays and matrices, along with mathematical
functions to operate on these arrays.
import numpy as np
#SimpleImputer is used for handling missing values in a dataset
from sklearn.impute import SimpleImputer
#The LabelEncoder is used for encoding categorical variables into numerical values.
from sklearn.preprocessing import LabelEncoder
#Z-score is a measure of how many standard deviations a data point is from the mean.
from scipy.stats import zscore
from scipy.stats import zscore, skew, shapiro, probplot

from scipy.stats import zscore, skew, shapiro, probplot

import matplotlib.pyplot as plt # Import matplotlib.pyplot
import seaborn as sns

# Load the dataset

data = pd.read_csv("/Users/apple/Downloads/BVCOEW/TE SEM 6/DSBDA
Practicals/academic.csv")

Check for any missing values and displaying them

print("Missing values before handling: ")
missing_values = data.isnull().sum() #isnull()-->checks for any missing value(gives as
Boolean) and sum() is used to show the sum of all the missing values

Missing values before handling:

print("Missing Values:")
print(missing_values)

for column in data.columns:

print(f"\nColumn: {column}")
print(data[column].head()) #head() method returns the first 5
data values

Handling of Categorical with Mode:

Handling i.e imputing the missing Values with various strategies:
#The SimpleImputer is used for handling missing values in a dataset by
filling them with a specified strategy (mean, median,mode).
handle_missing_values_categorical =
SimpleImputer(strategy='most_frequent') #handle strings with mode
data_categorical = data.select_dtypes(exclude='number')
data[data_categorical.columns] =
handle_missing_values_categorical.fit_transform(data_categorical)
#fit_transform calculates the most frequent value for each categorical
column in the training data (data_categorical) and then replaces
missing values with these calculated values.

Handling Numeric with Mode

#handle_missing_values = SimpleImputer(strategy='most_frequent') #handle
numeric with mode
data_numeric = data.select_dtypes(include='number') #creates a new DataFrame named
data_numeric,selecting only the numeric columns
#data[data_numeric.columns] = handle_missing_values.fit_transform(data_numeric) #for
most_frequent numeric
#fit_transform calculates the most frequent value for each categorical column in the training
data (data_categorical) and then replaces missing values with these calculated values.

Handling with Mean

#handle_missing_values_numeric_mean = SimpleImputer(strategy='mean') #handle
numeric with mean
#data[data_numeric.columns] =
handle_missing_values_numeric_mean.fit_transform(data_numeric)
#fit_transform calculates the most frequent value for each categorical column in the training
data (data_categorical) and then replaces missing values with these calculated values.

Handling with Median

handle_missing_values_numeric_median = SimpleImputer(strategy='median') #handle
numeric with median
data[data_numeric.columns] =
handle_missing_values_numeric_median.fit_transform(data_numeric)
#fit_transform calculates the most frequent value for each categorical column in the training
data (data_categorical) and then replaces missing values with these calculated values.

Display the data after handling missing values:

print("\nData after handling missing values:")
print(data)
Now handling of outliers (an outlier is an extreme value that deviates from the general pattern or
distribution of the data.)
A z-score is a measure that tells you how far away a particular data point is from the average (or mean) of
a group of data points, expressed in terms of standard deviations.
#Calculate Z-Scores:
z_scores = zscore(data.select_dtypes(include='number'), axis=0) #axis=0 indicates that z-
scores should be calculated along columns

Outliers are data points that are significantly different from the majority of the other data points in a set.
#Identify Outliers:
outliers = (z_scores > 3) | (z_scores < -3)

The mask method is used to replace values in the DataFrame based on a condition.
#Mask Outliers in the DataFrame:
data_no_outliers = data.select_dtypes(include='number').mask(outliers, np.nan)

Data Transformation: Using Square root transformation

for column in data_no_outliers.columns:
print(f"\nColumn: {column}")
print(data_no_outliers[column].head())

Calculate skewness before transformation

skew_before = data_no_outliers['Fees'].skew()
print(f"\nSkewness before transformation: {skew_before}")

# Apply square root transformation to 'Fees' column

#Applying a square root transformation to the 'Fees' column in the DataFrame
data_no_outliers and creating a new column named 'Fees_sqrt' to store the transformed
values.
data_no_outliers['Fees_sqrt'] = np.sqrt(data_no_outliers['Fees'])

Calculate skewness after square root transformation

skew_after_sqrt = data_no_outliers['Fees_sqrt'].skew()
print(f"\nSkewness after square root transformation: {skew_after_sqrt}")

In this case, the skewness before the transformation was -0.5783358295678959, indicating a slight
negative skewness,which means the distribution was already somewhat left-skewed. After applying the
logarithm transformation, the skewness became more negative (-1.0555250171550188), suggesting a
further shift towards the left. Reducing skewness is one step towards achieving a more symmetric
distribution, it's important to note that achieving a perfectly normal distribution is not always necessary or
possible in practice. However, making the distribution more symmetric and closer to normal can be
beneficial
Displaying the transformed data:
print("\nTransformed Data: ")

Transformed Data:
#Focus on the output of Fees and Fees_sqrt (Data is transformed from a big number to its
square root for easier understanding and handling)
print(data_no_outliers)

Plot histogram and Q-Q plot after square root transformation

# Plot histogram and Q-Q plot after square root transformation
plt.figure(figsize=(12, 6)) # This line creates a new figure with a specified size of 12 inches in
width and 6 inches in height.

plt.subplot(1, 2, 1) #This line creates a subplot grid with 1 row and 2 columns and selects
the first subplot (leftmost). The parameters (1, 2, 1) specify that there is 1 row, 2 columns,
and the current plot being referred to is the first one.
sns.histplot(data_no_outliers['Fees_sqrt'], kde=True) #This line creates a histogram using
Seaborn's histplot() function. It plots the distribution of data in the 'Fees_sqrt' column of the
DataFrame data_no_outliers
plt.title('Histogram of Square Root-transformed Fees')

plt.subplot(1, 2, 2)
probplot(data_no_outliers['Fees_sqrt'], dist="norm", plot=plt) # Use probplot directly
plt.title('Q-Q Plot of Square Root-transformed Fees')
plt.show()

(Feature Engineering) (Extended-Cheatsheet)
100% (1)
(Feature Engineering) (Extended-Cheatsheet)
9 pages
Experiment No. 5: Objective
No ratings yet
Experiment No. 5: Objective
5 pages
40 Multiple Choice Questions in Basic Statistics
No ratings yet
40 Multiple Choice Questions in Basic Statistics
8 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
AE4 Part 1 Quiz No. 1
100% (1)
AE4 Part 1 Quiz No. 1
13 pages
Data Manipulation & Visualization
No ratings yet
Data Manipulation & Visualization
7 pages
Statistics 2
67% (3)
Statistics 2
3 pages
Data Analytics Lab Manual
No ratings yet
Data Analytics Lab Manual
47 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
42 pages
Subset Selection Class Assignment
No ratings yet
Subset Selection Class Assignment
5 pages
Session 6 - CSD102 Measures of Divergence From Normality
100% (1)
Session 6 - CSD102 Measures of Divergence From Normality
30 pages
ADS EXP Assignments
No ratings yet
ADS EXP Assignments
38 pages
Math T STPM Sem 3 2022
No ratings yet
Math T STPM Sem 3 2022
2 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
Assignment 1 - LP1
No ratings yet
Assignment 1 - LP1
14 pages
Data Analysis and Visualization Guide
No ratings yet
Data Analysis and Visualization Guide
16 pages
Data Analytics Lab Manual - 250402 - 095326
No ratings yet
Data Analytics Lab Manual - 250402 - 095326
58 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
32 pages
DA Programs
No ratings yet
DA Programs
44 pages
Slides On DataII
No ratings yet
Slides On DataII
26 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
DS Problem Statements and Codes
No ratings yet
DS Problem Statements and Codes
21 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Machine Learning Lab File
No ratings yet
Machine Learning Lab File
45 pages
AI Question Bank - For UT-II-23-24
100% (1)
AI Question Bank - For UT-II-23-24
2 pages
Module 3
No ratings yet
Module 3
108 pages
Data Science Practicals
No ratings yet
Data Science Practicals
47 pages
Data Visualization & Preprocessing Guide
No ratings yet
Data Visualization & Preprocessing Guide
18 pages
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
32 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
ML Self Unit 2
No ratings yet
ML Self Unit 2
20 pages
ML Notes
No ratings yet
ML Notes
44 pages
ML Ex2
No ratings yet
ML Ex2
7 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Exp 2
No ratings yet
Exp 2
6 pages
1data Cleansing Cheklist
No ratings yet
1data Cleansing Cheklist
2 pages
Point-Biserial Correlation Guide
0% (1)
Point-Biserial Correlation Guide
3 pages
Data Analytics Lab Manual
No ratings yet
Data Analytics Lab Manual
26 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
24 pages
Exp-2 ML
No ratings yet
Exp-2 ML
6 pages
Dsbda Lab - 2.1 - 1736750718198
No ratings yet
Dsbda Lab - 2.1 - 1736750718198
9 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
8 pages
Advance Python
No ratings yet
Advance Python
5 pages
PP DWDM 4 5
No ratings yet
PP DWDM 4 5
26 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
DA Lab
No ratings yet
DA Lab
27 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
ADS LAB Merged
No ratings yet
ADS LAB Merged
86 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Da Program Upto 6
No ratings yet
Da Program Upto 6
20 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Statistics IMP Questions and Answers
No ratings yet
Statistics IMP Questions and Answers
23 pages
Data - Analytics Lab - Manual JNTUH R22 Regulation
No ratings yet
Data - Analytics Lab - Manual JNTUH R22 Regulation
26 pages
Solution Manual For Business Analytics: Data Analysis & Decision Making 6th Edition Albright
100% (1)
Solution Manual For Business Analytics: Data Analysis & Decision Making 6th Edition Albright
49 pages
FOUND. DATA SCIENCE Practical
No ratings yet
FOUND. DATA SCIENCE Practical
15 pages
Ashish Maths 10 B Statistics
No ratings yet
Ashish Maths 10 B Statistics
14 pages
ML 8 Program
No ratings yet
ML 8 Program
5 pages
Dbs3e PPT ch03
No ratings yet
Dbs3e PPT ch03
61 pages
Study Material For Machine Learning - 1 - 1754721598318
No ratings yet
Study Material For Machine Learning - 1 - 1754721598318
18 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
Eda U2
No ratings yet
Eda U2
141 pages
Résumé-Analyse Des Données Resumee Resumee
No ratings yet
Résumé-Analyse Des Données Resumee Resumee
4 pages
MMW Module 4 Lesson 3 5 (1) .PDF S 1
No ratings yet
MMW Module 4 Lesson 3 5 (1) .PDF S 1
24 pages
Applied Statistics (Unit 2)
No ratings yet
Applied Statistics (Unit 2)
25 pages
ST2187 Block 2
No ratings yet
ST2187 Block 2
27 pages
Third Term Lesson Note For Week - 6
No ratings yet
Third Term Lesson Note For Week - 6
20 pages
Data Structure Interview Questions
No ratings yet
Data Structure Interview Questions
17 pages
Stat Activity 2 Group 4
No ratings yet
Stat Activity 2 Group 4
12 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Regression Analysis Summary
No ratings yet
Regression Analysis Summary
6 pages
STA1505 Assignment 2 - 2025
No ratings yet
STA1505 Assignment 2 - 2025
3 pages
Statistic Projects
No ratings yet
Statistic Projects
4 pages
12th PT Paper
No ratings yet
12th PT Paper
4 pages
Winners Math Practice
No ratings yet
Winners Math Practice
3 pages
Skewness and Kurtosis
No ratings yet
Skewness and Kurtosis
3 pages
Assignment 4
No ratings yet
Assignment 4
3 pages
Central Tendency for Data Analysis
No ratings yet
Central Tendency for Data Analysis
4 pages
Understanding Skewness in Statistics
No ratings yet
Understanding Skewness in Statistics
13 pages
PBT150S Tutorial Chapter 2 2021
No ratings yet
PBT150S Tutorial Chapter 2 2021
2 pages
Descriptive Statistics Set 1
No ratings yet
Descriptive Statistics Set 1
2 pages
GM21CM043 - ASDM Exam - Solutions
No ratings yet
GM21CM043 - ASDM Exam - Solutions
15 pages
Assignment 3 Research Methodlogy 20040621068 PDF
No ratings yet
Assignment 3 Research Methodlogy 20040621068 PDF
2 pages
Resource 20241230203841 Statistics-2
No ratings yet
Resource 20241230203841 Statistics-2
10 pages
Activity No. 10
No ratings yet
Activity No. 10
6 pages
3226 Coding Sol
No ratings yet
3226 Coding Sol
2 pages
CCP403
No ratings yet
CCP403
34 pages

ASSi2 DSBDA

Uploaded by

ASSi2 DSBDA

Uploaded by

Assi2_DSBDA

Importing all the required libraries

from scipy.stats import zscore, skew, shapiro, probplot

# Load the dataset

Check for any missing values and displaying them

Missing values before handling:

for column in data.columns:

Handling of Categorical with Mode:

Handling Numeric with Mode

Handling with Mean

Handling with Median

Display the data after handling missing values:

Data Transformation: Using Square root transformation

Calculate skewness before transformation

# Apply square root transformation to 'Fees' column

Calculate skewness after square root transformation

Plot histogram and Q-Q plot after square root transformation

You might also like