Assi2_DSBDA
Importing all the required libraries
#used for data manipulation
import pandas as pd
#provides support for large, multi-dimensional arrays and matrices, along with mathematical
functions to operate on these arrays.
import numpy as np
#SimpleImputer is used for handling missing values in a dataset
from sklearn.impute import SimpleImputer
#The LabelEncoder is used for encoding categorical variables into numerical values.
from sklearn.preprocessing import LabelEncoder
#Z-score is a measure of how many standard deviations a data point is from the mean.
from scipy.stats import zscore
from scipy.stats import zscore, skew, shapiro, probplot
from scipy.stats import zscore, skew, shapiro, probplot
import matplotlib.pyplot as plt # Import matplotlib.pyplot
import seaborn as sns
# Load the dataset
data = pd.read_csv("/Users/apple/Downloads/BVCOEW/TE SEM 6/DSBDA
Practicals/academic.csv")
Check for any missing values and displaying them
print("Missing values before handling: ")
missing_values = data.isnull().sum() #isnull()-->checks for any missing value(gives as
Boolean) and sum() is used to show the sum of all the missing values
Missing values before handling:
print("Missing Values:")
print(missing_values)
for column in data.columns:
print(f"\nColumn: {column}")
print(data[column].head()) #head() method returns the first 5
data values
Handling of Categorical with Mode:
Handling i.e imputing the missing Values with various strategies:
#The SimpleImputer is used for handling missing values in a dataset by
filling them with a specified strategy (mean, median,mode).
handle_missing_values_categorical =
SimpleImputer(strategy='most_frequent') #handle strings with mode
data_categorical = data.select_dtypes(exclude='number')
data[data_categorical.columns] =
handle_missing_values_categorical.fit_transform(data_categorical)
#fit_transform calculates the most frequent value for each categorical
column in the training data (data_categorical) and then replaces
missing values with these calculated values.
Handling Numeric with Mode
#handle_missing_values = SimpleImputer(strategy='most_frequent') #handle
numeric with mode
data_numeric = data.select_dtypes(include='number') #creates a new DataFrame named
data_numeric,selecting only the numeric columns
#data[data_numeric.columns] = handle_missing_values.fit_transform(data_numeric) #for
most_frequent numeric
#fit_transform calculates the most frequent value for each categorical column in the training
data (data_categorical) and then replaces missing values with these calculated values.
Handling with Mean
#handle_missing_values_numeric_mean = SimpleImputer(strategy='mean') #handle
numeric with mean
#data[data_numeric.columns] =
handle_missing_values_numeric_mean.fit_transform(data_numeric)
#fit_transform calculates the most frequent value for each categorical column in the training
data (data_categorical) and then replaces missing values with these calculated values.
Handling with Median
handle_missing_values_numeric_median = SimpleImputer(strategy='median') #handle
numeric with median
data[data_numeric.columns] =
handle_missing_values_numeric_median.fit_transform(data_numeric)
#fit_transform calculates the most frequent value for each categorical column in the training
data (data_categorical) and then replaces missing values with these calculated values.
Display the data after handling missing values:
print("\nData after handling missing values:")
print(data)
Now handling of outliers (an outlier is an extreme value that deviates from the general pattern or
distribution of the data.)
A z-score is a measure that tells you how far away a particular data point is from the average (or mean) of
a group of data points, expressed in terms of standard deviations.
#Calculate Z-Scores:
z_scores = zscore(data.select_dtypes(include='number'), axis=0) #axis=0 indicates that z-
scores should be calculated along columns
Outliers are data points that are significantly different from the majority of the other data points in a set.
#Identify Outliers:
outliers = (z_scores > 3) | (z_scores < -3)
The mask method is used to replace values in the DataFrame based on a condition.
#Mask Outliers in the DataFrame:
data_no_outliers = data.select_dtypes(include='number').mask(outliers, np.nan)
Data Transformation: Using Square root transformation
for column in data_no_outliers.columns:
print(f"\nColumn: {column}")
print(data_no_outliers[column].head())
Calculate skewness before transformation
skew_before = data_no_outliers['Fees'].skew()
print(f"\nSkewness before transformation: {skew_before}")
# Apply square root transformation to 'Fees' column
#Applying a square root transformation to the 'Fees' column in the DataFrame
data_no_outliers and creating a new column named 'Fees_sqrt' to store the transformed
values.
data_no_outliers['Fees_sqrt'] = np.sqrt(data_no_outliers['Fees'])
Calculate skewness after square root transformation
skew_after_sqrt = data_no_outliers['Fees_sqrt'].skew()
print(f"\nSkewness after square root transformation: {skew_after_sqrt}")
In this case, the skewness before the transformation was -0.5783358295678959, indicating a slight
negative skewness,which means the distribution was already somewhat left-skewed. After applying the
logarithm transformation, the skewness became more negative (-1.0555250171550188), suggesting a
further shift towards the left. Reducing skewness is one step towards achieving a more symmetric
distribution, it's important to note that achieving a perfectly normal distribution is not always necessary or
possible in practice. However, making the distribution more symmetric and closer to normal can be
beneficial
Displaying the transformed data:
print("\nTransformed Data: ")
Transformed Data:
#Focus on the output of Fees and Fees_sqrt (Data is transformed from a big number to its
square root for easier understanding and handling)
print(data_no_outliers)
Plot histogram and Q-Q plot after square root transformation
# Plot histogram and Q-Q plot after square root transformation
plt.figure(figsize=(12, 6)) # This line creates a new figure with a specified size of 12 inches in
width and 6 inches in height.
plt.subplot(1, 2, 1) #This line creates a subplot grid with 1 row and 2 columns and selects
the first subplot (leftmost). The parameters (1, 2, 1) specify that there is 1 row, 2 columns,
and the current plot being referred to is the first one.
sns.histplot(data_no_outliers['Fees_sqrt'], kde=True) #This line creates a histogram using
Seaborn's histplot() function. It plots the distribution of data in the 'Fees_sqrt' column of the
DataFrame data_no_outliers
plt.title('Histogram of Square Root-transformed Fees')
plt.subplot(1, 2, 2)
probplot(data_no_outliers['Fees_sqrt'], dist="norm", plot=plt) # Use probplot directly
plt.title('Q-Q Plot of Square Root-transformed Fees')
plt.show()