0% found this document useful (0 votes)

31 views21 pages

Mini Project2 DAV Answers - Jupyter Notebook

This document discusses data preprocessing steps for a Bigmart sales dataset in Python. It defines necessary libraries, loads the dataset, drops unnecessary columns, extracts the target label, performs encoding of categorical variables, handles missing data through imputation, splits the data into train and test sets, and applies linear regression to calculate RMSE. Standardization is also applied before another train-test split and linear regression. The goal is to clean, preprocess and analyze the dataset to build a predictive model for sales.

Uploaded by

Priscella Coc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views21 pages

Mini Project2 DAV Answers - Jupyter Notebook

Uploaded by

Priscella Coc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Data Analysis and Visualization MPA-2

NITHIN RAJ

KISHORE KUMAR M

VISHNU VARADHAN REDDY

Define the necessary libraries (1 mark)

In [1]: import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from sklearn.linear_model import LinearRegression
from sklearn import metrics as mt
from sklearn.preprocessing import OrdinalEncoder,StandardScaler,MinMaxScaler,MaxAbsScaler,MaxAbsScaler,RobustScaler,No
from sklearn.model_selection import train_test_split

Load the dataset into the dataframe (1 mark)

In [2]: df = pd.read_csv('BigmartSales.csv')

Drop the "Item_Identifier" and "Outlet_Identifier" columns (1 mark)

In [3]: # df.drop() drops the data in the dataframe-df.

# By Default it drops the row data.
# To drop the column data set the axis=1
In [4]: print('Columns in the dataset before dropping are: ',df.columns)
df = df.drop(['Item_Identifier','Outlet_Identifier'],axis=1)
print('Columns in the dataset after dropping are: ',df.columns)

Columns in the dataset before dropping are: Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visib
ility',
'Item_Type', 'Item_MRP', 'Outlet_Identifier',
'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type',
'Outlet_Type', 'Item_Outlet_Sales'],
dtype='object')
Columns in the dataset after dropping are: Index(['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type',
'Item_MRP', 'Outlet_Establishment_Year', 'Outlet_Size',
'Outlet_Location_Type', 'Outlet_Type', 'Item_Outlet_Sales'],
dtype='object')

Extract the target labels (1 mark)

In [5]: # The target label in a dataset is the output data.

# It changes based on the other features in the dataset

In [6]: target_label = df.Item_Outlet_Sales

target_label

Out[6]: 0 3735.1380
1 443.4228
2 2097.2700
3 732.3800
4 994.7052
...
8518 2778.3834
8519 549.2850
8520 1193.1136
8521 1845.5976
8522 765.6700
Name: Item_Outlet_Sales, Length: 8523, dtype: float64

Replace the field "Item_Fat_Content" with numerical value (1 mark)

In [7]: # df.replace({'old_data':'new_data'}) replaces the old_data in a dataframe with the new_data provided as the key-value

In [8]: print('Before Replacing: ',df['Item_Fat_Content'].unique())

df['Item_Fat_Content'] = df['Item_Fat_Content'].replace({'Low Fat':0,'LF':0,'Regular':1,'reg':1,'low fat':0})
print('After Replacing: ',df['Item_Fat_Content'].unique())

Before Replacing: ['Low Fat' 'Regular' 'low fat' 'LF' 'reg']

After Replacing: [0 1]

Perform ordinal encoding of the "Item_Type", "Outlet_Type", "Outlet_Location_Type" and "Outlet_Type" field (1 mark)

In [9]: # Encoding is the process of transforming the categorical (discrete) features into ordinal integers.
# This is the preprocessing step to be done before using the dataset for ML model training

In [10]: ordEnc = OrdinalEncoder()

df['Item_Type'] = ordEnc.fit_transform(df['Item_Type'].values.reshape(-1, 1))
df['Item_Type']

Out[10]: 0 4.0
1 14.0
2 10.0
3 6.0
4 9.0
...
8518 13.0
8519 0.0
8520 8.0
8521 13.0
8522 14.0
Name: Item_Type, Length: 8523, dtype: float64
In [11]: df['Outlet_Type'] = ordEnc.fit_transform(df['Outlet_Type'].values.reshape(-1, 1))
df['Outlet_Type']

Out[11]: 0 1.0
1 2.0
2 1.0
3 0.0
4 1.0
...
8518 1.0
8519 1.0
8520 1.0
8521 2.0
8522 1.0
Name: Outlet_Type, Length: 8523, dtype: float64

In [12]: df['Outlet_Location_Type'] = ordEnc.fit_transform(df['Outlet_Location_Type'].values.reshape(-1, 1))

df['Outlet_Location_Type']

Out[12]: 0 0.0
1 2.0
2 0.0
3 2.0
4 2.0
...
8518 2.0
8519 1.0
8520 1.0
8521 2.0
8522 0.0
Name: Outlet_Location_Type, Length: 8523, dtype: float64
In [13]: df.isna().sum()

Out[13]: Item_Weight 1463

Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Establishment_Year 0
Outlet_Size 2410
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64

Imputation of "Outlet_Size" field with mode value (1 mark)

In [14]: # fillna() is the method used to place the custom values at the NaN in a dataframe of series

In [15]: print('The Mode of Outlet Size is: ',df['Outlet_Size'].mode())

The Mode of Outlet Size is: 0 Medium

Name: Outlet_Size, dtype: object

In [16]: df['Outlet_Size'] = df['Outlet_Size'].fillna('Medium')

In [17]: df.isna().sum()

Out[17]: Item_Weight 1463

Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Establishment_Year 0
Outlet_Size 0
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64

Check for null values (1 mark)

In [18]: df.isnull().sum()

Out[18]: Item_Weight 1463

Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Establishment_Year 0
Outlet_Size 0
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64

Imputation of "Item_Weight" field with mode value (1 mark)

In [19]: print('The Mode of Item Weight is: ',df['Item_Weight'].mode())

The Mode of Item Weight is: 0 12.15

Name: Item_Weight, dtype: float64
In [20]: df['Item_Weight'] = df['Item_Weight'].fillna(12.15)

In [21]: df.isna().sum()

Out[21]: Item_Weight 0
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Establishment_Year 0
Outlet_Size 0
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64

Display all field in the dataset using boxplot (1 mark)

In [22]: # Box plot is used to find the outliers present in a data set. Mostly used for a univariate analysis.
# Also can be applied to bivariate analysis having 1 numerical and 1 categorical data
# It is called as grouped boxplot
In [23]: plt.figure(figsize=(10,5))
sns.boxplot(df)
plt.xticks(rotation=90)
plt.title('Bigmart Sales Data')
plt.show()
Split the dataset into train and test(20%), apply Linear Regression and calculate RMSE value (1 mark)
In [24]: # train_test_split is the method in sklearn.model_selection
# It is used to create the training and testing data from a complete data
# It gets the parameters - input data, output data,
# test_size=the size of the data that has to be selected for the testing of the ML model
# it returns four values - xtrain,xtest,ytrain,ytest that are given to the ML model for training and testing

In [25]: df['Outlet_Size'] = ordEnc.fit_transform(df['Outlet_Size'].values.reshape(-1, 1))

X=df.drop('Item_Outlet_Sales',axis=1)
Y=df['Item_Outlet_Sales']
xtrain,xtest,ytrain,ytest = train_test_split(X,Y,test_size=0.2,)
# Create and fit the linear regression model
model = LinearRegression()
model.fit(xtrain, ytrain)

# Make predictions on the test set
ypred = model.predict(xtest)

# Calculate RMSE
rmse1 = math.sqrt(mt.mean_squared_error(ytest, ypred))

print(f"Root Mean Squared Error (RMSE): {rmse1}")

Root Mean Squared Error (RMSE): 1177.361349688933

Apply StandardScaller and split the dataset into train and test(20%) (1 mark)

In [26]: # StandardScaler standardize features by removing the mean and scaling to unit variance.
# Standardization of a dataset is a common requirement for many machine learning estimators:
# they might behave badly if the individual features do not more or less look like standard normally distributed data
In [27]: sc = StandardScaler()
df_sc = sc.fit_transform(df)
df1 = pd.DataFrame(df_sc)

df1.columns=['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type',
'Item_MRP', 'Outlet_Establishment_Year', 'Outlet_Size',
'Outlet_Location_Type', 'Outlet_Type', 'Item_Outlet_Sales']

X1=df1.drop('Item_Outlet_Sales',axis=1)
Y1=df1['Item_Outlet_Sales']
x1train,x1test,y1train,y1test=train_test_split(X1,Y,test_size=0.2)
# Create and fit the linear regression model
model1 = LinearRegression()
model1.fit(x1train, y1train)

Out[27]: LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Display all field in the dataset using boxplot (1 mark)

In [28]: plt.figure(figsize=(10,5))
sns.boxplot(df1)
plt.xticks(rotation=90)
plt.title('Bigmart Sales Data')
plt.show()
Apply Linear Regression and calculate RMSE value (1 mark)
In [29]: # Make predictions on the test set
y1pred = model1.predict(x1test)

# Calculate RMSE
rmse2 = math.sqrt(mt.mean_squared_error(y1test, y1pred))

print(f"Root Mean Squared Error (RMSE): {rmse2}")

Root Mean Squared Error (RMSE): 1161.6406081768139

Apply MinMaxScaler, split the dataset into train and test(20%), apply LinearRegression and calculate RMSE (1 mark)

In [30]: # MinMaxScaler Transform features by scaling each feature to a given range.

# This estimator scales and translates each feature individually such that
# it is in the given range on the training set, e.g. between zero and one.
# This transformation is often used as an alternative to zero mean, unit variance scaling.
In [31]: mmsc = MinMaxScaler()
df_mmsc = mmsc.fit_transform(df)
df2 = pd.DataFrame(df_mmsc)

df2.columns=['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type',
'Item_MRP', 'Outlet_Establishment_Year', 'Outlet_Size',
'Outlet_Location_Type', 'Outlet_Type', 'Item_Outlet_Sales']

X2=df2.drop('Item_Outlet_Sales',axis=1)
Y2=df2['Item_Outlet_Sales']
x2train,x2test,y2train,y2test=train_test_split(X2,Y,test_size=0.2)
# Create and fit the linear regression model
model2 = LinearRegression()
model2.fit(x2train, y2train)
# Make predictions on the test set
y2pred = model2.predict(x2test)

# Calculate RMSE
rmse3 = math.sqrt(mt.mean_squared_error(y2test, y2pred))

print(f"Root Mean Squared Error (RMSE): {rmse3}")

Root Mean Squared Error (RMSE): 1176.5289257439433

Apply RobustScaler,Split the dataset into train and test(20%), apply LinearRegression and calculate RMSE (1 mark)

In [32]: # RobustScaler scales features using statistics that are robust to outliers.
# This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile R
# The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).
In [33]: rsc = RobustScaler()
df_rsc = rsc.fit_transform(df)
dfr = pd.DataFrame(df_rsc)

dfr.columns=['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type',
'Item_MRP', 'Outlet_Establishment_Year', 'Outlet_Size',
'Outlet_Location_Type', 'Outlet_Type', 'Item_Outlet_Sales']

Xr=dfr.drop('Item_Outlet_Sales',axis=1)
Yr=dfr['Item_Outlet_Sales']
xrtrain,xrtest,yrtrain,yrtest=train_test_split(Xr,Y,test_size=0.2)
# Create and fit the linear regression model
modelr = LinearRegression()
modelr.fit(xrtrain, yrtrain)
# Make predictions on the test set
yrpred = modelr.predict(xrtest)

# Calculate RMSE
rmse4 = math.sqrt(mt.mean_squared_error(yrtest, yrpred))

print(f"Root Mean Squared Error (RMSE): {rmse4}")

Root Mean Squared Error (RMSE): 1143.8487793222237

Apply MaxAbsScaler, split the dataset into train and test(20%), apply LinearRegression and calculate RMSE (1 mark)

In [34]: # MaxAbsScaler scales each feature by its maximum absolute value.

# This estimator scales and translates each feature individually such that
# the maximal absolute value of each feature in the training set will be 1.0.
# It does not shift/center the data, and thus does not destroy any sparsity.
# MaxAbsScaler doesn’t reduce the effect of outliers; it only linearily scales them down.
In [35]: masc = MaxAbsScaler()
df_masc = masc.fit_transform(df)
dfa = pd.DataFrame(df_masc)

dfa.columns=['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type',
'Item_MRP', 'Outlet_Establishment_Year', 'Outlet_Size',
'Outlet_Location_Type', 'Outlet_Type', 'Item_Outlet_Sales']

Xa=dfa.drop('Item_Outlet_Sales',axis=1)
Ya=dfa['Item_Outlet_Sales']
xatrain,xatest,yatrain,yatest=train_test_split(Xa,Y,test_size=0.2)
# Create and fit the linear regression model
modela = LinearRegression()
modela.fit(xatrain, yatrain)
# Make predictions on the test set
yapred = modela.predict(xatest)

# Calculate RMSE
rmse5 = math.sqrt(mt.mean_squared_error(yatest, yapred))

print(f"Root Mean Squared Error (RMSE): {rmse5}")

Root Mean Squared Error (RMSE): 1195.9232136536114

Apply Normalizer, split the dataset into train and test(20%), apply LinearRegression and calculate RMSE (1 mark)

In [36]: # Normalizer normalizes samples individually to unit norm.

# Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of oth
# so that its norm (l1, l2 or inf) equals one.
In [37]: nsc = Normalizer()
df_nsc = masc.fit_transform(df)
dfn = pd.DataFrame(df_nsc)

dfn.columns=['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type',
'Item_MRP', 'Outlet_Establishment_Year', 'Outlet_Size',
'Outlet_Location_Type', 'Outlet_Type', 'Item_Outlet_Sales']

Xn=dfn.drop('Item_Outlet_Sales',axis=1)
Yn=dfn['Item_Outlet_Sales']
xntrain,xntest,yntrain,yntest=train_test_split(Xn,Y,test_size=0.2)
# Create and fit the linear regression model
modeln = LinearRegression()
modeln.fit(xntrain, yntrain)
# Make predictions on the test set
ynpred = modeln.predict(xntest)

# Calculate RMSE
rmse6 = math.sqrt(mt.mean_squared_error(yntest, ynpred))

print(f"Root Mean Squared Error (RMSE): {rmse6}")

Root Mean Squared Error (RMSE): 1218.0003678085768

Define a function valuelabel to place the legend of each bar in the histogram (1 mark)
In [38]: def valuelabel(ax, spacing=3):

# For each bar: Place a label
for rect in ax.patches:
# Get X and Y placement of label from rect.
y_value = rect.get_height()
x_value = rect.get_x() + rect.get_width() / 2

# Number of points between bar and label
space = spacing
# Vertical alignment for positive values
va = 'bottom'

# If value of bar is negative: Place label below bar
if y_value < 0:
# Invert space to place label below
space *= -1
# Vertically align label at top
va = 'top'

# Use Y value as label and format number with one decimal place
label = "{:.1f}".format(y_value)

# Create annotation
ax.annotate(
label, # Use `label` as label
(x_value, y_value), # Place label at end of the bar
xytext=(0, space), # Vertically shift label by `space`
textcoords="offset points", # Interpret `xytext` as offset in points
ha='center', # Horizontally center label
va=va) # Vertically align label differently for
# positive and negative values.

Plot a histogram to display the RMSE value of each scaler (1 mark)

In [39]: rmses = [rmse1,rmse2,rmse3,rmse4,rmse5,rmse6]
rmse_Series = pd.Series(rmses)
labels = ['rmse1','rmse2','rmse3','rmse4','rmse5','rmse6']
# Creating histogram
plt.figure(figsize=(10,5))
ax = rmse_Series.plot(kind='bar')
ax.set_xticklabels(labels)
valuelabel(ax)
# Show plot
plt.show()

07 - Combustion - Optimisation PDF
100% (1)
07 - Combustion - Optimisation PDF
90 pages
Map of The GD&T World
No ratings yet
Map of The GD&T World
2 pages
Forklift Battery Maintenance Guide
No ratings yet
Forklift Battery Maintenance Guide
3 pages
Supermart Grocery Sales - Retail Analytics Dataset (Finance Analyst)
No ratings yet
Supermart Grocery Sales - Retail Analytics Dataset (Finance Analyst)
19 pages
Cbse 10th Bio Atom Bomb Free
No ratings yet
Cbse 10th Bio Atom Bomb Free
6 pages
Pipe Support Span Chart
No ratings yet
Pipe Support Span Chart
1 page
Drip Irrigation Pipes
No ratings yet
Drip Irrigation Pipes
8 pages
Data - Analytics Lab - Manual JNTUH R22 Regulation
No ratings yet
Data - Analytics Lab - Manual JNTUH R22 Regulation
26 pages
CODE
No ratings yet
CODE
24 pages
Project 12 Big Mart Sales Prediction
No ratings yet
Project 12 Big Mart Sales Prediction
15 pages
Experiment No 11
No ratings yet
Experiment No 11
19 pages
House Price Prediction Using Machine Learning in Python
No ratings yet
House Price Prediction Using Machine Learning in Python
13 pages
MATLAB Solution To Microwave Engineering Pozar 4th Ed. Example 1.5
No ratings yet
MATLAB Solution To Microwave Engineering Pozar 4th Ed. Example 1.5
5 pages
Machine Learning Record VR19
No ratings yet
Machine Learning Record VR19
46 pages
SSF Plastics Baseline Draft For Review
No ratings yet
SSF Plastics Baseline Draft For Review
19 pages
PRJ Sales Forecasting
No ratings yet
PRJ Sales Forecasting
22 pages
Ass 1 ML
No ratings yet
Ass 1 ML
21 pages
ML Lab1
No ratings yet
ML Lab1
11 pages
BigMart Sales Data Analysis
No ratings yet
BigMart Sales Data Analysis
16 pages
Week 01.a
No ratings yet
Week 01.a
4 pages
Data Cleaning
No ratings yet
Data Cleaning
7 pages
Oddstudents
No ratings yet
Oddstudents
35 pages
Pandas
No ratings yet
Pandas
20 pages
Machine Exercise 3
No ratings yet
Machine Exercise 3
22 pages
Lab1 Features Selections-Class-GI2
No ratings yet
Lab1 Features Selections-Class-GI2
25 pages
Porter Case Study
No ratings yet
Porter Case Study
153 pages
ML Complete Notes Hridoy
No ratings yet
ML Complete Notes Hridoy
5 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
BigMart PDF
100% (1)
BigMart PDF
42 pages
Mini Project (BDA) Output
No ratings yet
Mini Project (BDA) Output
5 pages
Data Preprocessing Example Programs1
No ratings yet
Data Preprocessing Example Programs1
9 pages
Data Preprocessing 2
No ratings yet
Data Preprocessing 2
5 pages
Practice Questions2
No ratings yet
Practice Questions2
2 pages
Aerofit Case Study
No ratings yet
Aerofit Case Study
16 pages
Big Sales Mart Final Script PDF
No ratings yet
Big Sales Mart Final Script PDF
36 pages
Lab File
No ratings yet
Lab File
96 pages
HET Ka FML
No ratings yet
HET Ka FML
13 pages
DS Food
No ratings yet
DS Food
23 pages
Deep Learning Assignments
No ratings yet
Deep Learning Assignments
13 pages
Class 10 - DECEMBER PREBOARD EXAM
No ratings yet
Class 10 - DECEMBER PREBOARD EXAM
11 pages
Data Science Tutorial 1686911993
No ratings yet
Data Science Tutorial 1686911993
41 pages
Machine Learning Lab Assignment 2
No ratings yet
Machine Learning Lab Assignment 2
23 pages
ML Lab Manual 2025-2
No ratings yet
ML Lab Manual 2025-2
35 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
5-2a Dataframes Column Operations - Instruction
No ratings yet
5-2a Dataframes Column Operations - Instruction
2 pages
Lte Users Guide v0.1 en
No ratings yet
Lte Users Guide v0.1 en
11 pages
ML Practical 4D
No ratings yet
ML Practical 4D
11 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
Customer Segmentation With K-Means Clustering and Visualization - Colab
No ratings yet
Customer Segmentation With K-Means Clustering and Visualization - Colab
3 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
Assignment 1
No ratings yet
Assignment 1
7 pages
Task 6
No ratings yet
Task 6
14 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Exp 2 Data Preprocessing - Cleaning The Dataset Obtained From The UCI ML Repository
No ratings yet
Exp 2 Data Preprocessing - Cleaning The Dataset Obtained From The UCI ML Repository
9 pages
Practicals
No ratings yet
Practicals
42 pages
Data Clearning
No ratings yet
Data Clearning
7 pages
Certificate
No ratings yet
Certificate
25 pages
Dataframe
No ratings yet
Dataframe
19 pages
Supermart Grocery Sales - Retail Analytics Dataset - (Data Analyst)
No ratings yet
Supermart Grocery Sales - Retail Analytics Dataset - (Data Analyst)
17 pages
Handle Missing Data in Real-Time
No ratings yet
Handle Missing Data in Real-Time
5 pages
House Price Prediction for Analysts
No ratings yet
House Price Prediction for Analysts
91 pages
Exp 01-B Feature Selection and Extraction
No ratings yet
Exp 01-B Feature Selection and Extraction
12 pages
Data Wrangling Python.
No ratings yet
Data Wrangling Python.
8 pages
Project 4: Final Project: Bigmart Sales Prediction: Chapter 1: Problem Statement
No ratings yet
Project 4: Final Project: Bigmart Sales Prediction: Chapter 1: Problem Statement
35 pages
Data Analysis: Data Preparation
No ratings yet
Data Analysis: Data Preparation
9 pages
BIG Mart Data Analyst Project
No ratings yet
BIG Mart Data Analyst Project
19 pages
Day7 PandasCoreFeatures
No ratings yet
Day7 PandasCoreFeatures
4 pages
Environmental Fact Sheet (# 33) Crude Palm Kernel Oil (CPKO)
No ratings yet
Environmental Fact Sheet (# 33) Crude Palm Kernel Oil (CPKO)
6 pages
LG - TV - LG Uj6500
100% (1)
LG - TV - LG Uj6500
37 pages
Delhivery Mani
No ratings yet
Delhivery Mani
79 pages
PP 11 - Bony Anatomy of The Hip
No ratings yet
PP 11 - Bony Anatomy of The Hip
14 pages
SCL/SCM Drive Installation Guide
No ratings yet
SCL/SCM Drive Installation Guide
50 pages
Mymms d2b Catalog
No ratings yet
Mymms d2b Catalog
12 pages
The Opportunity Cost of Using Excess Capacity
No ratings yet
The Opportunity Cost of Using Excess Capacity
8 pages
q14 SVC 052 Chaudhry r0
No ratings yet
q14 SVC 052 Chaudhry r0
5 pages
Polygon Shafts & Components Guide
No ratings yet
Polygon Shafts & Components Guide
6 pages
Anthotypes Explore The Darkroom in Your Garden and Make Photographs Using Plants 1466261005 9781466261006 - Compress
No ratings yet
Anthotypes Explore The Darkroom in Your Garden and Make Photographs Using Plants 1466261005 9781466261006 - Compress
100 pages
Stress Strain
No ratings yet
Stress Strain
17 pages
Pindell Dewey 1982
No ratings yet
Pindell Dewey 1982
34 pages
XPS FOAM - SquareEdge
No ratings yet
XPS FOAM - SquareEdge
4 pages
Science Teaching Reflection
No ratings yet
Science Teaching Reflection
2 pages
Golgi Apparatus Structure and Function Relationship
No ratings yet
Golgi Apparatus Structure and Function Relationship
3 pages
Transport 2 QP - Merged
No ratings yet
Transport 2 QP - Merged
11 pages
Randeberg 2007
No ratings yet
Randeberg 2007
11 pages
LKG GK Syllabus Whole Session
No ratings yet
LKG GK Syllabus Whole Session
6 pages
All About Kerogen
No ratings yet
All About Kerogen
7 pages
The Cellular Approach: Smart Energy Region Wunsiedel. Testbed For Smart Grid, Smart Metering and Smart Home Solutions
No ratings yet
The Cellular Approach: Smart Energy Region Wunsiedel. Testbed For Smart Grid, Smart Metering and Smart Home Solutions
6 pages
Weight-For-Age BOYS: 6 Months To 2 Years (Percentiles)
No ratings yet
Weight-For-Age BOYS: 6 Months To 2 Years (Percentiles)
1 page

Mini Project2 DAV Answers - Jupyter Notebook

Uploaded by

Mini Project2 DAV Answers - Jupyter Notebook

Uploaded by

Data Analysis and Visualization MPA-2

VISHNU VARADHAN REDDY

Define the necessary libraries (1 mark)

In [1]: import pandas as pd

Load the dataset into the dataframe (1 mark)

Drop the "Item_Identifier" and "Outlet_Identifier" columns (1 mark)

In [3]: # df.drop() drops the data in the dataframe-df.

Extract the target labels (1 mark)

In [5]: # The target label in a dataset is the output data.

In [6]: target_label = df.Item_Outlet_Sales

Replace the field "Item_Fat_Content" with numerical value (1 mark)

In [8]: print('Before Replacing: ',df['Item_Fat_Content'].unique())

Before Replacing: ['Low Fat' 'Regular' 'low fat' 'LF' 'reg']

In [10]: ordEnc = OrdinalEncoder()

In [12]: df['Outlet_Location_Type'] = ordEnc.fit_transform(df['Outlet_Location_Type'].values.reshape(-1, 1))

Out[13]: Item_Weight 1463

Imputation of "Outlet_Size" field with mode value (1 mark)

In [15]: print('The Mode of Outlet Size is: ',df['Outlet_Size'].mode())

The Mode of Outlet Size is: 0 Medium

In [16]: df['Outlet_Size'] = df['Outlet_Size'].fillna('Medium')

Out[17]: Item_Weight 1463

Check for null values (1 mark)

Out[18]: Item_Weight 1463

Imputation of "Item_Weight" field with mode value (1 mark)

In [19]: print('The Mode of Item Weight is: ',df['Item_Weight'].mode())

The Mode of Item Weight is: 0 12.15

Display all field in the dataset using boxplot (1 mark)

In [25]: df['Outlet_Size'] = ordEnc.fit_transform(df['Outlet_Size'].values.reshape(-1, 1))

Root Mean Squared Error (RMSE): 1177.361349688933

Display all field in the dataset using boxplot (1 mark)

Root Mean Squared Error (RMSE): 1161.6406081768139

In [30]: # MinMaxScaler Transform features by scaling each feature to a given range.

Root Mean Squared Error (RMSE): 1176.5289257439433

Root Mean Squared Error (RMSE): 1143.8487793222237

In [34]: # MaxAbsScaler scales each feature by its maximum absolute value.

Root Mean Squared Error (RMSE): 1195.9232136536114

In [36]: # Normalizer normalizes samples individually to unit norm.

Root Mean Squared Error (RMSE): 1218.0003678085768

Plot a histogram to display the RMSE value of each scaler (1 mark)

You might also like