0% found this document useful (0 votes)

14 views7 pages

Assignment 1

The document details the process of loading and preprocessing a dataset from a CSV file containing sales data from Big Mart. It includes steps for handling missing values, encoding categorical variables, and transforming the data for analysis. The final dataset consists of various features related to items and their sales, ready for further analysis or modeling.

Uploaded by

Priangshu Paul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views7 pages

Assignment 1

Uploaded by

Priangshu Paul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df1 = pd.read_csv("Datasets/bigmart.csv")

df1.head()

Item_Identifier Item_Weight Item_Fat_Content Item_Visibility \

0 FDA15 9.30 Low Fat 0.016047
1 DRC01 5.92 Regular 0.019278
2 FDN15 17.50 Low Fat 0.016760
3 FDX07 19.20 Regular 0.000000
4 NCD19 8.93 Low Fat 0.000000

Item_Type Item_MRP Outlet_Identifier \

0 Dairy 249.8092 OUT049
1 Soft Drinks 48.2692 OUT018
2 Meat 141.6180 OUT049
3 Fruits and Vegetables 182.0950 OUT010
4 Household 53.8614 OUT013

Outlet_Establishment_Year Outlet_Size Outlet_Location_Type \

0 1999 Medium Tier 1
1 2009 Medium Tier 3
2 1999 Medium Tier 1
3 1998 NaN Tier 3
4 1987 High Tier 3

Outlet_Type Item_Outlet_Sales
0 Supermarket Type1 3735.1380
1 Supermarket Type2 443.4228
2 Supermarket Type1 2097.2700
3 Grocery Store 732.3800
4 Supermarket Type1 994.7052

df1.shape

(8523, 12)

df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Item_Identifier 8523 non-null object
1 Item_Weight 7060 non-null float64
2 Item_Fat_Content 8523 non-null object
3 Item_Visibility 8523 non-null float64
4 Item_Type 8523 non-null object
5 Item_MRP 8523 non-null float64
6 Outlet_Identifier 8523 non-null object
7 Outlet_Establishment_Year 8523 non-null int64
8 Outlet_Size 6113 non-null object
9 Outlet_Location_Type 8523 non-null object
10 Outlet_Type 8523 non-null object
11 Item_Outlet_Sales 8523 non-null float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB

df1.isnull().sum()

Item_Identifier 0
Item_Weight 1463
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Identifier 0
Outlet_Establishment_Year 0
Outlet_Size 2410
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64

df1.isnull().sum() / df1.shape[0] * 100

Item_Identifier 0.000000
Item_Weight 17.165317
Item_Fat_Content 0.000000
Item_Visibility 0.000000
Item_Type 0.000000
Item_MRP 0.000000
Outlet_Identifier 0.000000
Outlet_Establishment_Year 0.000000
Outlet_Size 28.276428
Outlet_Location_Type 0.000000
Outlet_Type 0.000000
Item_Outlet_Sales 0.000000
dtype: float64

df2 = df1.drop(df1[["Item_Identifier", "Outlet_Identifier",

"Outlet_Establishment_Year", "Outlet_Type"]], axis = 1)

df2.head()

Item_Weight Item_Fat_Content Item_Visibility

Item_Type \
0 9.30 Low Fat 0.016047
Dairy
1 5.92 Regular 0.019278 Soft
Drinks
2 17.50 Low Fat 0.016760
Meat
3 19.20 Regular 0.000000 Fruits and
Vegetables
4 8.93 Low Fat 0.000000
Household

Item_MRP Outlet_Size Outlet_Location_Type Item_Outlet_Sales

0 249.8092 Medium Tier 1 3735.1380
1 48.2692 Medium Tier 3 443.4228
2 141.6180 Medium Tier 1 2097.2700
3 182.0950 NaN Tier 3 732.3800
4 53.8614 High Tier 3 994.7052

df2.isnull().sum()

Item_Weight 1463
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Size 2410
Outlet_Location_Type 0
Item_Outlet_Sales 0
dtype: int64

df2.isnull().sum() / df2.shape[0] * 100

Item_Weight 17.165317
Item_Fat_Content 0.000000
Item_Visibility 0.000000
Item_Type 0.000000
Item_MRP 0.000000
Outlet_Size 28.276428
Outlet_Location_Type 0.000000
Item_Outlet_Sales 0.000000
dtype: float64

sns.boxplot(df2["Item_Weight"])

<Axes: >
df2.Item_Weight = df2.Item_Weight.fillna(df2.Item_Weight.mean())
df2.head()

Item_Weight Item_Fat_Content Item_Visibility

Item_Type \
0 9.30 Low Fat 0.016047
Dairy
1 5.92 Regular 0.019278 Soft
Drinks
2 17.50 Low Fat 0.016760
Meat
3 19.20 Regular 0.000000 Fruits and
Vegetables
4 8.93 Low Fat 0.000000
Household

Item_MRP Outlet_Size Outlet_Location_Type Item_Outlet_Sales

0 249.8092 Medium Tier 1 3735.1380
1 48.2692 Medium Tier 3 443.4228
2 141.6180 Medium Tier 1 2097.2700
3 182.0950 NaN Tier 3 732.3800
4 53.8614 High Tier 3 994.7052

df2.isnull().sum()
Item_Weight 0
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Size 2410
Outlet_Location_Type 0
Item_Outlet_Sales 0
dtype: int64

df2["Outlet_Size"].value_counts()

Outlet_Size
Medium 2793
Small 2388
High 932
Name: count, dtype: int64

df2["Outlet_Size"] =
df2["Outlet_Size"].fillna(df2["Outlet_Size"].mode()[0])

df2.head()

Item_Weight Item_Fat_Content Item_Visibility

Item_Type \
0 9.30 Low Fat 0.016047
Dairy
1 5.92 Regular 0.019278 Soft
Drinks
2 17.50 Low Fat 0.016760
Meat
3 19.20 Regular 0.000000 Fruits and
Vegetables
4 8.93 Low Fat 0.000000
Household

Item_MRP Outlet_Size Outlet_Location_Type Item_Outlet_Sales

0 249.8092 Medium Tier 1 3735.1380
1 48.2692 Medium Tier 3 443.4228
2 141.6180 Medium Tier 1 2097.2700
3 182.0950 Medium Tier 3 732.3800
4 53.8614 High Tier 3 994.7052

df2.isnull().sum()

Item_Weight 0
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Size 0
Outlet_Location_Type 0
Item_Outlet_Sales 0
dtype: int64

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df2["Item_Fat_Content"] = le.fit_transform(df2["Item_Fat_Content"])
df2["Outlet_Size"] = le.fit_transform(df2["Outlet_Size"])
df2["Outlet_Location_Type"] =
le.fit_transform(df2["Outlet_Location_Type"])

df2.head()

Item_Weight Item_Fat_Content Item_Visibility

Item_Type \
0 9.30 1 0.016047
Dairy
1 5.92 2 0.019278 Soft
Drinks
2 17.50 1 0.016760
Meat
3 19.20 2 0.000000 Fruits and
Vegetables
4 8.93 1 0.000000
Household

Item_MRP Outlet_Size Outlet_Location_Type Item_Outlet_Sales

0 249.8092 1 0 3735.1380
1 48.2692 1 2 443.4228
2 141.6180 1 0 2097.2700
3 182.0950 1 2 732.3800
4 53.8614 0 2 994.7052

df2["Item_Type"].value_counts()

Item_Type
Fruits and Vegetables 1232
Snack Foods 1200
Household 910
Frozen Foods 856
Dairy 682
Canned 649
Baking Goods 648
Health and Hygiene 520
Soft Drinks 445
Meat 425
Breads 251
Hard Drinks 214
Others 169
Starchy Foods 148
Breakfast 110
Seafood 64
Name: count, dtype: int64

import category_encoders as ce

BE = ce.BinaryEncoder(cols = ["Item_Type"])

X = BE.fit_transform(df2["Item_Type"])

df3 = pd.concat([df2, X], axis = 1)

df3 = df3.drop("Item_Type", axis = 1)

df3.head()

Item_Weight Item_Fat_Content Item_Visibility Item_MRP

Outlet_Size \
0 9.30 1 0.016047 249.8092
1
1 5.92 2 0.019278 48.2692
1
2 17.50 1 0.016760 141.6180
1
3 19.20 2 0.000000 182.0950
1
4 8.93 1 0.000000 53.8614
0

Outlet_Location_Type Item_Outlet_Sales Item_Type_0

Item_Type_1 \
0 0 3735.1380 0 0

1 2 443.4228 0 0

2 0 2097.2700 0 0

3 2 732.3800 0 0

4 2 994.7052 0 0

Item_Type_2 Item_Type_3 Item_Type_4

0 0 0 1
1 0 1 0
2 0 1 1
3 1 0 0
4 1 0 1

Halal Cold Storage Guidelines
100% (7)
Halal Cold Storage Guidelines
53 pages
Lectin-Free Foods: Vegetables Fruits Oils
100% (8)
Lectin-Free Foods: Vegetables Fruits Oils
3 pages
Funny Readers' Theater: Pigs Play
100% (2)
Funny Readers' Theater: Pigs Play
12 pages
My Greenwich Case Study
No ratings yet
My Greenwich Case Study
5 pages
MRA Project Milestone2 PDF
100% (1)
MRA Project Milestone2 PDF
1 page
Marketing Analytics Assignment 1
No ratings yet
Marketing Analytics Assignment 1
6 pages
30 Soal Bahasa Inggris Kelas 3 SD Semester 1 Dan Jawabannya
No ratings yet
30 Soal Bahasa Inggris Kelas 3 SD Semester 1 Dan Jawabannya
8 pages
Your answers: Môn: Tiếng Anh - Lớp 10
No ratings yet
Your answers: Môn: Tiếng Anh - Lớp 10
11 pages
Wholesale Data Insights for Retailers
No ratings yet
Wholesale Data Insights for Retailers
25 pages
Publikasi A PDF
No ratings yet
Publikasi A PDF
409 pages
Agricultural Sciences Mushroom Cultivation (AGA - 453) B.Sc. (Hons.), B.Sc. Integrated MBA
100% (1)
Agricultural Sciences Mushroom Cultivation (AGA - 453) B.Sc. (Hons.), B.Sc. Integrated MBA
7 pages
Supermarket Data XL
No ratings yet
Supermarket Data XL
380 pages
Cookery LM Mod.1 SHS V
No ratings yet
Cookery LM Mod.1 SHS V
100 pages
McDonald Dataset Analysis Project
No ratings yet
McDonald Dataset Analysis Project
10 pages
Project 4: Final Project: Bigmart Sales Prediction: Chapter 1: Problem Statement
No ratings yet
Project 4: Final Project: Bigmart Sales Prediction: Chapter 1: Problem Statement
35 pages
Bigmart Retail Sales Analysis 2018-20
100% (1)
Bigmart Retail Sales Analysis 2018-20
21 pages
Wholesale Data Analysis Report
100% (1)
Wholesale Data Analysis Report
17 pages
BigMart PDF
100% (1)
BigMart PDF
42 pages
SM2ed Flashcards L2
50% (2)
SM2ed Flashcards L2
214 pages
ATM - GA - 2 - Pivot Table - SOLUTION
No ratings yet
ATM - GA - 2 - Pivot Table - SOLUTION
52 pages
McDonald's Nutrition Data Analysis
No ratings yet
McDonald's Nutrition Data Analysis
22 pages
Bruno Albouze 10 Classic French Recipes v1
100% (3)
Bruno Albouze 10 Classic French Recipes v1
23 pages
Price List Tetap Segar Supplier
No ratings yet
Price List Tetap Segar Supplier
10 pages
Project 2
No ratings yet
Project 2
40 pages
Starter - Plus - Test 1 - RW
100% (1)
Starter - Plus - Test 1 - RW
4 pages
Ank SMDM PDF
No ratings yet
Ank SMDM PDF
39 pages
(National Foods Limited) : Management Report On
No ratings yet
(National Foods Limited) : Management Report On
49 pages
Pandas Practice
No ratings yet
Pandas Practice
18 pages
Coding
No ratings yet
Coding
3 pages
Data Analysis Exercises for Beginners
No ratings yet
Data Analysis Exercises for Beginners
43 pages
BigMart Case Study
No ratings yet
BigMart Case Study
3 pages
Data Analysis with Pandas Guide
No ratings yet
Data Analysis with Pandas Guide
40 pages
Mini Project2 DAV Answers - Jupyter Notebook
No ratings yet
Mini Project2 DAV Answers - Jupyter Notebook
21 pages
1
No ratings yet
1
12 pages
Lab1 Features Selections-Class-GI2
No ratings yet
Lab1 Features Selections-Class-GI2
25 pages
Session8 Exercise
No ratings yet
Session8 Exercise
144 pages
PRJ Sales Forecasting
No ratings yet
PRJ Sales Forecasting
22 pages
WIN SEM (2023-24) FRESHERS - CSE0504 - ETH - AP2023247000196 - 2024-02-29 - Reference-Material-II
No ratings yet
WIN SEM (2023-24) FRESHERS - CSE0504 - ETH - AP2023247000196 - 2024-02-29 - Reference-Material-II
13 pages
V. Market Product - Focus: Mission Statement
No ratings yet
V. Market Product - Focus: Mission Statement
5 pages
E-Commerce Product Delivery Prediction
No ratings yet
E-Commerce Product Delivery Prediction
13 pages
Scorched Particle Standards For Dry Milk
No ratings yet
Scorched Particle Standards For Dry Milk
3 pages
IP PROJECT MGD GROUP - Removed
No ratings yet
IP PROJECT MGD GROUP - Removed
20 pages
SFB - CIA 3 - Report - FINAL
No ratings yet
SFB - CIA 3 - Report - FINAL
28 pages
BigMart Sales Data Analysis
No ratings yet
BigMart Sales Data Analysis
16 pages
Big Sales Mart Final Script PDF
No ratings yet
Big Sales Mart Final Script PDF
36 pages
Blinkit Sales Dashboard
No ratings yet
Blinkit Sales Dashboard
1 page
22 - SITHKOP010 - Cost Effective Recipe Buffet - %%
No ratings yet
22 - SITHKOP010 - Cost Effective Recipe Buffet - %%
63 pages
FMCG Sales Data
No ratings yet
FMCG Sales Data
1 page
The Factors Affecting Big Mart's Sales
No ratings yet
The Factors Affecting Big Mart's Sales
20 pages
Grade 10 Session 4
No ratings yet
Grade 10 Session 4
34 pages
Implement K-Means Clustering.: Preprocessing
No ratings yet
Implement K-Means Clustering.: Preprocessing
8 pages
Pandas Tutorial
No ratings yet
Pandas Tutorial
7 pages
Cultured Butter Cookie Delight
No ratings yet
Cultured Butter Cookie Delight
1 page
What Is Association Rule Learning: 7. Implement Association Algorithms For Supervised Classification On Any Dataset
No ratings yet
What Is Association Rule Learning: 7. Implement Association Algorithms For Supervised Classification On Any Dataset
18 pages
Practical
No ratings yet
Practical
3 pages
5-2a Dataframes Column Operations - Instruction
No ratings yet
5-2a Dataframes Column Operations - Instruction
2 pages
Task 1 - Data Preparation and Customer Analytics - Jupyter Notebook
No ratings yet
Task 1 - Data Preparation and Customer Analytics - Jupyter Notebook
64 pages
Project
No ratings yet
Project
12 pages
ML 5
No ratings yet
ML 5
11 pages
SalesDataAnalysis 1693296057
No ratings yet
SalesDataAnalysis 1693296057
14 pages
Data Analysis for Sales Insights
No ratings yet
Data Analysis for Sales Insights
4 pages
Food and Nutrition Topical Questions and Answers Grade 8 Agriculture and Nutrition2024 Teacher - Co - .Ke
No ratings yet
Food and Nutrition Topical Questions and Answers Grade 8 Agriculture and Nutrition2024 Teacher - Co - .Ke
4 pages
1 Demand
No ratings yet
1 Demand
13 pages
2024 7월 고3 모의고사 영어 해설
No ratings yet
2024 7월 고3 모의고사 영어 해설
5 pages
09 P3-Q1 ข้อสอบ English Grammar แกนกลาง 2567 student
No ratings yet
09 P3-Q1 ข้อสอบ English Grammar แกนกลาง 2567 student
3 pages
GIAE - L2A - Grammar Reference
No ratings yet
GIAE - L2A - Grammar Reference
8 pages
Grocery
No ratings yet
Grocery
41 pages
E-commerce Order Data Analysis
No ratings yet
E-commerce Order Data Analysis
6 pages
Mini Project (BDA) Output
No ratings yet
Mini Project (BDA) Output
5 pages
Supermarket Sales Analysis 1
No ratings yet
Supermarket Sales Analysis 1
13 pages
Cinebarre Menu 2020
No ratings yet
Cinebarre Menu 2020
24 pages
E-Workbook Tapau Bi Grammar 3 Ogos 2024
No ratings yet
E-Workbook Tapau Bi Grammar 3 Ogos 2024
14 pages
Nigerian Cereal-Based Foods
No ratings yet
Nigerian Cereal-Based Foods
21 pages
Porter Case Study
No ratings yet
Porter Case Study
27 pages
Item
No ratings yet
Item
5 pages
Week 1
No ratings yet
Week 1
35 pages
Mart Train Sample
No ratings yet
Mart Train Sample
4 pages
Online Retail Recommendation System
No ratings yet
Online Retail Recommendation System
7 pages
Project 12 Big Mart Sales Prediction
No ratings yet
Project 12 Big Mart Sales Prediction
15 pages
SC Report
No ratings yet
SC Report
104 pages
Grammar 2024 Exercises
No ratings yet
Grammar 2024 Exercises
42 pages
1997
No ratings yet
1997
33 pages
Nutritional Facts
No ratings yet
Nutritional Facts
5 pages
Exercise 8
No ratings yet
Exercise 8
12 pages
Pandas PD Numpy NP Matplotlib - Pyplot PLT Seaborn Sns DF PD - Read - CSV (, Encoding ) DF - Head
No ratings yet
Pandas PD Numpy NP Matplotlib - Pyplot PLT Seaborn Sns DF PD - Read - CSV (, Encoding ) DF - Head
31 pages
Chip Analysis
No ratings yet
Chip Analysis
2 pages
Diwali Sales Anlaysis
No ratings yet
Diwali Sales Anlaysis
10 pages
Book 1
No ratings yet
Book 1
9 pages

Assignment 1

Uploaded by

Assignment 1

Uploaded by

import numpy as np

Item_Identifier Item_Weight Item_Fat_Content Item_Visibility \

Item_Type Item_MRP Outlet_Identifier \

Outlet_Establishment_Year Outlet_Size Outlet_Location_Type \

df1.isnull().sum() / df1.shape[0] * 100

df2 = df1.drop(df1[["Item_Identifier", "Outlet_Identifier",

Item_Weight Item_Fat_Content Item_Visibility

Item_MRP Outlet_Size Outlet_Location_Type Item_Outlet_Sales

df2.isnull().sum() / df2.shape[0] * 100

Item_Weight Item_Fat_Content Item_Visibility

Item_MRP Outlet_Size Outlet_Location_Type Item_Outlet_Sales

Item_Weight Item_Fat_Content Item_Visibility

Item_MRP Outlet_Size Outlet_Location_Type Item_Outlet_Sales

from sklearn.preprocessing import LabelEncoder

Item_Weight Item_Fat_Content Item_Visibility

Item_MRP Outlet_Size Outlet_Location_Type Item_Outlet_Sales

df3 = pd.concat([df2, X], axis = 1)

Item_Weight Item_Fat_Content Item_Visibility Item_MRP

Outlet_Location_Type Item_Outlet_Sales Item_Type_0

Item_Type_2 Item_Type_3 Item_Type_4

You might also like