0% found this document useful (0 votes)

8 views11 pages

A09Ass02 - Jupyter Notebook

The document outlines an assignment for data wrangling using Python, focusing on handling missing values, outliers, and data transformations on an open-source dataset. It includes steps for importing libraries, reading a CSV file, checking for missing values, and performing statistical analysis on student performance data. The assignment also emphasizes documenting the approach taken for data transformations to improve understanding or distribution of the data.

Uploaded by

Suryal Khirade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views11 pages

A09Ass02 - Jupyter Notebook

Uploaded by

Suryal Khirade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

Name : Suryal D . Khirade

Roll NO: T190424399
Assignment No :02

Data Wrangling II Perform the following operations using P

ython on any open source dataset
(eg. data.csv)
1. Scan all variables for missing values and inconsistenci
es. If there are missing values
and/or inconsistencies, use any of the suitable techniques
to deal with them
2. Scan all numeric variables for outliers. If there are o
utliers, use any of the suitable
techniques to deal with them.
3. Apply data transformations on at least one of the varia
bles. The purpose of this
transformation should be one of the following reasons: to
change the scale for better
understanding of the variable, to convert a non-linear rel
ation into a linear one, or to
decrease the skewness and convert the distribution into a
normal distribution. Reason
and document your approach properly.

Import all the required Python Libraries.

In [1]: import numpy as np

import pandas as pd
import random as rd
import seaborn as sns

In [19]: df = pd.read_csv("C:\\Users\\alisu\\Downloads\\archive (1)\\Student Perform

In [20]: df.describe()

Out[20]: math score reading score writing score

count 1000.00000 1000.000000 1000.000000

mean 66.08900 69.169000 68.054000

std 15.16308 14.600192 15.195657

min 0.00000 17.000000 10.000000

25% 57.00000 59.000000 57.750000

50% 66.00000 70.000000 69.000000

75% 77.00000 79.000000 79.000000

max 100.00000 100.000000 100.000000

localhost:8890/notebooks/A09Ass02.ipynb 1/11
4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

In [21]: df.isnull().sum()

Out[21]: gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64

Create a DataFrame from the dictionary

In [22]: data={'studentid':[i for i in range(1,101)],

'Age':[rd.randint(15,18) for i in range(100)],
'class':[rd.choice(['9 th','10 th' ,'11 th' ,'12 th']) for _ in range
,'attendence':[rd.uniform(10,100) for _ in range(100)],
'score':[rd.randint(30,100) for _ in range(100) ]
}

In [23]: df.to_csv('StudentPerformance' , index =False)

df=pd.DataFrame(data )

In [24]: df

Out[24]: studentid Age class attendence score

0 1 17 11 th 47.894979 50

1 2 18 11 th 46.373847 100

2 3 18 11 th 46.636907 76

3 4 16 11 th 25.126863 77

4 5 16 10 th 55.518510 41

... ... ... ... ... ...

95 96 17 10 th 34.682479 59

96 97 16 10 th 52.756471 87

97 98 15 10 th 72.759473 30

98 99 16 12 th 59.991836 82

99 100 15 12 th 35.561988 91

100 rows × 5 columns

Load the Dataset into pandas dataframe.

localhost:8890/notebooks/A09Ass02.ipynb 2/11
4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

In [31]: import pandas as pd

df = pd.read_csv("C:\\Users\\alisu\\Downloads\\archive (1)\\Student Perform
df.head()

Out[31]: parental test

math reading writing
gender race/ethnicity level of lunch preparation
score score score
education course

bachelor's
0 female group B standard none 72 72 74
degree

some
1 female group C standard completed 69 90 88
college

master's
2 female group B standard none 90 95 93
degree

associate's
3 male group A free/reduced none 47 57 44
degree

some
4 male group C standard none 76 78 75
college

In [32]: df.shape

Out[32]: (1000, 8)

In [33]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 1000 non-null object
1 race/ethnicity 1000 non-null object
2 parental level of education 1000 non-null object
3 lunch 1000 non-null object
4 test preparation course 1000 non-null object
5 math score 1000 non-null int64
6 reading score 1000 non-null int64
7 writing score 1000 non-null int64
dtypes: int64(3), object(5)
memory usage: 62.6+ KB

In [34]: df.describe()

Out[34]: math score reading score writing score

count 1000.00000 1000.000000 1000.000000

mean 66.08900 69.169000 68.054000

std 15.16308 14.600192 15.195657

min 0.00000 17.000000 10.000000

25% 57.00000 59.000000 57.750000

50% 66.00000 70.000000 69.000000

75% 77.00000 79.000000 79.000000

max 100.00000 100.000000 100.000000

localhost:8890/notebooks/A09Ass02.ipynb 3/11
4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

Scan all variables for missing values and inconsistencies. If there are missing values
and/or inconsistencies, use any of the suitable techniques to deal with them.

Data Preprocessing

In [35]: df.isnull().sum()

Out[35]: gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64

In [36]: df.nunique()

Out[36]: gender 2
race/ethnicity 5
parental level of education 6
lunch 2
test preparation course 2
math score 81
reading score 72
writing score 77
dtype: int64

In [37]: df["gender"].value_counts() #categorical column

Out[37]: gender
female 518
male 482
Name: count, dtype: int64

In [38]: df['gender'].fillna('female',inplace=True)
df.isnull().sum()

Out[38]: gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64

In [39]: df['gender'].mode(0)

Out[39]: 0 female
Name: gender, dtype: object

localhost:8890/notebooks/A09Ass02.ipynb 4/11
4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

In [40]: df['race/ethnicity'].value_counts()

Out[40]: race/ethnicity
group C 319
group D 262
group B 190
group E 140
group A 89
Name: count, dtype: int64

In [41]: df['race/ethnicity'].fillna('Group C',inplace=True)

df.isnull().sum()

Out[41]: gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64

In [42]: df['lunch'].value_counts()

Out[42]: lunch
standard 645
free/reduced 355
Name: count, dtype: int64

In [43]: df['lunch'].mode()

Out[43]: 0 standard
Name: lunch, dtype: object

In [44]: df['lunch'].mode()[0]

Out[44]: 'standard'

In [45]: df['lunch'].fillna(df['lunch'].mode()[0],inplace=True)
df.isnull().sum()

Out[45]: gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64

localhost:8890/notebooks/A09Ass02.ipynb 5/11
4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

In [46]: df.isnull().sum()

Out[46]: gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64

In [47]: sns.distplot(df['math score'])

C:\Users\alisu\AppData\Local\Temp\ipykernel_17132\2354272343.py:1: UserWa
rning:

`distplot` is a deprecated function and will be removed in seaborn v0.14.

Please adapt your code to use either `displot` (a figure-level function w

ith
similar flexibility) or `histplot` (an axes-level function for histogram
s).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 (http
s://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751)

sns.distplot(df['math score'])

Out[47]: <Axes: xlabel='math score', ylabel='Density'>

localhost:8890/notebooks/A09Ass02.ipynb 6/11
4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

In [48]: sns.kdeplot(df['reading score'])#normally distributed

Out[48]: <Axes: xlabel='reading score', ylabel='Density'>

localhost:8890/notebooks/A09Ass02.ipynb 7/11
4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

In [49]: sns.distplot(df['writing score'],hist=False,)

C:\Users\alisu\AppData\Local\Temp\ipykernel_17132\3897196859.py:1: UserWa
rning:

`distplot` is a deprecated function and will be removed in seaborn v0.14.

Please adapt your code to use either `displot` (a figure-level function w

ith
similar flexibility) or `kdeplot` (an axes-level function for kernel dens
ity plots).

sns.distplot(df['writing score'],hist=False,)

Out[49]: <Axes: xlabel='writing score', ylabel='Density'>

In [50]: df['math score'].fillna(df['math score'].mean(),inplace=True)

df.isnull().sum()

Out[50]: gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64

localhost:8890/notebooks/A09Ass02.ipynb 8/11
4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

Scan all numeric variables for outliers. If there are outliers, use any of the suitable 
techniques to deal with them.

In [51]: sns.boxplot(df)

Out[51]: <Axes: >

In [52]: Q1=df['math score'].quantile(0.25)

Q3=df['math score'].quantile(0.75)
IQR=Q3-Q1
lower= Q1-(1.5*IQR)
upper=Q3+(1.5*IQR)

localhost:8890/notebooks/A09Ass02.ipynb 9/11
4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

In [53]: np.clip(df['math score'],lower,upper,inplace=True)

sns.boxplot(df['math score'])

Out[53]: <Axes: >

Apply data transformations on at least one of the variables. The purpose of this
transformation should be one of the following reasons: to change the scale for better
understanding of the variable, to convert a non-linear relation into a linear one, or to
decrease the skewness and convert the distribution into a normal distribution.

In [54]: df['math score'].skew()

-0.12912399951580147
df['reading score'].skew()
-0.2595478399998487
df['writing score'].skew()
-0.3125529577143879
from sklearn.preprocessing import StandardScaler,MinMaxScaler
scaler=StandardScaler()

In [55]: scaler.fit(df[['math score']])

Out[55]: StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or
trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page
with nbviewer.org.

localhost:8890/notebooks/A09Ass02.ipynb 10/11
4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

In [58]: df.head()

Out[58]: parental test

math reading writing
gender race/ethnicity level of lunch preparation
score score score
education course

bachelor's
0 female group B standard none 72 72 74
degree

some
1 female group C standard completed 69 90 88
college

master's
2 female group B standard none 90 95 93
degree

associate's
3 male group A free/reduced none 47 57 44
degree

some
4 male group C standard none 76 78 75
college

In [59]: scaler.fit(df[['reading score']])

Out[59]: StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or
trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page
with nbviewer.org.

In [60]: scaled_rscore=scaler.transform(df[['reading score']])

localhost:8890/notebooks/A09Ass02.ipynb 11/11

Preparation 7 - Ointments
No ratings yet
Preparation 7 - Ointments
8 pages
Walmart Factory List
100% (2)
Walmart Factory List
5 pages
Teaching Reading Skills in A Foreign Language
No ratings yet
Teaching Reading Skills in A Foreign Language
14 pages
Nursing Theory Foundations
No ratings yet
Nursing Theory Foundations
60 pages
Analyzing Student Performance in Exams Using Python
No ratings yet
Analyzing Student Performance in Exams Using Python
11 pages
Science7-Q2-Mod8 v1
100% (1)
Science7-Q2-Mod8 v1
34 pages
Rockwool Installation Guide
100% (1)
Rockwool Installation Guide
8 pages
Force FX-8CS Service Manual - en
83% (6)
Force FX-8CS Service Manual - en
282 pages
Chapter 6 Barriers To International Trade
No ratings yet
Chapter 6 Barriers To International Trade
13 pages
Business Simulation Assessment 2017 18 PDF
No ratings yet
Business Simulation Assessment 2017 18 PDF
6 pages
Goldfrank's Toxicologic Emergencies, 11E (TRUE PDF) 11th Edition Robert S. Hoffman PDF Download
100% (1)
Goldfrank's Toxicologic Emergencies, 11E (TRUE PDF) 11th Edition Robert S. Hoffman PDF Download
62 pages
Assignment HBEC4503 Action Research in Early Childhood Education Assignment 2 May 2019 Semester
No ratings yet
Assignment HBEC4503 Action Research in Early Childhood Education Assignment 2 May 2019 Semester
10 pages
DevOps Engineer Learning Path Guide
No ratings yet
DevOps Engineer Learning Path Guide
10 pages
CC Assignment 3
No ratings yet
CC Assignment 3
20 pages
Students Performance Analysis
No ratings yet
Students Performance Analysis
12 pages
EDA Student
No ratings yet
EDA Student
8 pages
Effect of Niobium On The As-Cast Microstructure of Hypereutectic High Chromium Cast Iron
No ratings yet
Effect of Niobium On The As-Cast Microstructure of Hypereutectic High Chromium Cast Iron
4 pages
Recovery CDs
No ratings yet
Recovery CDs
6 pages
Configure Eap Tls Authentication With Is
No ratings yet
Configure Eap Tls Authentication With Is
20 pages
MKT Căn B N
No ratings yet
MKT Căn B N
4 pages
DSBDL Pract 2
No ratings yet
DSBDL Pract 2
6 pages
Improvement of Supply Chain Performance of Printin
No ratings yet
Improvement of Supply Chain Performance of Printin
12 pages
A09Ass04 - Jupyter Notebook
No ratings yet
A09Ass04 - Jupyter Notebook
10 pages
I222153 Lab03
No ratings yet
I222153 Lab03
28 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
86 pages
Adcps: Question Paper Cum Answer Sheet
No ratings yet
Adcps: Question Paper Cum Answer Sheet
5 pages
A09Ass08 - Jupyter Notebook
No ratings yet
A09Ass08 - Jupyter Notebook
23 pages
Unit-1 AI ML PYTHON - Jupyter Notebook
No ratings yet
Unit-1 AI ML PYTHON - Jupyter Notebook
10 pages
DA Lab Manual r22
No ratings yet
DA Lab Manual r22
31 pages
Aiclass
No ratings yet
Aiclass
9 pages
Student Performance Analysis and Prediction 2.3
No ratings yet
Student Performance Analysis and Prediction 2.3
19 pages
Practise
No ratings yet
Practise
9 pages
Ds&bda 1-14
No ratings yet
Ds&bda 1-14
95 pages
A09Ass01 - Jupyter Notebook
No ratings yet
A09Ass01 - Jupyter Notebook
8 pages
Student Performance Analysis
No ratings yet
Student Performance Analysis
16 pages
Company Profile PDF
No ratings yet
Company Profile PDF
38 pages
Project Work Info
No ratings yet
Project Work Info
20 pages
A09Ass03 - Jupyter Notebook
No ratings yet
A09Ass03 - Jupyter Notebook
5 pages
DSBDA Prac2
No ratings yet
DSBDA Prac2
2 pages
CSC - 310 Advanced Python Programming Continuous Assessment-2 Assignment:Ca2
No ratings yet
CSC - 310 Advanced Python Programming Continuous Assessment-2 Assignment:Ca2
33 pages
SARA-R5 ATCommands UBX-19047455
No ratings yet
SARA-R5 ATCommands UBX-19047455
558 pages
Student Performance Analysis
No ratings yet
Student Performance Analysis
22 pages
Student Performance Analysis and Prediction
No ratings yet
Student Performance Analysis and Prediction
19 pages
Experiment 1
No ratings yet
Experiment 1
5 pages
DR T V V Pavan Kumar - Assign - 2
No ratings yet
DR T V V Pavan Kumar - Assign - 2
5 pages
23:23:48
No ratings yet
23:23:48
364 pages
Radiant July 2018
No ratings yet
Radiant July 2018
18 pages
Codealpha Studentseda
No ratings yet
Codealpha Studentseda
2 pages
The Angel - Pearl Buck
No ratings yet
The Angel - Pearl Buck
7 pages
Updated RFP Template and Mandatory Provisions For Federal Aid Projects
No ratings yet
Updated RFP Template and Mandatory Provisions For Federal Aid Projects
2 pages
Data Science Practical Book - Ipynb
No ratings yet
Data Science Practical Book - Ipynb
21 pages
RPMS COT Sheets
No ratings yet
RPMS COT Sheets
12 pages
Class 11 Physics Exam Paper
No ratings yet
Class 11 Physics Exam Paper
4 pages
Jamboree
No ratings yet
Jamboree
10 pages
Data Manipulation With Python Pandas 1700003764
No ratings yet
Data Manipulation With Python Pandas 1700003764
10 pages
Solar PV Grant Declaration of Works Form
No ratings yet
Solar PV Grant Declaration of Works Form
2 pages
Data Preprocessing - Ipynb - Colaboratory
No ratings yet
Data Preprocessing - Ipynb - Colaboratory
7 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
List of Practical Ip065 Xii Session 2025 CKC Academy
No ratings yet
List of Practical Ip065 Xii Session 2025 CKC Academy
19 pages
Assessment Test
No ratings yet
Assessment Test
22 pages
Jamboree Case Study
No ratings yet
Jamboree Case Study
24 pages
PNL Account Cashflow Forecast: Missing Values
No ratings yet
PNL Account Cashflow Forecast: Missing Values
5 pages
List of Practical Ip065 Xii Session 2025 CKC Academy
No ratings yet
List of Practical Ip065 Xii Session 2025 CKC Academy
19 pages
11 - 8 - 2022 - 9 - 12 - 58 - 189bachelor of Science B.Sc. - ExamForm
No ratings yet
11 - 8 - 2022 - 9 - 12 - 58 - 189bachelor of Science B.Sc. - ExamForm
2 pages
Samarth Raghav
No ratings yet
Samarth Raghav
15 pages
PMA Experiment 1
No ratings yet
PMA Experiment 1
9 pages
Prog Found Final
No ratings yet
Prog Found Final
10 pages
First 4
No ratings yet
First 4
11 pages
Ssce-2025 Practical Test Solution
No ratings yet
Ssce-2025 Practical Test Solution
7 pages
Data Cleaning
No ratings yet
Data Cleaning
83 pages
IP12 Gargi
No ratings yet
IP12 Gargi
32 pages
Data Wrangling, 2
No ratings yet
Data Wrangling, 2
4 pages
Assignment 4
No ratings yet
Assignment 4
5 pages
Practical File Class Xii
No ratings yet
Practical File Class Xii
25 pages
FDS Slot 1
No ratings yet
FDS Slot 1
19 pages
Python Case Study
No ratings yet
Python Case Study
7 pages
Week2 Lab
No ratings yet
Week2 Lab
8 pages
Notebook PYTHON DATA SCIENCE
No ratings yet
Notebook PYTHON DATA SCIENCE
16 pages
MajorProject - Ipynb - Colaboratory
No ratings yet
MajorProject - Ipynb - Colaboratory
11 pages
Assignment 02
No ratings yet
Assignment 02
4 pages
Students Exam Scores Analysis - Ipynb
No ratings yet
Students Exam Scores Analysis - Ipynb
4 pages
CMSC320 Final Project
No ratings yet
CMSC320 Final Project
20 pages
Student Dropout
No ratings yet
Student Dropout
38 pages
Lab 2 - Basic Statistical Analysis
No ratings yet
Lab 2 - Basic Statistical Analysis
7 pages
00 - Lesson - Data Science Workflow - Jupyter Notebook
No ratings yet
00 - Lesson - Data Science Workflow - Jupyter Notebook
6 pages
ML Lab Manual Final
No ratings yet
ML Lab Manual Final
36 pages
Dav 2024 Pyq
No ratings yet
Dav 2024 Pyq
7 pages
Data Wrangling - Jupyter Notebook
No ratings yet
Data Wrangling - Jupyter Notebook
5 pages
Tutorial 2 QB & QP
No ratings yet
Tutorial 2 QB & QP
4 pages
IP XII U1 Ch3 DataHandling (DataFrame) Final
No ratings yet
IP XII U1 Ch3 DataHandling (DataFrame) Final
45 pages
00 - Project - Your First Data Science Project - Jupyter Notebook
No ratings yet
00 - Project - Your First Data Science Project - Jupyter Notebook
8 pages
Experiment 2
No ratings yet
Experiment 2
5 pages

A09Ass02 - Jupyter Notebook

Uploaded by

A09Ass02 - Jupyter Notebook

Uploaded by

4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

Name : Suryal D . Khirade

Data Wrangling II Perform the following operations using P

Import all the required Python Libraries.

In [1]: import numpy as np

In [19]: df = pd.read_csv("C:\\Users\\alisu\\Downloads\\archive (1)\\Student Perform

Out[20]: math score reading score writing score

count 1000.00000 1000.000000 1000.000000

mean 66.08900 69.169000 68.054000

std 15.16308 14.600192 15.195657

min 0.00000 17.000000 10.000000

25% 57.00000 59.000000 57.750000

50% 66.00000 70.000000 69.000000

75% 77.00000 79.000000 79.000000

max 100.00000 100.000000 100.000000

Create a DataFrame from the dictionary

In [22]: data={'studentid':[i for i in range(1,101)],

In [23]: df.to_csv('StudentPerformance' , index =False)

Out[24]: studentid Age class attendence score

... ... ... ... ... ...

100 rows × 5 columns

Load the Dataset into pandas dataframe.

In [31]: import pandas as pd

Out[31]: parental test

Out[34]: math score reading score writing score

count 1000.00000 1000.000000 1000.000000

mean 66.08900 69.169000 68.054000

std 15.16308 14.600192 15.195657

min 0.00000 17.000000 10.000000

25% 57.00000 59.000000 57.750000

50% 66.00000 70.000000 69.000000

75% 77.00000 79.000000 79.000000

max 100.00000 100.000000 100.000000

In [37]: df["gender"].value_counts() #categorical column

In [41]: df['race/ethnicity'].fillna('Group C',inplace=True)

In [47]: sns.distplot(df['math score'])

`distplot` is a deprecated function and will be removed in seaborn v0.14.

Please adapt your code to use either `displot` (a figure-level function w

Out[47]: <Axes: xlabel='math score', ylabel='Density'>

In [48]: sns.kdeplot(df['reading score'])#normally distributed

Out[48]: <Axes: xlabel='reading score', ylabel='Density'>

In [49]: sns.distplot(df['writing score'],hist=False,)

`distplot` is a deprecated function and will be removed in seaborn v0.14.

Please adapt your code to use either `displot` (a figure-level function w

Out[49]: <Axes: xlabel='writing score', ylabel='Density'>

In [50]: df['math score'].fillna(df['math score'].mean(),inplace=True)

Out[51]: <Axes: >

In [52]: Q1=df['math score'].quantile(0.25)

In [53]: np.clip(df['math score'],lower,upper,inplace=True)

Out[53]: <Axes: >

In [54]: df['math score'].skew()

In [55]: scaler.fit(df[['math score']])

Out[58]: parental test

In [59]: scaler.fit(df[['reading score']])

In [60]: scaled_rscore=scaler.transform(df[['reading score']])

You might also like