Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views11 pages

A09Ass02 - Jupyter Notebook

The document outlines an assignment for data wrangling using Python, focusing on handling missing values, outliers, and data transformations on an open-source dataset. It includes steps for importing libraries, reading a CSV file, checking for missing values, and performing statistical analysis on student performance data. The assignment also emphasizes documenting the approach taken for data transformations to improve understanding or distribution of the data.

Uploaded by

Suryal Khirade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views11 pages

A09Ass02 - Jupyter Notebook

The document outlines an assignment for data wrangling using Python, focusing on handling missing values, outliers, and data transformations on an open-source dataset. It includes steps for importing libraries, reading a CSV file, checking for missing values, and performing statistical analysis on student performance data. The assignment also emphasizes documenting the approach taken for data transformations to improve understanding or distribution of the data.

Uploaded by

Suryal Khirade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

Name : Suryal D . Khirade


Roll NO: T190424399
Assignment No :02

Data Wrangling II Perform the following operations using P


ython on any open source dataset
(eg. data.csv)
1. Scan all variables for missing values and inconsistenci
es. If there are missing values
and/or inconsistencies, use any of the suitable techniques
to deal with them
2. Scan all numeric variables for outliers. If there are o
utliers, use any of the suitable
techniques to deal with them.
3. Apply data transformations on at least one of the varia
bles. The purpose of this
transformation should be one of the following reasons: to
change the scale for better
understanding of the variable, to convert a non-linear rel
ation into a linear one, or to
decrease the skewness and convert the distribution into a
normal distribution. Reason
and document your approach properly.

Import all the required Python Libraries.

In [1]: import numpy as np


import pandas as pd
import random as rd
import seaborn as sns

In [19]: df = pd.read_csv("C:\\Users\\alisu\\Downloads\\archive (1)\\Student Perform

In [20]: df.describe()

Out[20]: math score reading score writing score

count 1000.00000 1000.000000 1000.000000

mean 66.08900 69.169000 68.054000

std 15.16308 14.600192 15.195657

min 0.00000 17.000000 10.000000

25% 57.00000 59.000000 57.750000

50% 66.00000 70.000000 69.000000

75% 77.00000 79.000000 79.000000

max 100.00000 100.000000 100.000000

localhost:8890/notebooks/A09Ass02.ipynb 1/11
4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

In [21]: df.isnull().sum()

Out[21]: gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64

Create a DataFrame from the dictionary

In [22]: data={'studentid':[i for i in range(1,101)],


'Age':[rd.randint(15,18) for i in range(100)],
'class':[rd.choice(['9 th','10 th' ,'11 th' ,'12 th']) for _ in range
,'attendence':[rd.uniform(10,100) for _ in range(100)],
'score':[rd.randint(30,100) for _ in range(100) ]
}

In [23]: df.to_csv('StudentPerformance' , index =False)


df=pd.DataFrame(data )

In [24]: df

Out[24]: studentid Age class attendence score

0 1 17 11 th 47.894979 50

1 2 18 11 th 46.373847 100

2 3 18 11 th 46.636907 76

3 4 16 11 th 25.126863 77

4 5 16 10 th 55.518510 41

... ... ... ... ... ...

95 96 17 10 th 34.682479 59

96 97 16 10 th 52.756471 87

97 98 15 10 th 72.759473 30

98 99 16 12 th 59.991836 82

99 100 15 12 th 35.561988 91

100 rows × 5 columns

Load the Dataset into pandas dataframe.

localhost:8890/notebooks/A09Ass02.ipynb 2/11
4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

In [31]: import pandas as pd


df = pd.read_csv("C:\\Users\\alisu\\Downloads\\archive (1)\\Student Perform
df.head()

Out[31]: parental test


math reading writing
gender race/ethnicity level of lunch preparation
score score score
education course

bachelor's
0 female group B standard none 72 72 74
degree

some
1 female group C standard completed 69 90 88
college

master's
2 female group B standard none 90 95 93
degree

associate's
3 male group A free/reduced none 47 57 44
degree

some
4 male group C standard none 76 78 75
college

In [32]: df.shape

Out[32]: (1000, 8)

In [33]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 1000 non-null object
1 race/ethnicity 1000 non-null object
2 parental level of education 1000 non-null object
3 lunch 1000 non-null object
4 test preparation course 1000 non-null object
5 math score 1000 non-null int64
6 reading score 1000 non-null int64
7 writing score 1000 non-null int64
dtypes: int64(3), object(5)
memory usage: 62.6+ KB

In [34]: df.describe()

Out[34]: math score reading score writing score

count 1000.00000 1000.000000 1000.000000

mean 66.08900 69.169000 68.054000

std 15.16308 14.600192 15.195657

min 0.00000 17.000000 10.000000

25% 57.00000 59.000000 57.750000

50% 66.00000 70.000000 69.000000

75% 77.00000 79.000000 79.000000

max 100.00000 100.000000 100.000000

localhost:8890/notebooks/A09Ass02.ipynb 3/11
4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

Scan all variables for missing values and inconsistencies. If there are missing values
and/or inconsistencies, use any of the suitable techniques to deal with them.

Data Preprocessing

In [35]: df.isnull().sum()

Out[35]: gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64

In [36]: df.nunique()

Out[36]: gender 2
race/ethnicity 5
parental level of education 6
lunch 2
test preparation course 2
math score 81
reading score 72
writing score 77
dtype: int64

In [37]: df["gender"].value_counts() #categorical column

Out[37]: gender
female 518
male 482
Name: count, dtype: int64

In [38]: df['gender'].fillna('female',inplace=True)
df.isnull().sum()

Out[38]: gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64

In [39]: df['gender'].mode(0)

Out[39]: 0 female
Name: gender, dtype: object

localhost:8890/notebooks/A09Ass02.ipynb 4/11
4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

In [40]: df['race/ethnicity'].value_counts()

Out[40]: race/ethnicity
group C 319
group D 262
group B 190
group E 140
group A 89
Name: count, dtype: int64

In [41]: df['race/ethnicity'].fillna('Group C',inplace=True)


df.isnull().sum()

Out[41]: gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64

In [42]: df['lunch'].value_counts()

Out[42]: lunch
standard 645
free/reduced 355
Name: count, dtype: int64

In [43]: df['lunch'].mode()

Out[43]: 0 standard
Name: lunch, dtype: object

In [44]: df['lunch'].mode()[0]

Out[44]: 'standard'

In [45]: df['lunch'].fillna(df['lunch'].mode()[0],inplace=True)
df.isnull().sum()

Out[45]: gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64

localhost:8890/notebooks/A09Ass02.ipynb 5/11
4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

In [46]: df.isnull().sum()

Out[46]: gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64

In [47]: sns.distplot(df['math score'])

C:\Users\alisu\AppData\Local\Temp\ipykernel_17132\2354272343.py:1: UserWa
rning:

`distplot` is a deprecated function and will be removed in seaborn v0.14.


0.

Please adapt your code to use either `displot` (a figure-level function w


ith
similar flexibility) or `histplot` (an axes-level function for histogram
s).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 (http
s://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751)

sns.distplot(df['math score'])

Out[47]: <Axes: xlabel='math score', ylabel='Density'>

localhost:8890/notebooks/A09Ass02.ipynb 6/11
4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

In [48]: sns.kdeplot(df['reading score'])#normally distributed


Out[48]: <Axes: xlabel='reading score', ylabel='Density'>

localhost:8890/notebooks/A09Ass02.ipynb 7/11
4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

In [49]: sns.distplot(df['writing score'],hist=False,)


C:\Users\alisu\AppData\Local\Temp\ipykernel_17132\3897196859.py:1: UserWa
rning:

`distplot` is a deprecated function and will be removed in seaborn v0.14.


0.

Please adapt your code to use either `displot` (a figure-level function w


ith
similar flexibility) or `kdeplot` (an axes-level function for kernel dens
ity plots).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 (http
s://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751)

sns.distplot(df['writing score'],hist=False,)

Out[49]: <Axes: xlabel='writing score', ylabel='Density'>

In [50]: df['math score'].fillna(df['math score'].mean(),inplace=True)


df.isnull().sum()

Out[50]: gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64

localhost:8890/notebooks/A09Ass02.ipynb 8/11
4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

Scan all numeric variables for outliers. If there are outliers, use any of the suitable 
techniques to deal with them.

In [51]: sns.boxplot(df)

Out[51]: <Axes: >

In [52]: Q1=df['math score'].quantile(0.25)


Q3=df['math score'].quantile(0.75)
IQR=Q3-Q1
lower= Q1-(1.5*IQR)
upper=Q3+(1.5*IQR)

localhost:8890/notebooks/A09Ass02.ipynb 9/11
4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

In [53]: np.clip(df['math score'],lower,upper,inplace=True)


sns.boxplot(df['math score'])

Out[53]: <Axes: >

Apply data transformations on at least one of the variables. The purpose of this
transformation should be one of the following reasons: to change the scale for better
understanding of the variable, to convert a non-linear relation into a linear one, or to
decrease the skewness and convert the distribution into a normal distribution.

In [54]: df['math score'].skew()


-0.12912399951580147
df['reading score'].skew()
-0.2595478399998487
df['writing score'].skew()
-0.3125529577143879
from sklearn.preprocessing import StandardScaler,MinMaxScaler
scaler=StandardScaler()

In [55]: scaler.fit(df[['math score']])

Out[55]: StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or
trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page
with nbviewer.org.

localhost:8890/notebooks/A09Ass02.ipynb 10/11
4/4/24, 9:48 AM A09Ass02 - Jupyter Notebook

In [58]: df.head()

Out[58]: parental test


math reading writing
gender race/ethnicity level of lunch preparation
score score score
education course

bachelor's
0 female group B standard none 72 72 74
degree

some
1 female group C standard completed 69 90 88
college

master's
2 female group B standard none 90 95 93
degree

associate's
3 male group A free/reduced none 47 57 44
degree

some
4 male group C standard none 76 78 75
college

In [59]: scaler.fit(df[['reading score']])

Out[59]: StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or
trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page
with nbviewer.org.

In [60]: scaled_rscore=scaler.transform(df[['reading score']])

localhost:8890/notebooks/A09Ass02.ipynb 11/11

You might also like