Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views27 pages

Day09 DataWrangling

Uploaded by

Anjana Mahawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views27 pages

Day09 DataWrangling

Uploaded by

Anjana Mahawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Wrangling

Objectives
After completing this lab you will be able to:

• Handle missing values


• Correct data format
• Standardize and normalize data

Data wrangling is the process of converting data from the initial format to a format that may be
better for analysis.

import pandas as pd

Use the Pandas method read_csv() to load the data from the data file.

df = pd.read_csv("carData.csv")

Use the method head() to display the first five rows of the dataframe.

# To see what the data set looks like, we'll use the head() method.
df.head()

symboling normalized-losses make fuel-type aspiration num-


of-doors \
0 3 ? alfa-romero gas std
two
1 3 ? alfa-romero gas std
two
2 1 ? alfa-romero gas std
two
3 2 164 audi gas std
four
4 2 164 audi gas std
four

body-style drive-wheels engine-location wheel-base ... engine-


size \
0 convertible rwd front 88.6 ...
130
1 convertible rwd front 88.6 ...
130
2 hatchback rwd front 94.5 ...
152
3 sedan fwd front 99.8 ...
109
4 sedan 4wd front 99.4 ...
136

fuel-system bore stroke compression-ratio horsepower peak-rpm


city-mpg \
0 mpfi 3.47 2.68 9.0 111 5000
21
1 mpfi 3.47 2.68 9.0 111 5000
21
2 mpfi 2.68 3.47 9.0 154 5000
19
3 mpfi 3.19 3.40 10.0 102 5500
24
4 mpfi 3.19 3.40 8.0 115 5500
18

highway-mpg price
0 27 13495
1 27 16500
2 26 16500
3 30 13950
4 22 17450

[5 rows x 26 columns]

As we can see, several question marks appeared in the dataframe; those are missing values
which may hinder our further analysis. So, how do we identify all those missing values and deal
with them?

How to work with missing data?

Steps for working with missing data: Identify missing data Deal with missing data Correct data
format

In the car dataset, missing data comes with the question mark "?". We replace "?" with NaN (Not
a Number), Python's default missing value marker for reasons of computational speed and
convenience. Here we use the function: .replace(A, B, inplace = True) to replace A by B.

import numpy as np

# replace "?" to NaN


df.replace("?", np.nan, inplace = True)
df.head(5)

symboling normalized-losses make fuel-type aspiration num-


of-doors \
0 3 NaN alfa-romero gas std
two
1 3 NaN alfa-romero gas std
two
2 1 NaN alfa-romero gas std
two
3 2 164 audi gas std
four
4 2 164 audi gas std
four

body-style drive-wheels engine-location wheel-base ... engine-


size \
0 convertible rwd front 88.6 ...
130
1 convertible rwd front 88.6 ...
130
2 hatchback rwd front 94.5 ...
152
3 sedan fwd front 99.8 ...
109
4 sedan 4wd front 99.4 ...
136

fuel-system bore stroke compression-ratio horsepower peak-rpm


city-mpg \
0 mpfi 3.47 2.68 9.0 111 5000
21
1 mpfi 3.47 2.68 9.0 111 5000
21
2 mpfi 2.68 3.47 9.0 154 5000
19
3 mpfi 3.19 3.40 10.0 102 5500
24
4 mpfi 3.19 3.40 8.0 115 5500
18

highway-mpg price
0 27 13495
1 27 16500
2 26 16500
3 30 13950
4 22 17450

[5 rows x 26 columns]

# Descriptive stats
df.describe()

symboling wheel-base length width height \


count 205.000000 205.000000 205.000000 205.000000 205.000000
mean 0.834146 98.756585 174.049268 65.907805 53.724878
std 1.245307 6.021776 12.337289 2.145204 2.443522
min -2.000000 86.600000 141.100000 60.300000 47.800000
25% 0.000000 94.500000 166.300000 64.100000 52.000000
50% 1.000000 97.000000 173.200000 65.500000 54.100000
75% 2.000000 102.400000 183.100000 66.900000 55.500000
max 3.000000 120.900000 208.100000 72.300000 59.800000

curb-weight engine-size compression-ratio city-mpg


highway-mpg
count 205.000000 205.000000 205.000000 205.000000
205.000000
mean 2555.565854 126.907317 10.142537 25.219512
30.751220
std 520.680204 41.642693 3.972040 6.542142
6.886443
min 1488.000000 61.000000 7.000000 13.000000
16.000000
25% 2145.000000 97.000000 8.600000 19.000000
25.000000
50% 2414.000000 120.000000 9.000000 24.000000
30.000000
75% 2935.000000 141.000000 9.400000 30.000000
34.000000
max 4066.000000 326.000000 23.000000 49.000000
54.000000

The missing values are converted by default. We use the following functions to identify these
missing values. There are two methods to detect missing data: .isnull() .notnull() The output is a
boolean value indicating whether the value that is passed into the argument is in fact missing
data.

# Lets try to use find the null values based on columns or rows
df.loc[df.isnull().any(axis=1)] # when u filter on columns
# Here all the rows where any of the columns has null is displayed

symboling normalized-losses make fuel-type


aspiration \
0 3 NaN alfa-romero gas std

1 3 NaN alfa-romero gas std

2 1 NaN alfa-romero gas std

5 2 NaN audi gas std

7 1 NaN audi gas std

9 0 NaN audi gas turbo

14 1 NaN bmw gas std


15 0 NaN bmw gas std

16 0 NaN bmw gas std

17 0 NaN bmw gas std

27 1 148 dodge gas turbo

43 0 NaN isuzu gas std

44 1 NaN isuzu gas std

45 0 NaN isuzu gas std

46 2 NaN isuzu gas std

48 0 NaN jaguar gas std

49 0 NaN jaguar gas std

55 3 150 mazda gas std

56 3 150 mazda gas std

57 3 150 mazda gas std

58 3 150 mazda gas std

63 0 NaN mazda diesel std

66 0 NaN mazda diesel std

71 -1 NaN mercedes-benz gas std

73 0 NaN mercedes-benz gas std

74 1 NaN mercedes-benz gas std

75 1 NaN mercury gas turbo

82 3 NaN mitsubishi gas turbo

83 3 NaN mitsubishi gas turbo

84 3 NaN mitsubishi gas turbo

109 0 NaN peugot gas std

110 0 NaN peugot diesel turbo

113 0 NaN peugot gas std


114 0 NaN peugot diesel turbo

124 3 NaN plymouth gas turbo

126 3 NaN porsche gas std

127 3 NaN porsche gas std

128 3 NaN porsche gas std

129 1 NaN porsche gas std

130 0 NaN renault gas std

131 2 NaN renault gas std

181 -1 NaN toyota gas std

189 3 NaN volkswagen gas std

191 0 NaN volkswagen gas std

192 0 NaN volkswagen diesel turbo

193 0 NaN volkswagen gas std

num-of-doors body-style drive-wheels engine-location wheel-base


... \
0 two convertible rwd front 88.6
...
1 two convertible rwd front 88.6
...
2 two hatchback rwd front 94.5
...
5 two sedan fwd front 99.8
...
7 four wagon fwd front 105.8
...
9 two hatchback 4wd front 99.5
...
14 four sedan rwd front 103.5
...
15 four sedan rwd front 103.5
...
16 two sedan rwd front 103.5
...
17 four sedan rwd front 110.0
...
27 NaN sedan fwd front 93.7
...
43 four sedan rwd front 94.3
...
44 two sedan fwd front 94.5
...
45 four sedan fwd front 94.5
...
46 two hatchback rwd front 96.0
...
48 four sedan rwd front 113.0
...
49 two sedan rwd front 102.0
...
55 two hatchback rwd front 95.3
...
56 two hatchback rwd front 95.3
...
57 two hatchback rwd front 95.3
...
58 two hatchback rwd front 95.3
...
63 NaN sedan fwd front 98.8
...
66 four sedan rwd front 104.9
...
71 four sedan rwd front 115.6
...
73 four sedan rwd front 120.9
...
74 two hardtop rwd front 112.0
...
75 two hatchback rwd front 102.7
...
82 two hatchback fwd front 95.9
...
83 two hatchback fwd front 95.9
...
84 two hatchback fwd front 95.9
...
109 four wagon rwd front 114.2
...
110 four wagon rwd front 114.2
...
113 four wagon rwd front 114.2
...
114 four wagon rwd front 114.2
...
124 two hatchback rwd front 95.9
...
126 two hardtop rwd rear 89.5
...
127 two hardtop rwd rear 89.5
...
128 two convertible rwd rear 89.5
...
129 two hatchback rwd front 98.4
...
130 four wagon fwd front 96.1
...
131 two hatchback fwd front 96.1
...
181 four wagon rwd front 104.5
...
189 two convertible fwd front 94.5
...
191 four sedan fwd front 100.4
...
192 four sedan fwd front 100.4
...
193 four wagon fwd front 100.4
...

engine-size fuel-system bore stroke compression-ratio


horsepower \
0 130 mpfi 3.47 2.68 9.0
111
1 130 mpfi 3.47 2.68 9.0
111
2 152 mpfi 2.68 3.47 9.0
154
5 136 mpfi 3.19 3.40 8.5
110
7 136 mpfi 3.19 3.40 8.5
110
9 131 mpfi 3.13 3.40 7.0
160
14 164 mpfi 3.31 3.19 9.0
121
15 209 mpfi 3.62 3.39 8.0
182
16 209 mpfi 3.62 3.39 8.0
182
17 209 mpfi 3.62 3.39 8.0
182
27 98 mpfi 3.03 3.39 7.6
102
43 111 2bbl 3.31 3.23 8.5
78
44 90 2bbl 3.03 3.11 9.6
70
45 90 2bbl 3.03 3.11 9.6
70
46 119 spfi 3.43 3.23 9.2
90
48 258 mpfi 3.63 4.17 8.1
176
49 326 mpfi 3.54 2.76 11.5
262
55 70 4bbl NaN NaN 9.4
101
56 70 4bbl NaN NaN 9.4
101
57 70 4bbl NaN NaN 9.4
101
58 80 mpfi NaN NaN 9.4
135
63 122 idi 3.39 3.39 22.7
64
66 134 idi 3.43 3.64 22.0
72
71 234 mpfi 3.46 3.10 8.3
155
73 308 mpfi 3.80 3.35 8.0
184
74 304 mpfi 3.80 3.35 8.0
184
75 140 mpfi 3.78 3.12 8.0
175
82 156 spdi 3.58 3.86 7.0
145
83 156 spdi 3.59 3.86 7.0
145
84 156 spdi 3.59 3.86 7.0
145
109 120 mpfi 3.46 3.19 8.4
97
110 152 idi 3.70 3.52 21.0
95
113 120 mpfi 3.46 2.19 8.4
95
114 152 idi 3.70 3.52 21.0
95
124 156 spdi 3.59 3.86 7.0
145
126 194 mpfi 3.74 2.90 9.5
207
127 194 mpfi 3.74 2.90 9.5
207
128 194 mpfi 3.74 2.90 9.5
207
129 203 mpfi 3.94 3.11 10.0
288
130 132 mpfi 3.46 3.90 8.7
NaN
131 132 mpfi 3.46 3.90 8.7
NaN
181 161 mpfi 3.27 3.35 9.2
156
189 109 mpfi 3.19 3.40 8.5
90
191 136 mpfi 3.19 3.40 8.5
110
192 97 idi 3.01 3.40 23.0
68
193 109 mpfi 3.19 3.40 9.0
88

peak-rpm city-mpg highway-mpg price


0 5000 21 27 13495
1 5000 21 27 16500
2 5000 19 26 16500
5 5500 19 25 15250
7 5500 19 25 18920
9 5500 16 22 NaN
14 4250 20 25 24565
15 5400 16 22 30760
16 5400 16 22 41315
17 5400 15 20 36880
27 5500 24 30 8558
43 4800 24 29 6785
44 5400 38 43 NaN
45 5400 38 43 NaN
46 5000 24 29 11048
48 4750 15 19 35550
49 5000 13 17 36000
55 6000 17 23 10945
56 6000 17 23 11845
57 6000 17 23 13645
58 6000 16 23 15645
63 4650 36 42 10795
66 4200 31 39 18344
71 4750 16 18 34184
73 4500 14 16 40960
74 4500 14 16 45400
75 5000 19 24 16503
82 5000 19 24 12629
83 5000 19 24 14869
84 5000 19 24 14489
109 5000 19 24 12440
110 4150 25 25 13860
113 5000 19 24 16695
114 4150 25 25 17075
124 5000 19 24 12764
126 5900 17 25 32528
127 5900 17 25 34028
128 5900 17 25 37028
129 5750 17 28 NaN
130 NaN 23 31 9295
131 NaN 23 31 9895
181 5200 19 24 15750
189 5500 24 29 11595
191 5500 19 24 13295
192 4500 33 38 13845
193 5500 25 31 12290

[46 rows x 26 columns]

# Lets try to use find the null values based on columns or rows
df.loc[:,df.isnull().any(axis=0)] # when u filter on rows
## Students Practice : What is the above code doing exactly.

normalized-losses num-of-doors bore stroke horsepower peak-rpm


price
0 NaN two 3.47 2.68 111 5000
13495
1 NaN two 3.47 2.68 111 5000
16500
2 NaN two 2.68 3.47 154 5000
16500
3 164 four 3.19 3.40 102 5500
13950
4 164 four 3.19 3.40 115 5500
17450
.. ... ... ... ... ... ...
...
200 95 four 3.78 3.15 114 5400
16845
201 95 four 3.78 3.15 160 5300
19045
202 95 four 3.58 2.87 134 5500
21485
203 95 four 3.01 3.40 106 4800
22470
204 95 four 3.78 3.15 114 5400
22625
[205 rows x 7 columns]

missing_data = df.isnull()
missing_data.sum()

symboling 0
normalized-losses 41
make 0
fuel-type 0
aspiration 0
num-of-doors 2
body-style 0
drive-wheels 0
engine-location 0
wheel-base 0
length 0
width 0
height 0
curb-weight 0
engine-type 0
num-of-cylinders 0
engine-size 0
fuel-system 0
bore 4
stroke 4
compression-ratio 0
horsepower 2
peak-rpm 2
city-mpg 0
highway-mpg 0
price 4
dtype: int64

"True" means the value is a missing value while "False" means the value is not a missing value.

df.isnull().sum()

symboling 0
normalized-losses 41
make 0
fuel-type 0
aspiration 0
num-of-doors 2
body-style 0
drive-wheels 0
engine-location 0
wheel-base 0
length 0
width 0
height 0
curb-weight 0
engine-type 0
num-of-cylinders 0
engine-size 0
fuel-system 0
bore 4
stroke 4
compression-ratio 0
horsepower 2
peak-rpm 2
city-mpg 0
highway-mpg 0
price 4
dtype: int64

for column in missing_data.columns.values.tolist():


print (missing_data[column].value_counts())
print("")

symboling
False 205
Name: count, dtype: int64

normalized-losses
False 164
True 41
Name: count, dtype: int64

make
False 205
Name: count, dtype: int64

fuel-type
False 205
Name: count, dtype: int64

aspiration
False 205
Name: count, dtype: int64

num-of-doors
False 203
True 2
Name: count, dtype: int64

body-style
False 205
Name: count, dtype: int64
drive-wheels
False 205
Name: count, dtype: int64

engine-location
False 205
Name: count, dtype: int64

wheel-base
False 205
Name: count, dtype: int64

length
False 205
Name: count, dtype: int64

width
False 205
Name: count, dtype: int64

height
False 205
Name: count, dtype: int64

curb-weight
False 205
Name: count, dtype: int64

engine-type
False 205
Name: count, dtype: int64

num-of-cylinders
False 205
Name: count, dtype: int64

engine-size
False 205
Name: count, dtype: int64

fuel-system
False 205
Name: count, dtype: int64

bore
False 201
True 4
Name: count, dtype: int64

stroke
False 201
True 4
Name: count, dtype: int64

compression-ratio
False 205
Name: count, dtype: int64

horsepower
False 203
True 2
Name: count, dtype: int64

peak-rpm
False 203
True 2
Name: count, dtype: int64

city-mpg
False 205
Name: count, dtype: int64

highway-mpg
False 205
Name: count, dtype: int64

price
False 201
True 4
Name: count, dtype: int64

Based on the summary above, each column has 205 rows of data and seven of the columns
containing missing data: "normalized-losses": 41 missing data "num-of-doors": 2 missing data
"bore": 4 missing data "stroke" : 4 missing data "horsepower": 2 missing data "peak-rpm": 2
missing data "price": 4 missing data

How to deal with missing data?

Whole columns should be dropped only if most entries in the column are empty. In our dataset,
none of the columns are empty enough to drop entirely. We have some freedom in choosing
which method to replace data; however, some methods may seem more reasonable than others.
We will apply each method to many different columns:

Replace by mean: "normalized-losses": 41 missing data, replace them with mean "stroke": 4
missing data, replace them with mean "bore": 4 missing data, replace them with mean
"horsepower": 2 missing data, replace them with mean "peak-rpm": 2 missing data, replace
them with mean

Replace by frequency: "num-of-doors": 2 missing data, replace them with "four". Reason: 84%
sedans is four doors. Since four doors is most frequent, it is most likely to occur
Drop the whole row: "price": 4 missing data, simply delete the whole row Reason: price is what
we want to predict. Any data entry without price data cannot be used for prediction; therefore
any row now without price data is not useful to us

print(df.dtypes)
df.head()

symboling int64
normalized-losses object
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore object
stroke object
compression-ratio float64
horsepower object
peak-rpm object
city-mpg int64
highway-mpg int64
price object
dtype: object

symboling normalized-losses make fuel-type aspiration num-


of-doors \
0 3 NaN alfa-romero gas std
two
1 3 NaN alfa-romero gas std
two
2 1 NaN alfa-romero gas std
two
3 2 164 audi gas std
four
4 2 164 audi gas std
four

body-style drive-wheels engine-location wheel-base ... engine-


size \
0 convertible rwd front 88.6 ...
130
1 convertible rwd front 88.6 ...
130
2 hatchback rwd front 94.5 ...
152
3 sedan fwd front 99.8 ...
109
4 sedan 4wd front 99.4 ...
136

fuel-system bore stroke compression-ratio horsepower peak-rpm


city-mpg \
0 mpfi 3.47 2.68 9.0 111 5000
21
1 mpfi 3.47 2.68 9.0 111 5000
21
2 mpfi 2.68 3.47 9.0 154 5000
19
3 mpfi 3.19 3.40 10.0 102 5500
24
4 mpfi 3.19 3.40 8.0 115 5500
18

highway-mpg price
0 27 13495
1 27 16500
2 26 16500
3 30 13950
4 22 17450

[5 rows x 26 columns]

avg_norm_loss = df["normalized-losses"].astype("float").mean(axis=0)
print("Average of normalized-losses:", avg_norm_loss)
#Replace "NaN" with mean value in "normalized-losses" column
df["normalized-losses"] = df["normalized-losses"].replace(np.nan,
avg_norm_loss)
# we check if the normalized-losses all the NaN values are replaced by
average.
df['normalized-losses'].isnull().sum()

Average of normalized-losses: 122.0

avg_bore=df['bore'].astype('float').mean(axis=0)
print("Average of bore:", avg_bore)

Average of bore: 3.3297512437810943


df["bore"] = df["bore"].replace(np.nan, avg_bore)
# check if the bore NAN values are replaced to average
df['bore'].isnull().sum()

# Write your code below and press Shift+Enter to execute


#Calculate the mean vaule for "stroke" column
avg_stroke = df["stroke"].astype("float").mean(axis = 0)
print("Average of stroke:", avg_stroke)

# replace NaN by mean value in "stroke" column


df["stroke"] = df["stroke"].replace(np.nan, avg_stroke)

Average of stroke: 3.255422885572139

avg_horsepower = df['horsepower'].astype('float').mean(axis=0)
print("Average horsepower:", avg_horsepower)

Average horsepower: 104.25615763546799

df['horsepower']= df['horsepower'].replace(np.nan, avg_horsepower)

avg_peakrpm=df['peak-rpm'].astype('float').mean(axis=0)
print("Average peak rpm:", avg_peakrpm)

Average peak rpm: 5125.369458128079

df['peak-rpm'] = df['peak-rpm'].replace(np.nan, avg_peakrpm)

# this statement will check if all the missing values (null values)
have some valid values.
df.isnull().sum()

symboling 0
normalized-losses 0
make 0
fuel-type 0
aspiration 0
num-of-doors 2
body-style 0
drive-wheels 0
engine-location 0
wheel-base 0
length 0
width 0
height 0
curb-weight 0
engine-type 0
num-of-cylinders 0
engine-size 0
fuel-system 0
bore 0
stroke 0
compression-ratio 0
horsepower 0
peak-rpm 0
city-mpg 0
highway-mpg 0
price 4
dtype: int64

Lets now address num-of-doors, To see which values are present in a particular column, we can
use the ".value_counts()" method:

df['num-of-doors'].value_counts()

num-of-doors
four 114
two 89
Name: count, dtype: int64

We can see that four doors are the most common type. We can also use the ".idxmax()" method
to calculate the most common type automatically:

df['num-of-doors'].value_counts().idxmax()

'four'

The replacement procedure is very similar to what we have seen previously:

#replace the missing 'num-of-doors' values by the most frequent


freq_numofdoors = df['num-of-doors'].value_counts().idxmax()
df['num-of-doors'] = df["num-of-doors"].replace(np.nan,freq_numofdoors
)

Finally, let's drop all rows that do not have price data:

# simply drop whole row with NaN in "price" column , axis = 0 is for
rows, inplace=true will update the exising dataframe df variable
df.dropna(subset=["price"], axis=0, inplace=True)

df.tail()

symboling normalized-losses make fuel-type aspiration num-of-


doors \
200 -1 95 volvo gas std
four
201 -1 95 volvo gas turbo
four
202 -1 95 volvo gas std
four
203 -1 95 volvo diesel turbo
four
204 -1 95 volvo gas turbo
four

body-style drive-wheels engine-location wheel-base ... engine-


size \
200 sedan rwd front 109.1 ...
141
201 sedan rwd front 109.1 ...
141
202 sedan rwd front 109.1 ...
173
203 sedan rwd front 109.1 ...
145
204 sedan rwd front 109.1 ...
141

fuel-system bore stroke compression-ratio horsepower peak-rpm


\
200 mpfi 3.78 3.15 9.5 114 5400

201 mpfi 3.78 3.15 8.7 160 5300

202 mpfi 3.58 2.87 8.8 134 5500

203 idi 3.01 3.40 23.0 106 4800

204 mpfi 3.78 3.15 9.5 114 5400

city-mpg highway-mpg price


200 23 28 16845
201 19 25 19045
202 18 23 21485
203 26 27 22470
204 19 25 22625

[5 rows x 26 columns]

# reset index, because we droped four rows


df.reset_index(drop=True, inplace=True)

df.tail()

symboling normalized-losses make fuel-type aspiration num-of-


doors \
196 -1 95 volvo gas std
four
197 -1 95 volvo gas turbo
four
198 -1 95 volvo gas std
four
199 -1 95 volvo diesel turbo
four
200 -1 95 volvo gas turbo
four

body-style drive-wheels engine-location wheel-base ... engine-


size \
196 sedan rwd front 109.1 ...
141
197 sedan rwd front 109.1 ...
141
198 sedan rwd front 109.1 ...
173
199 sedan rwd front 109.1 ...
145
200 sedan rwd front 109.1 ...
141

fuel-system bore stroke compression-ratio horsepower peak-rpm


\
196 mpfi 3.78 3.15 9.5 114 5400

197 mpfi 3.78 3.15 8.7 160 5300

198 mpfi 3.58 2.87 8.8 134 5500

199 idi 3.01 3.40 23.0 106 4800

200 mpfi 3.78 3.15 9.5 114 5400

city-mpg highway-mpg price


196 23 28 16845
197 19 25 19045
198 18 23 21485
199 26 27 22470
200 19 25 22625

[5 rows x 26 columns]

df.isnull().sum()

symboling 0
normalized-losses 0
make 0
fuel-type 0
aspiration 0
num-of-doors 0
body-style 0
drive-wheels 0
engine-location 0
wheel-base 0
length 0
width 0
height 0
curb-weight 0
engine-type 0
num-of-cylinders 0
engine-size 0
fuel-system 0
bore 0
stroke 0
compression-ratio 0
horsepower 0
peak-rpm 0
city-mpg 0
highway-mpg 0
price 0
dtype: int64

# this returns the number of rows and number of columns.


df.shape

(201, 26)

Good! Now, we have a dataset with no missing values.

We are almost there! The last step in data cleaning is checking and making sure that all data is in
the correct format (int, float, text or other).

In Pandas, we use: .dtype() to check the data type .astype() to change the data type

df.dtypes

symboling int64
normalized-losses object
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore object
stroke object
compression-ratio float64
horsepower object
peak-rpm object
city-mpg int64
highway-mpg int64
price object
dtype: object

df[["bore", "stroke"]] = df[["bore", "stroke"]].astype("float")


df[["normalized-losses"]] = df[["normalized-losses"]].astype("int")
df[["price"]] = df[["price"]].astype("float")
df[["peak-rpm"]] = df[["peak-rpm"]].astype("float")

df.dtypes

symboling int64
normalized-losses int32
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore float64
stroke float64
compression-ratio float64
horsepower object
peak-rpm float64
city-mpg int64
highway-mpg int64
price float64
dtype: object

Wonderful!
Now we have finally obtained the cleaned dataset with no missing values with all data in its
proper format.

What is standardization? Standardization is the process of transforming data into a common


format, allowing the researcher to make the meaningful comparison.

Example Transform mpg to L/100km: In our dataset, the fuel consumption columns "city-mpg"
and "highway-mpg" are represented by mpg (miles per gallon) unit. Assume we are developing
an application in a country that accepts the fuel consumption with L/100km standard. We will
need to apply data transformation to transform mpg into L/100km.

The formula for unit conversion is: L/100km = 235 / mpg We can do many mathematical
operations directly in Pandas.

df.head()

symboling normalized-losses make fuel-type aspiration \


0 3 122 alfa-romero gas std
1 3 122 alfa-romero gas std
2 1 122 alfa-romero gas std
3 2 164 audi gas std
4 2 164 audi gas std

num-of-doors body-style drive-wheels engine-location wheel-


base ... \
0 two convertible rwd front
88.6 ...
1 two convertible rwd front
88.6 ...
2 two hatchback rwd front
94.5 ...
3 four sedan fwd front
99.8 ...
4 four sedan 4wd front
99.4 ...

engine-size fuel-system bore stroke compression-ratio horsepower


\
0 130 mpfi 3.47 2.68 9.0 111

1 130 mpfi 3.47 2.68 9.0 111

2 152 mpfi 2.68 3.47 9.0 154

3 109 mpfi 3.19 3.40 10.0 102

4 136 mpfi 3.19 3.40 8.0 115

peak-rpm city-mpg highway-mpg price


0 5000.0 21 27 13495.0
1 5000.0 21 27 16500.0
2 5000.0 19 26 16500.0
3 5500.0 24 30 13950.0
4 5500.0 18 22 17450.0

[5 rows x 26 columns]

# Convert mpg to L/100km by mathematical operation (235 divided by


mpg)
df['city-L/100km'] = 235/df["city-mpg"]

# check your transformed data


df.head()

# drop original column "city-L/100Km" from "df"


df.drop("city-mpg", axis = 1, inplace=True)

df.dtypes

symboling int64
normalized-losses int32
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore float64
stroke float64
compression-ratio float64
horsepower object
peak-rpm float64
highway-mpg int64
price float64
city-L/100km float64
dtype: object

# Write your code below and press Shift+Enter to execute


# transform mpg to L/100km by mathematical operation (235 divided by
mpg)
df["highway-mpg"] = 235/df["highway-mpg"]

# rename column name from "highway-mpg" to "highway-L/100km"


df = df.rename({'highway-mpg':'highway-L/100km'}, axis=1)

# check your transformed data


df.head()

symboling normalized-losses make fuel-type aspiration \


0 3 122 alfa-romero gas std
1 3 122 alfa-romero gas std
2 1 122 alfa-romero gas std
3 2 164 audi gas std
4 2 164 audi gas std

num-of-doors body-style drive-wheels engine-location wheel-


base ... \
0 two convertible rwd front
88.6 ...
1 two convertible rwd front
88.6 ...
2 two hatchback rwd front
94.5 ...
3 four sedan fwd front
99.8 ...
4 four sedan 4wd front
99.4 ...

engine-size fuel-system bore stroke compression-ratio horsepower


\
0 130 mpfi 3.47 2.68 9.0 111

1 130 mpfi 3.47 2.68 9.0 111

2 152 mpfi 2.68 3.47 9.0 154

3 109 mpfi 3.19 3.40 10.0 102

4 136 mpfi 3.19 3.40 8.0 115

peak-rpm highway-L/100km price city-L/100km


0 5000.0 8.703704 13495.0 11.190476
1 5000.0 8.703704 16500.0 11.190476
2 5000.0 9.038462 16500.0 12.368421
3 5500.0 7.833333 13950.0 9.791667
4 5500.0 10.681818 17450.0 13.055556

[5 rows x 26 columns]
Why normalization? Normalization is the process of transforming values of several variables into
a similar range. Typical normalizations include scaling the variable so the variable average is 0,
scaling the variable so the variance is 1, or scaling the variable so the variable values range from
0 to 1.

Example To demonstrate normalization, let's say we want to scale the columns "length",
"width" and "height". Target: would like to normalize those variables so their value ranges from
0 to 1 Approach: replace original value by (original value)/(maximum value)

df['width']

0 64.1
1 64.1
2 65.5
3 66.2
4 66.4
...
196 68.9
197 68.8
198 68.9
199 68.9
200 68.9
Name: width, Length: 201, dtype: float64

# replace (original value) by


# (original value)/(maximum value)
df['length'] = df['length']/df['length'].max()
df['width'] = df['width']/df['width'].max()
df['height'] = df['height']/df['height'].max()

df['width']
# show the scaled columns
df[["length","width","height"]].head()

length width height


0 0.811148 0.890278 0.816054
1 0.811148 0.890278 0.816054
2 0.822681 0.909722 0.876254
3 0.848630 0.919444 0.908027
4 0.848630 0.922222 0.908027

Here we can see we've normalized "length", "width" and "height" in the range of [0,1].

Save the new csv:

df.to_csv('cleanCars.csv')

You might also like