Day09 DataWrangling
Day09 DataWrangling
Objectives
After completing this lab you will be able to:
Data wrangling is the process of converting data from the initial format to a format that may be
better for analysis.
import pandas as pd
Use the Pandas method read_csv() to load the data from the data file.
df = pd.read_csv("carData.csv")
Use the method head() to display the first five rows of the dataframe.
# To see what the data set looks like, we'll use the head() method.
df.head()
highway-mpg price
0 27 13495
1 27 16500
2 26 16500
3 30 13950
4 22 17450
[5 rows x 26 columns]
As we can see, several question marks appeared in the dataframe; those are missing values
which may hinder our further analysis. So, how do we identify all those missing values and deal
with them?
Steps for working with missing data: Identify missing data Deal with missing data Correct data
format
In the car dataset, missing data comes with the question mark "?". We replace "?" with NaN (Not
a Number), Python's default missing value marker for reasons of computational speed and
convenience. Here we use the function: .replace(A, B, inplace = True) to replace A by B.
import numpy as np
highway-mpg price
0 27 13495
1 27 16500
2 26 16500
3 30 13950
4 22 17450
[5 rows x 26 columns]
# Descriptive stats
df.describe()
The missing values are converted by default. We use the following functions to identify these
missing values. There are two methods to detect missing data: .isnull() .notnull() The output is a
boolean value indicating whether the value that is passed into the argument is in fact missing
data.
# Lets try to use find the null values based on columns or rows
df.loc[df.isnull().any(axis=1)] # when u filter on columns
# Here all the rows where any of the columns has null is displayed
# Lets try to use find the null values based on columns or rows
df.loc[:,df.isnull().any(axis=0)] # when u filter on rows
## Students Practice : What is the above code doing exactly.
missing_data = df.isnull()
missing_data.sum()
symboling 0
normalized-losses 41
make 0
fuel-type 0
aspiration 0
num-of-doors 2
body-style 0
drive-wheels 0
engine-location 0
wheel-base 0
length 0
width 0
height 0
curb-weight 0
engine-type 0
num-of-cylinders 0
engine-size 0
fuel-system 0
bore 4
stroke 4
compression-ratio 0
horsepower 2
peak-rpm 2
city-mpg 0
highway-mpg 0
price 4
dtype: int64
"True" means the value is a missing value while "False" means the value is not a missing value.
df.isnull().sum()
symboling 0
normalized-losses 41
make 0
fuel-type 0
aspiration 0
num-of-doors 2
body-style 0
drive-wheels 0
engine-location 0
wheel-base 0
length 0
width 0
height 0
curb-weight 0
engine-type 0
num-of-cylinders 0
engine-size 0
fuel-system 0
bore 4
stroke 4
compression-ratio 0
horsepower 2
peak-rpm 2
city-mpg 0
highway-mpg 0
price 4
dtype: int64
symboling
False 205
Name: count, dtype: int64
normalized-losses
False 164
True 41
Name: count, dtype: int64
make
False 205
Name: count, dtype: int64
fuel-type
False 205
Name: count, dtype: int64
aspiration
False 205
Name: count, dtype: int64
num-of-doors
False 203
True 2
Name: count, dtype: int64
body-style
False 205
Name: count, dtype: int64
drive-wheels
False 205
Name: count, dtype: int64
engine-location
False 205
Name: count, dtype: int64
wheel-base
False 205
Name: count, dtype: int64
length
False 205
Name: count, dtype: int64
width
False 205
Name: count, dtype: int64
height
False 205
Name: count, dtype: int64
curb-weight
False 205
Name: count, dtype: int64
engine-type
False 205
Name: count, dtype: int64
num-of-cylinders
False 205
Name: count, dtype: int64
engine-size
False 205
Name: count, dtype: int64
fuel-system
False 205
Name: count, dtype: int64
bore
False 201
True 4
Name: count, dtype: int64
stroke
False 201
True 4
Name: count, dtype: int64
compression-ratio
False 205
Name: count, dtype: int64
horsepower
False 203
True 2
Name: count, dtype: int64
peak-rpm
False 203
True 2
Name: count, dtype: int64
city-mpg
False 205
Name: count, dtype: int64
highway-mpg
False 205
Name: count, dtype: int64
price
False 201
True 4
Name: count, dtype: int64
Based on the summary above, each column has 205 rows of data and seven of the columns
containing missing data: "normalized-losses": 41 missing data "num-of-doors": 2 missing data
"bore": 4 missing data "stroke" : 4 missing data "horsepower": 2 missing data "peak-rpm": 2
missing data "price": 4 missing data
Whole columns should be dropped only if most entries in the column are empty. In our dataset,
none of the columns are empty enough to drop entirely. We have some freedom in choosing
which method to replace data; however, some methods may seem more reasonable than others.
We will apply each method to many different columns:
Replace by mean: "normalized-losses": 41 missing data, replace them with mean "stroke": 4
missing data, replace them with mean "bore": 4 missing data, replace them with mean
"horsepower": 2 missing data, replace them with mean "peak-rpm": 2 missing data, replace
them with mean
Replace by frequency: "num-of-doors": 2 missing data, replace them with "four". Reason: 84%
sedans is four doors. Since four doors is most frequent, it is most likely to occur
Drop the whole row: "price": 4 missing data, simply delete the whole row Reason: price is what
we want to predict. Any data entry without price data cannot be used for prediction; therefore
any row now without price data is not useful to us
print(df.dtypes)
df.head()
symboling int64
normalized-losses object
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore object
stroke object
compression-ratio float64
horsepower object
peak-rpm object
city-mpg int64
highway-mpg int64
price object
dtype: object
highway-mpg price
0 27 13495
1 27 16500
2 26 16500
3 30 13950
4 22 17450
[5 rows x 26 columns]
avg_norm_loss = df["normalized-losses"].astype("float").mean(axis=0)
print("Average of normalized-losses:", avg_norm_loss)
#Replace "NaN" with mean value in "normalized-losses" column
df["normalized-losses"] = df["normalized-losses"].replace(np.nan,
avg_norm_loss)
# we check if the normalized-losses all the NaN values are replaced by
average.
df['normalized-losses'].isnull().sum()
avg_bore=df['bore'].astype('float').mean(axis=0)
print("Average of bore:", avg_bore)
avg_horsepower = df['horsepower'].astype('float').mean(axis=0)
print("Average horsepower:", avg_horsepower)
avg_peakrpm=df['peak-rpm'].astype('float').mean(axis=0)
print("Average peak rpm:", avg_peakrpm)
# this statement will check if all the missing values (null values)
have some valid values.
df.isnull().sum()
symboling 0
normalized-losses 0
make 0
fuel-type 0
aspiration 0
num-of-doors 2
body-style 0
drive-wheels 0
engine-location 0
wheel-base 0
length 0
width 0
height 0
curb-weight 0
engine-type 0
num-of-cylinders 0
engine-size 0
fuel-system 0
bore 0
stroke 0
compression-ratio 0
horsepower 0
peak-rpm 0
city-mpg 0
highway-mpg 0
price 4
dtype: int64
Lets now address num-of-doors, To see which values are present in a particular column, we can
use the ".value_counts()" method:
df['num-of-doors'].value_counts()
num-of-doors
four 114
two 89
Name: count, dtype: int64
We can see that four doors are the most common type. We can also use the ".idxmax()" method
to calculate the most common type automatically:
df['num-of-doors'].value_counts().idxmax()
'four'
Finally, let's drop all rows that do not have price data:
# simply drop whole row with NaN in "price" column , axis = 0 is for
rows, inplace=true will update the exising dataframe df variable
df.dropna(subset=["price"], axis=0, inplace=True)
df.tail()
[5 rows x 26 columns]
df.tail()
[5 rows x 26 columns]
df.isnull().sum()
symboling 0
normalized-losses 0
make 0
fuel-type 0
aspiration 0
num-of-doors 0
body-style 0
drive-wheels 0
engine-location 0
wheel-base 0
length 0
width 0
height 0
curb-weight 0
engine-type 0
num-of-cylinders 0
engine-size 0
fuel-system 0
bore 0
stroke 0
compression-ratio 0
horsepower 0
peak-rpm 0
city-mpg 0
highway-mpg 0
price 0
dtype: int64
(201, 26)
We are almost there! The last step in data cleaning is checking and making sure that all data is in
the correct format (int, float, text or other).
In Pandas, we use: .dtype() to check the data type .astype() to change the data type
df.dtypes
symboling int64
normalized-losses object
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore object
stroke object
compression-ratio float64
horsepower object
peak-rpm object
city-mpg int64
highway-mpg int64
price object
dtype: object
df.dtypes
symboling int64
normalized-losses int32
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore float64
stroke float64
compression-ratio float64
horsepower object
peak-rpm float64
city-mpg int64
highway-mpg int64
price float64
dtype: object
Wonderful!
Now we have finally obtained the cleaned dataset with no missing values with all data in its
proper format.
Example Transform mpg to L/100km: In our dataset, the fuel consumption columns "city-mpg"
and "highway-mpg" are represented by mpg (miles per gallon) unit. Assume we are developing
an application in a country that accepts the fuel consumption with L/100km standard. We will
need to apply data transformation to transform mpg into L/100km.
The formula for unit conversion is: L/100km = 235 / mpg We can do many mathematical
operations directly in Pandas.
df.head()
[5 rows x 26 columns]
df.dtypes
symboling int64
normalized-losses int32
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore float64
stroke float64
compression-ratio float64
horsepower object
peak-rpm float64
highway-mpg int64
price float64
city-L/100km float64
dtype: object
[5 rows x 26 columns]
Why normalization? Normalization is the process of transforming values of several variables into
a similar range. Typical normalizations include scaling the variable so the variable average is 0,
scaling the variable so the variance is 1, or scaling the variable so the variable values range from
0 to 1.
Example To demonstrate normalization, let's say we want to scale the columns "length",
"width" and "height". Target: would like to normalize those variables so their value ranges from
0 to 1 Approach: replace original value by (original value)/(maximum value)
df['width']
0 64.1
1 64.1
2 65.5
3 66.2
4 66.4
...
196 68.9
197 68.8
198 68.9
199 68.9
200 68.9
Name: width, Length: 201, dtype: float64
df['width']
# show the scaled columns
df[["length","width","height"]].head()
Here we can see we've normalized "length", "width" and "height" in the range of [0,1].
df.to_csv('cleanCars.csv')