Pandas Dataframes
Part II
Pandas Dataframes - Recap
In the previous lecture, we have seen about
Introduction to pandas
Importing data into Spyder
Creating copy of original data
Attributes of data
Indexing and selecting data
Python for Data Science 2
In this lecture
Data types
◦ Numeric
◦ Character
Checking data types of each column
Count of unique data types
Selecting data based on data types
Concise summary of dataframe
Checking format of each column
Getting unique elements of each column
Python for Data Science 3
Data types
The way information gets stored in a dataframe or
a python object affects the analysis and outputs of
calculations
There are two main types of data
◦ numeric and character types
Numeric data types includes integers and floats
◦ For example: integer – 10, float – 10.53
Strings are known as objects in pandas which can
store values that contain numbers and / or
characters
◦ For example:‘category1’
Python for Data Science 4
Numeric types
Pandas and base Python uses different names for data types
Python data type Pandas data type Description
int int64 Numeric characters
float float64 Numeric characters with decimals
◦ ‘64’ simply refers to the memory allocated to store data in each cell which
effectively relates to how many digits it can store in each “cell”
◦ 64 bits is equivalent to 8 bytes
◦ Allocating space ahead of time allows computers to optimize storage and
processing efficiency
Python for Data Science 5
Character types
Difference between category & object
category object
◦ A string variable ◦ The column will be assigned
consisting of only a few as object data type when it
different values. has mixed types (numbers
Converting such a and strings). If a column
string variable to a contains ‘nan’(blank cells),
categorical variable will pandas will default to object
save some memory datatype.
◦ A categorical variable ◦ For strings, the length is not
takes on a limited, fixed fixed
number of possible
values
Python for Data Science 6
Checking data types of each column
dtypes returns a series with the data type of
each column
Syntax: DataFrame.dtypes
Python for Data Science 7
Count of unique data types
get_dtype_counts()returns counts of
unique data types in the dataframe
Syntax: DataFrame.get_dtype_counts()
Python for Data Science 8
Selecting data based on data types
pandas.DataFrame.select_dtypes() returns a
subset of the columns from dataframe based on the column
dtypes
Syntax: DataFrame.select_dtypes(include=None,
exclude=None)
Python for Data Science 9
Concise summary of dataframe
info() returns a concise summary of a
dataframe
data type of index
data type of columns
count of non-null values
memory usage
Syntax: DataFrame.info()
Python for Data Science 10
Checking format of each column
By using info(), we can see
‘KM’ has been read as object instead of integer
‘HP’ has been read as object instead of integer
‘MetColor’ and ‘Automatic’ have been read as
float64 and int64 respectively since it has values 0/1
Ideally, ‘Doors’ should’ve been read as int64 since it
has values 2, 3, 4, 5. But it has been read as object
Missing values present in few variables
Let’s encounter the reason !
Python for Data Science 11
Unique elements of columns
unique() is used to find the unique
elements of a column
Syntax: numpy.unique(array)
‘KM’ has special character to it -
Hence, it has been read as object instead of int64
Python for Data Science 12
Unique elements of columns
Variable ‘HP’ :
‘HP’ has special character to it -
Hence, it has been read as object instead of int64
Variable ‘MetColor’ :
‘MetColor’ have been read as float64 since it has values 0. & 1.
Python for Data Science 13
Unique elements of columns
Variable ‘Automatic’ :
‘Automatic’ has been read as int64 since it has values 0 & 1
Variable ‘Doors’ :
‘Doors’ has been read as object instead of int64 because of
values ‘five’ ‘four’ ‘three’ which are strings
Python for Data Science 14
Summary
Data types
◦ Numeric
◦ Character
Checked data types of each column
Count of unique data types
Selected data based on data types
Concise summary of dataframe
Checked format of each column
Got unique elements of each column
Python for Data Science 15
THANK YOU