Introduction to
Pandas
import pandas as pd
inspired from http://rcs.bu.edu/examples/python/data_analysis/
Before starting
● Datasets (available in our Google Drive)
○ Salaries.csv
○ flights.csv
● Prerequisites
○ Good command in Python
○ numpy
○ sklearn
Before starting
● Comando no Collab:
from google.colab import drive
drive.mount('/content/drive')
fpath='/content/drive/MyDrive/<your path>'
Pandas
● Allows working with data like a table (relational)
● Provides tools form data manipulation: sorting, slicing,
aggregation, among others.
● Tools for plotting data
● Statistics information
Pandas - DataFrame
import pandas as pd
# read a csv file
dfSal=pd.read_csv('Salaries.csv')
# show the first 5 rows (default)
dfSal.head()
dfSal.tail()
DataFrame data types
Pandas Type Native Type Description
object string Columns with strings and
mixed types
int64 int numeric
float64 float numeric with decimals
datetime64 N/A stores time series
DataFrame data types
Be careful: if the attribute’s name is a
dfSal['salary'].dtype pandas reserved word, you have to
use df[‘attribute’].xxxxx
dtype('int64')
dfSal.salary.dtype
dtype('int64')
dfSal.dtypes
rank object
discipline object
phd int64
service int64
sex object
salary int64
dtype: object
DataFrame attributes
Atributte Description
dtypes types of the columns
columns column names
axes list the row labels and column names
ndim number of dimensions
size number of elements
shape tuple with the dimensionality
values numpy representation
index row labels
DataFrame methods
Method Description
head([n]), tail([n]) first or last n rows (default 5)
describe() descriptive statistics (numeric ones)
max(), min() return max/min values for all attributes
df.attribute.min()/max() return max/min for a given attribute
mean(), median() return mean/median for all attributes or a given one
std() standard deviation
sample([n]) return a random sample of rows (default 1)
dropna() drop all rows with missing values
drop() drop specified labels from rows or columns.
Grouping data
● Pandas group by can use for:
○ Split data into groups based on some criteria
○ Calculate statistics (or apply a function) to each group
# grouping by rank (attribute)
dfRank=dfSal.groupby(['rank'])
dfRank.mean()
# or
dfSal.groupby(['rank']).mean()
# we can calculate statistics
dfSal.groupby('rank')[['salary']].mean()
# or
dfSal.groupby('rank').salary.mean()
Filtering data
● We can subset the data applying Boolean indexing (filter)
dfSalG12=dfSal[dfSal['salary'] > 120000]
dfSalG12.head()
# Any operator: > < == >= <= !=
dfWom=dfSal[dfSal['sex']== 'Female']
dfWom.head()
Slicing
● A dataframe can be sliced in several ways
# one particular columns
dfSal['salary'] # or dfSal.salary creates a Series
dfSal[['salary']] # creates a dataframe
# two or more columns
dfSal[['rank','salary']] # to store dfRS=dfSal[['rank','salary']]
# Selection rows by their position
dfSal[10:20] # from the 11th row to 20th (dataframe starts in 0)
# create a new dataframe from another dataframe selected rows
s=[dfSal[0:10],dfSal[50:60],dfSal[100:110]] # select the rows
dfSlice=pd.concat(s) # concat them to a new dataframe
# Selection rows (first 20) and some labels (attributes)
dfSal.loc[:20,['rank','sex','salary']]
# or by column position
dfSal.loc[:20,[0,4,5]]
Slicing
● More method iloc
dfSal.iloc[0] # or dfSal.salary creates a Series
dfSal.iloc[i] # (i+1)th row (remember 0 is the first one)
dfSal.iloc[-1] # last row or dfSal.tail(1)
dfSal.iloc[:,0] # first column or dfSal['rank']
dfSal.iloc[:,-1] # last column or dfSal['salary']
dfSal.iloc[0:7] # first 7 rows or dfSal.head(7)
dfSal.iloc[:,0:2] # first 2 columns or dfSal[['rank','discipline']]
dfSal.iloc[[0,5],[1,3]] # 1st and 6th rows and 2nd and 4th columns
Dropping
● Delete rows with drop
dfSal.drop([5,6], axis=0, inplace=True)
dfSal=dfSal.iloc[:100] # Overwrite the df with the first 100 rows
# deleting using conditions
dfSal.drop(dfSal[(dfSal['salary'] >1000) & (dfSal['sex']=='Male')].index, axis=0, inplace=True)
# delete columns
dfSal.drop(['salary'], axis=1, inplace=True)
# multiples
dfSal.drop(['sex','salary'], axis=1, inplace=True)
Sorting
● By default is in ascending and return the dataframe sorted
dfSal.sort_values(by='service') # default ascending=True inplace=False
dfSal.sort_values(by=['service','salary']) # sort salary within service
# sort by service ascending and salary descending
dfSal.sort_values(by=['service','salary'], ascending=[True, False])
# sort the dataframe by column label (attribute name)
dfSal.sort_index(axis=1,ascending=True, inplace=True)
dfSal.head(5)
Note: axis=0 refers to row
axis=1 refers to column
Missing values (NaN)
● Most of the time missing values are marked as NaN
dfFlig=pd.read_csv('flights.csv')
dfFlig.isnull()
dfFlig[dfFlig.isnull().any(axis=1)].head()
dfSal.iloc[0].isnull().sum() # number of null values in row 0
Missing values methods
Method Description
dropna() drop missing observations (rows)
dropna(how=’all’) drop missing observations (rows) where all attributes
are NaN
dropna(axis=1,how=’all’) drop columns if all values are missing
dropna(thresh=n) drop rows that contain less than n non-missing values
fillna(0) replace missing values with zeros
sample([n]) return a random sample of rows (default 1)
isnull() returns True if the value is missing
notnull() returns True if the value is non-missing
Graphics with Data Frame
● Pandas DataFrame offers some methods to plot data
○ %matplotlib inline
○ import matplotlib.pyplot as plt
dfSal.plot(x='rank',y='salary')
dfSal['salary'].plot.hist()