0% found this document useful (0 votes)

29 views18 pages

Intro Pandas

This document provides an introduction to the Pandas library in Python, covering its capabilities for data manipulation, including reading CSV files, data types, attributes, methods, and handling missing values. It also discusses grouping, filtering, slicing, sorting data, and basic plotting functionalities. The document includes code snippets to illustrate various operations and methods available in Pandas.

Uploaded by

duarte.denio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views18 pages

Intro Pandas

Uploaded by

duarte.denio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Introduction to

Pandas
import pandas as pd

inspired from http://rcs.bu.edu/examples/python/data_analysis/

Before starting

● Datasets (available in our Google Drive)

○ Salaries.csv
○ flights.csv
● Prerequisites
○ Good command in Python
○ numpy
○ sklearn
Before starting

● Comando no Collab:
from google.colab import drive
drive.mount('/content/drive')
fpath='/content/drive/MyDrive/<your path>'
Pandas

● Allows working with data like a table (relational)

● Provides tools form data manipulation: sorting, slicing,
aggregation, among others.
● Tools for plotting data
● Statistics information
Pandas - DataFrame
import pandas as pd
# read a csv file
dfSal=pd.read_csv('Salaries.csv')
# show the first 5 rows (default)
dfSal.head()
dfSal.tail()
DataFrame data types

Pandas Type Native Type Description

object string Columns with strings and

mixed types

int64 int numeric

float64 float numeric with decimals

datetime64 N/A stores time series

DataFrame data types
Be careful: if the attribute’s name is a
dfSal['salary'].dtype pandas reserved word, you have to
use df[‘attribute’].xxxxx
dtype('int64')
dfSal.salary.dtype
dtype('int64')
dfSal.dtypes
rank object
discipline object
phd int64
service int64
sex object
salary int64
dtype: object
DataFrame attributes

Atributte Description

dtypes types of the columns

columns column names

axes list the row labels and column names

ndim number of dimensions

size number of elements

shape tuple with the dimensionality

values numpy representation

index row labels

DataFrame methods

Method Description

head([n]), tail([n]) first or last n rows (default 5)

describe() descriptive statistics (numeric ones)

max(), min() return max/min values for all attributes

df.attribute.min()/max() return max/min for a given attribute

mean(), median() return mean/median for all attributes or a given one

std() standard deviation

sample([n]) return a random sample of rows (default 1)

dropna() drop all rows with missing values

drop() drop specified labels from rows or columns.

Grouping data

● Pandas group by can use for:

○ Split data into groups based on some criteria
○ Calculate statistics (or apply a function) to each group
# grouping by rank (attribute)
dfRank=dfSal.groupby(['rank'])
dfRank.mean()
# or
dfSal.groupby(['rank']).mean()
# we can calculate statistics
dfSal.groupby('rank')[['salary']].mean()
# or
dfSal.groupby('rank').salary.mean()
Filtering data

● We can subset the data applying Boolean indexing (filter)

dfSalG12=dfSal[dfSal['salary'] > 120000]
dfSalG12.head()
# Any operator: > < == >= <= !=
dfWom=dfSal[dfSal['sex']== 'Female']
dfWom.head()
Slicing

● A dataframe can be sliced in several ways

# one particular columns
dfSal['salary'] # or dfSal.salary creates a Series
dfSal[['salary']] # creates a dataframe
# two or more columns
dfSal[['rank','salary']] # to store dfRS=dfSal[['rank','salary']]
# Selection rows by their position
dfSal[10:20] # from the 11th row to 20th (dataframe starts in 0)
# create a new dataframe from another dataframe selected rows
s=[dfSal[0:10],dfSal[50:60],dfSal[100:110]] # select the rows
dfSlice=pd.concat(s) # concat them to a new dataframe
# Selection rows (first 20) and some labels (attributes)
dfSal.loc[:20,['rank','sex','salary']]
# or by column position
dfSal.loc[:20,[0,4,5]]
Slicing

● More method iloc

dfSal.iloc[0] # or dfSal.salary creates a Series
dfSal.iloc[i] # (i+1)th row (remember 0 is the first one)
dfSal.iloc[-1] # last row or dfSal.tail(1)
dfSal.iloc[:,0] # first column or dfSal['rank']
dfSal.iloc[:,-1] # last column or dfSal['salary']
dfSal.iloc[0:7] # first 7 rows or dfSal.head(7)
dfSal.iloc[:,0:2] # first 2 columns or dfSal[['rank','discipline']]
dfSal.iloc[[0,5],[1,3]] # 1st and 6th rows and 2nd and 4th columns
Dropping

● Delete rows with drop

dfSal.drop([5,6], axis=0, inplace=True)
dfSal=dfSal.iloc[:100] # Overwrite the df with the first 100 rows
# deleting using conditions
dfSal.drop(dfSal[(dfSal['salary'] >1000) & (dfSal['sex']=='Male')].index, axis=0, inplace=True)
# delete columns
dfSal.drop(['salary'], axis=1, inplace=True)
# multiples
dfSal.drop(['sex','salary'], axis=1, inplace=True)
Sorting

● By default is in ascending and return the dataframe sorted

dfSal.sort_values(by='service') # default ascending=True inplace=False
dfSal.sort_values(by=['service','salary']) # sort salary within service
# sort by service ascending and salary descending
dfSal.sort_values(by=['service','salary'], ascending=[True, False])
# sort the dataframe by column label (attribute name)
dfSal.sort_index(axis=1,ascending=True, inplace=True)
dfSal.head(5)

Note: axis=0 refers to row

axis=1 refers to column
Missing values (NaN)

● Most of the time missing values are marked as NaN

dfFlig=pd.read_csv('flights.csv')
dfFlig.isnull()

dfFlig[dfFlig.isnull().any(axis=1)].head()

dfSal.iloc[0].isnull().sum() # number of null values in row 0

Missing values methods

Method Description

dropna() drop missing observations (rows)

dropna(how=’all’) drop missing observations (rows) where all attributes

are NaN

dropna(axis=1,how=’all’) drop columns if all values are missing

dropna(thresh=n) drop rows that contain less than n non-missing values

fillna(0) replace missing values with zeros

sample([n]) return a random sample of rows (default 1)

isnull() returns True if the value is missing

notnull() returns True if the value is non-missing

Graphics with Data Frame

● Pandas DataFrame offers some methods to plot data

○ %matplotlib inline
○ import matplotlib.pyplot as plt
dfSal.plot(x='rank',y='salary')
dfSal['salary'].plot.hist()

Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
Python For Data Analysis Jan 28
No ratings yet
Python For Data Analysis Jan 28
105 pages
Python Interviews
No ratings yet
Python Interviews
154 pages
Nursing Informatics
89% (9)
Nursing Informatics
43 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Pandas
No ratings yet
Pandas
32 pages
Artificial Intelligence Assignment
70% (10)
Artificial Intelligence Assignment
5 pages
Unit IV
No ratings yet
Unit IV
49 pages
Python Cheat Sheet Code Academy
100% (1)
Python Cheat Sheet Code Academy
1 page
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
Pandas Notes
No ratings yet
Pandas Notes
20 pages
Python Data Analysis Tutorial
No ratings yet
Python Data Analysis Tutorial
47 pages
Pandas Tutorial
No ratings yet
Pandas Tutorial
33 pages
Unit - 4 - Part 2
No ratings yet
Unit - 4 - Part 2
36 pages
Justenoughpython Pandas 220915 175329
No ratings yet
Justenoughpython Pandas 220915 175329
64 pages
Day 11 Pandas For Data Science - Part 2
No ratings yet
Day 11 Pandas For Data Science - Part 2
21 pages
Pandas Introduction: What Is Python Pandas Used For?
No ratings yet
Pandas Introduction: What Is Python Pandas Used For?
28 pages
Pandas
No ratings yet
Pandas
26 pages
CO3 - 1 - Pandas Series and Data Frame
No ratings yet
CO3 - 1 - Pandas Series and Data Frame
37 pages
Asfasdas
No ratings yet
Asfasdas
36 pages
Pandas 1
No ratings yet
Pandas 1
50 pages
Python For Data Analysis Edgar
No ratings yet
Python For Data Analysis Edgar
49 pages
Python For ML
No ratings yet
Python For ML
41 pages
Pandas For Machine Learning
No ratings yet
Pandas For Machine Learning
10 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
Python Data Science Guide
100% (2)
Python Data Science Guide
47 pages
Python Data Analysis Libraries Guide
100% (1)
Python Data Analysis Libraries Guide
43 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
Murali Internship
No ratings yet
Murali Internship
34 pages
Pandas Programs
No ratings yet
Pandas Programs
2 pages
20 Pandas Codes To Master Data Analysis
No ratings yet
20 Pandas Codes To Master Data Analysis
3 pages
Exp3 Python
No ratings yet
Exp3 Python
15 pages
Python For Statistics
No ratings yet
Python For Statistics
40 pages
Python Data Analysis Basics
No ratings yet
Python Data Analysis Basics
32 pages
Data Analysis with Pandas
No ratings yet
Data Analysis with Pandas
31 pages
Pandas
No ratings yet
Pandas
25 pages
Exercise 3
No ratings yet
Exercise 3
12 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
27 pages
Pandas
No ratings yet
Pandas
4 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
Pandas Tutorial
No ratings yet
Pandas Tutorial
9 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
5 pages
Artificial Intelligence Questions
No ratings yet
Artificial Intelligence Questions
15 pages
Cloud Computing Unit-2 PPT - PPSX
No ratings yet
Cloud Computing Unit-2 PPT - PPSX
46 pages
Data Mining - Week - 4
No ratings yet
Data Mining - Week - 4
8 pages
Pandas Dataframe All Operations 1735471870
No ratings yet
Pandas Dataframe All Operations 1735471870
4 pages
Python For Data Science
No ratings yet
Python For Data Science
45 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
CHP 8 Pandas
No ratings yet
CHP 8 Pandas
49 pages
Pandas
No ratings yet
Pandas
13 pages
Introduction To Pandas in Data Analytics
No ratings yet
Introduction To Pandas in Data Analytics
12 pages
Content Pandas Cheat Sheet
No ratings yet
Content Pandas Cheat Sheet
9 pages
Introduction to Pandas Library
No ratings yet
Introduction to Pandas Library
31 pages
Jawaban MTCNA
No ratings yet
Jawaban MTCNA
13 pages
How To Easily Generate Sales Funnels and Growth Hack Your Business Using ClickFunnels - Kev Chavez - Your Keen & Crisp VP
50% (8)
How To Easily Generate Sales Funnels and Growth Hack Your Business Using ClickFunnels - Kev Chavez - Your Keen & Crisp VP
103 pages
Exploit Labs Short
No ratings yet
Exploit Labs Short
17 pages
Design and Fabrication of Compact Bicycle Trolley
No ratings yet
Design and Fabrication of Compact Bicycle Trolley
7 pages
Abb E-Clipse Bypass Configurations (BCR, BDR, VCR, or VDR) For Ach 550 User Manual
No ratings yet
Abb E-Clipse Bypass Configurations (BCR, BDR, VCR, or VDR) For Ach 550 User Manual
100 pages
Wireless Printer Manual
No ratings yet
Wireless Printer Manual
16 pages
Laser Spectroscopy Basic Concepts and Instrumentation 3rd Ed Wolfgang Demtrder PDF Download
100% (1)
Laser Spectroscopy Basic Concepts and Instrumentation 3rd Ed Wolfgang Demtrder PDF Download
16 pages
Towards MVD Semantic Level
No ratings yet
Towards MVD Semantic Level
11 pages
ABAP Web Service Client Proxy Guide
No ratings yet
ABAP Web Service Client Proxy Guide
20 pages
Exploring English Learners Experiences of Using M
No ratings yet
Exploring English Learners Experiences of Using M
15 pages
Now and Get: Best VTU Student Companion You Can Get
No ratings yet
Now and Get: Best VTU Student Companion You Can Get
5 pages
IntelliSteer Operating Guide PDF
No ratings yet
IntelliSteer Operating Guide PDF
240 pages
6670 01 Que 2003 SPECIMEN
No ratings yet
6670 01 Que 2003 SPECIMEN
4 pages
Design A Cloud-Enabled Humanoid Robot Application System To Assess The ABA Learning For Autistic Children
No ratings yet
Design A Cloud-Enabled Humanoid Robot Application System To Assess The ABA Learning For Autistic Children
8 pages
Marketing Information Systems
No ratings yet
Marketing Information Systems
7 pages
Pertemuan 3. Business Motivations and Drivers For Big Data Adoption
No ratings yet
Pertemuan 3. Business Motivations and Drivers For Big Data Adoption
16 pages
Rfmipi PDF
No ratings yet
Rfmipi PDF
10 pages
Investigating and Ranking The Rate of Penetration (ROP) Features For Petroleum Drilling Monitoring and Optimization
No ratings yet
Investigating and Ranking The Rate of Penetration (ROP) Features For Petroleum Drilling Monitoring and Optimization
7 pages
FACTORS INFLUENCING ADOPTION OF E-PROCUREMENT IN HUMANITARIAN ORGANIZATIONS (A Case of Norwegian Refugee Council - Kakuma Refugee Camp
100% (1)
FACTORS INFLUENCING ADOPTION OF E-PROCUREMENT IN HUMANITARIAN ORGANIZATIONS (A Case of Norwegian Refugee Council - Kakuma Refugee Camp
72 pages
Gamayas Portfolio
No ratings yet
Gamayas Portfolio
17 pages
Evolution of The Practice of Software Testing in Java Projects
No ratings yet
Evolution of The Practice of Software Testing in Java Projects
5 pages
WM 2024
No ratings yet
WM 2024
6 pages
Esther Joy. M: Resume
No ratings yet
Esther Joy. M: Resume
7 pages
Unit 01-1
No ratings yet
Unit 01-1
33 pages
Fpga Implementation of Neural Networks: Main Contents
No ratings yet
Fpga Implementation of Neural Networks: Main Contents
21 pages
NetWorker 19.1 Installation Guide PDF
No ratings yet
NetWorker 19.1 Installation Guide PDF
196 pages

Intro Pandas

Uploaded by

Intro Pandas

Uploaded by

Introduction to

inspired from http://rcs.bu.edu/examples/python/data_analysis/

● Datasets (available in our Google Drive)

● Allows working with data like a table (relational)

Pandas Type Native Type Description

object string Columns with strings and

int64 int numeric

float64 float numeric with decimals

datetime64 N/A stores time series

dtypes types of the columns

columns column names

axes list the row labels and column names

ndim number of dimensions

size number of elements

shape tuple with the dimensionality

values numpy representation

index row labels

head([n]), tail([n]) first or last n rows (default 5)

describe() descriptive statistics (numeric ones)

max(), min() return max/min values for all attributes

mean(), median() return mean/median for all attributes or a given one

std() standard deviation

sample([n]) return a random sample of rows (default 1)

dropna() drop all rows with missing values

drop() drop specified labels from rows or columns.

● Pandas group by can use for:

● We can subset the data applying Boolean indexing (filter)

● A dataframe can be sliced in several ways

● More method iloc

● Delete rows with drop

● By default is in ascending and return the dataframe sorted

Note: axis=0 refers to row

● Most of the time missing values are marked as NaN

dfSal.iloc[0].isnull().sum() # number of null values in row 0

dropna() drop missing observations (rows)

dropna(how=’all’) drop missing observations (rows) where all attributes

dropna(axis=1,how=’all’) drop columns if all values are missing

dropna(thresh=n) drop rows that contain less than n non-missing values

fillna(0) replace missing values with zeros

sample([n]) return a random sample of rows (default 1)

isnull() returns True if the value is missing

notnull() returns True if the value is non-missing

● Pandas DataFrame offers some methods to plot data

You might also like