Introduction to Pandas
Pandas
• Pandas is a software package written for the Python
programming language for data manipulation and analysis.
• For more information about pandas, go to the website of
http://pandas.pydata.org/.
Travel Time Index Dataset
• We will be utilizing the Travel Time Index (TTI) dataset in this lecture. TTI serves as a metric
for average travel conditions, offering insights into the extent to which travel times are
extended during congestion in comparison to periods of light traffic. For more
comprehensive information about this dataset, please refer to our publication, which can be
accessed at the following link: https://ascelibrary.org/doi/abs/10.1061/9780784484876.040
• Download travel time index (TTI) data from
– https://uh.edu/tech/cm-lab/hourly_tti.csv
• Download weather data from the same city:
– https://uh.edu/tech/cm-lab/weather.csv
• Here is a link to a Google Colab notebook that shows how to use Pandas to work with and
study the TTI dataset:
– https://colab.research.google.com/drive/1IDsXyqocsJzJ42pDxxYKwyzYZ-Fg_Siw?usp=sharing
What is the Travel Time Index (TTI)?
• Travel time index is a metric used to measure the relative travel
time on a road network compared to the ideal or free-flow
travel time.
• TTI is typically expressed as a ratio or percentage.
𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑡𝑟𝑎𝑣𝑒𝑙 𝑡𝑖𝑚𝑒
𝑇𝑇𝐼 =
𝑓𝑟𝑒𝑒 𝑓𝑙𝑜𝑤 𝑡𝑟𝑎𝑣𝑒𝑙 𝑡𝑖𝑚𝑒
• A TTI value of 1.3, for example, indicates a 20-minute free-flow
trip requires 26 minutes.
90th or 95th percentile travel times
• The 90th or 95th percentile travel times serve as a
straightforward method to gauge the reliability of travel
durations. They provide an estimation of the extent of delays
on specific routes during peak traffic periods, particularly on
the heaviest traffic days.
The 90th or 95th percentile travel times
gauge travel reliability, estimating delays
on busy routes during peak traffic,
especially on heavy traffic days.
The buffer index advises travelers to allocate extra time
for trips to ensure timely arrival. For example, with a 40
percent buffer index, a 20-minute trip would require an
additional 8 minutes, totaling 28 minutes for a 95 percent
on-time arrival.
95𝑡ℎ 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑡𝑟𝑎𝑣𝑒𝑙 𝑡𝑖𝑚𝑒 − 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑡𝑟𝑎𝑣𝑒𝑙 𝑡𝑖𝑚𝑒
𝐵𝐼 = ∗ 100
𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑡𝑟𝑎𝑣𝑒𝑙𝑡 𝑖𝑚𝑒
The planning time index includes total travel time
needed for on-time arrival, while the buffer index
adds extra time. A planning time index of 1.60
means a 15-minute trip requires 24 minutes for 95
percent on-time arrival.
Topics
Topics Video Colab Link
Introduction to Pandas Video N.A.
Dataframe Video N.A.
Create a dataframe Video link
Import and Export CSV file Video link
Import from google spreadsheet N.A. link
Summary statistics of dataframe Video link
Count unique values of a column Video link
Missing data Video link
Delete columns Video link
Create a standard date time format Video link
Merge two datasets through a common column Video link
Groupby N.A. link
Lagged variable N.A. link
Different ways to access external data in
Google Colab
• In Google Colab, users have several convenient ways to access external data. These methods
include:
– Drag and Drop Files: Users can easily upload files from their local computer to Colab by simply
dragging and dropping them into the Colab environment. This provides a straightforward and
user-friendly way to import data.
– Import Data from Google Drive: Another method is to import data from Google Drive. To do this,
follow these steps:
• Ensure that the file you want to access is shared with "Anyone with the link.“
• Copy the file ID from the shared link of the file in Google Drive.
• In Colab, you can use the ‘gdown’ library to download the file using its file ID. For example:
!gdown https://drive.google.com/uc?id=YOUR_FILE_ID
– Fetching Data from the Web: Data can also be obtained directly from the web by downloading or
scraping it. You can use the ‘wget’ command to download data from a URL and save it to your
Colab environment. For instance:
!wget -O test.csv https://www.example.com/data.csv
• Example: https://colab.research.google.com/drive/1LCXDGbWvM3ozUwSpAGc4UgjZk4gO7vRQ?usp=sharing
DataFrame
• DataFrame is a 2-dimensional labeled
data structure with columns of potentially
different types. You can think of it like a
spreadsheet. Most of the data we use in
this course is in the format of DataFrame.
Import and export csv file from a URL link
• There are two methods for handling CSV data from a URL link.
– Method 1: Download the CSV file to Google Colab, then read it with
Pandas.
– Method 2: Read CSV directly from a URL link.
• Example link
Summary Statistics
• Summary statistics are essential for gaining quick insights into your dataset.
• Pandas provides an easy way to calculate these statistics using the describe()
function.
• Example link: https://colab.research.google.com/drive/1b4t6AGeHzQEZR-
50EeFlLfBLlMOavl5B?usp=share_link#scrollTo=yKVlaaQmXNZL
Displaying Column Names in Pandas
• Knowing the column names in your dataset is crucial for data
manipulation and analysis.
• In Pandas, we can easily display all the column names: use the
‘.columns’ attribute of your DataFrame.
Counting and Displaying Unique Values in a
Column with Pandas
• Counting and displaying unique values within a
column is essential for understanding the
diversity of your data. In Pandas, we can easily
achieve this.
• To count and display unique values in a Pandas
DataFrame column:
– Use the .unique() function to get the unique
values.
– Use the .nunique() function to count them.
• Example link:
– https://colab.research.google.com/drive/1hZUhcf_tn
mRx3SizbG7sjYnmwy6F414D?usp=share_link#scrollTo
=t5dO41wqYmM_
Removing Missing Values using .dropna() in
Pandas
• Handling missing values is a crucial
step in data preprocessing.
• Pandas provides the .dropna() function
to remove rows containing missing
values.
• To remove rows with missing values:
– Use df.dropna() without any additional
arguments.
– This will remove any row containing at
least one missing value.
• Example link:
– https://colab.research.google.com/drive/1
9pJ4Q9UCeF-
8DWFLe_VKUSF3sUfP2GON?usp=share_lin
k#scrollTo=3GN4R9U4c4yw
Creating New Columns Based on Existing Data
in Pandas
• Often, you may need to create new
columns in your dataset based on
values from existing columns.
Pandas provides a straightforward
way to achieve this.
• To create a new column based on
existing data:
– Use the DataFrame assignment
operator (=) to define the new column.
– Utilize operations or functions with
existing columns.
Datetime Format
• Standardizing datetime data formats to
a common standard is essential for
ensuring consistency in data analysis.
To standardize datetime data format
(e.g., 9/13/2017, September 13, 2017,
2017-9-13), we can use the
to_datetime().
• Example link:
– https://colab.research.google.com/
drive/1oGDEUkwDsFQJyM8vZOIFLC
MpgikZx2aT?usp=share_link#scrollT
o=kn2pDmSle71S
Groupby
• In some cases, we want to split the data into
subsets and apply some functionality on each
subset. Grouping and aggregating data is a
fundamental operation in data analysis,
allowing us to gain insights from structured
data. Pandas' groupby() function is a powerful
tool for achieving this. It enables us to group
data based on a specific column's values,
facilitating subsequent analysis and summary
statistics for each group.
• Example link:
– https://colab.research.google.com/drive/1
vKax-
2pIZJSel_wQ51ddKZSLJ74gBJgr?usp=sharin
g#scrollTo=lM1WSNMt8T0M
Combining Two DataFrames
• Combining data from multiple sources is a common task in
data analysis. Pandas provides the powerful merge() function
for merging DataFrames.
Convert categorical variable into dummy/indicator
variables
• A lot of real world data are discrete or categorical. For
example, the weekday variable takes discrete values of 0, 1, 2,
..., 6. To use machine learning models in sklearn, we need to
convert these categorical variables into dummy variables using
get_dummies().
Exercises
1. Create a weekday column for the TTI dataframe.
2. Create a new column showing previous hour’s tti value.
3. Create a new column showing last year’s tti value (on the
same hour of the same day of the same month).