Pandas
1. Series: The Fundamental 1D Structure
A Series is a one-dimensional labeled array capable of holding any data type.
Key Attributes:
values: Underlying data array
index: Axis labels
dtype: Data type
name: Series name
size: Number of elements
Example 1: Basic Series Creation
python
import pandas as pd
import numpy as np
# Create series from list with default index
s1 = pd.Series([10, 20, 30, 40])
print("s1:\n", s1)
# Create series with custom index
s2 = pd.Series([1.1, 2.2, 3.3], index=['a', 'b', 'c'], name='temperatures')
print("\ns2:\n", s2)
Output:
text
s1:
0 10
1 20
2 30
3 40
dtype: int64
s2:
a 1.1
b 2.2
c 3.3
Name: temperatures, dtype: float64
Example 2: Series from Dictionary (Automatic Index Alignment)
python
population_dict = {'New York': 8.4, 'London': 9.0, 'Tokyo': 13.9}
pop_series = pd.Series(population_dict)
print("Population Series:\n", pop_series)
# Add new city (demonstrating index alignment)
new_data = {'London': 9.1, 'Paris': 2.2}
pop_series = pop_series.add(pd.Series(new_data), fill_value=0)
print("\nUpdated Population:\n", pop_series)
Output:
text
Population Series:
New York 8.4
London 9.0
Tokyo 13.9
dtype: float64
Updated Population:
London 9.1
New York 8.4
Paris 2.2
Tokyo 13.9
dtype: float64
Example 3: Vectorized Operations & Filtering
python
# Vectorized operations
squares = pd.Series([1, 4, 9, 16, 25])
sqrt = np.sqrt(squares)
print("Square Roots:\n", sqrt)
# Boolean filtering
filtered = sqrt[sqrt > 2]
print("\nValues > 2:\n", filtered)
# Index-based operations
reindexed = filtered.reindex([0, 2, 4], fill_value=0)
print("\nReindexed:\n", reindexed)
Output:
text
Square Roots:
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
dtype: float64
Values > 2:
2 3.0
3 4.0
4 5.0
dtype: float64
Reindexed:
0 0.0
2 3.0
4 5.0
dtype: float64
2. DataFrame: The Primary 2D Structure
A DataFrame is a two-dimensional labeled data structure with columns of potentially different
types.
Key Attributes:
index: Row labels
columns: Column labels
dtypes: Column data types
shape: (rows, columns) dimensions
values: Underlying data array
Example 1: DataFrame Creation
python
# From dictionary of lists
data = {
'City': ['London', 'New York', 'Tokyo'],
'Population (M)': [9.2, 8.4, 13.9],
'Country': ['UK', 'USA', 'Japan']
df = pd.DataFrame(data, index=['A', 'B', 'C'])
print("DataFrame:\n", df)
# Add new column
df['GDP (B)'] = [465, 1940, 1020]
print("\nWith GDP:\n", df)
Output:
text
DataFrame:
City Population (M) Country
A London 9.2 UK
B New York 8.4 USA
C Tokyo 13.9 Japan
With GDP:
City Population (M) Country GDP (B)
A London 9.2 UK 465
B New York 8.4 USA 1940
C Tokyo 13.9 Japan 1020
Example 2: Indexing and Selection
python
# Column selection
print("Cities:\n", df['City'])
# Row selection by label
print("\nRow B:\n", df.loc['B'])
# Row selection by position
print("\nFirst row:\n", df.iloc[0])
# Boolean indexing
print("\nLarge cities:\n", df[df['Population (M)'] > 10])
# Multi-axis selection
print("\nSpecific selection:\n", df.loc['C', ['City', 'GDP (B)']])
Output:
text
Cities:
A London
B New York
C Tokyo
Name: City, dtype: object
Row B:
City New York
Population (M) 8.4
Country USA
GDP (B) 1940
Name: B, dtype: object
First row:
City London
Population (M) 9.2
Country UK
GDP (B) 465
Name: A, dtype: object
Large cities:
City Population (M) Country GDP (B)
C Tokyo 13.9 Japan 1020
Specific selection:
City Tokyo
GDP (B) 1020
Name: C, dtype: object
Example 3: Handling Missing Data
python
# Create DF with missing values
df_missing = pd.DataFrame({
'A': [1, None, 3],
'B': [np.nan, 5, 6],
'C': [7, 8, None]
})
print("Original:\n", df_missing)
# Fill missing with column mean
df_filled = df_missing.fillna(df_missing.mean())
print("\nFilled:\n", df_filled)
# Drop rows with any missing values
df_dropped = df_missing.dropna()
print("\nDropped:\n", df_dropped)
Output:
text
Original:
A B C
0 1.0 NaN 7.0
1 NaN 5.0 8.0
2 3.0 6.0 NaN
Filled:
A B C
0 1.0 5.5 7.0
1 2.0 5.0 8.0
2 3.0 6.0 7.5
Dropped:
Empty DataFrame
Columns: [A, B, C]
Index: []
Example 4: Advanced Operations
python
# Add calculated column
df['GDP per Capita'] = df['GDP (B)'] * 1000 / (df['Population (M)'] * 1e6)
print("With GDP per Capita:\n", df)
# Apply function across rows
df['Size Category'] = df.apply(
lambda row: 'Large' if row['Population (M)'] > 10 else 'Medium',
axis=1
print("\nWith Size Category:\n", df)
# Transpose data
print("\nTransposed:\n", df.T)
Output:
text
With GDP per Capita:
City Population (M) Country GDP (B) GDP per Capita
A London 9.2 UK 465 50.543478
B New York 8.4 USA 1940 230.952381
C Tokyo 13.9 Japan 1020 73.381295
With Size Category:
City Population (M) Country GDP (B) GDP per Capita Size Category
A London 9.2 UK 465 50.543478 Medium
B New York 8.4 USA 1940 230.952381 Medium
C Tokyo 13.9 Japan 1020 73.381295 Large
Transposed:
A B C
City London New York Tokyo
Population (M) 9.2 8.4 13.9
Country UK USA Japan
GDP (B) 465 1940 1020
GDP per Capita 50.543478 230.952381 73.381295
Size Category Medium Medium Large
3. Index Objects: The Core Labeling System
Pandas Index objects are immutable arrays implementing advanced set operations.
Key Features:
Immutable
Supports set operations (union, intersection, difference)
Can be hierarchical (MultiIndex)
Example:
python
# Create custom index
index = pd.Index(['X', 'Y', 'Z'], name='coordinates')
# Set operations
index1 = pd.Index(['A', 'B', 'C'])
index2 = pd.Index(['B', 'C', 'D'])
print("Union:", index1.union(index2))
print("Intersection:", index1.intersection(index2))
print("Difference:", index1.difference(index2))
# Hierarchical index (MultiIndex)
multi_index = pd.MultiIndex.from_tuples(
[('Group1', 'A'), ('Group1', 'B'), ('Group2', 'A')],
names=['Group', 'Subgroup']
print("\nMultiIndex:\n", multi_index)
Output:
text
Union: Index(['A', 'B', 'C', 'D'], dtype='object')
Intersection: Index(['B', 'C'], dtype='object')
Difference: Index(['A'], dtype='object')
MultiIndex:
MultiIndex([('Group1', 'A'),
('Group1', 'B'),
('Group2', 'A')],
names=['Group', 'Subgroup'])
Key Differences Between Series and DataFrames
Feature Series DataFrame
Dimensions 1D 2D
Data Homogeneous Heterogeneous (per column)
Indexing Single index Row and column indexes
Data Access Single [] operator Multiple [] or .loc/.iloc
Memory Efficiency Higher Lower
Use Cases Single variable analysis Multi-variable datasets
Data I/O & Data Inspection
Covering pd.read_csv(), pd.read_excel(), df.to_csv(), and key inspection methods with practical
examples.
1. Reading Data
CSV Files: pd.read_csv()
Essential Parameters & Usage:
python
df = pd.read_csv(
'data.csv',
sep=',', # Delimiter
header=0, # Header row (set None for no header)
index_col='id', # Column to use as index
usecols=['col1','col2'], # Columns to read
dtype={'price': float}, # Data type specification
parse_dates=['date'], # Parse dates
na_values=['NA', '?'], # Custom missing value markers
skiprows=5, # Skip first 5 rows
nrows=1000, # Read only first 1000 rows
encoding='latin-1' # Handle special characters
Example:
python
# Read sales data
sales_df = pd.read_csv(
'sales.csv',
parse_dates=['order_date'],
na_values=['MISSING'],
dtype={'order_id': 'string', 'quantity': int}
print("Data shape:", sales_df.shape)
Excel Files: pd.read_excel()
Key Parameters:
python
df = pd.read_excel(
'data.xlsx',
sheet_name='Sheet1', # Sheet name or index
header=1, # Start reading from row 1
usecols='A:D,F', # Column range (Excel-style)
skipfooter=3, # Skip last 3 rows
engine='openpyxl' # Required for .xlsx files
)
Example:
python
# Read multi-sheet Excel
with pd.ExcelFile('financials.xlsx') as xls:
q1_df = pd.read_excel(xls, 'Q1_Sales')
expenses_df = pd.read_excel(xls, 'Q1_Expenses')
print("Q1 Sales Columns:", q1_df.columns.tolist())
2. Writing Data
CSV Export: df.to_csv()
Essential Parameters:
python
df.to_csv(
'output.csv',
index=False, # Exclude index
columns=['col1','col3'], # Selected columns
header=['First','Third'], # Custom column names
na_rep='NULL', # Missing value representation
float_format='%.2f', # Format floats
encoding='utf-8-sig', # Add BOM for Excel compatibility
date_format='%Y-%m-%d' # Date formatting
Example:
python
# Process and export data
sales_df['total'] = sales_df['unit_price'] * sales_df['quantity']
sales_df.to_csv(
'processed_sales.csv',
index=False,
columns=['order_id', 'order_date', 'total'],
na_rep='MISSING',
float_format='%.0f' # No decimal places
3. Data Inspection Methods
Initial Glance: df.head() & df.tail()
python
print("First 3 rows:")
print(sales_df.head(3)) # First n rows (default 5)
print("\nLast 2 rows:")
print(sales_df.tail(2)) # Last n rows
Sample Output:
text
First 3 rows:
order_id order_date unit_price quantity
0 ORD001 2023-01-15 49.99 2
1 ORD002 2023-01-16 19.95 1
2 ORD003 2023-01-16 199.00 NA
Last 2 rows:
order_id order_date unit_price quantity
98 ORD099 2023-03-30 14.99 3
99 ORD100 2023-03-31 79.99 1
Structural Overview: df.info()
python
sales_df.info(verbose=True, show_counts=True)
Output:
text
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 100 non-null string
1 order_date 100 non-null datetime64[ns]
2 unit_price 97 non-null float64
3 quantity 96 non-null Int64
dtypes: Int64(1), datetime64[ns](1), float64(1), string(1)
memory usage: 3.2 KB
Key insights:
Detects 3 missing values in unit_price
Shows 4 missing values in quantity
Confirms correct data types
Reports memory usage
Statistical Summary: df.describe()
python
print(sales_df.describe(include='all', datetime_is_numeric=True))
Output:
text
order_id order_date unit_price quantity
count 100 100 97.000000 96.000000
unique 100 NaN NaN NaN
top ORD076 NaN NaN NaN
freq 1 NaN NaN NaN
mean NaN 2023-02-14 12:00 76.492474 2.187500
min NaN 2023-01-15 00:00 5.990000 1.000000
25% NaN 2023-01-28 00:00 24.990000 1.000000
50% NaN 2023-02-15 00:00 49.990000 2.000000
75% NaN 2023-03-15 00:00 99.990000 3.000000
max NaN 2023-03-31 00:00 499.000000 10.000000
std NaN NaN 87.312073 1.782456
Key insights:
Categorical columns: Unique counts, top values
Numeric columns: Mean, percentiles, min/max
Datetime columns: First/last dates, distribution
Missing values: Excluded from calculations
4. Practical Workflow Example
python
# 1. Read data from multiple sources
sales_data = pd.read_csv('sales.csv', parse_dates=['date'])
inventory = pd.read_excel('inventory.xlsx', sheet_name='Stock')
# 2. Initial inspection
print("Sales Data Overview:")
sales_data.info()
print("\nMissing Values Check:")
print(sales_data.isna().sum())
# 3. Data cleaning
sales_data.fillna({'quantity': 1}, inplace=True)
sales_data.dropna(subset=['product_id'], inplace=True)
# 4. Advanced inspection
print("\nStatistical Summary:")
print(sales_data.describe(percentiles=[0.1, 0.9]))
# 5. Export processed data
sales_data.to_csv('cleaned_sales.csv', index=False)
5. Pro Tips for Efficient Inspection
1. Custom Summary Function:
python
def quick_summary(df):
return pd.DataFrame({
'dtype': df.dtypes,
'missing': df.isna().sum(),
'unique': df.nunique(),
'min': df.min(numeric_only=True),
'max': df.max(numeric_only=True)
})
print(quick_summary(sales_data))
2. Memory Optimization:
python
# Downcast numeric columns
sales_data['quantity'] = pd.to_numeric(sales_data['quantity'], downcast='integer')
# Convert to categorical
sales_data['category'] = sales_data['category'].astype('category')
3. Large File Sampling:
python
# Read 1% random sample of large CSV
sample_df = pd.read_csv('big_data.csv', skiprows=lambda x: x>0 and random.random() > 0.01)
Key Takeaways:
Task Method Key Insight
Read CSV pd.read_csv() Handle encoding, dates, missing values
Read Excel pd.read_excel() Specify sheets, skip headers/footers
Export Data df.to_csv() Control index, columns, missing value rep
Initial Inspection df.head()/tail() Verify data ingestion
Structural Analysis df.info() Check dtypes, memory, missing values
Statistical Summary df.describe() Understand distributions, outliers
Task Method Key Insight
Advanced Inspection Custom functions Create tailored data quality checks