Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views20 pages

Pandas

The document provides an overview of the Pandas library, detailing its key structures: Series (1D) and DataFrame (2D), along with their attributes and examples of creation, indexing, and operations. It also covers Index objects, data input/output methods, and data inspection techniques, emphasizing practical applications for data manipulation and analysis. Key differences between Series and DataFrames are highlighted, along with examples of reading and writing data from various sources.

Uploaded by

Raaz gaming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views20 pages

Pandas

The document provides an overview of the Pandas library, detailing its key structures: Series (1D) and DataFrame (2D), along with their attributes and examples of creation, indexing, and operations. It also covers Index objects, data input/output methods, and data inspection techniques, emphasizing practical applications for data manipulation and analysis. Key differences between Series and DataFrames are highlighted, along with examples of reading and writing data from various sources.

Uploaded by

Raaz gaming
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Pandas

1. Series: The Fundamental 1D Structure

A Series is a one-dimensional labeled array capable of holding any data type.

Key Attributes:

 values: Underlying data array

 index: Axis labels

 dtype: Data type

 name: Series name

 size: Number of elements

Example 1: Basic Series Creation

python

import pandas as pd

import numpy as np

# Create series from list with default index

s1 = pd.Series([10, 20, 30, 40])

print("s1:\n", s1)

# Create series with custom index

s2 = pd.Series([1.1, 2.2, 3.3], index=['a', 'b', 'c'], name='temperatures')

print("\ns2:\n", s2)

Output:

text

s1:

0 10
1 20

2 30

3 40

dtype: int64

s2:

a 1.1

b 2.2

c 3.3

Name: temperatures, dtype: float64

Example 2: Series from Dictionary (Automatic Index Alignment)

python

population_dict = {'New York': 8.4, 'London': 9.0, 'Tokyo': 13.9}

pop_series = pd.Series(population_dict)

print("Population Series:\n", pop_series)

# Add new city (demonstrating index alignment)

new_data = {'London': 9.1, 'Paris': 2.2}

pop_series = pop_series.add(pd.Series(new_data), fill_value=0)

print("\nUpdated Population:\n", pop_series)

Output:

text

Population Series:

New York 8.4

London 9.0

Tokyo 13.9
dtype: float64

Updated Population:

London 9.1

New York 8.4

Paris 2.2

Tokyo 13.9

dtype: float64

Example 3: Vectorized Operations & Filtering

python

# Vectorized operations

squares = pd.Series([1, 4, 9, 16, 25])

sqrt = np.sqrt(squares)

print("Square Roots:\n", sqrt)

# Boolean filtering

filtered = sqrt[sqrt > 2]

print("\nValues > 2:\n", filtered)

# Index-based operations

reindexed = filtered.reindex([0, 2, 4], fill_value=0)

print("\nReindexed:\n", reindexed)

Output:

text

Square Roots:

0 1.0
1 2.0

2 3.0

3 4.0

4 5.0

dtype: float64

Values > 2:

2 3.0

3 4.0

4 5.0

dtype: float64

Reindexed:

0 0.0

2 3.0

4 5.0

dtype: float64

2. DataFrame: The Primary 2D Structure

A DataFrame is a two-dimensional labeled data structure with columns of potentially different


types.

Key Attributes:

 index: Row labels

 columns: Column labels

 dtypes: Column data types

 shape: (rows, columns) dimensions


 values: Underlying data array

Example 1: DataFrame Creation

python

# From dictionary of lists

data = {

'City': ['London', 'New York', 'Tokyo'],

'Population (M)': [9.2, 8.4, 13.9],

'Country': ['UK', 'USA', 'Japan']

df = pd.DataFrame(data, index=['A', 'B', 'C'])

print("DataFrame:\n", df)

# Add new column

df['GDP (B)'] = [465, 1940, 1020]

print("\nWith GDP:\n", df)

Output:

text

DataFrame:

City Population (M) Country

A London 9.2 UK

B New York 8.4 USA

C Tokyo 13.9 Japan

With GDP:

City Population (M) Country GDP (B)


A London 9.2 UK 465

B New York 8.4 USA 1940

C Tokyo 13.9 Japan 1020

Example 2: Indexing and Selection

python

# Column selection

print("Cities:\n", df['City'])

# Row selection by label

print("\nRow B:\n", df.loc['B'])

# Row selection by position

print("\nFirst row:\n", df.iloc[0])

# Boolean indexing

print("\nLarge cities:\n", df[df['Population (M)'] > 10])

# Multi-axis selection

print("\nSpecific selection:\n", df.loc['C', ['City', 'GDP (B)']])

Output:

text

Cities:

A London

B New York

C Tokyo

Name: City, dtype: object


Row B:

City New York

Population (M) 8.4

Country USA

GDP (B) 1940

Name: B, dtype: object

First row:

City London

Population (M) 9.2

Country UK

GDP (B) 465

Name: A, dtype: object

Large cities:

City Population (M) Country GDP (B)

C Tokyo 13.9 Japan 1020

Specific selection:

City Tokyo

GDP (B) 1020

Name: C, dtype: object

Example 3: Handling Missing Data

python

# Create DF with missing values


df_missing = pd.DataFrame({

'A': [1, None, 3],

'B': [np.nan, 5, 6],

'C': [7, 8, None]

})

print("Original:\n", df_missing)

# Fill missing with column mean

df_filled = df_missing.fillna(df_missing.mean())

print("\nFilled:\n", df_filled)

# Drop rows with any missing values

df_dropped = df_missing.dropna()

print("\nDropped:\n", df_dropped)

Output:

text

Original:

A B C

0 1.0 NaN 7.0

1 NaN 5.0 8.0

2 3.0 6.0 NaN

Filled:

A B C

0 1.0 5.5 7.0

1 2.0 5.0 8.0


2 3.0 6.0 7.5

Dropped:

Empty DataFrame

Columns: [A, B, C]

Index: []

Example 4: Advanced Operations

python

# Add calculated column

df['GDP per Capita'] = df['GDP (B)'] * 1000 / (df['Population (M)'] * 1e6)

print("With GDP per Capita:\n", df)

# Apply function across rows

df['Size Category'] = df.apply(

lambda row: 'Large' if row['Population (M)'] > 10 else 'Medium',

axis=1

print("\nWith Size Category:\n", df)

# Transpose data

print("\nTransposed:\n", df.T)

Output:

text

With GDP per Capita:

City Population (M) Country GDP (B) GDP per Capita

A London 9.2 UK 465 50.543478


B New York 8.4 USA 1940 230.952381

C Tokyo 13.9 Japan 1020 73.381295

With Size Category:

City Population (M) Country GDP (B) GDP per Capita Size Category

A London 9.2 UK 465 50.543478 Medium

B New York 8.4 USA 1940 230.952381 Medium

C Tokyo 13.9 Japan 1020 73.381295 Large

Transposed:

A B C

City London New York Tokyo

Population (M) 9.2 8.4 13.9

Country UK USA Japan

GDP (B) 465 1940 1020

GDP per Capita 50.543478 230.952381 73.381295

Size Category Medium Medium Large

3. Index Objects: The Core Labeling System

Pandas Index objects are immutable arrays implementing advanced set operations.

Key Features:

 Immutable

 Supports set operations (union, intersection, difference)

 Can be hierarchical (MultiIndex)

Example:

python
# Create custom index

index = pd.Index(['X', 'Y', 'Z'], name='coordinates')

# Set operations

index1 = pd.Index(['A', 'B', 'C'])

index2 = pd.Index(['B', 'C', 'D'])

print("Union:", index1.union(index2))

print("Intersection:", index1.intersection(index2))

print("Difference:", index1.difference(index2))

# Hierarchical index (MultiIndex)

multi_index = pd.MultiIndex.from_tuples(

[('Group1', 'A'), ('Group1', 'B'), ('Group2', 'A')],

names=['Group', 'Subgroup']

print("\nMultiIndex:\n", multi_index)

Output:

text

Union: Index(['A', 'B', 'C', 'D'], dtype='object')

Intersection: Index(['B', 'C'], dtype='object')

Difference: Index(['A'], dtype='object')

MultiIndex:

MultiIndex([('Group1', 'A'),

('Group1', 'B'),

('Group2', 'A')],
names=['Group', 'Subgroup'])

Key Differences Between Series and DataFrames

Feature Series DataFrame

Dimensions 1D 2D

Data Homogeneous Heterogeneous (per column)

Indexing Single index Row and column indexes

Data Access Single [] operator Multiple [] or .loc/.iloc

Memory Efficiency Higher Lower

Use Cases Single variable analysis Multi-variable datasets

Data I/O & Data Inspection

Covering pd.read_csv(), pd.read_excel(), df.to_csv(), and key inspection methods with practical
examples.

1. Reading Data

CSV Files: pd.read_csv()

Essential Parameters & Usage:

python

df = pd.read_csv(

'data.csv',

sep=',', # Delimiter

header=0, # Header row (set None for no header)

index_col='id', # Column to use as index

usecols=['col1','col2'], # Columns to read

dtype={'price': float}, # Data type specification


parse_dates=['date'], # Parse dates

na_values=['NA', '?'], # Custom missing value markers

skiprows=5, # Skip first 5 rows

nrows=1000, # Read only first 1000 rows

encoding='latin-1' # Handle special characters

Example:

python

# Read sales data

sales_df = pd.read_csv(

'sales.csv',

parse_dates=['order_date'],

na_values=['MISSING'],

dtype={'order_id': 'string', 'quantity': int}

print("Data shape:", sales_df.shape)

Excel Files: pd.read_excel()

Key Parameters:

python

df = pd.read_excel(

'data.xlsx',

sheet_name='Sheet1', # Sheet name or index

header=1, # Start reading from row 1

usecols='A:D,F', # Column range (Excel-style)

skipfooter=3, # Skip last 3 rows

engine='openpyxl' # Required for .xlsx files


)

Example:

python

# Read multi-sheet Excel

with pd.ExcelFile('financials.xlsx') as xls:

q1_df = pd.read_excel(xls, 'Q1_Sales')

expenses_df = pd.read_excel(xls, 'Q1_Expenses')

print("Q1 Sales Columns:", q1_df.columns.tolist())

2. Writing Data

CSV Export: df.to_csv()

Essential Parameters:

python

df.to_csv(

'output.csv',

index=False, # Exclude index

columns=['col1','col3'], # Selected columns

header=['First','Third'], # Custom column names

na_rep='NULL', # Missing value representation

float_format='%.2f', # Format floats

encoding='utf-8-sig', # Add BOM for Excel compatibility

date_format='%Y-%m-%d' # Date formatting

Example:

python
# Process and export data

sales_df['total'] = sales_df['unit_price'] * sales_df['quantity']

sales_df.to_csv(

'processed_sales.csv',

index=False,

columns=['order_id', 'order_date', 'total'],

na_rep='MISSING',

float_format='%.0f' # No decimal places

3. Data Inspection Methods

Initial Glance: df.head() & df.tail()

python

print("First 3 rows:")

print(sales_df.head(3)) # First n rows (default 5)

print("\nLast 2 rows:")

print(sales_df.tail(2)) # Last n rows

Sample Output:

text

First 3 rows:

order_id order_date unit_price quantity

0 ORD001 2023-01-15 49.99 2

1 ORD002 2023-01-16 19.95 1

2 ORD003 2023-01-16 199.00 NA


Last 2 rows:

order_id order_date unit_price quantity

98 ORD099 2023-03-30 14.99 3

99 ORD100 2023-03-31 79.99 1

Structural Overview: df.info()

python

sales_df.info(verbose=True, show_counts=True)

Output:

text

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 100 entries, 0 to 99

Data columns (total 4 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 order_id 100 non-null string

1 order_date 100 non-null datetime64[ns]

2 unit_price 97 non-null float64

3 quantity 96 non-null Int64

dtypes: Int64(1), datetime64[ns](1), float64(1), string(1)

memory usage: 3.2 KB

Key insights:

 Detects 3 missing values in unit_price

 Shows 4 missing values in quantity

 Confirms correct data types

 Reports memory usage

Statistical Summary: df.describe()


python

print(sales_df.describe(include='all', datetime_is_numeric=True))

Output:

text

order_id order_date unit_price quantity

count 100 100 97.000000 96.000000

unique 100 NaN NaN NaN

top ORD076 NaN NaN NaN

freq 1 NaN NaN NaN

mean NaN 2023-02-14 12:00 76.492474 2.187500

min NaN 2023-01-15 00:00 5.990000 1.000000

25% NaN 2023-01-28 00:00 24.990000 1.000000

50% NaN 2023-02-15 00:00 49.990000 2.000000

75% NaN 2023-03-15 00:00 99.990000 3.000000

max NaN 2023-03-31 00:00 499.000000 10.000000

std NaN NaN 87.312073 1.782456

Key insights:

 Categorical columns: Unique counts, top values

 Numeric columns: Mean, percentiles, min/max

 Datetime columns: First/last dates, distribution

 Missing values: Excluded from calculations

4. Practical Workflow Example

python

# 1. Read data from multiple sources

sales_data = pd.read_csv('sales.csv', parse_dates=['date'])


inventory = pd.read_excel('inventory.xlsx', sheet_name='Stock')

# 2. Initial inspection

print("Sales Data Overview:")

sales_data.info()

print("\nMissing Values Check:")

print(sales_data.isna().sum())

# 3. Data cleaning

sales_data.fillna({'quantity': 1}, inplace=True)

sales_data.dropna(subset=['product_id'], inplace=True)

# 4. Advanced inspection

print("\nStatistical Summary:")

print(sales_data.describe(percentiles=[0.1, 0.9]))

# 5. Export processed data

sales_data.to_csv('cleaned_sales.csv', index=False)

5. Pro Tips for Efficient Inspection

1. Custom Summary Function:

python

def quick_summary(df):

return pd.DataFrame({

'dtype': df.dtypes,

'missing': df.isna().sum(),
'unique': df.nunique(),

'min': df.min(numeric_only=True),

'max': df.max(numeric_only=True)

})

print(quick_summary(sales_data))

2. Memory Optimization:

python

# Downcast numeric columns

sales_data['quantity'] = pd.to_numeric(sales_data['quantity'], downcast='integer')

# Convert to categorical

sales_data['category'] = sales_data['category'].astype('category')

3. Large File Sampling:

python

# Read 1% random sample of large CSV

sample_df = pd.read_csv('big_data.csv', skiprows=lambda x: x>0 and random.random() > 0.01)

Key Takeaways:

Task Method Key Insight

Read CSV pd.read_csv() Handle encoding, dates, missing values

Read Excel pd.read_excel() Specify sheets, skip headers/footers

Export Data df.to_csv() Control index, columns, missing value rep

Initial Inspection df.head()/tail() Verify data ingestion

Structural Analysis df.info() Check dtypes, memory, missing values

Statistical Summary df.describe() Understand distributions, outliers


Task Method Key Insight

Advanced Inspection Custom functions Create tailored data quality checks

You might also like