Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
23 views95 pages

Data Frame

This document provides an introduction to DataFrames in Pandas, explaining their structure, creation from various data sources, and key features. It covers loading data from files, databases, and APIs, as well as inspecting and exploring DataFrames through various methods. Additionally, it details basic operations such as selecting rows and columns, renaming, and reordering columns.

Uploaded by

kameshs366
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views95 pages

Data Frame

This document provides an introduction to DataFrames in Pandas, explaining their structure, creation from various data sources, and key features. It covers loading data from files, databases, and APIs, as well as inspecting and exploring DataFrames through various methods. Additionally, it details basic operations such as selecting rows and columns, renaming, and reordering columns.

Uploaded by

kameshs366
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

DataFrames in Jupyter Notebook

Part 1: Introduction to DataFrames

1. What are DataFrames:


A DataFrame is a two-dimensional, mutable data structure in Pandas, similar to a table
in relational databases or a spreadsheet. It consists of labeled rows and columns, allowing for
efficient data manipulation and analysis.

1.1 Understanding DataFrames and Series in Pandas


●​ Series: A one-dimensional labeled array capable of holding data of any type. It's like a
single column of a DataFrame.
●​ DataFrame: A collection of Series aligned along the same index, resembling a table.

1.2 Creating DataFrames from Various Sources: - Lists - Dictionaries - NumPy Arrays

LISTS:
import pandas as pd # Importing the pandas library.
data = [[1, 'Alice'], [2, 'Bob'], [3, 'Charlie']] # The first value in each inner list is
an ID, and the second is a name.
df = pd.DataFrame(data, columns=['ID', 'Name']) # Creating a DataFrame from
the list of lists, with column names explicitly specified.
print(df) # Printing the DataFrame

Output:

The code creates a DataFrame from a list of lists, explicitly defining column
names (ID and Name) for each row of data.

DICTIONARIES:
​ Code:
import pandas as pd # Import pandas library.
data = {'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']} # Define data as a
dictionary with column names as keys.
df = pd.DataFrame(data) # Create a DataFrame from the dictionary..
print(df) # Print the DataFrame
Output:

The code creates a DataFrame using a dictionary, where keys represent column
names, and values are lists of data, making it more structured and readable.

NUMPY ARRAYS:
Code:
​ import numpy as np # Importing the numpy
import pandas as pd # Importing pandas
data = np.array([[1, 'Alice'], [2, 'Bob'], [3, 'Charlie']]) # Creating a NumPy
array with rows containing ID and Name data.
df = pd.DataFrame(data, columns=['ID', 'Name']) # Converting the
NumPy array into a pandas DataFrame and specifying column names.
print(df) # Printing the DataFrame

Output:

The code creates a NumPy array with ID and Name data and converts it into a
pandas DataFrame. It then prints the DataFrame, displaying the data in a tabular format
with labeled columns.

1.3 Key Features and Benefits of Using DataFrames

●​ Ease of Data Manipulation: Perform operations like filtering, sorting, and


aggregation with minimal code.
●​ Handling Missing Data: Use methods like .fillna() and .dropna() to manage
missing values efficiently.
●​ Integration: Works seamlessly with NumPy, Matplotlib, and machine learning
libraries.
●​ Data Alignment: Automatic alignment for operations involving multiple
DataFrames or Series.
2. Loading Data into DataFrames

2.1 Reading CSV, Excel, and JSON Files
CSV:​
​ df = pd.read_csv('data.csv')
Reads data from a CSV file (data.csv) and loads it into a pandas
DataFrame.

Excel:​
​ df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
Reads data from an Excel file (data.xlsx) from the specified sheet (Sheet1)
into a pandas DataFrame.

JSON:​
​ df = pd.read_json('data.json')
Loads data from a JSON file (data.json) into a pandas DataFrame

2.2 Connecting to Databases (SQL) and Importing Data

Here connecting to an SQLite database, executing a query to fetch data from a table, and
loading the result into a pandas DataFrame are done.
Code:
import sqlite3 # Import sqlite3 for SQLite database interaction.
import pandas as pd # Import pandas for data manipulation.
conn = sqlite3.connect('example.db') # Connect to SQLite database.
query = "SELECT * FROM data" # Define SQL query to fetch all rows from
table.
df = pd.read_sql_query(query, conn) # Execute query and store result in
DataFrame.
print(df) # Print the DataFrame.

Output:

The code establishes a connection to an SQLite database, runs a query to fetch


data from data, and stores the result in a pandas DataFrame for further processing.
2.3 Reading from APIs and Web Sources
From APIs:
import requests # Import requests library for HTTP requests.
url = 'https://api.example.com/data'
response = requests.get(url) # Send GET request to the API
df = pd.DataFrame(response.json()) # Convert the JSON response to a pandas
DataFrame

The code sends a GET request to the API and retrieves the response in JSON
format. It then converts the JSON data into a pandas DataFrame for easy manipulation.

From Web Sources:


​ ​ ​ import pandas as pd # Import pandas for data manipulation.
url = 'https://example.com/data.csv' # Define the URL of the CSV file
df = pd.read_csv(url) # Read CSV file from the URL into a DataFrame

The code reads a CSV file directly from the provided URL and loads its content
into a pandas DataFrame for data analysis and processing.

Part 2: Exploring DataFrames

3. Inspecting Data

3.1 Viewing the Head, Tail, and Random Samples

head(n): Displays the first n rows of the DataFrame. Defaults to 5 if n is not specified.

tail(n): Displays the last n rows of the DataFrame. Defaults to 5 if n is not specified.

sample(n): Randomly selects n rows from the DataFrame for inspection.

Code:

import pandas as pd

df = pd.read_csv('student-dataset.csv') # Load the dataset

print("Head of the Dataset:")

print(df.head()) # View the first 5 rows

print("\nTail of the Dataset:")

print(df.tail())# View the last 5 rows

print("\nRandom Samples from the Dataset:")

print(df.sample(3)) # View 3 random rows


df.head(n):

Use this to check the initial rows of your dataset. It helps ensure the dataset was
loaded correctly and provides insight into the structure of the data.

df.head(3)

Output:

df.tail(n):

Use this to view the last rows of your dataset. It’s useful for verifying data
integrity, particularly in large datasets.

df.tail(3)

Output:

df.sample(n):

This function provides a random subset of rows. It’s useful for checking the
diversity of your data without viewing the entire dataset.

df.sample(3)
Output:

3.2 Checking Column Names and Data Types

columns:
Lists all column names in the DataFrame.
dtypes:
Displays the data types of each column.
info():
Provides a detailed summary of the DataFrame, including column names, data
types, and non-null values.

Code:
import pandas as pd
df = pd.read_csv('sdata.csv') # Load the dataset (assuming 'sdata.csv' is in the current
working directory)
print("Column Names:")
print(df.columns) # View column names
print("\nData Types of Columns:")
print(df.dtypes) # View data types
print("\nDetailed DataFrame Information:")
print(df.info()) # View detailed information

Explanation

1.​ Viewing Column Names (.columns)


○​ Use this to quickly list all column names.

print(df.columns)

Output:
3.3.Checking Data Types (.dtypes)

●​ This method displays the data types of each column.

Example:

print(df.dtypes)

Output:

●​ Numeric columns like latitude, longitude, and age have correct data types (e.g., float64
or int64).
●​ Categorical columns like name, nationality, and gender are of type object.


3.4 Descriptive Statistics with describe(), info(), and value_counts()

describe()

●​ Summarizes numerical data (e.g., count, mean, standard deviation).


●​ Can also summarize categorical data with describe(include=['object']).

info()

●​ Provides a detailed summary of the DataFrame (e.g., non-null values, data types).

value_counts()

●​ Counts the occurrences of unique values in a column.


Code:
import pandas as pd
df = pd.read_csv("sdata.csv")# Load the dataset
print("Descriptive Statistics for Numerical Columns:")
print(df.describe()) # Descriptive statistics for numerical columns
print("\nDescriptive Statistics for Categorical Columns:")
print(df.describe(include=["object"])) # Descriptive statistics for categorical columns
print("\nDetailed DataFrame Info:")
print(df.info()) # Detailed DataFrame info
print("\nValue Counts for 'gender':")
print(df["gender"].value_counts()) # Frequency of values in a categorical column (e.g.,
'gender')

Explanation:

1. Descriptive Statistics with describe()

●​ Numerical Data: By default, describe() provides statistical measures for numeric


columns:
■​ count: Number of non-null entries.
■​ mean: Average value.
■​ std: Standard deviation.
■​ min/max: Minimum and maximum values.
■​ 25%, 50%, 75%: Quartiles.

print(df.describe())
Output:
●​ Categorical Data: Use include=["object"] to summarize categorical columns.

print(df.describe(include=["object"]))

Output:

The name column has 200 unique entries, while gender has 2 (e.g., Male/Female).

2. Detailed DataFrame Information (.info())

●​ Use .info() to get an overview of:


○​ Column names.
○​ Data types.
○​ Non-null values (indicates missing data).
○​ Memory usage.

print(df.info())

Output:
3. Value Counts with value_counts()

Use this for analyzing the frequency of unique values in a specific column.

print(df["gender"].value_counts())

​ Output:

4. Basic DataFrame Operations



4.1 Selecting Rows, Columns, and Specific Values

​ 1. Selecting Columns:

●​ Single Column: You can access a single column by using the column name inside
square brackets []. For example, df["name"] gives you all the values in the "name"
column.​

●​ Multiple Columns: To access multiple columns, provide a list of column names


like df[["name", "age"]], which gives you the data from both "name" and "age"
columns.​

2. Selecting Rows:

●​ By Index: You can access rows by their index position using .iloc[]. For example,
df.iloc[0] will return the first row of the DataFrame (index 0).​

●​ By Label: If you want to select rows using labels (row names), you use .loc[]. For
example, df.loc[1] gives the second row (because labels are usually 0-indexed).​

3. Selecting Specific Value:

●​ To retrieve a specific value at the intersection of a row and column, you can use
both .iloc[] (for position-based access) and .loc[] (for label-based access). For
example:
○​ df.iloc[0]["name"]: This fetches the name of the first row in the "name"
column.
○​ df.loc[1, "name"]: This fetches the value from the "name" column in the
row with label 1.
4. Slicing Rows:

●​ Slicing Rows: To get a subset of rows, use .iloc[] with slice notation. For
instance, df.iloc[:5] will return the first 5 rows. The slice notation can be adjusted
to get a specific range of rows.

5. Conditional Selection:

●​ Filtering Rows Based on Conditions: You can filter rows by applying conditions
to columns. For example, df[df["age"] > 20] returns all rows where the "age"
column is greater than 20. The condition can be customized as needed, such as for
text values or numerical ranges.

Code:
import pandas as pd
df = pd.read_csv("student-dataset.csv")
print(df["name"].head()) # Displays the first 5 names
print(df.iloc[0]) # Displays the first row
print(df.iloc[0]["name"]) # Displays the 'name' from the first row
print(df.iloc[:5]) # Displays the first 5 rows
print(df[df["age"] > 20]) # Displays rows where age is greater than 20

Explanation:

1.​ df["name"].head():
○​ This will show the first 5 entries of the "name" column

Output:

2.​ df.iloc[0]:
○​ This will display the entire first row (as a pandas Series).

Output:
3.​ df.iloc[0]["name"]:
○​ This will return the value in the "name" column of the first row.

Output:

4.​ df.iloc[:5]:
○​ This returns the first 5 rows of the DataFrame.

Output:

5.​ df[df["age"] > 20]:


○​ This filters the DataFrame to show only rows where the "age" column is greater
than 20.

Output:

4.2 Using iloc, loc, and Conditional Filters

iloc[] is used to select data by position (i.e., integer-based indexing). It allows you to
access rows and columns based on their position in the DataFrame.

Syntax:
python
CopyEdit
df.iloc[row_index, column_index]

●​ row_index: The integer index for the row you want to select (starting from 0).
●​ column_index: The integer index for the column you want to select (starting from 0).
Code:
print(df.iloc[0]) # Select the first row
print(df.iloc[0, 0])# Select the first row and first column
print(df.iloc[:5, :])# Select the first 5 rows and all columns
print(df.iloc[:3, :2])# Select specific rows and columns (first 3 rows, first 2
columns)

Explanation:

●​ df.iloc[0]: Returns the entire first row.

Output:

●​ df.iloc[0, 0]: Returns the value at the first row and first column.

Output:

●​ df.iloc[:5, :]: Returns the first 5 rows of all columns.

Output:
●​ df.iloc[:3, :2]: Returns the first 3 rows and the first 2 columns.

Output:

4.3 Renaming and Reordering Columns

In this section, we will explore how to rename and reorder columns in a Pandas
DataFrame using two main techniques: rename() for renaming columns and reindex() for
reordering columns.

1. Renaming Columns Using rename()

You can rename one or more columns in a DataFrame using the rename() method.
This method takes a dictionary where the keys are the current column names, and the
values are the new names you want to assign to those columns.

Syntax:
df.rename(columns={"old_name1": "new_name1", "old_name2": "new_name2"},
inplace=True)

columns:
A dictionary with old column names as keys and new column names as values.
inplace=True:
If set to True, the DataFrame is modified in place; if False, a new DataFrame with
the updated column names is returned

Code:
import pandas as pd
df = pd.read_csv("student-dataset.csv")
df.rename(columns={"name": "full_name", "age": "years"}, inplace=True)
print(df.head()) # Rename columns "name" to "full_name" and "age" to "years"

Output:
After running the code, the DataFrame's "name" column will be renamed to "full_name,"
and the "age" column will be renamed to "years."

2. Reordering Columns Using reindex()

You can reorder the columns of a DataFrame by passing a list of the column names in the
desired order to the reindex() method.

Syntax:

df = df.reindex(columns=["new_col1", "new_col2", "new_col3", ...])

●​ columns: A list of column names in the desired order.

Code:
import pandas as pd
df = pd.read_csv("student-dataset.csv")
df = df.reindex(columns=["gender", "full_name", "years", "nationality", "city",
"age"])# Reorder columns: move "gender" to the first column and reorder others
print(df.head())

Output:

The gender column will be moved to the first position, and the remaining columns
will follow the order as specified in the list.

Part 3: Manipulating DataFrames

5.​ Handling Missing Data​


5.1 Identifying Missing Values with isna() and notna()
isna():
​ Code:
import pandas as pd
df = pd.read_csv("student.csv")# Load the dataset
print(df.isna()) # Return DataFrame True/False indicating missing values
print(df.isna().sum()) # Count of missing values in each column
​ Output:

isna() identifies missing values (NaN), and sum() counts them in each
column.

notna():
Code:
​ import pandas as pd
df = pd.read_csv("student.csv") # Load the dataset
print(df.notna()) # Returns True non-missing values and False for missing
ones

Output:

The notna() function identifies valid (non-missing) entries.


5.2 Filling Missing Data with Mean, Median, or Custom Values

●​ For Numeric Columns:


Before filling with the median or mean, check if the column is empty and
then fill it with a default value.
●​ For Categorical Columns:
You may want to fill them with a placeholder string, such as 'Unknown'.

​ Filling Missing Data with Mean:

​ ​ Code:
df['english.grade'].fillna(df['english.grade'].mean(),
inplace=True)#Replace NaN with the column's mean value
print("Filled missing values in 'english.grade' with the
mean:")#Confirmation message
print(df['english.grade']) # Display the updated 'english.grade' column

Output:

This replaces all NaN values in the english.grade column with the mean
of the column.

Filling Missing Data with Median:

​ ​ Code:
df['math.grade'].fillna(df['math.grade'].median(), inplace=True) # Replace
NaN with the column's median value
print("Filled missing values in 'math.grade' with the median:")
print(df['math.grade']) # Display the updated 'math.grade' column
​ ​ Output:

​ ​ ​
​ This replaces all NaN values in the math.grade column with the median of
the column.

Filling Missing Data with a Custom Value:

​ Code:
df['ethnic.grou'].fillna("Unknown", inplace=True) # Replaces
missing values in the 'ethnic.group' column with the custom value
'Unknown'
print("Filled missing values in 'ethnic.grou' with a custom value:")
print(df['ethnic.grou'])

Output:

This replaces all NaN values in the ethnic.group column with the
custom string "Unknown"
5.3 Dropping Rows or Columns with Null Values

​ Code:

df_dropped_rows = df.dropna() # Creates a new DataFrame without rows


containing NaN values
print("DataFrame after dropping rows with null values:")
print(df_dropped_rows)

Output:

This removes rows where any column has a missing value.

6. Sorting and Filtering

6.1 Sorting DataFrames by Columns or Indices

​ Code:

df_sorted = df.sort_values(by='english.grade', ascending=False)# Sort DataFrame


by 'english.grade' in descending order
df_result = df_sorted[['id', 'english.grade']]# Select the columns 'id', 'english.grade'
print("ID and 'english.grade' in descending order:")
print(df_result)# Print the result
Output:

​ Sorts the rows by the english.grade column in descending order.

6.2 Filtering Rows Based on Conditions



Code:
filtered_df = df[df['math.grade'] > 3.5]# Filter rows 'math.grade' is greater 3.5
df_result =filtered_df[['id', 'math.grade']]# Select the columns 'id' and 'math.grade'
print("ID and 'math.grade' in descending order:")
print(df_result) # Print the result

Output:

​ Filters rows where math.grade is greater than 3.5.

6.3 Applying Advanced Filters with Multiple Conditions

​ Code:
advanced_filtered_df = df[(df['math.grade'] > 3.5) &
(df['english.grade'] > 3.5)]# Filter rows where 'math.grade' > 3.5 and
'english.grade' > 3.5
filtered_result = advanced_filtered_df[['id', 'math.grade',
'english.grade']]# Select only the 'id', 'math.grade', and 'english.grade' columns
print("Filtered rows with 'math.grade' > 3.5 and 'english.grade' > 3.5 (only ID and
grades):")
print(filtered_result)# Print the filtered result

Output:

​ Filters rows where both math.grade and english.grade are greater than 3.5.

6.​ Adding and Removing Data



7.1 Adding New Columns Based on Calculations

​ Code:
df['average.grade'] = df[['english.grade', 'math.grade', Add a column for
the average of three grades.
print("Added a new column 'average.grade':")
print(df[['english.grade', 'math.grade', 'sciences.grade',
'average.grade']])# Display relevant columns

Output:

This code calculates the average of grades for each student across three
subjects and adds it as a new column average.grade to the DataFrame.
7.2 Dropping Unnecessary Columns or Rows
# Drop the 'latitude' and 'longitude' columns as they are unnecessary
df_dropped_columns = df.drop(columns=['latitude', 'longitude']) # Remove
specific columns from the DataFrame
print("Dropped 'latitude' and 'longitude' columns:")
print(df_dropped_columns.head()) # Display the first few rows without the
dropped columns

Output:

​ This code drops irrelevant columns (latitude and longitude).


7.3 Concatenating, Merging, and Joining DataFrames
# Create a new DataFrame for demonstration
extra_data = pd.DataFrame({
'id': [1, 2],
'name': ['Alex Doe', 'Taylor Smith'],
'nationality': ['Canada', 'UK'],
'city': ['Toronto', 'London'],
'english.grade': [3.7, 4.0],
'math.grade': [3.5, 3.8],
'sciences.grade': [3.6, 4.0]
})

# Concatenate the two DataFrames vertically


concatenated_df = pd.concat([df, extra_data], ignore_index=True) # Combine
rows from both DataFrames
print("Concatenated DataFrame:")
print(concatenated_df) # Display the combined DataFrame
# Merge with another DataFrame based on a common column (id)
merge_data = pd.DataFrame({
'id': [0, 1, 2],
'extra.info': ['Info A', 'Info B', 'Info C']
})
merged_df = df.merge(merge_data, on='id', how='left') # Perform a left join
print("Merged DataFrame with extra information:")
print(merged_df) # Display the merged DataFrame

Output:

​ The code demonstrates three operations: concatenating two DataFrames


vertically, merging DataFrames using a common column, and performing a left join to
add additional information.

Part 4: Data Transformation


8.​ Advanced Column Operations​
8.1 Transforming Data with apply() and Lambda Functions

8.1 Transforming Data with apply() and Lambda Functions

Pandas apply() function is a versatile tool that allows you to apply a function along an axis (rows
or columns) of a DataFrame. You can use apply() with a lambda function to perform
transformations or calculations on each element.

Syntax:

df['column_name'] = df['column_name'].apply(lambda x: <transformation>)

●​ lambda: A way to define anonymous functions in Python.


●​ apply(): A method used to apply a function to each element along a specified axis
(columns or rows).

Example:

Let's say we want to create a new column that categorizes ages into groups like "Young",
"Middle-aged", and "Old".

Code:

import pandas as pd

df = pd.read_csv("student-dataset.csv")

# Applying a lambda function to categorize ages

df['age_group'] = df['age'].apply(lambda x: 'Young' if x < 25 else 'Middle-aged' if x < 50 else


'Old')

print(df[['name', 'age', 'age_group']])

Output:
8.2 One-Hot Encoding for Categorical Variables
One-Hot Encoding is a technique used to convert categorical variables into a form that
can be provided to ML algorithms to do a better job in prediction. It involves creating a new
binary column for each category/label in the categorical variable.

Syntax:
pd.get_dummies(df['column_name'])

●​ get_dummies(): This function converts categorical variable(s) into dummy/indicator


variables (i.e., binary columns for each category).

Example:

Let's say the gender column contains "Male" and "Female," and we want to
convert this into binary columns.

Code:

import pandas as pd

data = {

'full_name': ['John Doe', 'Jane Doe', 'Alice Smith', 'Bob Brown'],

'gender': ['Male', 'Female', 'Female', 'Male'],

'age': [23, 30, 25, 40] # Sample DataFrame

df = pd.DataFrame(data)

gender_dummies = pd.get_dummies(df['gender'], prefix='gender') # One-hot


encode the "gender" column

df = pd.concat([df, gender_dummies], axis=1)# Concatenate the original


DataFrame with the new one-hot encoded columns

print(df[['full_name', 'gender', 'gender_Female', 'gender_Male']])# Displaying the


final DataFrame
Output:

8.3 Binning and Discretizing Continuous Variables

Binning (also known as discretization) is a process of converting continuous data into


discrete categories or bins. This can be helpful when you want to group numerical values into
categories for analysis.

Syntax:

df['binned_column'] = pd.cut(df['numeric_column'], bins=<number_of_bins>,


labels=<list_of_bin_labels>)

●​ cut(): This function bins continuous variables into discrete intervals.

Example:

Let's bin the age column into 3 age groups: "0-25", "26-50", and "51+".

Code:

import pandas as pd

data = {

'full_name': ['John Doe', 'Jane Doe', 'Alice Smith', 'Bob Brown'],

'gender': ['Male', 'Female', 'Female', 'Male'],

'age': [23, 30, 25, 40]

}# Sample DataFrame

df = pd.DataFrame(data)

bins = [0, 25, 50, 100]# Define bins for ages

labels = ['0-25', '26-50', '51+']


df['age_group_binned'] = pd.cut(df['age'], bins=bins, labels=labels)# Apply the
binning

print(df[['full_name', 'age', 'age_group_binned']])

Output:

Explanation:

●​ The age values are categorized into one of three groups:


○​ "0-25" for ages between 0 and 25,
○​ "26-50" for ages between 26 and 50,
○​ "51+" for ages above 50.

9. Working with Dates and Times​

9.1 Converting Strings to Datetime Objects

When dealing with date and time data, often dates are represented as strings. To perform
any operations (like sorting, filtering, or calculations), we need to convert these strings into
datetime objects.

Code:

import pandas as pd

data = {'event': ['Event A', 'Event B', 'Event C'],# Sample DataFrame with date as string

'date': ['2025-01-23', '2025-02-15', '2025-03-10']}

df = pd.DataFrame(data)

df['date'] = pd.to_datetime(df['date'])# Convert string to datetime object

print(df)# Display the DataFrame


Output:

The pd.to_datetime() function converts the date column from a string format to a
datetime object, making it easier to manipulate and extract components like day, month, year,
etc.

9.2 Extracting Day, Month, Year, and Weekday from Dates

Once dates are in datetime format, it's easy to extract individual components, such as the day,
month, year, or even the weekday.

Code:

import pandas as pd

data = {'event': ['Event A', 'Event B', 'Event C'],# Sample DataFrame with date as datetime object

'date': ['2025-01-23', '2025-02-15', '2025-03-10']}

df = pd.DataFrame(data)

df['date'] = pd.to_datetime(df['date'])

df['day'] = df['date'].dt.day # Extracting day, month, year, and weekday

df['month'] = df['date'].dt.month

df['year'] = df['date'].dt.year

df['weekday'] = df['date'].dt.weekday

print(df)# Display the DataFrame

Output:
Explanation:

●​ The dt accessor allows us to extract components of the datetime object.


○​ df['date'].dt.day: Extracts the day from the date.
○​ df['date'].dt.month: Extracts the month.
○​ df['date'].dt.year: Extracts the year.
○​ df['date'].dt.weekday: Extracts the weekday number (0 = Monday, 6 = Sunday).

9.3 Calculating Date Differences and Working with Time Zones

You can calculate the difference between two dates or work with time zones to handle
timestamps accurately.

Code:

import pandas as pd

​ data = {'event': ['Event A', 'Event B', 'Event C'],

'start_date': ['2025-01-23 09:00:00', '2025-02-15 14:00:00', '2025-03-10


19:00:00'],

'end_date': ['2025-01-23 12:00:00', '2025-02-15 16:00:00', '2025-03-10


21:00:00']} # Sample DataFrame with datetime objects

df = pd.DataFrame(data)

df['start_date'] = pd.to_datetime(df['start_date'])

df['end_date'] = pd.to_datetime(df['end_date'])

df['duration'] = df['end_date'] - df['start_date']# Calculate the difference between


start_date and end_date

df['start_date'] = df['start_date'].dt.tz_localize('UTC')# Handling time zones

df['end_date'] = df['end_date'].dt.tz_localize('UTC’) # Display the DataFrame

print(df)

Output:
Explanation:

Date Differences:

○​ The difference between start_date and end_date is calculated using df['end_date']


- df['start_date']. The result is a timedelta object that represents the difference in
time.

Time Zones:

○​ The dt.tz_localize('UTC') method converts the datetime to a specific timezone.


Here, we have localized the datetime columns to 'UTC'.​

10. Pivoting and Reshaping Data in pandas

Reshaping data allows you to rearrange data into the desired format, making it easier for
analysis and visualization.

10.1 Creating Pivot Tables for Summarization

Pivot tables allow you to summarize and aggregate data using rows and columns as
indices.

Code:
import pandas as pd
data = {
'student': ['Alice', 'Bob', 'Alice', 'Bob', 'Alice', 'Bob'],
'subject': ['Math', 'Math', 'English', 'English', 'Science', 'Science'],
'score': [85, 78, 92, 88, 91, 84]
}# Sample DataFrame
df = pd.DataFrame(data)
pivot_table = df.pivot_table(values='score', index='student',
columns='subject', aggfunc='mean')# Creating a pivot table
print(pivot_table)# Display the pivot table

Output:
Explanation:

●​ values: Column to aggregate (score in this case).


●​ index: Rows of the pivot table (student).
●​ columns: Columns of the pivot table (subject).
●​ aggfunc: Aggregation function (default is mean).

10.2 Reshaping Data with melt() and stack()

These methods help transform data from wide to long format for easier analysis.

Code:
melted_df = pd.melt(pivot_table.reset_index(), id_vars=['student'], var_name='subject',
value_name='score')# Melt: Wide to long format
print("Melted DataFrame:")
print(melted_df)
stacked_df = pivot_table.stack()# Stack: Converts columns to rows (reshapes pivot_table)
print("\nStacked DataFrame:")
print(stacked_df)

Output:

Explanation:

●​ melt(): Unpivots the data into a long format. Each row represents a unique combination
of id_vars and value_name.
●​ stack(): Moves columns into rows, creating a MultiIndex DataFrame.

10.3 Flattening Data with unstack()

unstack() converts rows into columns, reversing the effect of stack().


Code:
unstacked_df = stacked_df.unstack()# Unstacking the stacked DataFrame
print("Unstacked DataFrame:")
print(unstacked_df)

Output:

Explanation:

unstack() is the inverse of stack(). It pivots the innermost row index level into columns.

Part 5: Aggregation and Grouping

Grouping Data

What is Grouping?

Grouping is the process of splitting data into subsets based on one or more criteria
(columns) and then applying an operation to these subsets. This is useful for summarizing or
analyzing data by categories.

For example:

●​ Finding the average age of individuals grouped by gender.


●​ Calculating the total portfolio rating for each ethnic group.

11.1 Grouping by Single and Multiple Columns

Grouping by Single Column

We can group data by a single column to compute statistics or summarize values.

Code:

import pandas as pd
df = pd.read_csv('student-dataset.csv')# Load the dataset (assuming 'sdata.csv' is in the
current working directory)
group_by_gender = df.groupby('gender').mean(numeric_only=True)# Grouping by
gender
print("Grouped by Gender:\n", group_by_gender)
Output:

Explanation:

●​ The groupby('gender') groups data by the gender column.


●​ The mean() function calculates the average for numeric columns (e.g., age, portfolio
rating).

11.2 Aggregating Data with Functions

Aggregation Functions

Aggregation functions summarize data using operations like mean, sum, or count.

Code:
import pandas as pd
df = pd.read_csv('student-dataset.csv')# Load the dataset (assuming 'sdata.csv' is in the current
aggregation = df.groupby('gender').agg({working directory)
# Grouping by gender
'age': ['mean', 'min', 'max'], # Multiple aggregations for age
'portfolio.rating': 'sum', # Total portfolio rating
'refletter.rating': 'count' # Count of reference ratings
})
print("Aggregated Data:\n", aggregation)

Output:

Explanation:

●​ The agg() function allows multiple aggregations on different columns.


●​ Provides detailed summaries for grouped data.
11.3 Custom Aggregations Using Lambda Functions

Custom aggregations are used for more specific calculations. For instance, calculating the
percentage of high ratings in a group.

Code:
import pandas as pd
df = pd.read_csv('student-dataset.csv')# Load the dataset
custom_agg = df.groupby('gender').agg(# Custom aggregation example
high_ratings_pct=('portfolio.rating', lambda x: (x > 7).mean() * 100)
)
print("Custom Aggregation (High Ratings Percentage):\n", custom_agg)

Output:

Explanation:

agg() Method:

The syntax ('column_name', function) is used to specify which column to apply


the aggregation on and what function to apply.

Here, ('portfolio.rating', lambda x: (x > 7).mean() * 100) checks which rows in the
portfolio.rating column have values greater than 7, calculates the proportion, and
converts it into a percentage.

Lambda Function:

The lambda x: (x > 7).mean() * 100 calculates the percentage of values in the
group where portfolio.rating > 7.

groupby('gender'):

Groups the dataset by the gender column.

Result:

The output shows the percentage of high ratings for each gender group.
12.Advanced Grouping Techniques​

12.1 Performing Hierarchical Grouping

Hierarchical grouping involves grouping by multiple columns to create sub-groups.

Code:
import pandas as pd
df = pd.read_csv('student-dataset.csv')# Load the dataset
grouped = df.groupby(['gender', 'ethnic.group'])[['portfolio.rating',
'coverletter.rating']].mean()# Perform hierarchical grouping by 'gender' and 'ethnic.group'

print("Hierarchical Grouping (Mean Portfolio and Cover Letter Ratings):\n", grouped)

Output:

Explanation:

●​ Groups the data first by gender and then by ethnic.group.


●​ Calculates the mean of portfolio.rating and coverletter.rating for each sub-group.
●​ The result is a DataFrame indexed hierarchically.

12.2 Using groupby Objects for Transformations

You can use groupby to transform data within groups.

Code:
import pandas as pd
df['mean_age_by_gender'] = df.groupby('gender')['age'].transform('mean')# Add a new
column showing the mean age within each 'gender' group
print("Data with Mean Age by Gender:\n", df[['gender', 'age', 'mean_age_by_gender']])
Output:

Explanation:

●​ transform() applies a function to each group and returns a Series with the same index as
the original DataFrame.
●​ Here, it calculates the mean age for each gender group and adds it as a new column.

12.3 Aggregating Multiple Columns with Different Functions

You can apply different aggregation functions to different columns.

Code:
import pandas as pd
agg_results = df.groupby('gender').agg({
'age': ['mean', 'max'], # Apply mean and max to 'age'
'portfolio.rating': 'sum', # Apply sum to 'portfolio.rating'
'coverletter.rating': 'min' # Apply min to 'coverletter.rating'
})# Aggregate with different functions for different column
print("Aggregating Multiple Columns with Different Functions:\n", agg_results)

Output:

Explanation:

●​ Uses the agg() method to specify different functions for each column.
●​ For example:
○​ mean and max are applied to age.
○​ sum is applied to portfolio.rating.
○​ min is applied to coverletter.rating.
Part 6: Advanced Operations

13. Applying Functions Across DataFrames

13.1 Using Vectorized Operations for Speed


​ Code:
df['total.grade'] = df['english.grade'] + df['math.grade'] + df['sciences.grade'] #
Add values across columns
print("Calculated 'total.grade' using vectorized operations:")
print(df[['english.grade', 'math.grade', 'sciences.grade', 'total.grade']]) # Display
relevant columns
Output:

​ Vectorized operations in pandas are optimized for performance. Here, the


sum of grades is calculated across columns (english.grade, math.grade, sciences.grade)
for each row, creating a new column total.grade.

13.2 Applying Custom Functions Row-Wise and Column-Wise



​ Code:
def grade_category(total):
if total >= 10:
return 'Excellent'
elif total >= 7:
return 'Good'
else:
return 'Needs Improvement'# Define a custom function to categorize grades

df['grade.category'] = df['total.grade'].apply(grade_category) # Apply the


function to classify grades
print("Applied custom function to categorize grades:")
print(df[['total.grade', 'grade.category']]) # Display total grades and their
categories

Output:

Custom functions provide flexibility to implement row-wise or column-wise


operations. Here, a function grade_category assigns a label (Excellent, Good, Needs
Improvement) based on the total.grade.

13.3 Combining Data with merge(), join(), and concat()

​ Code:
import pandas as pd
student_details = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'nationality': ['USA', 'UK', 'India']
})# DataFrames for demonstration

additional_scores = pd.DataFrame({
'id': [1, 2],
'sports.score': [85, 90]
})# Ensure 'id' columns are of the same type
student_details['id'] = student_details['id'].astype(int)
additional_scores['id'] = additional_scores['id'].astype(int)# Merge student details
with additional scores on 'id'
merged_data = student_details.merge(additional_scores, on='id', how='left') #
Left join
print("Merged DataFrame with additional scores:")
print(merged_data) # Display merged result
concat_data = pd.concat([student_details, additional_scores],
ignore_index=True)# Combine rows
print("\nConcatenated DataFrame:")
print(concat_data) # Display concatenated data
Output:

In the provided example, merge combines the student_details DataFrame with


additional_scores based on the id column, performing a left join to retain all student
details while adding available scores. Concat stacks rows from both DataFrames, creating
a unified DataFrame

14. Window Functions​


14.1 Calculating Rolling Averages and Standard Deviations
import pandas as pd

# Load the dataset


df = pd.read_csv("student.csv")

# Calculate a rolling average and standard deviation for the 'math.grade' column
(window size = 3)
df['math_grade_rolling_avg'] = df['math.grade'].rolling(window=3).mean() #
Rolling average
df['math_grade_rolling_std'] = df['math.grade'].rolling(window=3).std() #
Rolling standard deviation

# Display rolling calculations


print("Rolling average and standard deviation for 'math.grade':")
print(df[['math.grade', 'math_grade_rolling_avg', 'math_grade_rolling_std']])

OUTPUT:
Rolling average and standard deviation are calculated over a window of 3 rows
for 'math.grade'.

14.2 Expanding Windows for Cumulative Metrics



​ Code:
import pandas as pd
df = pd.read_csv("student.csv")# Load the dataset
df['english_cumulative_sum'] = df['english.grade'].expanding().sum()#Cumulative
sum
df['english_cumulative_mean'] = df['english.grade'].expanding().
mean() #Cumulative mean

print("\nCumulative sum and mean for 'english.grade':")# Display cumulative


metric
print(df[['english.grade', 'english_cumulative_sum', 'english_cumulative_mean']])

Output:

Cumulative sum and mean track metrics from the beginning of the dataset for
'english.grade'.
14.3 Ranking Rows with rank() and Custom Sorting

​ Code:
import pandas as pd
df = pd.read_csv("student.csv") # Load the dataset

df_sorted = df.sort_values(by=['math.grade', 'english.grade'], ascending=[False,


True])# Sort by 'math.grade' in descending order,'english.grade' in ascending order
df_sorted['rank'] = df_sorted['math.grade'].rank(ascending=False)# Rank students
based on the sorted DataFrame

print("\nRanking students with custom sorting:")# Display ranking with custom


sorting
print(df_sorted[['name', 'math.grade', 'english.grade', 'rank']])

Output:

This code ranks students based on custom sorting by multiple columns, first by
math.grade in descending order, then by english.grade in ascending order. The rank()
function is used to assign ranks based on the sorted data.​

Part 7: Visualization with DataFrames

15. Baic visualization

Visualizing data is essential for identifying patterns, relationships, and trends. Using
Matplotlib, you can create various plots to understand the data better. We'll demonstrate basic
visualization techniques using the student-dataset.csv.
15.1 Plotting DataFrame Columns with Matplotlib

You can plot a column or multiple columns directly from a DataFrame.

Code: Plotting Age Distribution by Gender

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('student-dataset.csv')# Load the dataset

plt.figure(figsize=(8, 5))# Plot the age distribution for each gender


for gender, group in df.groupby('gender'):
plt.plot(group['id'], group['age'], marker='o', label=gender)

plt.title("Age Distribution by Gender")


plt.xlabel("ID")
plt.ylabel("Age")
plt.legend(title='Gender')
plt.grid()
plt.show()

Output:

Explanation:

1.​ Groups data by gender and plots the age against id for each group.
2.​ plt.plot() creates a line chart for each gender, differentiated by a label.
3.​ Adds title, axis labels, and grid for better readability.
15.2 Creating Histograms, Line Charts, and Scatter Plots

1. Creating a Histogram: Portfolio Ratings Distribution

A histogram is used to visualize the frequency distribution of continuous data.

Code:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('student-dataset.csv')# Load the dataset

plt.figure(figsize=(8, 5))# Create a histogram for portfolio ratings


plt.hist(df['portfolio.rating'], bins=5, color='skyblue', edgecolor='black')
plt.title("Portfolio Ratings Distribution")
plt.xlabel("Portfolio Rating")
plt.ylabel("Frequency")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

Output:

Explanation:

●​ plt.hist() creates a histogram for the portfolio.rating column.


●​ bins=5 divides the data into 5 intervals.
●​ edgecolor='black' highlights the edges of the bars.
2. Creating a Line Chart: Average Grades by Subject

A line chart is useful for visualizing trends over categories or time.

Code:
import pandas as pd
avg_grades = df[['english.grade', 'math.grade', 'sciences.grade',
'language.grade']].mean()# Calculate average grades for each subject

plt.figure(figsize=(8, 5))# Plot the average grades as a line chart


avg_grades.plot(kind='line', marker='o', color='purple')
plt.title("Average Grades by Subject")
plt.xlabel("Subjects")
plt.ylabel("Average Grade")
plt.grid()
plt.show()

Output:

Explanation:

●​ The mean() function calculates the average grade for each subject.
●​ The plot(kind='line') method generates a line chart.
●​ marker='o' adds circular markers at each data point.

3.Creating a Scatter Plot: Portfolio Rating vs. Cover Letter Rating

A scatter plot visualizes the relationship between two continuous variables.

Code:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('student-dataset.csv')# Load the dataset


plt.figure(figsize=(8, 5))# Create a scatter plot for portfolio rating and cover letter rating
plt.scatter(df['portfolio.rating'], df['coverletter.rating'], color='green', alpha=0.7)
plt.title("Portfolio Rating vs. Cover Letter Rating")
plt.xlabel("Portfolio Rating")
plt.ylabel("Cover Letter Rating")
plt.grid()
plt.show()

Output:

Explanation:

●​ plt.scatter() plots the portfolio.rating on the x-axis and coverletter.rating on the y-axis.
●​ alpha=0.7 adjusts the transparency to improve clarity.

16.Advanced Visualization
Visualizing data is essential for identifying patterns, relationships, and trends. Using
Matplotlib, you can create various plots to understand the data better. We'll demonstrate basic
visualization techniques using the student-dataset.csv.

16.1 Visualizing Grouped Data with Bar and Box Plots

1. Bar Plot: Average Grades by Gender

Bar plots are used to visualize aggregated data, such as averages or counts, for categorical
variables.

Code:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('student-dataset.csv')# Load the dataset

avg_grades_gender = df.groupby('gender')[['english.grade', 'math.grade',


'sciences.grade']].mean()# Calculate average grades grouped by gender

avg_grades_gender.plot(kind='bar', figsize=(8, 5), color=['skyblue', 'orange', 'green'])


plt.title("Average Grades by Gender")# Plot a bar plot
plt.xlabel("Gender")
plt.ylabel("Average Grade")
plt.xticks(rotation=0)
plt.legend(title="Subjects")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

Output:

Explanation:

●​ Grouped the dataset by gender to calculate the mean grades for English, Math, and
Sciences.
●​ Created a bar plot for visualizing the average grades by gender.

2. Box Plot: Distribution of Portfolio Ratings by Gender

Box plots help visualize the distribution, spread, and outliers in data across categories.

Code:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('student-dataset.csv')# Load the dataset


plt.figure(figsize=(8, 5))# Create a box plot for portfolio ratings by gender
df.boxplot(column='portfolio.rating', by='gender', grid=False, patch_artist=True,
boxprops=dict(facecolor='lightblue'))
plt.title("Portfolio Ratings by Gender")
plt.suptitle("") # Removes the default grouped title
plt.xlabel("Gender")
plt.ylabel("Portfolio Rating")
plt.show()

Output:

Explanation:

●​ Used boxplot() to create a grouped box plot for portfolio.rating by gender.


●​ This plot highlights the distribution of ratings within each gender category.

16.2 Correlation Heatmaps Using Seaborn

Correlation heatmaps are used to visualize the relationship between numeric columns.

Code:
import seaborn as sns
import matplotlib.pyplot as plt

corr_matrix = df[['english.grade', 'math.grade', 'sciences.grade', 'portfolio.rating',


'coverletter.rating']].corr()# Calculate the correlation matrix

plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f",
linewidths=0.5)# Create a heatmap
plt.title("Correlation Heatmap")
plt.show()
Output:


Explanation:

●​ Calculated the correlation matrix to measure relationships between numeric


columns.
●​ Used Seaborn’s heatmap() to create a visually appealing heatmap with
annotations.

16.3 Creating Interactive Plots with Plotly


​ Code:
​ import plotly.graph_objects as go # Import the plotly library

x = [1, 2, 3, 4, 5]
y = [10, 11, 12, 13, 14]# Create some data for the plot

fig = go.Figure(data=go.Scatter(x=x, y=y, mode='lines', name='Line Plot'))

fig.update_layout(title='Interactive Line Plot',


xaxis_title='X Axis',
yaxis_title='Y Axis')# Add title and labels

fig.show()# Show the interactive plot


Output:

Plotly allows you to create interactive visualizations, such as line, bar, and scatter
plots, with built-in features like zoom, hover, and pan.

Part 8: Performance Optimization

17. Improving Performance​

17.1 Vectorization and Avoiding Loops in DataFrames

●​ Concept: Vectorization refers to performing operations directly on arrays or columns


rather than using Python loops. It leverages the speed of NumPy/Pandas operations,
which are implemented in C.
●​ Why? Loops in Python are slow compared to optimized vectorized operations. Avoiding
them significantly speeds up data processing.

Example: Calculating Total Grade


Code:

import pandas as pd

df = pd.read_csv('student-dataset.csv')# Load the dataset

df['total_grade_loop'] = 0

for i in range(len(df)):

df.loc[i, 'total_grade_loop'] = (

df.loc[i, 'english.grade'] +

df.loc[i, 'math.grade'] +

df.loc[i, 'sciences.grade'] +

df.loc[i, 'language.grade']

)
print(df[['id', 'name', 'english.grade', 'math.grade', 'sciences.grade', 'language.grade',
'total_grade_loop']])# Print the resulting DataFrame

Output:

17.2 Optimizing Memory Usage with Data Types


Concept: By default, Pandas assigns larger data types like float64 or int64, which
consume more memory. Downcasting reduces memory usage.
Why? Reducing memory usage improves performance, especially with large datasets.

Code:
import pandas as pd
df = pd.read_csv('student-dataset.csv')# Load the dataset

print("Memory usage before optimization:")


print(df.info(memory_usage='deep')) # Check memory usage before optimization

df['age'] = pd.to_numeric(df['age'], downcast='integer')# Downcast numeric


columns
df['portfolio.rating'] = pd.to_numeric(df['portfolio.rating'], downcast='float')

print("\nMemory usage after optimization:")


print(df.info(memory_usage='deep'))# Check memory usage after optimization
Output:

17.3 Processing Large DataFrames Using Dask

●​ Concept: For very large datasets that cannot fit into memory, use Dask. It splits the
DataFrame into smaller chunks and processes them in parallel.
●​ Why? Dask allows working with datasets larger than memory by processing them in
chunks.

Example: Reading and Aggregating Large Data


​ Code:
import pandas as pd

df = pd.read_csv('student-dataset.csv') # Load the dataset


import dask.dataframe as dd
dask_df = dd.read_csv('student-dataset.csv')# Load the dataset using Dask

result = dask_df.groupby('gender')['portfolio.rating'].mean().compute()
print("Mean portfolio rating by gender:\n", result)# Perform an aggregation

Output:

Part 9: Real-World Applications

18.​Case Studies​
18.1 Analyzing Customer Purchases: Identifying Top Customers and Products
In 2020, Amazon, one of the largest e-commerce platforms globally, wanted to
enhance its understanding of customer purchasing behavior to improve marketing
campaigns and inventory management. By analyzing purchase data, Amazon's data
scientists were able to identify their top customers based on total spend and frequency of
purchase. They also used this data to identify top-selling products, especially seasonal
items, which were often out of stock due to unexpected demand. Amazon used data
analytics techniques like clustering and segmentation to categorize customers and predict
future buying behavior. The solution was implemented in early 2021, leading to targeted
promotional campaigns and better inventory management, reducing stockouts during
peak shopping seasons like Prime Day and the holiday rush. This led to improved
customer satisfaction and an increase in sales during high-demand periods. The project
took about four months to complete, leveraging advanced analytics tools such as Amazon
Redshift and AWS data services.

18.2 Time Series Analysis: Forecasting Sales Trends
In 2020, Walmart faced challenges in accurately forecasting demand for essential
products due to the COVID-19 pandemic. The company used time series analysis on
historical sales data from 2018 and 2019 to understand sales trends and seasonal
variations. By using time series models like ARIMA (AutoRegressive Integrated Moving
Average), Walmart was able to predict shifts in product demand, especially for essentials
like toilet paper, sanitizers, and groceries. The company applied these forecasts to adjust
stock levels dynamically and prevent shortages. The data was collected over several
months, starting in March 2020, with a solution fully implemented by May 2020. The
project helped Walmart not only manage supply chain disruptions but also adjust to new
consumer buying patterns caused by the pandemic. This analysis improved inventory
turnover and ensured that essential products were available, even during the early
pandemic months when consumer demand was highly unpredictable.

18.3 Building a KPI Dashboard with Aggregated Metrics

In 2021, Coca-Cola, the multinational beverage corporation, sought to build a KPI


dashboard to monitor the performance of its global supply chain, production efficiency,
and sales performance. The company aggregated data from various regions, focusing on
key metrics such as sales volume, production costs, and customer satisfaction. This
dashboard was built using business intelligence tools like Tableau and Power BI, with
data pulled from Coca-Cola’s SAP system. By June 2021, the dashboard was fully
operational, providing real-time insights for regional managers and executives. It allowed
Coca-Cola to quickly assess operational performance across different countries and adjust
strategies accordingly. For example, when sales dipped in certain markets, the dashboard
provided immediate feedback, enabling quick interventions, such as running targeted
promotions or redistributing stock from overperforming regions. The solution helped the
company streamline decision-making and improve efficiency across its global operations.
This project took six months to implement and had a significant impact on reducing
waste and improving profitability.

19. Practical Projects​


19.1 Financial Data Analysis: Calculating Portfolio Returns

​ Code:
import pandas as pd
import numpy as np
num_days = 252# Number of business days in 2022 (approximately 252 days)

data = {
'Date': pd.date_range('2022-01-01', periods=num_days, freq='B'), # Business​
days in 2022
'Stock_A': np.random.normal(100, 1, num_days), # Stock A prices
'Stock_B': np.random.normal(50, 1, num_days), # Stock B prices
'Stock_C': np.random.normal(200, 1, num_days) # Stock C prices
}# Sample stock data for a portfolio (assume the data is daily closing prices)

df = pd.DataFrame(data)
df.set_index('Date', inplace=True)

weights = {'Stock_A': 0.4, 'Stock_B': 0.3, 'Stock_C': 0.3}# Portfolio weights


(assumed)
returns = df.pct_change().dropna()# Calculate daily returns for each stock
portfolio_returns = (returns * pd.Series(weights)).sum(axis=1)# Calculate​
portfolio return for each day
cumulative_return = (1 + portfolio_returns).cumprod() - 1# Calculate cumulative
portfolio return

print("Cumulative Portfolio Return:")# Display the cumulative return


print(cumulative_return.tail())

Output:

Analysis:
●​ Data: In the code, we simulate daily stock prices for three stocks (Stock_A,
Stock_B, and Stock_C), with random values generated from a normal
distribution.
●​ Portfolio Weights: We assume the portfolio is composed of 40% in Stock_A, 30%
in Stock_B, and 30% in Stock_C.
●​ Daily Returns: Using pct_change(), we calculate the daily percentage change in
stock prices.
●​ Portfolio Return: The portfolio return on any given day is calculated by
multiplying the stock's daily return by its weight in the portfolio, and summing
the results.
●​ Cumulative Return: The cumulative portfolio return is the product of daily returns
over time, which is adjusted by subtracting 1.

19.2 Retail Analytics: Category-Wise Sales Comparison

Code:
import pandas as pd

data = {
'Product': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'Category': ['Electronics', 'Clothing', 'Electronics', 'Clothing', 'Furniture',
'Electronics', 'Furniture', 'Clothing', 'Furniture', 'Electronics'],
'Sales': [1200, 300, 1500, 700, 200, 1800, 400, 500, 350, 2200]
}# Sample retail sales data

df = pd.DataFrame(data)
category_sales = df.groupby('Category')['Sales'].sum()# Group by
category and calculate total sales

category_sales.plot(kind='bar', title='Category-Wise Sales Comparison',


xlabel='Category', ylabel='Total Sales')# Plot the category-wise sales
comparison
Output:

Analysis::

●​ Data: The dataset includes product names, their categories (Electronics, Clothing,
Furniture), and the sales amount for each product.
●​ Grouping: We use groupby() to aggregate the data by Category and calculate the
total sales for each category.
●​ Visualization: A bar chart is created to show the sales comparison between
categories. The plot(kind='bar') function generates the plot with labeled axes


19.3 Healthcare Analytics: Patient Data Analysis and Reporting

​ Code:
import pandas as pd

data = {
'Patient_ID': [1, 2, 3, 4, 5],
'Age': [34, 56, 23, 45, 67],
'Diagnosis': ['Diabetes', 'Hypertension', 'Diabetes', 'Hypertension', 'Cancer'],
'Treatment': ['Insulin', 'Medication', 'Insulin', 'Medication', 'Chemotherapy'],
'Outcome': ['Good', 'Fair', 'Good', 'Good', 'Poor']
}# Sample healthcare patient data

df = pd.DataFrame(data)

diagnosis_counts = df['Diagnosis'].value_counts()# Count the number of patients


by diagnosis
outcome_by_diagnosis = pd.crosstab(df['Diagnosis'], df['Outcome'])# Outcome
analysis by diagnosis

print("Diagnosis Count:")
print(diagnosis_counts)# Display diagnosis counts and outcome analysis
print("\nOutcome Analysis by Diagnosis:")
print(outcome_by_diagnosis)

Output:

Analysis:

●​ Data: The dataset contains patient records with information such as Patient_ID,
Age, Diagnosis, Treatment, and Outcome.
●​ Diagnosis Count: We use value_counts() to count the number of patients for each
diagnosis (e.g., Diabetes, Hypertension, Cancer).
●​ Outcome Analysis: A crosstab is used to show the relationship between the
diagnosis and the patient's outcome (e.g., Good, Fair, Poor).
●​ Report Generation: The diagnosis count and the outcome analysis are printed for
further interpretation.

Part 10: Extensions and Integration

Integrating with Other Libraries​

20.1 Using DataFrames with NumPy for Mathematical Operations

Concept:

NumPy provides powerful mathematical operations that can be applied to Pandas


DataFrames. You can use NumPy functions to perform element-wise operations or aggregate
computations.
Example: Calculate the square root of the grades using NumPy.

import pandas as pd

import numpy as np

df = pd.read_csv('student-dataset.csv')# Load the dataset

df['math.sqrt'] = np.sqrt(df['math.grade'])# Using NumPy for mathematical operations

df['average.grade'] = np.mean(

df[['english.grade', 'math.grade', 'sciences.grade', 'language.grade']], axis=1

)# Print the result

print("DataFrame with NumPy calculations:\n", df[['id', 'name', 'math.grade', 'math.sqrt',


'average.grade']])

Output:

20.2 Exporting Data to CSV, Excel, and SQL Databases

Concept:
Pandas makes it easy to export DataFrames to different formats for storage or integration
with other applications. You can export to CSV, Excel, or even SQL databases.

Example: Export the DataFrame to CSV and Excel, and create a table in an SQLite
database.
import sqlite3
df.to_csv('exported_student_data.csv', index=False) # Export to CSV
print("Data exported to 'exported_student_data.csv'.")

df.to_excel('exported_student_data.xlsx', index=False)# Export to Excel


print("Data exported to 'exported_student_data.xlsx'.")
conn = sqlite3.connect('students.db') # SQLite database# Export to SQL database
df.to_sql('students', conn, if_exists='replace', index=False)
print("Data exported to 'students' table in 'students.db'.")

Output:

20.3 Using Pandas DataFrames in Machine Learning Workflows

Concept:

Pandas DataFrames are commonly used to preprocess data for machine learning
workflows. You can handle missing values, encode categorical data, and split datasets for
training and testing.

Example:

Prepare the student-dataset.csv for machine learning by encoding categorical columns


and splitting into training and testing sets.

Code:

from sklearn.model_selection import train_test_split


from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['gender_encoded'] = label_encoder.fit_transform(df['gender'])# Encode the
"gender" column

X = df[['english.grade', 'math.grade', 'sciences.grade', 'language.grade',


'gender_encoded']]# Select features and target variable
y = df['portfolio.rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)
​ # Split into training and testing sets
print("Training Features:\n", X_train.head())# Print the training and testing data
print("\nTesting Features:\n", X_test.head())
print("\nTraining Targets:\n", y_train.head())
print("\nTesting Targets:\n", y_test.head())
Output:

21.Beyond Pandas

21.1 Exploring DataFrames in PySpark
Code: ​
​ ​ from pyspark.sql import SparkSession

spark =
SparkSession.builder.appName("DataFrameExample").getOrCreate()#
Initialize a Spark session

data = [("Alice", 25), ("Bob", 30), ("Cathy", 27)]


columns = ["Name", "Age"]# Create a sample DataFrame
df = spark.createDataFrame(data, columns)

df.show()# Show the contents of the DataFrame


df.printSchema()# Display the schema of the DataFrame

filtered_df = df.filter(df["Age"] > 26)# Perform a filter operation


filtered_df.show()
df.groupBy("Age").count().show()# Group by 'Age' and count occurrences
Output:

●​ DataFrame Creation: A PySpark DataFrame is created with sample data.


●​ Exploration: We show contents, schema, filter rows by a condition, and group
data for aggregation.

21.2 Faster Processing with Vaex and Modin
Code:
import vaex

df = vaex.from_csv("student.csv")# Load your dataset


if len(df) == 0:
print("The dataset is empty. Please load a valid dataset.")
else:# Check if the dataset is empty
print("Dataset loaded successfully.")
print("Number of rows:", len(df)) # Proceed with your operations

print(df.head())Displaying the first few rows

result = df.count()
print("Count result:", result)
Output:

Modin:
​ import os
os.environ["MODIN_ENGINE"] = "ray" # or "dask"

import modin.pandas as mpd


df = mpd.read_csv('student.csv')# Load data with Modin

df_filtered = df[df['age'] > 20]+# Perform operations


print(df_filtered)

Output:

Vaex provides memory-efficient, lazy evaluation for large datasets, while Modin
uses parallel processing to speed up Pandas operations. Both libraries offer faster data
manipulation compared to standard Pandas.

21.3 Combining Pandas with SQLAlchemy for Complex Queries

​ Code:
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('sqlite:///students.db') # Use your database URL


here

with engine.connect() as connection:


connection.execute("""
CREATE TABLE IF NOT EXISTS students (
id INTEGER PRIMARY KEY,
name TEXT,
age INTEGER
)
""")# Create a table (just for demonstration)
connection.execute("INSERT INTO students (name, age) VALUES
('Alice', 25), ('Bob', 30), ('Cathy', 27), ('David', 35)") # Inserting sample
data

query = "SELECT * FROM students WHERE age > 30" #Query data​
using Pandas
df = pd.read_sql(query, engine)# Load the result of the query into a Pandas
DataFrame
print(df)

Output:

This example demonstrates how to use SQLAlchemy with Pandas to run


complex SQL queries and retrieve the results as a DataFrame. It connects to a
SQLite database, queries for students over age 30, and displays the results.
Beginner-Level Lab Exercises
1: Getting Started with DataFrames

Creating DataFrames
Ex1: Create a DataFrame from a dictionary of lists.

Code:

data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35], # Create a DataFrame
'Score': [85, 90, 95]
}
df1 = pd.DataFrame(data)
print(df1)

Explanation:
This code creates a DataFrame using the pandas library from a dictionary of lists.
Each key in the dictionary (Name, Age, and Score) becomes a column in the DataFrame,
and the corresponding lists represent the values in those columns. Each row in the
DataFrame is formed by aligning the elements from the lists by their positions. The
resulting DataFrame organizes the data in a tabular format, making it easier to manipulate
and analyze. Finally, the DataFrame is printed to display the dataset.

Output:

Ex2: Load a dataset from a CSV file into a Pandas DataFrame and display the first 10
rows.

Code:

df = pd.read_csv('student-dataset.csv')
print(df.head(10)) # Load and display the first 10 rows

Explanation:
This code reads a CSV file named "student-dataset.csv" into a Pandas DataFrame
using the read_csv function. The resulting DataFrame organizes the data from the CSV
file into rows and columns for easy manipulation and analysis. The head(10) method is
then used to display the first 10 rows of the dataset, allowing for a quick inspection of its
structure and contents. This helps in understanding the data before performing further
operations.
Output:

Ex3: Create a DataFrame using NumPy arrays with custom row and column labels.

Code:

import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # Create a DataFrame from NumPy
arrays
df2 = pd.DataFrame(data, index=['Row1', 'Row2', 'Row3'], columns=['Col1',
'Col2', 'Col3'])
print(df2)

Explanation:
This code creates a DataFrame using NumPy arrays, where the array provides the
data in a matrix-like format. The DataFrame is constructed by mapping the rows of the
array to custom row labels (Row1, Row2, Row3) and the columns to custom column
labels (Col1, Col2, Col3). Each element in the array is placed at the intersection of its
respective row and column label in the DataFrame. The result is a labeled, tabular
structure that is printed for visualization.

Output:
Exploring DataFrames
Ex4: Display the shape, column names, and data types of a DataFrame.

Code:
print(df.shape) # Display shape, columns, and data types
print(df.columns)
print(df.dtypes)

Explanation:

This code provides key information about a DataFrame. The shape function
displays the dimensions of the DataFrame as (number of rows, number of columns). The
columns function lists the names of all columns in the DataFrame. The dtypes function
shows the data types of each column, helping to understand the structure and types of
data stored in the DataFrame. These details are useful for understanding the dataset
before performing further analysis.

Output:

Ex5: Use head(), tail(), and sample() to preview rows.

Code:

print(df.head()) # First 5 rows


print(df.tail()) # Last 5 rows
print(df.sample(3)) # 3 random rows

Explanation:
This code previews specific parts of the DataFrame for quick inspection. The
head function displays the first 5 rows of the DataFrame, while the tail function shows
the last 5 rows. The sample function retrieves a specified number of random rows (in this
case, 3). These methods are useful for gaining insights into the dataset's structure and
content without viewing the entire DataFrame.

Output:

Ex6: Count unique values in a categorical column using value_counts().

Code:
value_counts = df['gender'].value_counts() # Count unique values in the gender
column
print(value_counts)

Explanation:
This code counts the occurrences of each unique value in the specified column of
the DataFrame, which in this case is the "gender" column. The value_counts function
returns the count of each unique value as a series, with the values as the index and their
counts as the corresponding values. This is useful for analyzing the distribution of data in
a categorical column.

Output:
2: Basic Data Manipulation

Selecting and Filtering Data


Ex7: Select specific columns by name and rows by index using iloc and loc.

Code:

selected_columns = df[['name', 'age']]


selected_rows = df.iloc[0:5] # First 5 rows using position
filtered_rows = df.loc[df['age'] > 20, ['name', 'age']] # Filter using labels
print(selected_columns) # Select columns and rows
print(selected_rows)
print(filtered_rows)

Explanation:
This code demonstrates different methods for selecting and filtering data in a
DataFrame. First, the selected_columns line selects the "name" and "age" columns using
column names. Then, the selected_rows line uses iloc to select the first 5 rows based on
their position. The filtered_rows line uses loc to filter rows where the "age" column is
greater than 20 and selects only the "name" and "age" columns. These techniques allow
for flexible and precise data selection and filtering in a DataFrame.

Output:

Ex8: Filter rows where a numeric column value exceeds a threshold (e.g., Sales > 1000).

Code:

filtered_df = df[df['age'] > 25] # Filter rows where 'age' is greater than 25
print(filtered_df)

Explanation:
This code filters the rows of the DataFrame where the value in the "age" column
is greater than 25. It creates a new DataFrame,filtered_df, containing only the rows that
meet this condition. This technique is useful for narrowing down the dataset based on a
specific numeric criterion, in this case, age. The filtered DataFrame is then printed for
inspection.

Output:

Ex9: Select rows where a text column matches a specific value using str.contains().

Code:
filtered_city = df[df['city'].str.contains('New', na=False)]
print(filtered_city)# Select rows where the city contains "New”

Explanation:
This code filters the rows where the "city" column contains the substring "New".
The str.contains() function is used to check if the specified string ("New") is present in
each value of the "city" column. The na=False argument ensures that any missing (NaN)
values are excluded from the result. The filtered rows are then stored in the filtered_city
DataFrame and printed for inspection. This method is useful for searching text columns
for specific substrings.

Output:

Sorting and Renaming


Ex10: Sort the DataFrame by one or more columns in ascending and descending order.

Code:

df = pd.read_csv('student-dataset.csv')
df_sorted_by_age = df.sort_values('age') # Sort by age in ascending order
df_sorted_multi = df.sort_values(['age', 'math.grade'], ascending=[True, False]) #
Sort by multiple columns (age ascending, math.grade descending)

Explanation:
In this example, the sort_values method is used to sort the DataFrame df in
different ways. First, the DataFrame is sorted by the age column in ascending order. Next,
it is sorted by multiple columns: age in ascending order and math.grade in descending
order. The ascending parameter specifies the sort order for each column, with True for
ascending and False for descending.

​ Output:

Ex11: Rename columns to more descriptive names using rename().

Code:

df_renamed = df.rename(columns={
'english.grade': 'english_performance',
'math.grade': 'math_performance',
'sciences.grade': 'science_performance',
'language.grade': 'language_proficiency',
'ethnic.group': 'ethnicity' # Rename columns to more descriptive names
})
Explanation:
In this example, the rename method is used to change column names in the
DataFrame to more descriptive ones. The columns parameter takes a dictionary where
the keys are the original column names, and the values are the new names. This improves
readability and clarity in the DataFrame. To view the renamed DataFrame, use the
print(df_named) statement.

​ Output:

Ex12: Reset the index of a DataFrame and drop the existing index column.

​ Code:

df_reset = df.reset_index(drop=True)# Reset index and drop the existing index


column​
print(df_reset.head())

Explanation:
The DataFrame df now has a fresh index (from 0 to the number of rows - 1)
without retaining the original index column. This argument ensures that the old index is
removed from the DataFrame and not added as a new column. Without this argument, the
old index would appear as a separate column in the resulting DataFrame.
​ Output:

Intermediate-Level Exercises
3: Cleaning and Preprocessing Data

Handling Missing Data

Ex13: Identify missing values in each column using isna() and sum().

Code:​
​ ​ import pandas as pd
import numpy as np
df = pd.read_csv('student-dataset.csv') # Load the dataset
print("Missing values per column:")
print(df.isna().sum()) #is a method in pandas

Explanation:

This code identifies the number of missing values in each column of the
DataFrame. The isna() function checks for missing (NaN) values in the dataset, returning
a DataFrame of the same shape with True for missing values and False otherwise. The
sum() function then calculates the total count of missing values for each column. In this
case, only the ethnic.group column contains missing values, so its count is displayed,
while other columns show zero if they have no missing values.
​ Output:

Ex14: Replace missing numeric values with the column mean and missing text values with
"Unknown".

Code:​
​ ​ numeric_columns = df.select_dtypes(include=[np.number]).columns
text_columns = df.select_dtypes(include=['object']).columns
df[numeric_columns] =
df[numeric_columns].fillna(df[numeric_columns].mean())
df[text_columns] = df[text_columns].fillna("Unknown")# Replace missing text
values with "Unknown"​
print("Column Means:")
print(df[numeric_columns].mean())
print("\nFilled Numeric Columns:")
print(df[numeric_columns].head())# Print the filled dataframe or specific columns
to verify
print("\nFilled Text Columns:")
print(df[text_columns].head())

Explanation:
This code handles missing values in numeric and text columns separately.
Numeric columns are identified using select_dtypes with include=[np.number], and text
columns are identified with include=['object']. For numeric columns, missing values are
replaced with the column mean using the fillna method. For text columns, missing values
are replaced with the string "Unknown" using the same method. The updated DataFrame
is printed to verify the changes. This approach ensures missing values are handled
appropriately based on the data type, maintaining data integrity.
​ Output:

Ex15: Drop rows or columns with more than 50% missing data.

Code:
threshold = len(df) * 0.5
df.dropna(thresh=threshold, axis=1, inplace=True) # Drop columns
df.dropna(thresh=threshold, axis=0, inplace=True) # Drop rows
Explanation:
Since there are no missing values in this dataset, this step won't do anything.But
here's how you would do it if there were missing values.The student dataset chosen didn’t
miss any row or column with more than 50% of data.

Transforming Data
Ex16: Add a new column based on calculations using other columns​

Code:

df['total_grade'] = df[['english.grade', 'math.grade',


'sciences.grade','language.grade']].mean(axis=1)
df['grade_variance'] = df[['english.grade', 'math.grade', 'sciences.grade',
'language.grade']].var(axis=1)
print("\nNew Columns:") # Display the new columns
print(df[['total_grade', 'grade_variance']].head())

Explanation:
In this example, new columns are added to the DataFrame based on calculations
using existing columns. The total_grade column is created by calculating the mean of
grades across four subjects for each row using the mean function. Similarly, the
grade_variance column is added by calculating the variance of the same grades using the
var function. These new columns provide insights into the overall performance and
consistency of grades for each student.

Output:
Ex17: Standardize numeric columns by subtracting the mean and dividing by the standard
deviation.

Code:​

numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns #


Standardize all
for col in numeric_columns:#mean and standard deviation for the mentioned
columns
df[f'{col}_standardized'] = (df[col] - df[col].mean()) / df[col].std()
print("\nStandardized Numeric Columns:")# Display the original and standardized
columns
print(df[[col for col in numeric_columns] + [f'{col}_standardized' for col in
numeric_columns]].head())

Explanation:
This example standardizes numeric columns in the DataFrame by transforming
each column to have a mean of 0 and a standard deviation of 1. For each numeric
column, the mean is subtracted, and the result is divided by the standard deviation. This
transformation ensures that all numeric data is on the same scale, making it suitable for
machine learning models or statistical analysis. The standardized values are stored in new
columns with "_standardized" appended to the original column names.

Output:


Ex18: Apply a lambda function to modify values in a column (e.g., convert text to
uppercase).

​ Code:
df['city_upper'] = df['city'].apply(lambda x: x.upper() if isinstance(x, str) else x)
print("\nCity Column in Uppercase:")# Print the original and updated 'city'
column
print(df[['city', 'city_upper']].head())# Convert the 'city' column to uppercase
using a lambda function

Explanation:
In this example, a lambda function is applied to the city column to convert all text
values to uppercase. The apply method processes each value in the column, and the
lambda function checks if the value is a string before converting it to uppercase. The
modified values are stored in a new column named city_upper, while the original city
column remains unchanged.

Output:

4: Aggregation and Grouping

Grouping and Summarizing​



Ex19: Group data by a categorical column and calculate the mean of a numeric column.

​ Code:​
​ ​ group_means = df.groupby('nationality')['latitude'].mean() #mean of nationality
print(group_means)​

Explanation:
In this example, the data is grouped by the nationality column, and the mean of
the latitude column is calculated for each group. The groupby method groups the rows
based on unique values in the nationality column, and the mean function computes the
average latitude for each group. This approach is useful for summarizing and analyzing
data based on categories. The result is a Series with nationalities as the index and their
corresponding mean latitude values.
​ Output:



Ex20: Use multiple aggregation functions (e.g., sum, mean, max) on grouped data.

​ Code:
group_summary = df.groupby('nationality')[['latitude', 'longitude',
'english.grade']].agg(['sum', 'mean', 'max']) #perform three functions
print(group_summary)

Explanation:
In this example, multiple aggregation functions (sum,mean and max) are applied
to the grouped data. The DataFrame is grouped by the nationality column, and the
specified numeric columns (latitude,longitude and english.grade) are aggregated using
these functions. The agg method allows applying multiple functions simultaneously,
resulting in a summary table with a hierarchical column structure, where each numeric
column displays the results for all three functions. This provides a detailed summary of
grouped data.

Output:
Ex21: Find the top 3 groups by their total values in a numeric column.

Code:

top_groups = data.groupby('nationality')['latitude'].sum().nlargest(3).index
print(top_groups)

Explanation:

This code identifies the top three groups in the "nationality" column based on the
total sum of values in the "latitude" column. The groupby function groups the data by
"nationality", and the sum function calculates the total "latitude" value for each group.
The nlargest function is used to find the top three groups with the highest total values.
Finally, the index of these top groups is printed, showing the group names.

​ Output:
Advanced-Level Exercises
5: Advanced Transformations

Reshaping Data
Ex25: Reshape data using melt() to convert wide data into long format.

Code:
import pandas as pd
df = pd.read_csv('student-dataset.csv') # Load the dataset
long_format = pd.melt(
df,
id_vars=['id', 'name'],
value_vars=['english.grade', 'math.grade', 'sciences.grade', 'language.grade'],
var_name='Subject',
value_name='Grade'
) # Convert grade columns into long format
print("Long Format Data:")
print(long_format.head())

​ Explanation:
This code uses the melt function to reshape the DataFrame from a wide format to
a long format. The id_vars parameter specifies the columns that should remain
unchanged ("id" and "name"), while value_vars lists the columns to be unpivoted
("english.grade", "math.grade", "sciences.grade", "language.grade"). The resulting
long-format DataFrame has two new columns: "Subject", which contains the original
column names from value_vars, and "Grade", which contains the corresponding values.
This format is useful for analysis where data needs to be consolidated into a single
column for better processing or visualization.

Output:
Ex26: Use pivot() to transform long data into a summary table.

Code:

summary_table = long_format.pivot(index='id', columns='Subject',


values='Grade')
print("Summary Table:")# Transform long-format data back into a wide format
print(summary_table.head())

Explanation:
This code uses the pivot function to transform long-format data back into a
wide-format summary table. The index parameter specifies the column ("id") to use as
the row labels, the columns parameter defines which column ("Subject") will become the
new column headers, and the values parameter specifies the column ("Grade") containing
the data to populate the table. This creates a structured summary table where each row
represents an "id" and each column represents a "Subject" with the corresponding
"Grade" values, making it easier to compare observations across subjects.

Output:

Ex27: Stack and unstack hierarchical indices in a multi-index DataFrame.



​ Code:
multi_index_df = df.set_index(['id', 'gender']) # Create a multi-index
DataFrame using grades
stacked = multi_index_df[['english.grade', 'math.grade']].stack() # Stack and
unstack unstacked = stacked.unstack()
print("Stacked Data:")
print(stacked.head())
print("Unstacked Data:")
print(unstacked.head())

Explanation:
This code demonstrates how to manipulate hierarchical indices in a DataFrame
using stacking and unstacking. First, a multi-index DataFrame is created by setting "id"
and "gender" as hierarchical row indices. The stack function is then used to convert
specific columns ("english.grade" and "math.grade") into row-level indices, creating a
vertically compact structure. The unstack function reverses this process by converting
row-level indices back into columns, returning the DataFrame to its wide format. These
transformations are useful for reshaping and analyzing multi-dimensional data.

Output:

Date and Time Operations


Ex28: Convert a string column to a datetime format using pd.to_datetime().

Code:
import numpy as np
np.random.seed(0) # Simulate a date column for demonstration
df['Date'] = pd.to_datetime('2023-01-01') + pd.to_timedelta(np.random.randint(0,
365, df.shape[0]), unit='d')
print("Data with DateTime:")
print(df[['id', 'Date']].head()) #Convert the String to Datetime

Explanation:
This example converts a string column into a datetime format using the pandas
to_datetime function. Since the dataset did not originally include a date column, a
synthetic one was created by generating random days within a year starting from
"2023-01-01". The to_datetime function ensures that the data is in a proper datetime
format, which allows for operations like date extractions, sorting, and time-based
analysis. This process is useful for preparing date-related data for further analysis or
visualization.
Output:

Ex29: Extract the year, month, and weekday from a datetime column.

Code:
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Weekday'] = df['Date'].dt.day_name() # Extract date components
print("Data with Year, Month, and Weekday:")
print(df[['Date', 'Year', 'Month', 'Weekday']].head())

Explanation:
This code extracts specific components from a datetime column named "Date" in
the DataFrame. The year is extracted using the dt.year attribute, the month using
dt.month, and the weekday name using dt.day_name(). These extracted components are
stored as new columns ("Year", "Month", "Weekday") in the DataFrame. This process is
useful for analyzing or visualizing data based on temporal trends, such as identifying
patterns by year, month, or day of the week. The updated DataFrame is then printed to
show the extracted information alongside the original "Date" column.

Output:

Ex30: Group sales data by month and calculate total monthly revenue.

Code:
monthly_grades = df.groupby(df['Date'].dt.to_period('M'))[['english.grade',
'math.grade']].mean()
print("Monthly Average Grades:") # With the help of groupby we can calculate
the mean
print(monthly_grades)
Explanation:
This code groups the data by month using the groupby function along with the
to_period('M') method, which converts the datetime values to monthly periods. It then
calculates the mean of the selected columns ("english.grade" and "math.grade") for each
month. This aggregation allows you to analyze trends such as monthly averages or totals.
In the context of sales data, this could be used to calculate total revenue for each month,
but in this example, it is used to calculate the average grades for each month.

Output:

6: Performance Optimization

Optimizing DataFrames
Ex31: Reduce memory usage by converting columns to appropriate data types (e.g., float32
or category).

Code:

df['id'] = df['id'].astype('int32')
df['english.grade'] = df['english.grade'].astype('float32') #Convert grades to
float32 and id to int32
print("Reduced memory DataFrame:")
print(df.info())

Explanation:
This code optimizes memory usage by converting columns in the DataFrame to
more efficient data types. The "id" column is converted to int32, which uses less memory
compared to the default int64 type. The "english.grade" column is converted to float32,
which reduces memory usage compared to the default float64 type. The astype() function
is used for this conversion. After converting, the info method is called to display the
updated DataFrame information, showing the reduced memory usage. This technique is
useful for working with large datasets to improve performance.

Output:

Ex32: Use vectorized operations to replace loops for calculating new columns.

Code:

df['AverageGrade'] = df[['english.grade', 'math.grade', 'sciences.grade',


'language.grade']].mean(axis=1) # Calculate average grade using vectorized
operations
print("Data with Average Grade:")
print(df[['id', 'AverageGrade']].head())

Explanation:
This code replaces the use of loops with a vectorized operation to calculate the
average grade. Instead of iterating over each row manually, it computes the average of the
specified columns (english.grade, math.grade, sciences.grade, language.grade) using the
mean function with axis=1, which operates across rows. This approach is much faster and
more efficient than loops, especially for large datasets. The result is stored in a new
column called AverageGrade, and the DataFrame is printed with the id and AverageGrade
columns for inspection.

Output:

Ex33: Process large datasets in chunks using read_csv() with chunksize.

Code:

chunk_iter = pd.read_csv('student-dataset.csv', chunksize=250) # Process dataset


in chunks
for chunk in chunk_iter:
print("Processing a chunk:")
print(chunk['name'].head())

Explanation:

This code reads and processes a large dataset in chunks using the read_csv
function with the chunksize parameter. By specifying a chunk size of 250, the dataset is
read in smaller, manageable parts instead of all at once, which helps prevent memory
overload when working with large files that cannot fit into RAM. Each chunk is
processed individually within the loop, allowing for incremental analysis or
transformation of the data. In this example, the 'name' column is printed for the first few
rows of each chunk.

Output:
Working with Large Data
Ex34: Load and filter a dataset with over 1 million rows using Dask.

Code:

def process_large_dataset_with_dask(data_path='large_student_data'):
dask_df = dd.read_csv(os.path.join(data_path, '*.csv')) # Read all CSV files in
the directory
filtered_df = dask_df[
(dask_df['age'] >= 20) & # Complex filtering and aggregation
(dask_df['age'] <= 30) &
(dask_df['english_grade'] > 80)
]
grouped_stats = filtered_df.groupby('nationality').agg({
'math_grade': ['mean', 'max'], # Compute group-level statistics
'english_grade': ['mean', 'min']
}).compute()
return grouped_stats
if __name__ == '__main__':
dask_results = process_large_dataset_with_dask()
print("\nDask Grouped Statistics:")
print(dask_results)

Explanation:
This code uses Dask to process a large dataset, which can be spread across
multiple CSV files in a directory. It first loads all CSV files using dd.read_csv(), which
allows Dask to handle large datasets by breaking them into smaller chunks. The dataset is
filtered to include rows where the age is between 20 and 30, and the English grade is
above 80. After filtering, the data is grouped by nationality, and several statistics (mean,
max, min) for the math and English grades are computed using groupby() and agg(). The
compute() method is used to execute the calculations and return the results. Dask allows
for processing large datasets efficiently by utilizing parallel computing.

Output:
Ex35: Split a large dataset into smaller files and recombine them using concat().

Code:
def split_and_recombine_dataset(data_path='large_student_data'):
csv_files = [os.path.join(data_path, f) for f in os.listdir(data_path) if
f.endswith('.csv')] # Read all CSV files
chunks = []
for file in csv_files: # Read files in chunks
chunk = pd.read_csv(file, chunksize=50000)
chunks.extend(list(chunk))
for chunk in chunks:
print(chunk[id])
full_dataset = pd.concat(chunks, ignore_index=True) # Recombine using
concat
return full_dataset

Explanation:
This code splits a large dataset into smaller chunks and then recombines them
using concat(). First, it reads all the CSV files from the given directory and processes
them in chunks of 50,000 rows using pd.read_csv() with the chunksize parameter. These
chunks are stored in a list. Then, the code iterates through each chunk and prints the id
column from the chunk. After processing all the chunks, pd.concat() is used to combine
them into one large DataFrame. This allows for efficient handling of large datasets by
working with smaller pieces and then merging them back together.

Output:

Splitted Data Set:


Combined DataSet:

Real-World Applications
7: Case Studies

Customer Analytics
Ex36: Load a dataset of customer transactions and calculate: Total revenue by customer.
Average order value for each customer. Top 10 customers by total spending.

Dataset:

The dataset, customer-transactions.csv, contains the following columns:

●​ customer_id: Unique ID for each customer.


●​ transaction_id: Unique ID for each transaction.
●​ transaction_date: Date of the transaction.
●​ amount: The total amount for the transaction.

Objective:

1.​ Calculate the total revenue generated by each customer.


2.​ Calculate the average order value (AOV) for each customer.
3.​ Identify the top 10 customers by total spending.

Code:
import pandas as pd

data = {
'customer_id': [1, 2, 1, 3, 2, 1, 3, 4, 5, 6, 7, 8, 9, 10, 11],
'transaction_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114,
115],
'transaction_date': [
'2023-01-01', '2023-01-03', '2023-01-05',
'2023-01-10', '2023-01-12', '2023-01-15', '2023-01-20',
'2023-01-25', '2023-01-28', '2023-02-01', '2023-02-05',
'2023-02-10', '2023-02-15', '2023-02-20', '2023-02-25'
],
'amount': [50.5, 75.0, 60.0, 120.0, 40.0, 90.0, 80.0, 200.0, 150.0, 300.0, 250.0, 100.0,
130.0, 220.0, 310.0]
}# Create a sample dataset

df = pd.DataFrame(data)# Load the data into a DataFrame


total_revenue = df.groupby('customer_id')['amount'].sum().reset_index()
total_revenue.columns = ['customer_id', 'total_revenue']#Calculate total revenue by
customer
average_order_value = df.groupby('customer_id')['amount'].mean().reset_index()
average_order_value.columns = ['customer_id', 'average_order_value']# Calculate the
average order value for each customer

customer_analytics = pd.merge(total_revenue, average_order_value, on='customer_id')


# Merge total revenue and AOV
top_customers = customer_analytics.sort_values(by='total_revenue',
ascending=False).head(10)# Identify the top 10 customers by total spending

print("Customer Analytics (Total Revenue & AOV):\n", customer_analytics)


print("\nTop 10 Customers by Total Spending:\n", top_customers)# Display the results

Output:

Sales Analysis
Ex37: Analyze a dataset with fields like Product, Region, Sales, and Profit: Identify the
most profitable product category in each region. Visualize monthly sales trends using
Matplotlib or Seaborn.

Objective:

1.​ Identify the most profitable product category in each region.


2.​ Visualize monthly sales trends using Matplotlib or Seaborn.
Code:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = {
'Product': ['A', 'B', 'C', 'D', 'A', 'B', 'C', 'D', 'A', 'B', 'C', 'D'],
'Region': ['North', 'North', 'South', 'South', 'East', 'East', 'West', 'West', 'North', 'South',
'East', 'West'],
'Sales': [1000, 1500, 1200, 1800, 1400, 2000, 1600, 1900, 1700, 1100, 1500, 2200],
'Profit': [300, 400, 350, 500, 450, 600, 550, 700, 500, 300, 400, 800],
'Month': ['2023-01', '2023-01', '2023-01', '2023-01', '2023-02', '2023-02', '2023-02',
'2023-02', '2023-03', '2023-03', '2023-03', '2023-03']
}# Sample dataset for Sales Analysis

df = pd.DataFrame(data)# Create a DataFrame

most_profitable = df.groupby(['Region', 'Product'])['Profit'].sum().reset_index()


most_profitable =
most_profitable.loc[most_profitable.groupby('Region')['Profit'].idxmax()]
print("Most Profitable Product Category in Each Region:\n", most_profitable)#Identify
the most profitable product category in each region

df['Month'] = pd.to_datetime(df['Month'])#Visualize monthly sales trends


monthly_sales = df.groupby('Month')['Sales'].sum().reset_index()
plt.figure(figsize=(10, 6))# Plot sales trends
sns.lineplot(data=monthly_sales, x='Mo

Output:
Explanation:

1.​ Sales Analysis:


○​ We grouped by Region and product to calculate the total profit for each product in
each region.
○​ Used idxmax() to find the product with the highest profit in each region.
○​ Visualized monthly sales trends with a line chart.

Inventory Management
Ex38: Use a dataset with fields like Product ID, Stock, Reorder Level: Identify products
below the reorder level. Calculate the total inventory value for each product.

Inventory Management

Objective:

1.​ Identify products below the reorder level.


2.​ Calculate the total inventory value for each product.

Code:
inventory_data = {
'Product_ID': ['P001', 'P002', 'P003', 'P004', 'P005'],
'Stock': [50, 20, 15, 80, 30],
'Reorder_Level': [40, 30, 20, 50, 25],
'Unit_Price': [10, 15, 20, 5, 8]
}# Sample dataset for Inventory Management

inventory_df = pd.DataFrame(inventory_data)# Create a DataFrame

below_reorder = inventory_df[inventory_df['Stock'] < inventory_df['Reorder_Level']]


print("Products Below Reorder Level:\n", below_reorder)#Identify products below the
reorder level
inventory_df['Total_Value'] = inventory_df['Stock'] * inventory_df['Unit_Price']
print("\nTotal Inventory Value for Each Product:\n", inventory_df[['Product_ID',
'Total_Value']])# Calculate the total inventory value for each product

Output:
Explanation:

1.​ Inventory Management:


○​ Compared Stock with reorder_level to filter products needing replenishment.
○​ Calculated the total inventory value as stock*price.

8: Visualization and Reporting

Data Visualization
Ex39: Create a bar chart to compare total sales by region.

Objective:

Visualize the total sales in each region using a bar chart

Code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data = {
'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
'Sales': [1000, 1500, 1200, 1800, 1400, 1100, 2000, 1900],
}# Sample dataset for sales analysis

df = pd.DataFrame(data)# Create a DataFrame


sales_by_region = df.groupby('Region')['Sales'].sum().reset_index()# Group by region
and calculate total sales
plt.figure(figsize=(8, 5))
sns.barplot(data=sales_by_region, x='Region', y='Sales', palette='viridis')
plt.title('Total Sales by Region', fontsize=14)# Plot bar chart
plt.xlabel('Region', fontsize=12)
plt.ylabel('Total Sales', fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.show()
Output:

Ex40: Plot a Correlation Heatmap

This example involves visualizing the correlations between numerical fields in the dataset. We'll
use Seaborn to plot the heatmap.

Code:
data = {
'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
'Sales': [200, 300, 250, 400, 100, 150, 120, 180],
'Profit': [50, 70, 60, 100, 20, 30, 25, 40],
'Discount': [10, 15, 5, 10, 5, 10, 3, 7]
}# Add additional fields to the dataset
df = pd.DataFrame(data)

correlation_matrix = df[['Sales', 'Profit', 'Discount']].corr()# Compute the correlation


matrix

plt.figure(figsize=(8, 5))# Plot the heatmap


sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap', fontsize=14)
plt.show()
Output:

The heatmap will display the correlations between Sales, Profit, and Discount:

●​ Positive correlations will appear in warm colors (e.g., red).


●​ Negative correlations will appear in cool colors (e.g., blue).

You might also like