Data Frame
Data Frame
1.2 Creating DataFrames from Various Sources: - Lists - Dictionaries - NumPy Arrays
LISTS:
import pandas as pd # Importing the pandas library.
data = [[1, 'Alice'], [2, 'Bob'], [3, 'Charlie']] # The first value in each inner list is
an ID, and the second is a name.
df = pd.DataFrame(data, columns=['ID', 'Name']) # Creating a DataFrame from
the list of lists, with column names explicitly specified.
print(df) # Printing the DataFrame
Output:
The code creates a DataFrame from a list of lists, explicitly defining column
names (ID and Name) for each row of data.
DICTIONARIES:
Code:
import pandas as pd # Import pandas library.
data = {'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']} # Define data as a
dictionary with column names as keys.
df = pd.DataFrame(data) # Create a DataFrame from the dictionary..
print(df) # Print the DataFrame
Output:
The code creates a DataFrame using a dictionary, where keys represent column
names, and values are lists of data, making it more structured and readable.
NUMPY ARRAYS:
Code:
import numpy as np # Importing the numpy
import pandas as pd # Importing pandas
data = np.array([[1, 'Alice'], [2, 'Bob'], [3, 'Charlie']]) # Creating a NumPy
array with rows containing ID and Name data.
df = pd.DataFrame(data, columns=['ID', 'Name']) # Converting the
NumPy array into a pandas DataFrame and specifying column names.
print(df) # Printing the DataFrame
Output:
The code creates a NumPy array with ID and Name data and converts it into a
pandas DataFrame. It then prints the DataFrame, displaying the data in a tabular format
with labeled columns.
Excel:
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
Reads data from an Excel file (data.xlsx) from the specified sheet (Sheet1)
into a pandas DataFrame.
JSON:
df = pd.read_json('data.json')
Loads data from a JSON file (data.json) into a pandas DataFrame
Here connecting to an SQLite database, executing a query to fetch data from a table, and
loading the result into a pandas DataFrame are done.
Code:
import sqlite3 # Import sqlite3 for SQLite database interaction.
import pandas as pd # Import pandas for data manipulation.
conn = sqlite3.connect('example.db') # Connect to SQLite database.
query = "SELECT * FROM data" # Define SQL query to fetch all rows from
table.
df = pd.read_sql_query(query, conn) # Execute query and store result in
DataFrame.
print(df) # Print the DataFrame.
Output:
The code sends a GET request to the API and retrieves the response in JSON
format. It then converts the JSON data into a pandas DataFrame for easy manipulation.
The code reads a CSV file directly from the provided URL and loads its content
into a pandas DataFrame for data analysis and processing.
3. Inspecting Data
3.1 Viewing the Head, Tail, and Random Samples
head(n): Displays the first n rows of the DataFrame. Defaults to 5 if n is not specified.
tail(n): Displays the last n rows of the DataFrame. Defaults to 5 if n is not specified.
Code:
import pandas as pd
Use this to check the initial rows of your dataset. It helps ensure the dataset was
loaded correctly and provides insight into the structure of the data.
df.head(3)
Output:
df.tail(n):
Use this to view the last rows of your dataset. It’s useful for verifying data
integrity, particularly in large datasets.
df.tail(3)
Output:
df.sample(n):
This function provides a random subset of rows. It’s useful for checking the
diversity of your data without viewing the entire dataset.
df.sample(3)
Output:
columns:
Lists all column names in the DataFrame.
dtypes:
Displays the data types of each column.
info():
Provides a detailed summary of the DataFrame, including column names, data
types, and non-null values.
Code:
import pandas as pd
df = pd.read_csv('sdata.csv') # Load the dataset (assuming 'sdata.csv' is in the current
working directory)
print("Column Names:")
print(df.columns) # View column names
print("\nData Types of Columns:")
print(df.dtypes) # View data types
print("\nDetailed DataFrame Information:")
print(df.info()) # View detailed information
Explanation
print(df.columns)
Output:
3.3.Checking Data Types (.dtypes)
Example:
print(df.dtypes)
Output:
● Numeric columns like latitude, longitude, and age have correct data types (e.g., float64
or int64).
● Categorical columns like name, nationality, and gender are of type object.
3.4 Descriptive Statistics with describe(), info(), and value_counts()
describe()
info()
● Provides a detailed summary of the DataFrame (e.g., non-null values, data types).
value_counts()
Explanation:
print(df.describe())
Output:
● Categorical Data: Use include=["object"] to summarize categorical columns.
print(df.describe(include=["object"]))
Output:
The name column has 200 unique entries, while gender has 2 (e.g., Male/Female).
print(df.info())
Output:
3. Value Counts with value_counts()
Use this for analyzing the frequency of unique values in a specific column.
print(df["gender"].value_counts())
Output:
1. Selecting Columns:
● Single Column: You can access a single column by using the column name inside
square brackets []. For example, df["name"] gives you all the values in the "name"
column.
2. Selecting Rows:
● By Index: You can access rows by their index position using .iloc[]. For example,
df.iloc[0] will return the first row of the DataFrame (index 0).
● By Label: If you want to select rows using labels (row names), you use .loc[]. For
example, df.loc[1] gives the second row (because labels are usually 0-indexed).
● To retrieve a specific value at the intersection of a row and column, you can use
both .iloc[] (for position-based access) and .loc[] (for label-based access). For
example:
○ df.iloc[0]["name"]: This fetches the name of the first row in the "name"
column.
○ df.loc[1, "name"]: This fetches the value from the "name" column in the
row with label 1.
4. Slicing Rows:
● Slicing Rows: To get a subset of rows, use .iloc[] with slice notation. For
instance, df.iloc[:5] will return the first 5 rows. The slice notation can be adjusted
to get a specific range of rows.
5. Conditional Selection:
● Filtering Rows Based on Conditions: You can filter rows by applying conditions
to columns. For example, df[df["age"] > 20] returns all rows where the "age"
column is greater than 20. The condition can be customized as needed, such as for
text values or numerical ranges.
Code:
import pandas as pd
df = pd.read_csv("student-dataset.csv")
print(df["name"].head()) # Displays the first 5 names
print(df.iloc[0]) # Displays the first row
print(df.iloc[0]["name"]) # Displays the 'name' from the first row
print(df.iloc[:5]) # Displays the first 5 rows
print(df[df["age"] > 20]) # Displays rows where age is greater than 20
Explanation:
1. df["name"].head():
○ This will show the first 5 entries of the "name" column
Output:
2. df.iloc[0]:
○ This will display the entire first row (as a pandas Series).
Output:
3. df.iloc[0]["name"]:
○ This will return the value in the "name" column of the first row.
Output:
4. df.iloc[:5]:
○ This returns the first 5 rows of the DataFrame.
Output:
Output:
iloc[] is used to select data by position (i.e., integer-based indexing). It allows you to
access rows and columns based on their position in the DataFrame.
Syntax:
python
CopyEdit
df.iloc[row_index, column_index]
● row_index: The integer index for the row you want to select (starting from 0).
● column_index: The integer index for the column you want to select (starting from 0).
Code:
print(df.iloc[0]) # Select the first row
print(df.iloc[0, 0])# Select the first row and first column
print(df.iloc[:5, :])# Select the first 5 rows and all columns
print(df.iloc[:3, :2])# Select specific rows and columns (first 3 rows, first 2
columns)
Explanation:
Output:
● df.iloc[0, 0]: Returns the value at the first row and first column.
Output:
Output:
● df.iloc[:3, :2]: Returns the first 3 rows and the first 2 columns.
Output:
In this section, we will explore how to rename and reorder columns in a Pandas
DataFrame using two main techniques: rename() for renaming columns and reindex() for
reordering columns.
You can rename one or more columns in a DataFrame using the rename() method.
This method takes a dictionary where the keys are the current column names, and the
values are the new names you want to assign to those columns.
Syntax:
df.rename(columns={"old_name1": "new_name1", "old_name2": "new_name2"},
inplace=True)
columns:
A dictionary with old column names as keys and new column names as values.
inplace=True:
If set to True, the DataFrame is modified in place; if False, a new DataFrame with
the updated column names is returned
Code:
import pandas as pd
df = pd.read_csv("student-dataset.csv")
df.rename(columns={"name": "full_name", "age": "years"}, inplace=True)
print(df.head()) # Rename columns "name" to "full_name" and "age" to "years"
Output:
After running the code, the DataFrame's "name" column will be renamed to "full_name,"
and the "age" column will be renamed to "years."
You can reorder the columns of a DataFrame by passing a list of the column names in the
desired order to the reindex() method.
Syntax:
Code:
import pandas as pd
df = pd.read_csv("student-dataset.csv")
df = df.reindex(columns=["gender", "full_name", "years", "nationality", "city",
"age"])# Reorder columns: move "gender" to the first column and reorder others
print(df.head())
Output:
The gender column will be moved to the first position, and the remaining columns
will follow the order as specified in the list.
isna() identifies missing values (NaN), and sum() counts them in each
column.
notna():
Code:
import pandas as pd
df = pd.read_csv("student.csv") # Load the dataset
print(df.notna()) # Returns True non-missing values and False for missing
ones
Output:
Code:
df['english.grade'].fillna(df['english.grade'].mean(),
inplace=True)#Replace NaN with the column's mean value
print("Filled missing values in 'english.grade' with the
mean:")#Confirmation message
print(df['english.grade']) # Display the updated 'english.grade' column
Output:
This replaces all NaN values in the english.grade column with the mean
of the column.
Code:
df['math.grade'].fillna(df['math.grade'].median(), inplace=True) # Replace
NaN with the column's median value
print("Filled missing values in 'math.grade' with the median:")
print(df['math.grade']) # Display the updated 'math.grade' column
Output:
This replaces all NaN values in the math.grade column with the median of
the column.
Code:
df['ethnic.grou'].fillna("Unknown", inplace=True) # Replaces
missing values in the 'ethnic.group' column with the custom value
'Unknown'
print("Filled missing values in 'ethnic.grou' with a custom value:")
print(df['ethnic.grou'])
Output:
This replaces all NaN values in the ethnic.group column with the
custom string "Unknown"
5.3 Dropping Rows or Columns with Null Values
Code:
Output:
Code:
Output:
Code:
advanced_filtered_df = df[(df['math.grade'] > 3.5) &
(df['english.grade'] > 3.5)]# Filter rows where 'math.grade' > 3.5 and
'english.grade' > 3.5
filtered_result = advanced_filtered_df[['id', 'math.grade',
'english.grade']]# Select only the 'id', 'math.grade', and 'english.grade' columns
print("Filtered rows with 'math.grade' > 3.5 and 'english.grade' > 3.5 (only ID and
grades):")
print(filtered_result)# Print the filtered result
Output:
Filters rows where both math.grade and english.grade are greater than 3.5.
Output:
This code calculates the average of grades for each student across three
subjects and adds it as a new column average.grade to the DataFrame.
7.2 Dropping Unnecessary Columns or Rows
# Drop the 'latitude' and 'longitude' columns as they are unnecessary
df_dropped_columns = df.drop(columns=['latitude', 'longitude']) # Remove
specific columns from the DataFrame
print("Dropped 'latitude' and 'longitude' columns:")
print(df_dropped_columns.head()) # Display the first few rows without the
dropped columns
Output:
7.3 Concatenating, Merging, and Joining DataFrames
# Create a new DataFrame for demonstration
extra_data = pd.DataFrame({
'id': [1, 2],
'name': ['Alex Doe', 'Taylor Smith'],
'nationality': ['Canada', 'UK'],
'city': ['Toronto', 'London'],
'english.grade': [3.7, 4.0],
'math.grade': [3.5, 3.8],
'sciences.grade': [3.6, 4.0]
})
Output:
Pandas apply() function is a versatile tool that allows you to apply a function along an axis (rows
or columns) of a DataFrame. You can use apply() with a lambda function to perform
transformations or calculations on each element.
Syntax:
Example:
Let's say we want to create a new column that categorizes ages into groups like "Young",
"Middle-aged", and "Old".
Code:
import pandas as pd
df = pd.read_csv("student-dataset.csv")
Output:
8.2 One-Hot Encoding for Categorical Variables
One-Hot Encoding is a technique used to convert categorical variables into a form that
can be provided to ML algorithms to do a better job in prediction. It involves creating a new
binary column for each category/label in the categorical variable.
Syntax:
pd.get_dummies(df['column_name'])
Example:
Let's say the gender column contains "Male" and "Female," and we want to
convert this into binary columns.
Code:
import pandas as pd
data = {
df = pd.DataFrame(data)
Syntax:
Example:
Let's bin the age column into 3 age groups: "0-25", "26-50", and "51+".
Code:
import pandas as pd
data = {
}# Sample DataFrame
df = pd.DataFrame(data)
Output:
Explanation:
When dealing with date and time data, often dates are represented as strings. To perform
any operations (like sorting, filtering, or calculations), we need to convert these strings into
datetime objects.
Code:
import pandas as pd
data = {'event': ['Event A', 'Event B', 'Event C'],# Sample DataFrame with date as string
df = pd.DataFrame(data)
The pd.to_datetime() function converts the date column from a string format to a
datetime object, making it easier to manipulate and extract components like day, month, year,
etc.
Once dates are in datetime format, it's easy to extract individual components, such as the day,
month, year, or even the weekday.
Code:
import pandas as pd
data = {'event': ['Event A', 'Event B', 'Event C'],# Sample DataFrame with date as datetime object
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
df['weekday'] = df['date'].dt.weekday
Output:
Explanation:
You can calculate the difference between two dates or work with time zones to handle
timestamps accurately.
Code:
import pandas as pd
df = pd.DataFrame(data)
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
print(df)
Output:
Explanation:
Date Differences:
Time Zones:
Reshaping data allows you to rearrange data into the desired format, making it easier for
analysis and visualization.
Pivot tables allow you to summarize and aggregate data using rows and columns as
indices.
Code:
import pandas as pd
data = {
'student': ['Alice', 'Bob', 'Alice', 'Bob', 'Alice', 'Bob'],
'subject': ['Math', 'Math', 'English', 'English', 'Science', 'Science'],
'score': [85, 78, 92, 88, 91, 84]
}# Sample DataFrame
df = pd.DataFrame(data)
pivot_table = df.pivot_table(values='score', index='student',
columns='subject', aggfunc='mean')# Creating a pivot table
print(pivot_table)# Display the pivot table
Output:
Explanation:
These methods help transform data from wide to long format for easier analysis.
Code:
melted_df = pd.melt(pivot_table.reset_index(), id_vars=['student'], var_name='subject',
value_name='score')# Melt: Wide to long format
print("Melted DataFrame:")
print(melted_df)
stacked_df = pivot_table.stack()# Stack: Converts columns to rows (reshapes pivot_table)
print("\nStacked DataFrame:")
print(stacked_df)
Output:
Explanation:
● melt(): Unpivots the data into a long format. Each row represents a unique combination
of id_vars and value_name.
● stack(): Moves columns into rows, creating a MultiIndex DataFrame.
Output:
Explanation:
unstack() is the inverse of stack(). It pivots the innermost row index level into columns.
Grouping Data
What is Grouping?
Grouping is the process of splitting data into subsets based on one or more criteria
(columns) and then applying an operation to these subsets. This is useful for summarizing or
analyzing data by categories.
For example:
Code:
import pandas as pd
df = pd.read_csv('student-dataset.csv')# Load the dataset (assuming 'sdata.csv' is in the
current working directory)
group_by_gender = df.groupby('gender').mean(numeric_only=True)# Grouping by
gender
print("Grouped by Gender:\n", group_by_gender)
Output:
Explanation:
Aggregation Functions
Aggregation functions summarize data using operations like mean, sum, or count.
Code:
import pandas as pd
df = pd.read_csv('student-dataset.csv')# Load the dataset (assuming 'sdata.csv' is in the current
aggregation = df.groupby('gender').agg({working directory)
# Grouping by gender
'age': ['mean', 'min', 'max'], # Multiple aggregations for age
'portfolio.rating': 'sum', # Total portfolio rating
'refletter.rating': 'count' # Count of reference ratings
})
print("Aggregated Data:\n", aggregation)
Output:
Explanation:
Custom aggregations are used for more specific calculations. For instance, calculating the
percentage of high ratings in a group.
Code:
import pandas as pd
df = pd.read_csv('student-dataset.csv')# Load the dataset
custom_agg = df.groupby('gender').agg(# Custom aggregation example
high_ratings_pct=('portfolio.rating', lambda x: (x > 7).mean() * 100)
)
print("Custom Aggregation (High Ratings Percentage):\n", custom_agg)
Output:
Explanation:
agg() Method:
Here, ('portfolio.rating', lambda x: (x > 7).mean() * 100) checks which rows in the
portfolio.rating column have values greater than 7, calculates the proportion, and
converts it into a percentage.
Lambda Function:
The lambda x: (x > 7).mean() * 100 calculates the percentage of values in the
group where portfolio.rating > 7.
groupby('gender'):
Result:
The output shows the percentage of high ratings for each gender group.
12.Advanced Grouping Techniques
Code:
import pandas as pd
df = pd.read_csv('student-dataset.csv')# Load the dataset
grouped = df.groupby(['gender', 'ethnic.group'])[['portfolio.rating',
'coverletter.rating']].mean()# Perform hierarchical grouping by 'gender' and 'ethnic.group'
Output:
Explanation:
Code:
import pandas as pd
df['mean_age_by_gender'] = df.groupby('gender')['age'].transform('mean')# Add a new
column showing the mean age within each 'gender' group
print("Data with Mean Age by Gender:\n", df[['gender', 'age', 'mean_age_by_gender']])
Output:
Explanation:
● transform() applies a function to each group and returns a Series with the same index as
the original DataFrame.
● Here, it calculates the mean age for each gender group and adds it as a new column.
Code:
import pandas as pd
agg_results = df.groupby('gender').agg({
'age': ['mean', 'max'], # Apply mean and max to 'age'
'portfolio.rating': 'sum', # Apply sum to 'portfolio.rating'
'coverletter.rating': 'min' # Apply min to 'coverletter.rating'
})# Aggregate with different functions for different column
print("Aggregating Multiple Columns with Different Functions:\n", agg_results)
Output:
Explanation:
● Uses the agg() method to specify different functions for each column.
● For example:
○ mean and max are applied to age.
○ sum is applied to portfolio.rating.
○ min is applied to coverletter.rating.
Part 6: Advanced Operations
Output:
Code:
import pandas as pd
student_details = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'nationality': ['USA', 'UK', 'India']
})# DataFrames for demonstration
additional_scores = pd.DataFrame({
'id': [1, 2],
'sports.score': [85, 90]
})# Ensure 'id' columns are of the same type
student_details['id'] = student_details['id'].astype(int)
additional_scores['id'] = additional_scores['id'].astype(int)# Merge student details
with additional scores on 'id'
merged_data = student_details.merge(additional_scores, on='id', how='left') #
Left join
print("Merged DataFrame with additional scores:")
print(merged_data) # Display merged result
concat_data = pd.concat([student_details, additional_scores],
ignore_index=True)# Combine rows
print("\nConcatenated DataFrame:")
print(concat_data) # Display concatenated data
Output:
# Calculate a rolling average and standard deviation for the 'math.grade' column
(window size = 3)
df['math_grade_rolling_avg'] = df['math.grade'].rolling(window=3).mean() #
Rolling average
df['math_grade_rolling_std'] = df['math.grade'].rolling(window=3).std() #
Rolling standard deviation
OUTPUT:
Rolling average and standard deviation are calculated over a window of 3 rows
for 'math.grade'.
Output:
Cumulative sum and mean track metrics from the beginning of the dataset for
'english.grade'.
14.3 Ranking Rows with rank() and Custom Sorting
Code:
import pandas as pd
df = pd.read_csv("student.csv") # Load the dataset
Output:
This code ranks students based on custom sorting by multiple columns, first by
math.grade in descending order, then by english.grade in ascending order. The rank()
function is used to assign ranks based on the sorted data.
Visualizing data is essential for identifying patterns, relationships, and trends. Using
Matplotlib, you can create various plots to understand the data better. We'll demonstrate basic
visualization techniques using the student-dataset.csv.
15.1 Plotting DataFrame Columns with Matplotlib
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('student-dataset.csv')# Load the dataset
Output:
Explanation:
1. Groups data by gender and plots the age against id for each group.
2. plt.plot() creates a line chart for each gender, differentiated by a label.
3. Adds title, axis labels, and grid for better readability.
15.2 Creating Histograms, Line Charts, and Scatter Plots
Code:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('student-dataset.csv')# Load the dataset
Output:
Explanation:
Code:
import pandas as pd
avg_grades = df[['english.grade', 'math.grade', 'sciences.grade',
'language.grade']].mean()# Calculate average grades for each subject
Output:
Explanation:
● The mean() function calculates the average grade for each subject.
● The plot(kind='line') method generates a line chart.
● marker='o' adds circular markers at each data point.
Code:
import pandas as pd
import matplotlib.pyplot as plt
Output:
Explanation:
● plt.scatter() plots the portfolio.rating on the x-axis and coverletter.rating on the y-axis.
● alpha=0.7 adjusts the transparency to improve clarity.
16.Advanced Visualization
Visualizing data is essential for identifying patterns, relationships, and trends. Using
Matplotlib, you can create various plots to understand the data better. We'll demonstrate basic
visualization techniques using the student-dataset.csv.
Bar plots are used to visualize aggregated data, such as averages or counts, for categorical
variables.
Code:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('student-dataset.csv')# Load the dataset
Output:
Explanation:
● Grouped the dataset by gender to calculate the mean grades for English, Math, and
Sciences.
● Created a bar plot for visualizing the average grades by gender.
Box plots help visualize the distribution, spread, and outliers in data across categories.
Code:
import pandas as pd
import matplotlib.pyplot as plt
Output:
Explanation:
Correlation heatmaps are used to visualize the relationship between numeric columns.
Code:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f",
linewidths=0.5)# Create a heatmap
plt.title("Correlation Heatmap")
plt.show()
Output:
Explanation:
x = [1, 2, 3, 4, 5]
y = [10, 11, 12, 13, 14]# Create some data for the plot
Plotly allows you to create interactive visualizations, such as line, bar, and scatter
plots, with built-in features like zoom, hover, and pan.
import pandas as pd
df['total_grade_loop'] = 0
for i in range(len(df)):
df.loc[i, 'total_grade_loop'] = (
df.loc[i, 'english.grade'] +
df.loc[i, 'math.grade'] +
df.loc[i, 'sciences.grade'] +
df.loc[i, 'language.grade']
)
print(df[['id', 'name', 'english.grade', 'math.grade', 'sciences.grade', 'language.grade',
'total_grade_loop']])# Print the resulting DataFrame
Output:
Code:
import pandas as pd
df = pd.read_csv('student-dataset.csv')# Load the dataset
● Concept: For very large datasets that cannot fit into memory, use Dask. It splits the
DataFrame into smaller chunks and processes them in parallel.
● Why? Dask allows working with datasets larger than memory by processing them in
chunks.
result = dask_df.groupby('gender')['portfolio.rating'].mean().compute()
print("Mean portfolio rating by gender:\n", result)# Perform an aggregation
Output:
18.Case Studies
18.1 Analyzing Customer Purchases: Identifying Top Customers and Products
In 2020, Amazon, one of the largest e-commerce platforms globally, wanted to
enhance its understanding of customer purchasing behavior to improve marketing
campaigns and inventory management. By analyzing purchase data, Amazon's data
scientists were able to identify their top customers based on total spend and frequency of
purchase. They also used this data to identify top-selling products, especially seasonal
items, which were often out of stock due to unexpected demand. Amazon used data
analytics techniques like clustering and segmentation to categorize customers and predict
future buying behavior. The solution was implemented in early 2021, leading to targeted
promotional campaigns and better inventory management, reducing stockouts during
peak shopping seasons like Prime Day and the holiday rush. This led to improved
customer satisfaction and an increase in sales during high-demand periods. The project
took about four months to complete, leveraging advanced analytics tools such as Amazon
Redshift and AWS data services.
18.2 Time Series Analysis: Forecasting Sales Trends
In 2020, Walmart faced challenges in accurately forecasting demand for essential
products due to the COVID-19 pandemic. The company used time series analysis on
historical sales data from 2018 and 2019 to understand sales trends and seasonal
variations. By using time series models like ARIMA (AutoRegressive Integrated Moving
Average), Walmart was able to predict shifts in product demand, especially for essentials
like toilet paper, sanitizers, and groceries. The company applied these forecasts to adjust
stock levels dynamically and prevent shortages. The data was collected over several
months, starting in March 2020, with a solution fully implemented by May 2020. The
project helped Walmart not only manage supply chain disruptions but also adjust to new
consumer buying patterns caused by the pandemic. This analysis improved inventory
turnover and ensured that essential products were available, even during the early
pandemic months when consumer demand was highly unpredictable.
18.3 Building a KPI Dashboard with Aggregated Metrics
Code:
import pandas as pd
import numpy as np
num_days = 252# Number of business days in 2022 (approximately 252 days)
data = {
'Date': pd.date_range('2022-01-01', periods=num_days, freq='B'), # Business
days in 2022
'Stock_A': np.random.normal(100, 1, num_days), # Stock A prices
'Stock_B': np.random.normal(50, 1, num_days), # Stock B prices
'Stock_C': np.random.normal(200, 1, num_days) # Stock C prices
}# Sample stock data for a portfolio (assume the data is daily closing prices)
df = pd.DataFrame(data)
df.set_index('Date', inplace=True)
Output:
Analysis:
● Data: In the code, we simulate daily stock prices for three stocks (Stock_A,
Stock_B, and Stock_C), with random values generated from a normal
distribution.
● Portfolio Weights: We assume the portfolio is composed of 40% in Stock_A, 30%
in Stock_B, and 30% in Stock_C.
● Daily Returns: Using pct_change(), we calculate the daily percentage change in
stock prices.
● Portfolio Return: The portfolio return on any given day is calculated by
multiplying the stock's daily return by its weight in the portfolio, and summing
the results.
● Cumulative Return: The cumulative portfolio return is the product of daily returns
over time, which is adjusted by subtracting 1.
19.2 Retail Analytics: Category-Wise Sales Comparison
Code:
import pandas as pd
data = {
'Product': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'Category': ['Electronics', 'Clothing', 'Electronics', 'Clothing', 'Furniture',
'Electronics', 'Furniture', 'Clothing', 'Furniture', 'Electronics'],
'Sales': [1200, 300, 1500, 700, 200, 1800, 400, 500, 350, 2200]
}# Sample retail sales data
df = pd.DataFrame(data)
category_sales = df.groupby('Category')['Sales'].sum()# Group by
category and calculate total sales
Analysis::
● Data: The dataset includes product names, their categories (Electronics, Clothing,
Furniture), and the sales amount for each product.
● Grouping: We use groupby() to aggregate the data by Category and calculate the
total sales for each category.
● Visualization: A bar chart is created to show the sales comparison between
categories. The plot(kind='bar') function generates the plot with labeled axes
19.3 Healthcare Analytics: Patient Data Analysis and Reporting
Code:
import pandas as pd
data = {
'Patient_ID': [1, 2, 3, 4, 5],
'Age': [34, 56, 23, 45, 67],
'Diagnosis': ['Diabetes', 'Hypertension', 'Diabetes', 'Hypertension', 'Cancer'],
'Treatment': ['Insulin', 'Medication', 'Insulin', 'Medication', 'Chemotherapy'],
'Outcome': ['Good', 'Fair', 'Good', 'Good', 'Poor']
}# Sample healthcare patient data
df = pd.DataFrame(data)
print("Diagnosis Count:")
print(diagnosis_counts)# Display diagnosis counts and outcome analysis
print("\nOutcome Analysis by Diagnosis:")
print(outcome_by_diagnosis)
Output:
Analysis:
● Data: The dataset contains patient records with information such as Patient_ID,
Age, Diagnosis, Treatment, and Outcome.
● Diagnosis Count: We use value_counts() to count the number of patients for each
diagnosis (e.g., Diabetes, Hypertension, Cancer).
● Outcome Analysis: A crosstab is used to show the relationship between the
diagnosis and the patient's outcome (e.g., Good, Fair, Poor).
● Report Generation: The diagnosis count and the outcome analysis are printed for
further interpretation.
Concept:
import pandas as pd
import numpy as np
df['average.grade'] = np.mean(
Output:
Concept:
Pandas makes it easy to export DataFrames to different formats for storage or integration
with other applications. You can export to CSV, Excel, or even SQL databases.
Example: Export the DataFrame to CSV and Excel, and create a table in an SQLite
database.
import sqlite3
df.to_csv('exported_student_data.csv', index=False) # Export to CSV
print("Data exported to 'exported_student_data.csv'.")
Output:
Concept:
Pandas DataFrames are commonly used to preprocess data for machine learning
workflows. You can handle missing values, encode categorical data, and split datasets for
training and testing.
Example:
Code:
label_encoder = LabelEncoder()
df['gender_encoded'] = label_encoder.fit_transform(df['gender'])# Encode the
"gender" column
21.Beyond Pandas
21.1 Exploring DataFrames in PySpark
Code:
from pyspark.sql import SparkSession
spark =
SparkSession.builder.appName("DataFrameExample").getOrCreate()#
Initialize a Spark session
result = df.count()
print("Count result:", result)
Output:
Modin:
import os
os.environ["MODIN_ENGINE"] = "ray" # or "dask"
Output:
Vaex provides memory-efficient, lazy evaluation for large datasets, while Modin
uses parallel processing to speed up Pandas operations. Both libraries offer faster data
manipulation compared to standard Pandas.
21.3 Combining Pandas with SQLAlchemy for Complex Queries
Code:
import pandas as pd
from sqlalchemy import create_engine
query = "SELECT * FROM students WHERE age > 30" #Query data
using Pandas
df = pd.read_sql(query, engine)# Load the result of the query into a Pandas
DataFrame
print(df)
Output:
Creating DataFrames
Ex1: Create a DataFrame from a dictionary of lists.
Code:
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35], # Create a DataFrame
'Score': [85, 90, 95]
}
df1 = pd.DataFrame(data)
print(df1)
Explanation:
This code creates a DataFrame using the pandas library from a dictionary of lists.
Each key in the dictionary (Name, Age, and Score) becomes a column in the DataFrame,
and the corresponding lists represent the values in those columns. Each row in the
DataFrame is formed by aligning the elements from the lists by their positions. The
resulting DataFrame organizes the data in a tabular format, making it easier to manipulate
and analyze. Finally, the DataFrame is printed to display the dataset.
Output:
Ex2: Load a dataset from a CSV file into a Pandas DataFrame and display the first 10
rows.
Code:
df = pd.read_csv('student-dataset.csv')
print(df.head(10)) # Load and display the first 10 rows
Explanation:
This code reads a CSV file named "student-dataset.csv" into a Pandas DataFrame
using the read_csv function. The resulting DataFrame organizes the data from the CSV
file into rows and columns for easy manipulation and analysis. The head(10) method is
then used to display the first 10 rows of the dataset, allowing for a quick inspection of its
structure and contents. This helps in understanding the data before performing further
operations.
Output:
Ex3: Create a DataFrame using NumPy arrays with custom row and column labels.
Code:
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # Create a DataFrame from NumPy
arrays
df2 = pd.DataFrame(data, index=['Row1', 'Row2', 'Row3'], columns=['Col1',
'Col2', 'Col3'])
print(df2)
Explanation:
This code creates a DataFrame using NumPy arrays, where the array provides the
data in a matrix-like format. The DataFrame is constructed by mapping the rows of the
array to custom row labels (Row1, Row2, Row3) and the columns to custom column
labels (Col1, Col2, Col3). Each element in the array is placed at the intersection of its
respective row and column label in the DataFrame. The result is a labeled, tabular
structure that is printed for visualization.
Output:
Exploring DataFrames
Ex4: Display the shape, column names, and data types of a DataFrame.
Code:
print(df.shape) # Display shape, columns, and data types
print(df.columns)
print(df.dtypes)
Explanation:
This code provides key information about a DataFrame. The shape function
displays the dimensions of the DataFrame as (number of rows, number of columns). The
columns function lists the names of all columns in the DataFrame. The dtypes function
shows the data types of each column, helping to understand the structure and types of
data stored in the DataFrame. These details are useful for understanding the dataset
before performing further analysis.
Output:
Code:
Explanation:
This code previews specific parts of the DataFrame for quick inspection. The
head function displays the first 5 rows of the DataFrame, while the tail function shows
the last 5 rows. The sample function retrieves a specified number of random rows (in this
case, 3). These methods are useful for gaining insights into the dataset's structure and
content without viewing the entire DataFrame.
Output:
Code:
value_counts = df['gender'].value_counts() # Count unique values in the gender
column
print(value_counts)
Explanation:
This code counts the occurrences of each unique value in the specified column of
the DataFrame, which in this case is the "gender" column. The value_counts function
returns the count of each unique value as a series, with the values as the index and their
counts as the corresponding values. This is useful for analyzing the distribution of data in
a categorical column.
Output:
2: Basic Data Manipulation
Code:
Explanation:
This code demonstrates different methods for selecting and filtering data in a
DataFrame. First, the selected_columns line selects the "name" and "age" columns using
column names. Then, the selected_rows line uses iloc to select the first 5 rows based on
their position. The filtered_rows line uses loc to filter rows where the "age" column is
greater than 20 and selects only the "name" and "age" columns. These techniques allow
for flexible and precise data selection and filtering in a DataFrame.
Output:
Ex8: Filter rows where a numeric column value exceeds a threshold (e.g., Sales > 1000).
Code:
filtered_df = df[df['age'] > 25] # Filter rows where 'age' is greater than 25
print(filtered_df)
Explanation:
This code filters the rows of the DataFrame where the value in the "age" column
is greater than 25. It creates a new DataFrame,filtered_df, containing only the rows that
meet this condition. This technique is useful for narrowing down the dataset based on a
specific numeric criterion, in this case, age. The filtered DataFrame is then printed for
inspection.
Output:
Ex9: Select rows where a text column matches a specific value using str.contains().
Code:
filtered_city = df[df['city'].str.contains('New', na=False)]
print(filtered_city)# Select rows where the city contains "New”
Explanation:
This code filters the rows where the "city" column contains the substring "New".
The str.contains() function is used to check if the specified string ("New") is present in
each value of the "city" column. The na=False argument ensures that any missing (NaN)
values are excluded from the result. The filtered rows are then stored in the filtered_city
DataFrame and printed for inspection. This method is useful for searching text columns
for specific substrings.
Output:
Code:
df = pd.read_csv('student-dataset.csv')
df_sorted_by_age = df.sort_values('age') # Sort by age in ascending order
df_sorted_multi = df.sort_values(['age', 'math.grade'], ascending=[True, False]) #
Sort by multiple columns (age ascending, math.grade descending)
Explanation:
In this example, the sort_values method is used to sort the DataFrame df in
different ways. First, the DataFrame is sorted by the age column in ascending order. Next,
it is sorted by multiple columns: age in ascending order and math.grade in descending
order. The ascending parameter specifies the sort order for each column, with True for
ascending and False for descending.
Output:
Code:
df_renamed = df.rename(columns={
'english.grade': 'english_performance',
'math.grade': 'math_performance',
'sciences.grade': 'science_performance',
'language.grade': 'language_proficiency',
'ethnic.group': 'ethnicity' # Rename columns to more descriptive names
})
Explanation:
In this example, the rename method is used to change column names in the
DataFrame to more descriptive ones. The columns parameter takes a dictionary where
the keys are the original column names, and the values are the new names. This improves
readability and clarity in the DataFrame. To view the renamed DataFrame, use the
print(df_named) statement.
Output:
Ex12: Reset the index of a DataFrame and drop the existing index column.
Code:
Explanation:
The DataFrame df now has a fresh index (from 0 to the number of rows - 1)
without retaining the original index column. This argument ensures that the old index is
removed from the DataFrame and not added as a new column. Without this argument, the
old index would appear as a separate column in the resulting DataFrame.
Output:
Intermediate-Level Exercises
3: Cleaning and Preprocessing Data
Ex13: Identify missing values in each column using isna() and sum().
Code:
import pandas as pd
import numpy as np
df = pd.read_csv('student-dataset.csv') # Load the dataset
print("Missing values per column:")
print(df.isna().sum()) #is a method in pandas
Explanation:
This code identifies the number of missing values in each column of the
DataFrame. The isna() function checks for missing (NaN) values in the dataset, returning
a DataFrame of the same shape with True for missing values and False otherwise. The
sum() function then calculates the total count of missing values for each column. In this
case, only the ethnic.group column contains missing values, so its count is displayed,
while other columns show zero if they have no missing values.
Output:
Ex14: Replace missing numeric values with the column mean and missing text values with
"Unknown".
Code:
numeric_columns = df.select_dtypes(include=[np.number]).columns
text_columns = df.select_dtypes(include=['object']).columns
df[numeric_columns] =
df[numeric_columns].fillna(df[numeric_columns].mean())
df[text_columns] = df[text_columns].fillna("Unknown")# Replace missing text
values with "Unknown"
print("Column Means:")
print(df[numeric_columns].mean())
print("\nFilled Numeric Columns:")
print(df[numeric_columns].head())# Print the filled dataframe or specific columns
to verify
print("\nFilled Text Columns:")
print(df[text_columns].head())
Explanation:
This code handles missing values in numeric and text columns separately.
Numeric columns are identified using select_dtypes with include=[np.number], and text
columns are identified with include=['object']. For numeric columns, missing values are
replaced with the column mean using the fillna method. For text columns, missing values
are replaced with the string "Unknown" using the same method. The updated DataFrame
is printed to verify the changes. This approach ensures missing values are handled
appropriately based on the data type, maintaining data integrity.
Output:
Ex15: Drop rows or columns with more than 50% missing data.
Code:
threshold = len(df) * 0.5
df.dropna(thresh=threshold, axis=1, inplace=True) # Drop columns
df.dropna(thresh=threshold, axis=0, inplace=True) # Drop rows
Explanation:
Since there are no missing values in this dataset, this step won't do anything.But
here's how you would do it if there were missing values.The student dataset chosen didn’t
miss any row or column with more than 50% of data.
Transforming Data
Ex16: Add a new column based on calculations using other columns
Code:
Explanation:
In this example, new columns are added to the DataFrame based on calculations
using existing columns. The total_grade column is created by calculating the mean of
grades across four subjects for each row using the mean function. Similarly, the
grade_variance column is added by calculating the variance of the same grades using the
var function. These new columns provide insights into the overall performance and
consistency of grades for each student.
Output:
Ex17: Standardize numeric columns by subtracting the mean and dividing by the standard
deviation.
Code:
Explanation:
This example standardizes numeric columns in the DataFrame by transforming
each column to have a mean of 0 and a standard deviation of 1. For each numeric
column, the mean is subtracted, and the result is divided by the standard deviation. This
transformation ensures that all numeric data is on the same scale, making it suitable for
machine learning models or statistical analysis. The standardized values are stored in new
columns with "_standardized" appended to the original column names.
Output:
Ex18: Apply a lambda function to modify values in a column (e.g., convert text to
uppercase).
Code:
df['city_upper'] = df['city'].apply(lambda x: x.upper() if isinstance(x, str) else x)
print("\nCity Column in Uppercase:")# Print the original and updated 'city'
column
print(df[['city', 'city_upper']].head())# Convert the 'city' column to uppercase
using a lambda function
Explanation:
In this example, a lambda function is applied to the city column to convert all text
values to uppercase. The apply method processes each value in the column, and the
lambda function checks if the value is a string before converting it to uppercase. The
modified values are stored in a new column named city_upper, while the original city
column remains unchanged.
Output:
Explanation:
In this example, the data is grouped by the nationality column, and the mean of
the latitude column is calculated for each group. The groupby method groups the rows
based on unique values in the nationality column, and the mean function computes the
average latitude for each group. This approach is useful for summarizing and analyzing
data based on categories. The result is a Series with nationalities as the index and their
corresponding mean latitude values.
Output:
Ex20: Use multiple aggregation functions (e.g., sum, mean, max) on grouped data.
Code:
group_summary = df.groupby('nationality')[['latitude', 'longitude',
'english.grade']].agg(['sum', 'mean', 'max']) #perform three functions
print(group_summary)
Explanation:
In this example, multiple aggregation functions (sum,mean and max) are applied
to the grouped data. The DataFrame is grouped by the nationality column, and the
specified numeric columns (latitude,longitude and english.grade) are aggregated using
these functions. The agg method allows applying multiple functions simultaneously,
resulting in a summary table with a hierarchical column structure, where each numeric
column displays the results for all three functions. This provides a detailed summary of
grouped data.
Output:
Ex21: Find the top 3 groups by their total values in a numeric column.
Code:
top_groups = data.groupby('nationality')['latitude'].sum().nlargest(3).index
print(top_groups)
Explanation:
This code identifies the top three groups in the "nationality" column based on the
total sum of values in the "latitude" column. The groupby function groups the data by
"nationality", and the sum function calculates the total "latitude" value for each group.
The nlargest function is used to find the top three groups with the highest total values.
Finally, the index of these top groups is printed, showing the group names.
Output:
Advanced-Level Exercises
5: Advanced Transformations
Reshaping Data
Ex25: Reshape data using melt() to convert wide data into long format.
Code:
import pandas as pd
df = pd.read_csv('student-dataset.csv') # Load the dataset
long_format = pd.melt(
df,
id_vars=['id', 'name'],
value_vars=['english.grade', 'math.grade', 'sciences.grade', 'language.grade'],
var_name='Subject',
value_name='Grade'
) # Convert grade columns into long format
print("Long Format Data:")
print(long_format.head())
Explanation:
This code uses the melt function to reshape the DataFrame from a wide format to
a long format. The id_vars parameter specifies the columns that should remain
unchanged ("id" and "name"), while value_vars lists the columns to be unpivoted
("english.grade", "math.grade", "sciences.grade", "language.grade"). The resulting
long-format DataFrame has two new columns: "Subject", which contains the original
column names from value_vars, and "Grade", which contains the corresponding values.
This format is useful for analysis where data needs to be consolidated into a single
column for better processing or visualization.
Output:
Ex26: Use pivot() to transform long data into a summary table.
Code:
Explanation:
This code uses the pivot function to transform long-format data back into a
wide-format summary table. The index parameter specifies the column ("id") to use as
the row labels, the columns parameter defines which column ("Subject") will become the
new column headers, and the values parameter specifies the column ("Grade") containing
the data to populate the table. This creates a structured summary table where each row
represents an "id" and each column represents a "Subject" with the corresponding
"Grade" values, making it easier to compare observations across subjects.
Output:
Explanation:
This code demonstrates how to manipulate hierarchical indices in a DataFrame
using stacking and unstacking. First, a multi-index DataFrame is created by setting "id"
and "gender" as hierarchical row indices. The stack function is then used to convert
specific columns ("english.grade" and "math.grade") into row-level indices, creating a
vertically compact structure. The unstack function reverses this process by converting
row-level indices back into columns, returning the DataFrame to its wide format. These
transformations are useful for reshaping and analyzing multi-dimensional data.
Output:
Code:
import numpy as np
np.random.seed(0) # Simulate a date column for demonstration
df['Date'] = pd.to_datetime('2023-01-01') + pd.to_timedelta(np.random.randint(0,
365, df.shape[0]), unit='d')
print("Data with DateTime:")
print(df[['id', 'Date']].head()) #Convert the String to Datetime
Explanation:
This example converts a string column into a datetime format using the pandas
to_datetime function. Since the dataset did not originally include a date column, a
synthetic one was created by generating random days within a year starting from
"2023-01-01". The to_datetime function ensures that the data is in a proper datetime
format, which allows for operations like date extractions, sorting, and time-based
analysis. This process is useful for preparing date-related data for further analysis or
visualization.
Output:
Ex29: Extract the year, month, and weekday from a datetime column.
Code:
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Weekday'] = df['Date'].dt.day_name() # Extract date components
print("Data with Year, Month, and Weekday:")
print(df[['Date', 'Year', 'Month', 'Weekday']].head())
Explanation:
This code extracts specific components from a datetime column named "Date" in
the DataFrame. The year is extracted using the dt.year attribute, the month using
dt.month, and the weekday name using dt.day_name(). These extracted components are
stored as new columns ("Year", "Month", "Weekday") in the DataFrame. This process is
useful for analyzing or visualizing data based on temporal trends, such as identifying
patterns by year, month, or day of the week. The updated DataFrame is then printed to
show the extracted information alongside the original "Date" column.
Output:
Ex30: Group sales data by month and calculate total monthly revenue.
Code:
monthly_grades = df.groupby(df['Date'].dt.to_period('M'))[['english.grade',
'math.grade']].mean()
print("Monthly Average Grades:") # With the help of groupby we can calculate
the mean
print(monthly_grades)
Explanation:
This code groups the data by month using the groupby function along with the
to_period('M') method, which converts the datetime values to monthly periods. It then
calculates the mean of the selected columns ("english.grade" and "math.grade") for each
month. This aggregation allows you to analyze trends such as monthly averages or totals.
In the context of sales data, this could be used to calculate total revenue for each month,
but in this example, it is used to calculate the average grades for each month.
Output:
6: Performance Optimization
Optimizing DataFrames
Ex31: Reduce memory usage by converting columns to appropriate data types (e.g., float32
or category).
Code:
df['id'] = df['id'].astype('int32')
df['english.grade'] = df['english.grade'].astype('float32') #Convert grades to
float32 and id to int32
print("Reduced memory DataFrame:")
print(df.info())
Explanation:
This code optimizes memory usage by converting columns in the DataFrame to
more efficient data types. The "id" column is converted to int32, which uses less memory
compared to the default int64 type. The "english.grade" column is converted to float32,
which reduces memory usage compared to the default float64 type. The astype() function
is used for this conversion. After converting, the info method is called to display the
updated DataFrame information, showing the reduced memory usage. This technique is
useful for working with large datasets to improve performance.
Output:
Ex32: Use vectorized operations to replace loops for calculating new columns.
Code:
Explanation:
This code replaces the use of loops with a vectorized operation to calculate the
average grade. Instead of iterating over each row manually, it computes the average of the
specified columns (english.grade, math.grade, sciences.grade, language.grade) using the
mean function with axis=1, which operates across rows. This approach is much faster and
more efficient than loops, especially for large datasets. The result is stored in a new
column called AverageGrade, and the DataFrame is printed with the id and AverageGrade
columns for inspection.
Output:
Code:
Explanation:
This code reads and processes a large dataset in chunks using the read_csv
function with the chunksize parameter. By specifying a chunk size of 250, the dataset is
read in smaller, manageable parts instead of all at once, which helps prevent memory
overload when working with large files that cannot fit into RAM. Each chunk is
processed individually within the loop, allowing for incremental analysis or
transformation of the data. In this example, the 'name' column is printed for the first few
rows of each chunk.
Output:
Working with Large Data
Ex34: Load and filter a dataset with over 1 million rows using Dask.
Code:
def process_large_dataset_with_dask(data_path='large_student_data'):
dask_df = dd.read_csv(os.path.join(data_path, '*.csv')) # Read all CSV files in
the directory
filtered_df = dask_df[
(dask_df['age'] >= 20) & # Complex filtering and aggregation
(dask_df['age'] <= 30) &
(dask_df['english_grade'] > 80)
]
grouped_stats = filtered_df.groupby('nationality').agg({
'math_grade': ['mean', 'max'], # Compute group-level statistics
'english_grade': ['mean', 'min']
}).compute()
return grouped_stats
if __name__ == '__main__':
dask_results = process_large_dataset_with_dask()
print("\nDask Grouped Statistics:")
print(dask_results)
Explanation:
This code uses Dask to process a large dataset, which can be spread across
multiple CSV files in a directory. It first loads all CSV files using dd.read_csv(), which
allows Dask to handle large datasets by breaking them into smaller chunks. The dataset is
filtered to include rows where the age is between 20 and 30, and the English grade is
above 80. After filtering, the data is grouped by nationality, and several statistics (mean,
max, min) for the math and English grades are computed using groupby() and agg(). The
compute() method is used to execute the calculations and return the results. Dask allows
for processing large datasets efficiently by utilizing parallel computing.
Output:
Ex35: Split a large dataset into smaller files and recombine them using concat().
Code:
def split_and_recombine_dataset(data_path='large_student_data'):
csv_files = [os.path.join(data_path, f) for f in os.listdir(data_path) if
f.endswith('.csv')] # Read all CSV files
chunks = []
for file in csv_files: # Read files in chunks
chunk = pd.read_csv(file, chunksize=50000)
chunks.extend(list(chunk))
for chunk in chunks:
print(chunk[id])
full_dataset = pd.concat(chunks, ignore_index=True) # Recombine using
concat
return full_dataset
Explanation:
This code splits a large dataset into smaller chunks and then recombines them
using concat(). First, it reads all the CSV files from the given directory and processes
them in chunks of 50,000 rows using pd.read_csv() with the chunksize parameter. These
chunks are stored in a list. Then, the code iterates through each chunk and prints the id
column from the chunk. After processing all the chunks, pd.concat() is used to combine
them into one large DataFrame. This allows for efficient handling of large datasets by
working with smaller pieces and then merging them back together.
Output:
Real-World Applications
7: Case Studies
Customer Analytics
Ex36: Load a dataset of customer transactions and calculate: Total revenue by customer.
Average order value for each customer. Top 10 customers by total spending.
Dataset:
Objective:
Code:
import pandas as pd
data = {
'customer_id': [1, 2, 1, 3, 2, 1, 3, 4, 5, 6, 7, 8, 9, 10, 11],
'transaction_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114,
115],
'transaction_date': [
'2023-01-01', '2023-01-03', '2023-01-05',
'2023-01-10', '2023-01-12', '2023-01-15', '2023-01-20',
'2023-01-25', '2023-01-28', '2023-02-01', '2023-02-05',
'2023-02-10', '2023-02-15', '2023-02-20', '2023-02-25'
],
'amount': [50.5, 75.0, 60.0, 120.0, 40.0, 90.0, 80.0, 200.0, 150.0, 300.0, 250.0, 100.0,
130.0, 220.0, 310.0]
}# Create a sample dataset
Output:
Sales Analysis
Ex37: Analyze a dataset with fields like Product, Region, Sales, and Profit: Identify the
most profitable product category in each region. Visualize monthly sales trends using
Matplotlib or Seaborn.
Objective:
data = {
'Product': ['A', 'B', 'C', 'D', 'A', 'B', 'C', 'D', 'A', 'B', 'C', 'D'],
'Region': ['North', 'North', 'South', 'South', 'East', 'East', 'West', 'West', 'North', 'South',
'East', 'West'],
'Sales': [1000, 1500, 1200, 1800, 1400, 2000, 1600, 1900, 1700, 1100, 1500, 2200],
'Profit': [300, 400, 350, 500, 450, 600, 550, 700, 500, 300, 400, 800],
'Month': ['2023-01', '2023-01', '2023-01', '2023-01', '2023-02', '2023-02', '2023-02',
'2023-02', '2023-03', '2023-03', '2023-03', '2023-03']
}# Sample dataset for Sales Analysis
Output:
Explanation:
Inventory Management
Ex38: Use a dataset with fields like Product ID, Stock, Reorder Level: Identify products
below the reorder level. Calculate the total inventory value for each product.
Inventory Management
Objective:
Code:
inventory_data = {
'Product_ID': ['P001', 'P002', 'P003', 'P004', 'P005'],
'Stock': [50, 20, 15, 80, 30],
'Reorder_Level': [40, 30, 20, 50, 25],
'Unit_Price': [10, 15, 20, 5, 8]
}# Sample dataset for Inventory Management
Output:
Explanation:
Data Visualization
Ex39: Create a bar chart to compare total sales by region.
Objective:
Code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = {
'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
'Sales': [1000, 1500, 1200, 1800, 1400, 1100, 2000, 1900],
}# Sample dataset for sales analysis
This example involves visualizing the correlations between numerical fields in the dataset. We'll
use Seaborn to plot the heatmap.
Code:
data = {
'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
'Sales': [200, 300, 250, 400, 100, 150, 120, 180],
'Profit': [50, 70, 60, 100, 20, 30, 25, 40],
'Discount': [10, 15, 5, 10, 5, 10, 3, 7]
}# Add additional fields to the dataset
df = pd.DataFrame(data)
The heatmap will display the correlations between Sales, Profit, and Discount: