Data Engineering
Data Transformation
WINDOW
FUNCTIONS
DHANESH SARPALE
BASIC EXAMPLE
In the above example, the average salary has
been calculated by aggregating the salaries
based on the job titles of the employees.
DHANESH SARPALE
GENERIC SYNTAX
MySQL
SELECT <column_list>,
<aggregate_function>(<column_expression>) OVER
(
PARTITION BY <partition_expression>
ORDER BY <order_expression>
ROWS <window_frame>
) AS <alias>
FROM <table_name>;
Python
import pandas as pd
df['<alias>'] = df['<column_expression>'].<aggregate_function>
().\
groupby(<partition_expression>).\
<transform_function>()
DHANESH SARPALE
LIST OF SQL WINDOW FUNCTIONS
DHANESH SARPALE
DATA ENGINEERING COMMON
OPERATIONS WITH WINDOW
FUNCTIONS
aggregating, transforming, and analyzing data
within precise partitions or windows
1. Data Aggregation
2. Data Cleansing
3. Data Enrichment
4. Data Partitioning
5. Data Ordering
DHANESH SARPALE
1. DATA AGGREGATION
To perform aggregations over subsets of
data within a given window.
To calculate aggregated values such as
cumulative sums, averages, counts, or
percentages.
To perform these aggregations efficiently
and in a flexible manner, allowing to
aggregate data at different levels of
granularity.
DHANESH SARPALE
1. DATA AGGREGATION
MySQL
SELECT product_id, category, sales,
SUM(sales) OVER (PARTITION BY category) As
category_total_sales,
AVG(sales) OVER (PARTITION BY category) As
category_avg_sales,
SUM(sales) OVER () AS overall_total_sales,
AVG(sales) OVER () AS overall_avg_sales
FROM sales_data
GROUP BY product_id, category;
Python
import pandas as pd
# Assume you already have the data loaded into a pandas DataFrame called
'df'
# Calculating the sum and average sales for each product and category, and
overall sum and average
df['category_total_sales'] = df.groupby('category')['sales'].transform('sum')
df['category_avg_sales'] = df.groupby('category')['sales'].transform('mean')
df['overall_total_sales'] = df['sales'].sum()
df['overall_avg_sales'] = df['sales'].mean()
# Displaying the DataFrame
print(df)
DHANESH SARPALE
2. DATA CLEANSINS
To assist in data cleansing tasks by
identifying and handling duplicates,
missing values, or outliers within specific
windows.
To rank rows based on certain criteria and
identify duplicate records.
To calculate statistical measures within
windows to identify outliers that need to be
handled or removed during the ETL process.
DHANESH SARPALE
2. DATA CLEANSING
MySQL
SELECT name, score,
RANK() OVER (ORDER BY score DESC) AS rank
FROM students;
Python
import pandas as pd
# Assume you already have the data loaded into a pandas
DataFrame called 'df'
# Assigning ranks to students based on their exam scores
df['rank'] = df['score'].rank(ascending=False, method='min')
# Displaying the DataFrame
print(df)
DHANESH SARPALE
3.DATA ENRICHMENT
Window functions provide the ability to
enrich data by computing values based on a
subset of related records within a window.
to derive new information or generate
additional features for your dataset.
For instance, to calculate moving averages,
running totals, or cumulative sums within a
window to provide insights into trends or
patterns in the data.
DHANESH SARPALE
3.DATA ENRICHMENT
MySQL
SELECT product_id, sales,
AVG(sales) OVER (ORDER BY date_column ROWS
BETWEEN 2 PRECEDING AND CURRENT ROW) AS
moving_average,
SUM(sales) OVER (ORDER BY date_column) AS
running_total,
SUM(sales) OVER (ORDER BY date_column) AS
cumulative_sum
FROM sales_data;
Python
# Calculate the 3-day moving average of sales for each
product
df['moving_average'] = df['sales'].rolling(window=3,
min_periods=1).mean()
# Calculate the cumulative sum of sales for each product
df['cumulative_sum'] = df['sales'].cumsum()
# Display the DataFrame
print(df)
DHANESH SARPALE
4. DATA PARTITIONANING
Window functions enable to partition data
into logical groups based on one or more
columns.
This is particularly helpful during the
transformation phase of ETL when
performing calculations or aggregations
separately for different partitions.
For example, to partition data by region, time
period, or any other relevant attribute and
apply window functions within each partition
to obtain partition-specific results.
DHANESH SARPALE
5.DATA ORDERING
Window functions provide the ability to order
data within each partition based on specified
criteria.
This is useful when performing calculations
or aggregations in a specific order.
For example, to order time series data by
timestamp and use window functions to
calculate moving averages or detect trends
over a specified window size.
DHANESH SARPALE
Thank you for taking the time to read
this document! If you found it valuable,
I would greatly appreciate it if you
could show your support by liking and
sharing it with your network. I am
eager to connect with you on LinkedIn,
Let's connect and collaborate to foster
growth together!
DHANESH SARPALE