Advanced Customer Segmentation Using Azure Synapse
A Project Report Submitted in the partial fulfillment of the
requirements for the award of the degree of
Bachelor of Technology in
Department of CSE
By
2200030700 - Deeksha Supreeth
2200030086 - G.V. Sai Suhruth
2200090081 - G. Jaswanth
under the supervision of
DR. Praveen Kumar Madhavarapu
Department of Computer Science and Engineering
K L E F, Green Fields, Vaddeswaram- 522502,
Guntur (District), Andhra Pradesh, India.
April, 2025
CERTIFICATE
This is to certify that the Project Report entitled “Advanced Customer Segmentation
Using Azure Synapse” is being submitted by Deeksha Supreeth (2200030700), G.V. Sai
Suhruth (2200030086), and G. Jaswanth (2200090081) in partial fulfillment for the
award of B. Tech III Even Semester in CSE at K L University. This report is a record of
bonafide work carried out under our guidance and supervision.
The results embodied in this report have not been copied from any other Department,
University, or Institute.
Signature of the Supervisor
DR. Praveen Kumar Madhavarapu
Contents
S. No Contents
1. Abstract
2. Introduction
3. Problem Statement
4. Objectives of the Project
5. Literature Survey
6. System Architecture
7. Technologies Used
8. Implementation
9. Dataset Used
10. Data Flow Diagram
11. Screenshots of Output
12. Results and Discussions
13. Conclusion and Future Work
14. References
Abstract
In today’s competitive market, understanding customer behavior is key to driving
targeted marketing, enhancing user experiences, and boosting sales. This project
aims to implement an advanced customer segmentation solution using Azure
Synapse Analytics. By ingesting structured data stored in Azure Data Lake Storage
and processing it with Synapse Spark pools, the project applies machine learning
techniques to segment customers based on their purchasing behavior. The
segmented output is visualized using Power BI, providing actionable insights into
customer patterns and preferences. This end-to-end pipeline showcases the power
of integrating big data storage, analytics, and machine learning within the Azure
ecosystem.
Introduction
Customer segmentation is a vital data mining technique that divides a customer
base into distinct groups based on common characteristics. Businesses use this to
tailor marketing strategies, personalize services, and improve customer
satisfaction.
In this project, we leverage Azure Synapse Analytics, a powerful analytics service
that combines big data and data warehousing. We ingest the dataset into Azure
Data Lake Storage Gen2, process it using Apache SQL pools in Synapse to
discover customer segments. The final results are saved and visualized in Power
BI, enabling decision-makers to better understand customer clusters, spending
patterns, and engagement behavior.
This solution demonstrates a complete modern data pipeline for intelligent
customer analytics using the cloud.
Problem Statement
Traditional marketing strategies treat all customers alike, which leads to
inefficiencies and reduced customer satisfaction. Businesses need a scalable way to
segment customers and analyze their behavior patterns to target their audience
more precisely.
Challenge: How can we leverage cloud technologies to automate and scale
customer segmentation from raw data to insightful dashboards?
Objectives of the Project
Ingest structured customer data into Azure Data Lake Storage.
Process the data using Apache Spark in Azure Synapse.
Apply K-Means clustering to group customers based on behavior.
Store and visualize clustered data in Power BI dashboards.
Help businesses understand different customer segments for targeted
decision-making.
Literature Survey
Several studies have shown the importance of customer segmentation in enhancing
marketing effectiveness. Techniques like clustering and classification have been
used with tools like Python and R. However, they often lack scalability for large
datasets.
Azure Synapse provides a cloud-native platform that integrates big data and data
warehousing with ML. Recent case studies show that combining Spark with Data
Lake Storage leads to better performance and flexibility in data processing and
analytics.
System Architecture
→ Azure Data Lake Storage (train.csv)
→ Azure Synapse (Spark Pool for ML)
→ K-Means Clustering → Customer Segments
→ Output stored in ADLS as Parquet
→ Power BI Dashboard connects to output for visualization
Components:
Azure Data Lake Gen2 (Storage)
Synapse Spark Pool (Processing + ML)
Synapse Serverless SQL Pool (Optional querying)
Power BI (Visualization)
Technologies Used
Technology Purpose
Azure Synapse Analytics & Spark processing
Azure Data Lake Storage Gen2 File storage
Apache Spark (PySpark) ML model & data transformation
Power BI Visualization
K-Means Algorithm Clustering
Technology Purpose
CSV/Parquet Data formats
Implementation
1. Upload CSV: train.csv is uploaded to Azure Data Lake.
2. Data Processing: Read with Spark using read.csv(), select relevant columns.
3. Feature Engineering: Combine numerical features using VectorAssembler,
scale with StandardScaler.
4. Output: Write results to output/clustered_customers as Parquet.
5. Visualization: Load results into Power BI for insights.
Dataset Used: train.csv
The dataset used in this project is train.csv, also known in this case as
amazon.csv after upload to Azure Data Lake.
Features in the dataset:
Customer ID: Unique identifier for each customer
Age: Age of the customer
Gender: Gender (Male/Female)
Region: From where to where they travel
Usefulness:
Understand spending habits
Target different income and age groups
Improve customer retention
Data Flow Diagram
Screenshots of Output
Results and Discussions
The clustering output classified customers into 4 meaningful segments.
Segments were based on age, income, and frequency of purchases.
Power BI visualizations showed patterns like:
o Young frequent buyers
o High-income but low-frequency customers
o Loyal low-income users
Businesses can target each segment with different strategies (e.g., discount
offers, loyalty programs).
Conclusion
This project demonstrates how to build a scalable, cloud-based customer
segmentation using Azure Synapse Analytics. By leveraging Azure Data Lake
Storage for data ingestion, Synapse Spark for data processing and machine
learning, and Power BI for visualization, we created meaningful customer clusters
that businesses can use to make data-driven decisions.
The combination of big data processing, machine learning, and interactive
dashboards allows organizations to understand their customers better and unlock
greater business value. This architecture is flexible and can be extended for real-
time analytics, integration with CRM systems, or predictive modeling in future
enhancements.
References
1. Microsoft Azure Docs - https://learn.microsoft.com/azure
2. K-Means Clustering - scikit-learn documentation
3. Apache Spark MLlib Guide - https://spark.apache.org/docs/latest/ml-
guide.html
4. Power BI Documentation - https://learn.microsoft.com/power-bi
5. Customer Segmentation in Retail: A Literature Review – ResearchGate