This project represents the culmination of my training in the Microsoft Data Engineer Track at DEBI. It encompasses a complete data engineering pipeline, from extraction to loading, transformation, model building, and visualization, all within the Azure cloud environment. The goal of this project is to build a robust and scalable analytics platform for Uber trip data, enabling deeper insights into trip patterns, fare prediction, and ultimately, better business decisions.
This project involved the following steps:
-
Data Acquisition: The raw dataset,
uber_data.csv, was sourced from a public GitHub repository. -
Initial Database Creation: An initial database,
UberTripsDB, was created. The downloadeduber_data.csvfile was imported into this database using a flat file import process. This database served as a staging area for the raw data. -
Data Extraction using Azure Data Factory (ADF): ADF was configured to extract the
uber_data.csvfile directly from the GitHub HTTP source. -
Data Storage in Azure Blob Storage: The extracted CSV file was stored in an Azure Storage Account within a container designated for raw data (
RawData).
- Data Transformation using Azure Databricks: The raw data was then transformed within an Azure Databricks notebook. This involved data cleaning, feature engineering, and outlier handling. The transformed data was saved as CSV files in a separate Azure Storage Account container named
TransformedData. The transformation process is detailed in theTransformation.ipynbnotebook.
-
Data Warehouse (DWH) Creation in Azure Synapse Analytics: A dedicated data warehouse was created in Azure Synapse Analytics. The star schema design for this DWH is visualized in the Datamodel section below.
-
Data Loading into Azure Synapse Analytics: The transformed data from the
TransformedDatacontainer was loaded into the Azure Synapse DWH tables, including a key analytics table namedUberTable_Analytics. This table serves as the foundation for the dashboard. -
Data Inspection in SQL Server Management Studio (SSMS): The DWH was connected to SSMS to perform detailed data validation and inspection. A backup of the DWH is available as
UberTripsDWH.bak. -
Machine Learning Model Development: A machine learning model was developed to predict Uber fares based on various features. The model development process, including data preprocessing, feature engineering, model selection, training, evaluation, and hyperparameter tuning, is documented in the
UberML.ipynbnotebook. XGBoost was ultimately chosen as the best performing model, achieving an R-squared score of 0.84. The trained model was saved asuber_fare_prediction_model_improved.joblib. -
Dashboard Creation: Finally, a Power BI dashboard (Uber Dashboard.pbix) was created based on the
UberTable_Analyticstable in the DWH. This dashboard provides interactive visualizations and insights derived from the processed data.
The machine learning model for fare prediction was developed using Python with the following libraries:
- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn (including StandardScaler, OneHotEncoder, ColumnTransformer, Pipeline, RandomForestRegressor, train_test_split, cross_val_score, GridSearchCV, mean_squared_error, r2_score)
- xgboost
- scipy
- joblib
The model uses features such as pickup/dropoff times and locations, passenger count, trip distance, and engineered features like rush hour, weekend, and night flags to predict the fare amount. The model training process includes data preprocessing, outlier handling, feature selection, and hyperparameter tuning. Refer to the UberML.ipynb notebook for the complete code and detailed analysis.
This project provided valuable experience in building and deploying a cloud-based data engineering pipeline. Key learnings included working with Azure Data Factory, Databricks, Synapse Analytics, and Power BI. Challenges encountered during the project included handling data quality issues and optimizing the machine learning model. Future work could involve exploring real-time data ingestion and more advanced predictive modeling techniques.
I would like to express my sincere gratitude to my team members for their invaluable contributions to this project. I oversaw the project review and took on the most critical role in its development.
- Mohamed Zabady: [GitHub Link - https://github.com/zabady9] - Handled the machine learning aspects of the project.
- Yasser Elsayed: [GitHub Link - https://github.com/yasserelsayed7] - Managed the Azure Data Factory pipeline development
- Ziad Elsayed: [GitHub Link - https://github.com/zyadhozain96] - Contributed to the data transformation process in Databricks
- Abd-Alaah Mostafa: [GitHub Link - https://github.com/bedoo123] - Assisted with data warehouse design in Azure Synapse
- Saber Elsayed: [GitHub Link - https://github.com/Saber30454] - Supported the Power BI dashboard development
Their dedication, collaboration, and insightful feedback were instrumental in the successful completion of this project. I truly appreciate their support and teamwork.