0% found this document useful (0 votes)

8 views5 pages

Data Management Concepts Learned

The document outlines a comprehensive approach to retail customer segmentation, detailing the end-to-end project workflow including technology stack, data management, and processing stages. It emphasizes the integration of various technologies and data engineering strategies, such as managed and external tables, ETL/ELT processes, and data persistence methods. Key personas involved in the data handling process are also identified, highlighting their roles in building, analyzing, and governing the data pipeline.

Uploaded by

iamnsridhar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views5 pages

Data Management Concepts Learned

Uploaded by

iamnsridhar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Understanding of the Retail Customer Segmentation use case:

High level understanding: The given use case has given an overall idea of how a project or use case
to be approached end to end. Right from choosing the tech stack, infrastructure, tools suitable for
the given problem, scalability of the solution, various layers of maintaining data, transformations,
retention of transient data after processing, various personas in accessing the data and a complete
picture of the data flow starting from sourcing to consumption with multiple layers in-between.

This project integrates multiple technologies and data engineering (DE) strategies to enable
efficient data processing, transformation, and analysis. It spans various layers of a modern data
architecture and supports different personas involved in data handling and consumption.

Along with that, this project has given exposure to various technologies used in bigdata
ecosystem along with the methodologies to handle the data and various forms of data storing
practices such as Data Lake and lake house. It has also given exposure to different forms of source
system understanding as well.

Below mentioned points are the highlights of what approach / methods / concepts that has
been used in handling, processing and storing data for the given use case.

Data Management Concepts Learned

Creating Tables

 With a specified path:

o Data is stored at the specified location, independent of the system's default storage.

o Used for external table creation.

 Without a specified path:

o Data is stored in the system's default managed storage location.

o Typically used for managed tables.

Managed vs. External Tables

 Managed Tables:

o The system manages both metadata and data.

o Data is stored in a system-defined location.

o Overwriting the table replaces existing data entirely.

o Dropping a managed table removes both metadata and data.

 External Tables:

o Only metadata is managed by the system; data is stored externally.

o The system does not control the lifecycle of the data.

o New data is appended unless explicitly handled otherwise.

o Dropping an external table only removes metadata, while data remains in storage.
Data Processing Stages
Staging Layer

 Analyze the source data: This step helps in understanding the data.

 Create a managed table: Define a table based on the number of columns and delimiter used
in the source data.

 Load data: Use the LOAD command to ingest data into the managed table (INSERT
statements cannot be used here).

Curated/Transformed Layer
 Analyze business requirements: This step helps in identify the necessary transformations to
be used.

 Create an external table:

o Define a table with required columns.

o Specify a location for storing transformed data.

 Insert transformed data: Select columns from the managed table and insert them into the
external table.

 Validate data: Check the output in the specified HDFS location.

 Cleanup: If results are correct, drop the managed table (staging layer) and external table
(curated layer) to free up resources.

Outbound Layer

 Export data: Use Sqoop to export data from the external table's HDFS location to a MySQL
table with required columns.

Data Loading and Manipulation

Load Command

 Used to load data from external sources into tables.

 Supports various file formats (e.g., CSV, JSON, Parquet).

 Can be used to append new data or overwrite existing data.

Insert Select

 Inserts data into a table based on the output of a SELECT statement.

 Allows transformations and filtering while inserting data.

 Supports inserting into managed and external tables.

Data Persistence

 How Data is Stored:

o Managed tables store data in hive-managed storage.

o External tables reference data stored externally.

 Data Retention:

o Managed table data is removed when the table is dropped.

o External table data persists beyond the table's lifecycle.

 Overwrite vs. Append:

o Overwriting replaces existing data with new data.

o Appending adds new data to the existing dataset.

ETL & ELT Processes

Extract-Load (EL)

 Data is extracted from a source and loaded into the target system without transformation.

 Used for raw data storage before processing.

Extract-Load-Transform (ELT)

 Data is extracted, loaded into the target system, and then transformed within the storage
layer.

 Enables better scalability and performance by utilizing native processing capabilities.

Additional Learnings
 Working with different file formats:

o Text-File, CSV

o Structured / Semi-structured / Flat-File / Optimized / Serialized formats

 Different ways of loading data into Hive tables:

o LOAD command

o INSERT SELECT

o Overwrite vs. Append

 Positional to naming notation conversion using delimiters.

 Data Warehousing and Lakehouse Concepts:

o Warehouse + Data Lake = Lakehouse

o Different Stages: Sourcing to Transient to Raw to Curated to Consumption in a

Lakehouse built on top of a Data Lake.
Technologies Used

 Linux – File system operations, automation, and scripting.

 HDFS (Hadoop Distributed File System) – Distributed data storage.

 Sqoop – Data ingestion/export between HDFS and relational databases.

 Hive – Data warehousing and querying engine.

 MapReduce (MR) – Distributed data processing.

 YARN (Yet Another Resource Negotiator) – Resource management and job scheduling.

 SQL – Querying and transforming structured data.

Data Engineering (DE) Strategies

 Curation – Cleaning, enriching, and structuring raw data for further use.

 Merging – Combining multiple datasets for consistency and completeness.

 ETL (Extract-Transform-Load) – Traditional method where data is transformed before

 ELT (Extract-Load-Transform) – Modern approach where transformation happens after

loading into the system.

 EL (Extract-Load) – Loading raw data for later transformation based on need.

Data Analytics (DA) Strategies

 Pivoting – Reshaping data for easier analysis and reporting.

 Exploding/Transposing – Transforming row-column structures for better insights.

 Top-N Analysis – Identifying top-performing entities in datasets.

Data Processing Layers

 Transient Layer – Temporary storage for raw ingested data before processing.

 Raw Layer – Stores unprocessed data in its native format.

 Curated Layer – Cleaned, transformed, and structured data ready for analytics and reporting.

Personas Involved

 Data Engineers (DE) – Build and maintain the data pipeline, ensuring efficient ingestion and
transformation.

 Data Analysts (DA) – Perform deep analysis, derive insights, and create analytical reports.

 Report Builders – Design and generate dashboards/reports for business users.

 Clients – Consume reports and insights for decision-making.

 Data Governance (DG) Team – Ensure compliance, security, and quality of data.
 Architects (ARCH) – Design the overall data framework and technology stack.

EarningsInsight 012425
No ratings yet
EarningsInsight 012425
40 pages
Saket Yadav Research File 2
No ratings yet
Saket Yadav Research File 2
113 pages
Databricks Certified Data Engineer Associate Exam Guide 25
No ratings yet
Databricks Certified Data Engineer Associate Exam Guide 25
10 pages
DWDM Unit 1 (R23)
No ratings yet
DWDM Unit 1 (R23)
85 pages
Design Data Architecture 1st Unit
No ratings yet
Design Data Architecture 1st Unit
58 pages
SM 1
No ratings yet
SM 1
58 pages
Event Management Basics for Vietnam
No ratings yet
Event Management Basics for Vietnam
24 pages
PDF Data Engineering Interview Questions and Answers
No ratings yet
PDF Data Engineering Interview Questions and Answers
18 pages
Week 5 Chapter 6
No ratings yet
Week 5 Chapter 6
29 pages
DWDM
No ratings yet
DWDM
19 pages
Certified Financial Modeling & Valuation Analyst (FMVA®) CFI
No ratings yet
Certified Financial Modeling & Valuation Analyst (FMVA®) CFI
1 page
CGD IMS-Original Reg-27. 04.2008
No ratings yet
CGD IMS-Original Reg-27. 04.2008
61 pages
DWDM QB
No ratings yet
DWDM QB
29 pages
Solutions For Data Warehousing 7
No ratings yet
Solutions For Data Warehousing 7
18 pages
Case 1 Amazon Aur Dikhao
No ratings yet
Case 1 Amazon Aur Dikhao
4 pages
1.introduction To Data Warehouse
No ratings yet
1.introduction To Data Warehouse
26 pages
National Cranberry Cooperative Case
83% (12)
National Cranberry Cooperative Case
7 pages
Business Intelligence Overview
No ratings yet
Business Intelligence Overview
20 pages
100 Important Questions With Solutions For Data Warehousing & Data Mining (BCS058)
No ratings yet
100 Important Questions With Solutions For Data Warehousing & Data Mining (BCS058)
119 pages
Seplat Energy Fy 2023 Results Final 290224
No ratings yet
Seplat Energy Fy 2023 Results Final 290224
224 pages
Unit 4
No ratings yet
Unit 4
30 pages
Hospodka Master Thesis
No ratings yet
Hospodka Master Thesis
169 pages
Ex 1
No ratings yet
Ex 1
14 pages
BD&CC Unit2
No ratings yet
BD&CC Unit2
14 pages
DSS ch2
No ratings yet
DSS ch2
112 pages
DipakSarkar Offer Letter 03.10.2024
No ratings yet
DipakSarkar Offer Letter 03.10.2024
4 pages
ETL Interview Preparation
No ratings yet
ETL Interview Preparation
18 pages
Solve These Questions
No ratings yet
Solve These Questions
11 pages
Cs Tax
No ratings yet
Cs Tax
18 pages
MyPDF Msf:5284
No ratings yet
MyPDF Msf:5284
12 pages
Call Center Assessment
0% (1)
Call Center Assessment
4 pages
Iobit Malware Serial Key
50% (12)
Iobit Malware Serial Key
4 pages
Sem3 Unit1 DW
No ratings yet
Sem3 Unit1 DW
12 pages
ETL - PPT v0.2
No ratings yet
ETL - PPT v0.2
20 pages
Interview Topics 1749449767
No ratings yet
Interview Topics 1749449767
5 pages
Riyan
No ratings yet
Riyan
3 pages
MCS-221 2024-25 em
No ratings yet
MCS-221 2024-25 em
34 pages
DWDM202
No ratings yet
DWDM202
6 pages
ETL
No ratings yet
ETL
4 pages
Warehousing
No ratings yet
Warehousing
13 pages
Data Warehouse
No ratings yet
Data Warehouse
10 pages
Big Query
No ratings yet
Big Query
8 pages
Module 3 DM
No ratings yet
Module 3 DM
9 pages
Business Analytics
No ratings yet
Business Analytics
3 pages
DM Unit 2
No ratings yet
DM Unit 2
21 pages
BV2021CR - MFRS2 Share-Based Payment
No ratings yet
BV2021CR - MFRS2 Share-Based Payment
46 pages
Cat Data Mining
No ratings yet
Cat Data Mining
4 pages
A Report On Marketing Policy of Milk Vit
No ratings yet
A Report On Marketing Policy of Milk Vit
13 pages
Data Extraction
No ratings yet
Data Extraction
14 pages
DW Micro
No ratings yet
DW Micro
2 pages
Big Data
No ratings yet
Big Data
4 pages
Data Wharehousing, OLAP and Data Mining
No ratings yet
Data Wharehousing, OLAP and Data Mining
84 pages
12638/PANDIAN EXP Third Ac (3A)
No ratings yet
12638/PANDIAN EXP Third Ac (3A)
2 pages
Selected Topics of Recent Trends in Information Technology
No ratings yet
Selected Topics of Recent Trends in Information Technology
21 pages
Data Warehouse: Key Concepts & Architecture
No ratings yet
Data Warehouse: Key Concepts & Architecture
30 pages
Introduction to Data Warehousing
No ratings yet
Introduction to Data Warehousing
113 pages
Annual Return for Large Companies
No ratings yet
Annual Return for Large Companies
25 pages
DW Concepts
100% (1)
DW Concepts
40 pages
Python ML Guide for Beginners
100% (3)
Python ML Guide for Beginners
541 pages
Data Mining & BI Exam Guide 2023
No ratings yet
Data Mining & BI Exam Guide 2023
45 pages
CV Felix Munoz
No ratings yet
CV Felix Munoz
11 pages
ETL Question and Answers
No ratings yet
ETL Question and Answers
6 pages
Unit 2 LT
No ratings yet
Unit 2 LT
13 pages
Ass 1
No ratings yet
Ass 1
31 pages
Complete (Ebook PDF) Organizational Behavior: Key Concepts, Skills & Best Practices 5th PDF For All Chapters
100% (4)
Complete (Ebook PDF) Organizational Behavior: Key Concepts, Skills & Best Practices 5th PDF For All Chapters
41 pages
Module 2
No ratings yet
Module 2
43 pages
LectureNotes Data Warehousing
No ratings yet
LectureNotes Data Warehousing
126 pages
Data Warehouse Architechture-Layers
No ratings yet
Data Warehouse Architechture-Layers
21 pages
Ujuzi Craft Company Profile
No ratings yet
Ujuzi Craft Company Profile
17 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2
86 pages
Multitier DW Architecture & Implementation
No ratings yet
Multitier DW Architecture & Implementation
63 pages
HDFC Securities Demat Account Closure Request Form
No ratings yet
HDFC Securities Demat Account Closure Request Form
2 pages
Bengaluru Bannerghatta
No ratings yet
Bengaluru Bannerghatta
4 pages
Nicobar Fauna
No ratings yet
Nicobar Fauna
4 pages
Fauna Flora
No ratings yet
Fauna Flora
4 pages
Big Data Architectures and The Data Lake: James Serra
No ratings yet
Big Data Architectures and The Data Lake: James Serra
53 pages
Motor Car Loan Application Form3
No ratings yet
Motor Car Loan Application Form3
7 pages
Bentec Dealership 2024-2025
No ratings yet
Bentec Dealership 2024-2025
1 page
TILE SHEET 03march 503 Manasa Ramaniyam
No ratings yet
TILE SHEET 03march 503 Manasa Ramaniyam
1 page
Call Letter For BQ
No ratings yet
Call Letter For BQ
2 pages
DM Water Plant
No ratings yet
DM Water Plant
6 pages
Kiran Reddy Resume
No ratings yet
Kiran Reddy Resume
7 pages
DW Concepts
No ratings yet
DW Concepts
40 pages
BookingVoucher NH72154358263418
No ratings yet
BookingVoucher NH72154358263418
1 page
BookingVoucher NH73202358262400
No ratings yet
BookingVoucher NH73202358262400
1 page
12635/vaigai SF Exp Second Sitting (2S)
No ratings yet
12635/vaigai SF Exp Second Sitting (2S)
2 pages
12636/vaigai SF Exp Chair Car (CC)
No ratings yet
12636/vaigai SF Exp Chair Car (CC)
3 pages
Agreement of CEO of HP Innovations PVT LTD
No ratings yet
Agreement of CEO of HP Innovations PVT LTD
4 pages
ETL Processing: High Performance Data Warehouse Design and Construction
No ratings yet
ETL Processing: High Performance Data Warehouse Design and Construction
39 pages
NFT Contract
No ratings yet
NFT Contract
3 pages
Rainfall Analysis Implementing On Data Warehouse
No ratings yet
Rainfall Analysis Implementing On Data Warehouse
12 pages
QA Automation Roadmap Guide
No ratings yet
QA Automation Roadmap Guide
17 pages
Literature Review
No ratings yet
Literature Review
3 pages
Procurement Guidelines for Kenya
No ratings yet
Procurement Guidelines for Kenya
12 pages
Datawarehouse / Etl Testing: Reason For Build Data Warehouse: 1) Data Is Scattered at Different Places
No ratings yet
Datawarehouse / Etl Testing: Reason For Build Data Warehouse: 1) Data Is Scattered at Different Places
19 pages
Python in Excel (2024)
100% (13)
Python in Excel (2024)
607 pages
Diversity Assessment 1
No ratings yet
Diversity Assessment 1
8 pages
Data Warehousing for Analysts
No ratings yet
Data Warehousing for Analysts
40 pages
CSEP 546 Data Mining: Instructor: Pedro Domingos
No ratings yet
CSEP 546 Data Mining: Instructor: Pedro Domingos
63 pages
Data Warehouse Overview & Benefits
No ratings yet
Data Warehouse Overview & Benefits
10 pages
Databricks Certified Professional Data Engineer Questions and Answers PDF Dumps
No ratings yet
Databricks Certified Professional Data Engineer Questions and Answers PDF Dumps
6 pages
Top 200 Data Engineer Interview Question PDF
100% (4)
Top 200 Data Engineer Interview Question PDF
482 pages
Machine Learning With Python
100% (15)
Machine Learning With Python
692 pages
DatabricksDataEngineer Associate2024
80% (5)
DatabricksDataEngineer Associate2024
157 pages
Root A. Python For Data Analytics. A Beginners Guide For Learning 2019
100% (8)
Root A. Python For Data Analytics. A Beginners Guide For Learning 2019
167 pages
PracticeExam DataEngineerAssociate
No ratings yet
PracticeExam DataEngineerAssociate
23 pages
OReilly Learning Python 4th Edition Oct 2009
100% (20)
OReilly Learning Python 4th Edition Oct 2009
1,214 pages
2019 Book EssentialsOfBusinessAnalytics PDF
93% (14)
2019 Book EssentialsOfBusinessAnalytics PDF
971 pages
Data Engineering Cookbook
89% (9)
Data Engineering Cookbook
88 pages
Python Notes For Professionals
100% (18)
Python Notes For Professionals
814 pages
Big Data Engineering Interview Questions
67% (3)
Big Data Engineering Interview Questions
189 pages
Applied Generative AI For Beginners Practical Knowledge 1703207445
94% (16)
Applied Generative AI For Beginners Practical Knowledge 1703207445
221 pages
Data Analytics Concepts Techniques and A PDF
100% (14)
Data Analytics Concepts Techniques and A PDF
451 pages
Full Course of Machine Learning
100% (16)
Full Course of Machine Learning
660 pages
Understanding Machine Learning
100% (71)
Understanding Machine Learning
416 pages
Practical Projects
100% (30)
Practical Projects
478 pages
Hackers Guide To Machine Learning With Python PDF
100% (15)
Hackers Guide To Machine Learning With Python PDF
272 pages
Gen AI Companies 1679276337830
100% (1)
Gen AI Companies 1679276337830
1 page
The Python Manual
97% (32)
The Python Manual
196 pages
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
94% (16)
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
334 pages
(Hunt, J.) A Beginners Guide To Python 3 Programming
96% (47)
(Hunt, J.) A Beginners Guide To Python 3 Programming
440 pages
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
91% (11)
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
166 pages
Python Cheat Sheets
97% (33)
Python Cheat Sheets
11 pages
DATA ANALYTICS - A Comprehensive Beginner's Guide To Learn About The Realms of Data Analytics From A-Z
89% (18)
DATA ANALYTICS - A Comprehensive Beginner's Guide To Learn About The Realms of Data Analytics From A-Z
102 pages
Python Programming. A Step-by-Step Guide For Absolute Beginners
91% (44)
Python Programming. A Step-by-Step Guide For Absolute Beginners
181 pages
The Python Bible
97% (31)
The Python Bible
506 pages
Python 3 Cheat Sheet
94% (51)
Python 3 Cheat Sheet
2 pages
Most Common Interview Questions and Answers
88% (34)
Most Common Interview Questions and Answers
3 pages

Data Management Concepts Learned

Uploaded by

Data Management Concepts Learned

Uploaded by

Understanding of the Retail Customer Segmentation use case:

Data Management Concepts Learned

 With a specified path:

o Used for external table creation.

 Without a specified path:

o Data is stored in the system's default managed storage location.

o Typically used for managed tables.

Managed vs. External Tables

o The system manages both metadata and data.

o Data is stored in a system-defined location.

o Overwriting the table replaces existing data entirely.

o Dropping a managed table removes both metadata and data.

o Only metadata is managed by the system; data is stored externally.

o The system does not control the lifecycle of the data.

o New data is appended unless explicitly handled otherwise.

 Create an external table:

o Define a table with required columns.

o Specify a location for storing transformed data.

 Validate data: Check the output in the specified HDFS location.

Data Loading and Manipulation

 Used to load data from external sources into tables.

 Supports various file formats (e.g., CSV, JSON, Parquet).

 Can be used to append new data or overwrite existing data.

 Inserts data into a table based on the output of a SELECT statement.

 Allows transformations and filtering while inserting data.

 Supports inserting into managed and external tables.

 How Data is Stored:

o Managed tables store data in hive-managed storage.

o External tables reference data stored externally.

o Managed table data is removed when the table is dropped.

o External table data persists beyond the table's lifecycle.

 Overwrite vs. Append:

o Overwriting replaces existing data with new data.

o Appending adds new data to the existing dataset.

ETL & ELT Processes

 Used for raw data storage before processing.

 Enables better scalability and performance by utilizing native processing capabilities.

o Structured / Semi-structured / Flat-File / Optimized / Serialized formats

 Different ways of loading data into Hive tables:

o Overwrite vs. Append

 Positional to naming notation conversion using delimiters.

 Data Warehousing and Lakehouse Concepts:

o Warehouse + Data Lake = Lakehouse

o Different Stages: Sourcing to Transient to Raw to Curated to Consumption in a

 Linux – File system operations, automation, and scripting.

 HDFS (Hadoop Distributed File System) – Distributed data storage.

 Sqoop – Data ingestion/export between HDFS and relational databases.

 Hive – Data warehousing and querying engine.

 MapReduce (MR) – Distributed data processing.

 SQL – Querying and transforming structured data.

Data Engineering (DE) Strategies

 Merging – Combining multiple datasets for consistency and completeness.

 ETL (Extract-Transform-Load) – Traditional method where data is transformed before

 ELT (Extract-Load-Transform) – Modern approach where transformation happens after

 EL (Extract-Load) – Loading raw data for later transformation based on need.

Data Analytics (DA) Strategies

 Pivoting – Reshaping data for easier analysis and reporting.

 Exploding/Transposing – Transforming row-column structures for better insights.

 Top-N Analysis – Identifying top-performing entities in datasets.

Data Processing Layers

 Raw Layer – Stores unprocessed data in its native format.

 Report Builders – Design and generate dashboards/reports for business users.

 Clients – Consume reports and insights for decision-making.

You might also like