Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views5 pages

Data Management Concepts Learned

The document outlines a comprehensive approach to retail customer segmentation, detailing the end-to-end project workflow including technology stack, data management, and processing stages. It emphasizes the integration of various technologies and data engineering strategies, such as managed and external tables, ETL/ELT processes, and data persistence methods. Key personas involved in the data handling process are also identified, highlighting their roles in building, analyzing, and governing the data pipeline.

Uploaded by

iamnsridhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views5 pages

Data Management Concepts Learned

The document outlines a comprehensive approach to retail customer segmentation, detailing the end-to-end project workflow including technology stack, data management, and processing stages. It emphasizes the integration of various technologies and data engineering strategies, such as managed and external tables, ETL/ELT processes, and data persistence methods. Key personas involved in the data handling process are also identified, highlighting their roles in building, analyzing, and governing the data pipeline.

Uploaded by

iamnsridhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Understanding of the Retail Customer Segmentation use case:

High level understanding: The given use case has given an overall idea of how a project or use case
to be approached end to end. Right from choosing the tech stack, infrastructure, tools suitable for
the given problem, scalability of the solution, various layers of maintaining data, transformations,
retention of transient data after processing, various personas in accessing the data and a complete
picture of the data flow starting from sourcing to consumption with multiple layers in-between.

This project integrates multiple technologies and data engineering (DE) strategies to enable
efficient data processing, transformation, and analysis. It spans various layers of a modern data
architecture and supports different personas involved in data handling and consumption.

Along with that, this project has given exposure to various technologies used in bigdata
ecosystem along with the methodologies to handle the data and various forms of data storing
practices such as Data Lake and lake house. It has also given exposure to different forms of source
system understanding as well.

Below mentioned points are the highlights of what approach / methods / concepts that has
been used in handling, processing and storing data for the given use case.

Data Management Concepts Learned


Creating Tables

 With a specified path:

o Data is stored at the specified location, independent of the system's default storage.

o Used for external table creation.

 Without a specified path:

o Data is stored in the system's default managed storage location.

o Typically used for managed tables.

Managed vs. External Tables


 Managed Tables:

o The system manages both metadata and data.

o Data is stored in a system-defined location.

o Overwriting the table replaces existing data entirely.

o Dropping a managed table removes both metadata and data.

 External Tables:

o Only metadata is managed by the system; data is stored externally.

o The system does not control the lifecycle of the data.

o New data is appended unless explicitly handled otherwise.

o Dropping an external table only removes metadata, while data remains in storage.
Data Processing Stages
Staging Layer

 Analyze the source data: This step helps in understanding the data.

 Create a managed table: Define a table based on the number of columns and delimiter used
in the source data.

 Load data: Use the LOAD command to ingest data into the managed table (INSERT
statements cannot be used here).

Curated/Transformed Layer
 Analyze business requirements: This step helps in identify the necessary transformations to
be used.

 Create an external table:

o Define a table with required columns.

o Specify a location for storing transformed data.

 Insert transformed data: Select columns from the managed table and insert them into the
external table.

 Validate data: Check the output in the specified HDFS location.

 Cleanup: If results are correct, drop the managed table (staging layer) and external table
(curated layer) to free up resources.

Outbound Layer

 Export data: Use Sqoop to export data from the external table's HDFS location to a MySQL
table with required columns.

Data Loading and Manipulation


Load Command

 Used to load data from external sources into tables.

 Supports various file formats (e.g., CSV, JSON, Parquet).

 Can be used to append new data or overwrite existing data.

Insert Select

 Inserts data into a table based on the output of a SELECT statement.

 Allows transformations and filtering while inserting data.

 Supports inserting into managed and external tables.


Data Persistence

 How Data is Stored:

o Managed tables store data in hive-managed storage.

o External tables reference data stored externally.

 Data Retention:

o Managed table data is removed when the table is dropped.

o External table data persists beyond the table's lifecycle.

 Overwrite vs. Append:

o Overwriting replaces existing data with new data.

o Appending adds new data to the existing dataset.

ETL & ELT Processes


Extract-Load (EL)

 Data is extracted from a source and loaded into the target system without transformation.

 Used for raw data storage before processing.

Extract-Load-Transform (ELT)

 Data is extracted, loaded into the target system, and then transformed within the storage
layer.

 Enables better scalability and performance by utilizing native processing capabilities.

Additional Learnings
 Working with different file formats:

o Text-File, CSV

o Structured / Semi-structured / Flat-File / Optimized / Serialized formats

 Different ways of loading data into Hive tables:

o LOAD command

o INSERT SELECT

o Overwrite vs. Append

 Positional to naming notation conversion using delimiters.

 Data Warehousing and Lakehouse Concepts:

o Warehouse + Data Lake = Lakehouse

o Different Stages: Sourcing to Transient to Raw to Curated to Consumption in a


Lakehouse built on top of a Data Lake.
Technologies Used

 Linux – File system operations, automation, and scripting.

 HDFS (Hadoop Distributed File System) – Distributed data storage.

 Sqoop – Data ingestion/export between HDFS and relational databases.

 Hive – Data warehousing and querying engine.

 MapReduce (MR) – Distributed data processing.

 YARN (Yet Another Resource Negotiator) – Resource management and job scheduling.

 SQL – Querying and transforming structured data.

Data Engineering (DE) Strategies

 Curation – Cleaning, enriching, and structuring raw data for further use.

 Merging – Combining multiple datasets for consistency and completeness.

 ETL (Extract-Transform-Load) – Traditional method where data is transformed before


loading.

 ELT (Extract-Load-Transform) – Modern approach where transformation happens after


loading into the system.

 EL (Extract-Load) – Loading raw data for later transformation based on need.

Data Analytics (DA) Strategies

 Pivoting – Reshaping data for easier analysis and reporting.

 Exploding/Transposing – Transforming row-column structures for better insights.

 Top-N Analysis – Identifying top-performing entities in datasets.

Data Processing Layers

 Transient Layer – Temporary storage for raw ingested data before processing.

 Raw Layer – Stores unprocessed data in its native format.

 Curated Layer – Cleaned, transformed, and structured data ready for analytics and reporting.

Personas Involved

 Data Engineers (DE) – Build and maintain the data pipeline, ensuring efficient ingestion and
transformation.

 Data Analysts (DA) – Perform deep analysis, derive insights, and create analytical reports.

 Report Builders – Design and generate dashboards/reports for business users.

 Clients – Consume reports and insights for decision-making.

 Data Governance (DG) Team – Ensure compliance, security, and quality of data.
 Architects (ARCH) – Design the overall data framework and technology stack.

You might also like