0% found this document useful (0 votes)

40 views9 pages

Data Lineage1

Uploaded by

gmundluru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views9 pages

Data Lineage1

Uploaded by

gmundluru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Data Lineage: Explanation and Example

Data lineage refers to the lifecycle and journey of data as it flows through different systems,
applications, and processes within an organization. It provides a detailed map that traces the data's
origins, movements, transformations, and eventual destination. This traceability ensures data integrity,
accuracy, and compliance with regulatory requirements.

Key Concepts of Data Lineage

1. Source: Where the data originates.

2. Transformation: Any changes or processing applied to the data as it moves through the system.
3. Destination: Where the data ends up after all transformations.
4. Metadata: Information about the data, such as its type, structure, and rules applied during
transformations.
5. Lineage Tracking: The process of documenting each step the data takes from source to
destination.

Example: Data Lineage in a Banking Scenario

Scenario: A bank processes transaction data from various branches, transforms it for reporting, and
stores it in a data warehouse for analysis.

Step-by-Step Data Lineage Example

1. Data Source: Branch Transaction Systems

o Data is collected from branch transaction systems, where transactions like deposits,
withdrawals, and transfers are recorded.
o Example Source Tables: Branch_Transactions
2. Data Ingestion: ETL (Extract, Transform, Load) Process
o Data from the branch transaction systems is extracted and loaded into a staging area for
initial processing.
o Example Staging Tables: Staging_Branch_Transactions
3. Data Transformation: Data Cleansing and Aggregation
o Data is cleaned to remove duplicates, correct errors, and standardize formats.
o Transactions are aggregated by date, branch, and transaction type.
o Example Transformation Rules: Convert all date formats to YYYY-MM-DD, aggregate
transaction amounts by branch and date.
4. Data Storage: Data Warehouse
o Transformed data is loaded into the data warehouse.
o Example Data Warehouse Tables: DW_Branch_Summary
5. Data Usage: Reporting and Analytics
o Business intelligence tools query the data warehouse to generate reports and dashboards
for business analysis.
o Example Reports: Daily transaction summary, Branch performance reports.
6. Data Metadata: Documenting the Lineage

1
o Metadata is maintained to describe the source, transformations, and destination of the
data.
o Example Metadata Information: Source table names, transformation logic, data
warehouse table schemas.

Example Data Lineage Diagram

To visualize this example, you can create a data lineage diagram showing the flow from source to
destination.

Branch Transaction Systems (Source) ->

ETL Process (Ingestion) ->
Staging_Branch_Transactions (Staging) ->
Data Transformation (Transformation) ->
DW_Branch_Summary (Data Warehouse) ->
BI Reports (Destination)

Detailed Example

1. Source: Branch_Transactions table in branch systems.

o Columns: Transaction_ID, Branch_ID, Transaction_Date, Transaction_Type,
Amount
2. Staging: Staging_Branch_Transactions table in the staging area.
o Transformation: Clean data (e.g., remove duplicates).
o Columns: Transaction_ID, Branch_ID, Transaction_Date, Transaction_Type,
Amount
3. Transformation:
o Convert all date formats to YYYY-MM-DD.
o Aggregate data by Branch_ID, Transaction_Date, and Transaction_Type.
o Example Transformation SQL:

sql
Copy code

SELECT
Branch_ID,
Transaction_Date,
Transaction_Type,
SUM(Amount) AS Total_Amount
FROM
Staging_Branch_Transactions
GROUP BY
Branch_ID, Transaction_Date, Transaction_Type;

4. Destination: DW_Branch_Summary table in the data warehouse.

o Columns: Branch_ID, Transaction_Date, Transaction_Type, Total_Amount

2
5. Usage:
o Generate reports such as Daily_Branch_Summary showing total transactions per branch
and day.
o Example Report SQL:

sql
Copy code
SELECT
Branch_ID,
Transaction_Date,
SUM(Total_Amount) AS Daily_Total
FROM
DW_Branch_Summary
GROUP BY
Branch_ID, Transaction_Date;

Importance of Data Lineage

1. Data Quality: Ensures data accuracy and consistency by tracking its transformations.
2. Compliance: Helps in meeting regulatory requirements by providing a clear audit trail.
3. Troubleshooting: Facilitates debugging and resolving data issues by tracing errors back to their
source.
4. Impact Analysis: Assesses the impact of changes in data sources or transformations on
downstream systems.

By maintaining clear data lineage, organizations can ensure reliable data management, enhance
transparency, and support data governance initiatives.

3
1. Data Governance

Definition: Data governance is the overall management of the availability, usability, integrity, and
security of data used in an enterprise. It involves a set of policies, procedures, and standards to ensure
data is managed effectively and used properly across the organization.

Key Components:

 Policies and Procedures: Guidelines for data usage, data quality, and data privacy.
 Data Stewardship: Roles and responsibilities for managing data assets.
 Data Quality Management: Processes to ensure data accuracy, completeness, and consistency.
 Compliance: Ensuring data practices comply with legal and regulatory requirements.
 Data Security: Protecting data from unauthorized access and breaches.

Example: A financial institution implements a data governance program to manage customer data. They
establish policies to ensure data is collected accurately during onboarding, procedures for regular data
quality checks, and roles for data stewards responsible for maintaining data integrity. Compliance with
GDPR (General Data Protection Regulation) is enforced to protect customer privacy.

2. Metadata Management

Definition: Metadata management involves the administration of data that describes other data.
Metadata includes information about data sources, structures, definitions, and usage, providing context
and meaning to data.

Types of Metadata:

 Descriptive Metadata: Information about data content, such as titles, authors, and descriptions.
 Structural Metadata: Information about data format and structure, such as tables, columns, and
data types.
 Administrative Metadata: Information for managing data, such as creation dates, modification
dates, and access permissions.

Example: A retail company uses metadata management to keep track of their product database.
Descriptive metadata includes product names and descriptions. Structural metadata details the database
schema, including table names and column data types. Administrative metadata tracks when products
were added or updated in the system and who has access to modify product data.

3. Data Cataloging

Definition: Data cataloging is the process of creating an organized inventory of data assets. It includes
collecting metadata, managing data descriptions, and making data discoverable for users within an
organization.

Key Features:

4
 Search and Discovery: Tools for users to find relevant data assets easily.
 Data Lineage: Information on data origins, transformations, and usage.
 Data Profiling: Summarizing data characteristics and quality metrics.
 User Collaboration: Features for users to annotate, rate, and comment on data assets.

Example: A healthcare provider implements a data catalog to manage their extensive patient records.
The catalog includes metadata for each dataset, such as patient demographics, medical histories, and
treatment plans. Data lineage information shows how patient data flows from initial collection in clinics
to final reports. Data profiling ensures data quality, and healthcare professionals can search and access
the data they need efficiently.

Integration of Data Governance, Metadata Management, and Data Cataloging

Scenario: A multinational corporation is developing a customer analytics platform. To ensure data

quality, security, and usability, they integrate data governance, metadata management, and data
cataloging as follows:

1. Data Governance:
o Establishes policies for collecting and using customer data.
o Defines data stewards responsible for different datasets.
o Ensures compliance with data privacy regulations like GDPR.
2. Metadata Management:
o Collects metadata about customer data sources, including data lineage from collection
points (e.g., website forms) to the analytics platform.
o Documents data definitions, formats, and usage guidelines.
o Provides metadata to the data catalog for easier data discovery.
3. Data Cataloging:
o Creates an inventory of all customer data assets.
o Enables business analysts and data scientists to search for and find relevant customer data
for their analyses.
o Includes data profiling to provide insights into data quality and characteristics.
o Facilitates collaboration by allowing users to add comments and ratings to data assets.

Benefits:

 Enhanced Data Quality: Data governance ensures data is accurate, consistent, and reliable.
 Improved Data Discovery: Metadata management and data cataloging make it easy for users to
find and understand data.
 Regulatory Compliance: Data governance ensures data practices meet legal and regulatory
requirements.
 Efficient Data Usage: Integrated tools and processes streamline data access and usage, boosting
productivity and insights.

By implementing these practices, the corporation ensures that their customer data is well-managed,
easily accessible, and of high quality, enabling better decision-making and improved customer insights.

5
The IBM batch processing lifecycle in a banking system:

The IBM batch processing lifecycle in a banking system involves a series of steps to process large
volumes of transactions efficiently. This lifecycle can be broken down into several stages, each with
specific tasks and objectives. Here is a step-by-step overview of the IBM batch processing lifecycle:

1. Job Scheduling

Objective: Plan and schedule batch jobs to ensure they run at the appropriate times without conflicts.

Tasks:

 Define batch job schedules using a job scheduler (e.g., IBM Tivoli Workload Scheduler).
 Set up job dependencies and priorities.
 Allocate system resources and time slots for each job.
 Ensure compliance with operational windows and business hours.

Example:

 Schedule nightly transaction processing to start at 11 PM after all daily operations are closed.

2. Job Initiation

Objective: Start the batch job based on the predefined schedule.

Tasks:

 Trigger batch job execution either manually or automatically.

 Ensure all prerequisite jobs have completed successfully.
 Check system readiness and resource availability.

Example:

 At 11 PM, the job scheduler initiates the transaction processing batch job.

3. Data Extraction

Objective: Extract the necessary data from source systems for processing.

Tasks:

 Read input data from databases, files, or other sources.

 Perform initial data validation and integrity checks.
 Log the data extraction process for auditing and troubleshooting.

Example:

6
 Extract daily transaction data from the banking transaction system into a staging area.

4. Data Transformation

Objective: Process and transform the extracted data as required.

Tasks:

 Apply business rules and data transformations.

 Aggregate, filter, or split data as needed.
 Perform calculations and update data fields.

Example:

 Calculate interest for savings accounts based on daily transactions and account balances.

5. Data Loading

Objective: Load the processed data into target systems or databases.

Tasks:

 Insert, update, or delete records in the target databases.

 Ensure data integrity and consistency during the loading process.
 Log the data loading process for auditing purposes.

Example:

 Load the processed transaction data into the central accounting system.

6. Reporting and Notifications

Objective: Generate reports and notify stakeholders of job completion and any issues.

Tasks:

 Produce summary and detailed reports of the batch job results.

 Notify relevant personnel of job completion, errors, or exceptions via email, SMS, or other
means.
 Archive logs and reports for future reference.

Example:

 Generate a daily transaction summary report for management.

 Send an email notification to the operations team upon job completion.

7
7. Job Monitoring and Control

Objective: Monitor the batch job execution and control its progress.

Tasks:

 Track job execution in real-time using monitoring tools.

 Manage job queues and priorities dynamically.
 Detect and resolve errors or exceptions promptly.

Example:

 Use IBM Tivoli Workload Scheduler to monitor job execution and intervene if any job fails.

8. Error Handling and Recovery

Objective: Handle any errors that occur during the batch job and recover if necessary.

Tasks:

 Identify and log errors and exceptions.

 Implement retry mechanisms and alternative workflows.
 Perform root cause analysis and corrective actions.

Example:

 If a data loading step fails due to a database connectivity issue, automatically retry the step after
a short delay.

9. Job Termination

Objective: Complete the batch job lifecycle and release resources.

Tasks:

 Ensure all processes are terminated correctly.

 Release system resources and clean up temporary files or data.
 Archive job logs and results for compliance and auditing.

Example:

 After successful data loading, close database connections and delete temporary staging files.

10. Post-Processing Analysis

Objective: Analyze the results of the batch job and prepare for the next cycle.

8
Tasks:

 Review job logs, reports, and performance metrics.

 Identify areas for improvement in the batch process.
 Plan and implement enhancements for future batch cycles.

Example:

 Analyze job performance metrics to identify bottlenecks and optimize job scheduling for the next
cycle.

Summary

The IBM batch processing lifecycle in a banking system is a comprehensive and systematic approach to
handle large-scale transaction processing efficiently. By following these steps, banks can ensure
accurate, timely, and reliable processing of their batch jobs, ultimately supporting their operational and
business objectives.

The Ultimate Guide To Data Lineage
100% (2)
The Ultimate Guide To Data Lineage
19 pages
Carron, Brawley
No ratings yet
Carron, Brawley
18 pages
Data Warehousing and DSS
No ratings yet
Data Warehousing and DSS
53 pages
DISD SD380 Wheel Loader Specs PDF
No ratings yet
DISD SD380 Wheel Loader Specs PDF
8 pages
Data Modelling Course Guide
No ratings yet
Data Modelling Course Guide
473 pages
Email Marketing Assignment - Omera Anjum - PGSDM
No ratings yet
Email Marketing Assignment - Omera Anjum - PGSDM
9 pages
DM 5th Sem Unit-1
No ratings yet
DM 5th Sem Unit-1
8 pages
Advanced Statistics
No ratings yet
Advanced Statistics
125 pages
Business Intelligence Endsem
No ratings yet
Business Intelligence Endsem
10 pages
Bi - Unit Iii
No ratings yet
Bi - Unit Iii
65 pages
Unit 2
No ratings yet
Unit 2
19 pages
SingerValve 106 PR UL Fire Valve Sheet Product Pages
No ratings yet
SingerValve 106 PR UL Fire Valve Sheet Product Pages
2 pages
All Questions
No ratings yet
All Questions
7 pages
Data Analytics Fundamentals
No ratings yet
Data Analytics Fundamentals
3 pages
Data Warehouse
No ratings yet
Data Warehouse
15 pages
V30Plus GNSS RTK Brochure EN 20220608 S
100% (1)
V30Plus GNSS RTK Brochure EN 20220608 S
2 pages
Business Intelligence Overview
No ratings yet
Business Intelligence Overview
20 pages
Dmbi Question Bank
No ratings yet
Dmbi Question Bank
21 pages
Data Engineering Famous Terms 1756202104
No ratings yet
Data Engineering Famous Terms 1756202104
22 pages
Week 5 Chapter 6
No ratings yet
Week 5 Chapter 6
29 pages
DWDM 2 Unit Notes
No ratings yet
DWDM 2 Unit Notes
14 pages
BROCHURE
No ratings yet
BROCHURE
8 pages
PL 100F VFD - UserManual
No ratings yet
PL 100F VFD - UserManual
35 pages
Electrolux 102255 User Manual
No ratings yet
Electrolux 102255 User Manual
2 pages
Business Analytics Unit 2 Notes
No ratings yet
Business Analytics Unit 2 Notes
30 pages
DW Unit I Notes
No ratings yet
DW Unit I Notes
28 pages
The Social Engineer Toolkit
No ratings yet
The Social Engineer Toolkit
20 pages
Data Analytics
No ratings yet
Data Analytics
29 pages
1 s2.0 S0141029619311046 Main
No ratings yet
1 s2.0 S0141029619311046 Main
11 pages
Jenkins Pipeline
No ratings yet
Jenkins Pipeline
3 pages
Modern Data Management - Data Governance - IVL Academy
No ratings yet
Modern Data Management - Data Governance - IVL Academy
14 pages
Snowflake Adapter For SAP Integration Suite
No ratings yet
Snowflake Adapter For SAP Integration Suite
41 pages
19-2G0017 - Perf Curves
No ratings yet
19-2G0017 - Perf Curves
1 page
Unit 2
No ratings yet
Unit 2
7 pages
Data Lineage Business Document
No ratings yet
Data Lineage Business Document
9 pages
DWDM202
No ratings yet
DWDM202
6 pages
Da Unit-I
No ratings yet
Da Unit-I
19 pages
Business Analytics
No ratings yet
Business Analytics
3 pages
DWM QB Soln
No ratings yet
DWM QB Soln
18 pages
Data Analytics
No ratings yet
Data Analytics
26 pages
Data Warehouse
No ratings yet
Data Warehouse
10 pages
Unit 3-1
No ratings yet
Unit 3-1
19 pages
Xuv300 Accessories
No ratings yet
Xuv300 Accessories
2 pages
Antim Prahar Business Data Warehousing Data Mining 2024
No ratings yet
Antim Prahar Business Data Warehousing Data Mining 2024
65 pages
Data Governance Book
No ratings yet
Data Governance Book
11 pages
Data Infrastructure
No ratings yet
Data Infrastructure
7 pages
Chapter 1 Data Warehouse Fundamentals
No ratings yet
Chapter 1 Data Warehouse Fundamentals
26 pages
Elektra Micro Casa A Leva Coffee Machine
No ratings yet
Elektra Micro Casa A Leva Coffee Machine
12 pages
MTE 2223 08 Mar 2025
No ratings yet
MTE 2223 08 Mar 2025
8 pages
HTCB Unit 1
No ratings yet
HTCB Unit 1
5 pages
Summary in Spanish of DAMA
No ratings yet
Summary in Spanish of DAMA
99 pages
Data Extraction
No ratings yet
Data Extraction
14 pages
BDD for E-commerce Debit Payments
No ratings yet
BDD for E-commerce Debit Payments
4 pages
CS 2208 Data Mining and Warehousing Notes
No ratings yet
CS 2208 Data Mining and Warehousing Notes
14 pages
02 Data Transformation With The Cloud
No ratings yet
02 Data Transformation With The Cloud
17 pages
DWM Unit 1
No ratings yet
DWM Unit 1
48 pages
718-Article Text-2170-1-10-20240113
No ratings yet
718-Article Text-2170-1-10-20240113
18 pages
Dissertation Topics For Civil Engineering Students
100% (2)
Dissertation Topics For Civil Engineering Students
8 pages
DSS ch2
No ratings yet
DSS ch2
112 pages
WS2812 LED Upgrade Guide
No ratings yet
WS2812 LED Upgrade Guide
13 pages
Unit 2 LT
No ratings yet
Unit 2 LT
13 pages
Data Engineering Part 1 1735286787
No ratings yet
Data Engineering Part 1 1735286787
22 pages
Data Warehousing - Metadata Concepts
No ratings yet
Data Warehousing - Metadata Concepts
9 pages
Ali Hejazizo: - Curriculum Vitae
No ratings yet
Ali Hejazizo: - Curriculum Vitae
3 pages
Data Warehouse: Key Concepts & Architecture
No ratings yet
Data Warehouse: Key Concepts & Architecture
30 pages
Data Warehouse For Bignners
No ratings yet
Data Warehouse For Bignners
14 pages
MCS-221 2024-25 em
No ratings yet
MCS-221 2024-25 em
34 pages
Data Management Essentials Guide
No ratings yet
Data Management Essentials Guide
24 pages
FULL PreSonus Studio One 4 Professional 411 MULTILANG x64 PDF
No ratings yet
FULL PreSonus Studio One 4 Professional 411 MULTILANG x64 PDF
4 pages
DWHDM 22cse120 Module 2
No ratings yet
DWHDM 22cse120 Module 2
60 pages
AISE Anchor Bolt Details PDF
100% (1)
AISE Anchor Bolt Details PDF
1 page
Black Box Fairness Testing of Machine Learning Models
No ratings yet
Black Box Fairness Testing of Machine Learning Models
11 pages
Paper 2 Datawarehouse Notes
No ratings yet
Paper 2 Datawarehouse Notes
20 pages
Eu Customs Code EPRS STU (2018) 621854 en
No ratings yet
Eu Customs Code EPRS STU (2018) 621854 en
64 pages
Data Warehouse Essentials & ETL Process
No ratings yet
Data Warehouse Essentials & ETL Process
31 pages
A Comprehensive Meta Model For The
No ratings yet
A Comprehensive Meta Model For The
61 pages
Extract, Transform, Load: Inmon Bill
No ratings yet
Extract, Transform, Load: Inmon Bill
11 pages
Types of Brakes
No ratings yet
Types of Brakes
12 pages
Kevin Bartley Portfolio
No ratings yet
Kevin Bartley Portfolio
29 pages
Quasi-Anechoic Measurement of Loudspeakers Using Beamforming Method
No ratings yet
Quasi-Anechoic Measurement of Loudspeakers Using Beamforming Method
7 pages
Portable Percent Oxygen Analyzer With USB Data Logging
No ratings yet
Portable Percent Oxygen Analyzer With USB Data Logging
1 page
Swanti Satsangi
No ratings yet
Swanti Satsangi
1 page
Metadata Management: Reporter: Padpad, Justin Jay Pastolero, John Lloyd Passion, Jayvee
100% (1)
Metadata Management: Reporter: Padpad, Justin Jay Pastolero, John Lloyd Passion, Jayvee
28 pages
06-Data-Integration Quality Profiling
No ratings yet
06-Data-Integration Quality Profiling
39 pages
Hands-On Exercise No. 1 Batch-02 Graphic Design Total Marks: 10 Due Date: 04/08/2022
No ratings yet
Hands-On Exercise No. 1 Batch-02 Graphic Design Total Marks: 10 Due Date: 04/08/2022
3 pages
Data Warehouse Architechture-Layers
No ratings yet
Data Warehouse Architechture-Layers
21 pages
Ass 1
No ratings yet
Ass 1
31 pages
Cloud Integration Framework
No ratings yet
Cloud Integration Framework
1 page
Introduction To Hyperion Reports
No ratings yet
Introduction To Hyperion Reports
27 pages
Dataware House
100% (8)
Dataware House
42 pages

Data Lineage1

Uploaded by

Data Lineage1

Uploaded by

Data Lineage: Explanation and Example

Key Concepts of Data Lineage

1. Source: Where the data originates.

Example: Data Lineage in a Banking Scenario

Step-by-Step Data Lineage Example

1. Data Source: Branch Transaction Systems

Example Data Lineage Diagram

Branch Transaction Systems (Source) ->

1. Source: Branch_Transactions table in branch systems.

4. Destination: DW_Branch_Summary table in the data warehouse.

Importance of Data Lineage

Integration of Data Governance, Metadata Management, and Data Cataloging

Scenario: A multinational corporation is developing a customer analytics platform. To ensure data

Objective: Start the batch job based on the predefined schedule.

 Trigger batch job execution either manually or automatically.

 Read input data from databases, files, or other sources.

Objective: Process and transform the extracted data as required.

 Apply business rules and data transformations.

Objective: Load the processed data into target systems or databases.

 Insert, update, or delete records in the target databases.

6. Reporting and Notifications

 Produce summary and detailed reports of the batch job results.

 Generate a daily transaction summary report for management.

 Track job execution in real-time using monitoring tools.

8. Error Handling and Recovery

 Identify and log errors and exceptions.

Objective: Complete the batch job lifecycle and release resources.

 Ensure all processes are terminated correctly.

10. Post-Processing Analysis

 Review job logs, reports, and performance metrics.

You might also like