DATA WAREHOUSE AND
INTEGRATION
Lecturer: Nguyễn Văn Hồ, M.A.
Data Warehouse and Integration:
Data Warehouse in the the age of Big data
Nguyễn Văn Hồ, M.A.
[email protected]
Data Warehouse and Integration
Comparison Chart of Database Types
Data warehouse Operational system
Subject oriented Transaction oriented
Large (hundreds of GB up to several TB) Small (MB up to several GB)
Historic data Current data
De-normalized table structure (few tables, Normalized table structure (many tables,
many columns per table) few columns per table)
Batch updates Continuous updates
Usually very complex queries Simple to complex queries
Data Warehouse and Integration
What’s wrong with reporting from Operational System?
OLTP – On Line Transaction Source
Processing 1 Source
1
Designed for working with single Source
2
record at a time
Data is highly “normalized”, i.e. Source
1
duplicate values have been removed Source
2
Source
Getting all data for a record can 1
Source
involve many table joins 2
Can be quite confusing for
‘ad-hoc’ reporting Source
Can also be slow, having an
1
Source
impact on the OLTP system 2
32
Data Warehouse and Integration
A typical enterprise data warehousing system architecture
- Ad-hoc Queries - Statistical tools
An enterprise data warehousing - Standard Reports - Operational BI Apps
Front-end
Tools
system architecture with - Desktop OLAP - Data Mining
ROLAP/MOLAP/Hybrid OLAP LAN
Hybrid OLAP Data Marts
ROLAP MOLAP and OLAP
Data mart Data mart Engines
Departmentally
Structured
Sub setting, summarizing, customizing
DW Storage
Data Metadata Repository
Warehouse Metadata
ETL (Extract, Transform, Load)
ETL Management
Staging DB/ ODS Services
Other Archived External
OLTP ERP
RDMS Systems
Internal Data
sources Data source
Sources Systems
33
DW+BI Systems Used to Be Fairly Straightforward
Modern Data Warehousing
Why Data Lake is needed?
Data Warehouse and Integration
Objectives of Data Lake
• Reduce up-front effort by ingesting data in any
format without requiring a schema initially
• Make acquiring new data easy, so it can be
available for data science & analysis quickly
• Store large volume of multi-structured data in
its native format
• Achieve agility faster than a traditional data
warehouse can
• Speed up decision-making ability
• Storage for additional types of data which
were historically difficult to obtain
Data Warehouse and Integration
Growing your DW/BI/Analytics Environment
Cloud & Modern DW Advanced
Hybrid Multi-Platform Analytics
Platforms Architecture
Self- Agile,
Real-Time
Service Nimble
Reporting
BI Solutions
Data Warehouse and Integration
Achieving Extensibility in a DW
Design with change in mind. Ex: Create a lookup table with
code/descriptions, or implement in a view, rather than hard-coding in ETL.
Plan for a hybrid environment with multiple architectures.
Introduce conformed dimensions first whenever possible.
Try to avoid isolated “stovepipe” implementations unless the isolation
is absolutely intended.
Conduct active prototyping sessions with business users to flush out
requirements. A data modeling tool like Power BI works well for this.
Data Warehouse and Integration
Achieving Extensibility in a DW
Be prepared to do some refactoring along the way. Ex: converting an
attribute to be a conformed dimension.
DimCustomer
First implementation: FactSalesInvoice
CustomerName
CustomerRegion
…
Updated in a later iteration:
DimRegion
FactWarrantyRequest FactSalesInvoice
DimCustomer
Data Warehouse and Integration
Achieving Extensibility in a DW
Introducing new measures:
• Can be a new column in a fact table as long as it’s the same grain & the
same business process
Introducing new attributes:
• Can be a new column in a dimension, or
• Can be via a new foreign key in a fact table as long as it doesn’t affect
the grain
Agility for the things that usually require the most time investment:
• Data modeling
• ETL processes
• Data quality
Data Warehouse and Integration
Achieving Extensibility in a DW
Reusability Downstream Speed of Change
Implemented
DW OLAP Reports
Consider using an OLAP cube or in-memory model (like Analysis
Services) for:
• Summary data (as opposed to summary tables in your DW)
• Year-to-Date type of calculations
• Year-over-Year type of calculations
• Aggregate level calculations (as opposed to row-by-row calculations)
Data Warehouse and Integration
Modern DW: Important Concepts to Know
Polygot Lambda Schema on
Persistence Architecture Read
Using the most Data processing Data structure is
effective data architecture applied at query
storage technology which supports time rather than
to handle different large amounts of when the data is
data storage needs data via a speed initially stored
layer, batch layer,
and serving layer