Week 2: Data Warehouse Architecture
• Lecture Topics:
o Data Warehouse Architecture Components
o Data Storage and Management
o Data Integration
• Readings:
o "What is a Data Warehouse?"[2]
• Practical:
o Designing a simple data warehouse schema
Lecture Notes: Data Warehouse Architecture and Key Components
1. Data Warehouse Architecture Components
A data warehouse is a centralized repository that stores integrated data from multiple sources,
designed to support decision-making processes. The architecture typically comprises the
following components:
a) Source Systems
• These are operational systems (e.g., ERP, CRM, transactional databases) from which
data is extracted.
• Examples: Sales databases, customer support systems, or financial systems.
b) ETL (Extract, Transform, Load) Processes
• ETL tools extract data from source systems, transform it to fit the data warehouse
schema, and load it into the warehouse.
• Practical Example: A retailer extracts daily sales data, cleans it to remove duplicates,
aggregates sales by store, and loads it into the data warehouse.
c) Staging Area
• Temporary storage area where data is cleaned, deduplicated, and transformed before
loading into the warehouse.
• Example: A logistics company might use a staging area to align shipment tracking
data formats from different systems.
d) Data Storage Layer
• Contains the organized, structured, and optimized data for querying and analysis.
• Composed of fact tables (storing measurable data) and dimension tables (storing
descriptive attributes).
o Example: A sales fact table might have sales amounts, while the product
dimension table contains product descriptions.
e) Metadata Layer
Page 1 of 5
• Stores information about the data, such as definitions, structure, and lineage.
• Practical Example: Metadata helps analysts understand what "monthly revenue"
represents and how it was calculated.
f) Query and Reporting Tools
• Allow users to query the data and generate reports or dashboards.
• Example: A financial analyst uses a BI tool like Tableau to visualize quarterly
revenue trends.
g) Data Marts
• Subsets of the data warehouse, tailored for specific departments or business units.
• Example: A marketing data mart might focus on campaign performance metrics.
2. Data Storage and Management
Efficient data storage and management ensure high performance and scalability for a data
warehouse.
a) Storage Types
• Relational Databases: Traditional databases like Oracle, SQL Server.
• Columnar Databases: Optimized for analytical queries, e.g., Amazon Redshift,
Snowflake.
• Cloud-Based Solutions: Offer scalability and flexibility, e.g., Google BigQuery.
b) Storage Optimization Techniques
• Partitioning: Splitting large tables into smaller parts for faster queries.
o Example: Partitioning a sales table by year.
• Indexing: Creating indexes for frequently queried fields.
o Example: Adding an index on the "product_id" column to speed up product
searches.
• Compression: Reducing data size without losing information.
o Example: Storing numeric data in compressed formats.
c) Data Backup and Recovery
• Regular backups ensure data integrity in case of failures.
• Example: A company schedules nightly backups of their data warehouse.
d) Data Security
Page 2 of 5
• Ensures sensitive information is protected through access controls, encryption, and
monitoring.
• Example: A healthcare provider encrypts patient data stored in the warehouse.
3. Data Integration
Data integration is the process of combining data from various sources into a unified view,
crucial for analytics and decision-making.
a) Types of Data Integration
• ETL (Extract, Transform, Load): Data is extracted from sources, transformed, and
loaded into the warehouse.
• ELT (Extract, Load, Transform): Data is loaded into the warehouse first and then
transformed.
• Example: ELT is commonly used in modern cloud-based warehouses like Snowflake.
• Data Virtualization: Real-time access to data without physical storage.
o Example: A business user queries live data from multiple databases without
moving it to a central repository.
b) Data Transformation
• Standardizing data formats, cleansing errors, and enriching data.
• Example: Converting date formats from "MM/DD/YYYY" to "YYYY-MM-DD."
c) Data Consolidation Challenges
• Data Silos: Disconnected systems lead to incomplete views.
o Solution: Use APIs or middleware to connect systems.
• Data Quality Issues: Errors or inconsistencies affect trust.
o Solution: Implement data quality checks during integration.
d) Tools for Data Integration
• Popular tools include Informatica, Talend, Apache Nifi, and Microsoft Azure Data
Factory.
• Example: Informatica integrates customer data from CRM and ERP systems into a
centralized warehouse.
Conclusion
Understanding these components—data warehouse architecture, data storage and
management, and data integration—lays the foundation for implementing robust, scalable,
Page 3 of 5
and efficient analytical solutions. Practical use of these principles enables organizations to
transform raw data into actionable insights.
4. Practical: Designing a Simple Data Warehouse Schema
Objective: To design a basic schema for a retail business to analyze sales performance.
Steps:
1. Identify Business Requirements:
o Analyze the business’s key metrics, such as total sales, revenue by region, and
product performance.
o Example: A retail chain wants to track daily sales across stores and identify
top-selling products.
2. Define Fact and Dimension Tables:
o Fact Table:
Name: Sales_Fact
Attributes: Sale_ID, Date_ID, Store_ID, Product_ID, Revenue,
Quantity_Sold
o Dimension Tables:
Date_Dimension: Date_ID, Date, Month, Quarter, Year
Store_Dimension: Store_ID, Store_Name, Region, City
Product_Dimension: Product_ID, Product_Name, Category, Brand
3. Create the Schema:
o Star Schema Design:
The Sales_Fact table is at the center, with foreign keys linking to
dimension tables.
o Example:
o Sales_Fact
o -------------------
o Sale_ID | Date_ID | Store_ID | Product_ID | Revenue | Quantity_Sold
o
o Date_Dimension
o -------------------
Page 4 of 5
o Date_ID | Date | Month | Quarter | Year
o
o Store_Dimension
o -------------------
o Store_ID | Store_Name | Region | City
o
o Product_Dimension
o -------------------
Product_ID | Product_Name | Category | Brand
4. Populate Data:
o Collect data from the source systems and transform it to match the schema.
o Example: Load daily sales transactions into the Sales_Fact table and update
the dimension tables with store and product details.
5. Test and Query:
o Verify the schema by running queries.
o Example Query: "Find the total revenue by product category in Q1 2023
Page 5 of 5