Data Warehousing
Dr. L. Rajya Lakshmi
Data warehouse: source data
• External data:
• Data from external sources such as industry related statistics produced by
external agencies and national statistical offices
• To spot industry trends and for performance comparison
• The format of this external data may not match with the enterprise data
Data warehouse: staging area
• Functions to be performed to make data ready to store are:
• Extraction
• Transformation
• Loading
• Data staging:
• provides a place and an area
• A set of functions to clean, change, combine, convert, and prepare
• Why a separate staging area?
• Data come from different operational systems
• To create a subject-oriented view of the data
• Separate area is required to prepare the data
Data warehouse: staging area
• Data extraction
• Has to deal with numerous data sources
• Appropriate technique has to be identified for each source
• Relational database systems, legacy systems, flat files, etc
• Can use tools available or can write in-house programs to extract data
• Extracted data is stored in a separate environment (need not be) for easy
movement into the data warehouse.
Data warehouse: staging area
• Data transformation
• More challenging than data extraction
• Transformation tasks should be adaptable to revised data as well
• First you clean the data
• Correction of spelling mistakes, resolving conflicts
• Providing default values for data elements with missing values
• Deletion of duplicates, etc.
• Standardization of data elements
• data types and Field length for same data elements
• If two or more items from different systems have the same meaning, then we have to
resolve a synonym
• If the same term has different meanings in different systems, then we have to resolve a
homonym
Data warehouse: staging area
• Data transformation
• Combine the pieces of data from different sources
• Combining task may involve combining data from a single source record or data
elements from multiple records.
• Purging useless data
• Sorting and merging on a large scale
• Assignment of surrogate keys?
• Summarization
• Ex: Total units of each product sold at each store
• You have a collection of integrated data that is cleaned, standardized and
summarized
Data warehouse: staging area
• Two distinct functions for data loading
• What are they?
• Initial load
• Incremental data load
Data warehouse: staging area
Data warehouse: data storage
• Storage area of a data warehouse is kept different from that of
operational systems, why?
• The purpose of both the systems is different
• The formats of both the data are different
• Operational data could change very frequently
• Data in DW repository has to be stable
• DW data repositories are open (generally RDBMSs)
• DWs also employ MDDBs (especially summary data)
Data warehouse: information delivery
• Ad hoc reports: for casual users
• Complex queries, multi-dimensional and statistical analysis: business
analysts and power users
• Executive information systems: senior executives and high-level
managers
• Can provide online query and report
Data warehouse: metadata
• Similar to the data dictionary or data catalogue of a database management
system
• Data about the data in the data warehouse (Yellow pages for data
warehouse)
• Three categories
• Operational metadata
• Extraction and transformation metadata
• End-user metadata
• Operational metadata: information is tied to the original sources
• Extraction and transformation metadata: extraction frequency, extraction
methods, business rules for extraction, and information about all data
transformations
• End-user metadata: Navigational map of data warehouse; enables end-
users to get the required information using their own business terminology