The arguments put forth by him in favor of a library data warehouse are
• Data from several heterogenous data sources (MS Excel spreadsheets, MS Access CSVfile, etc.) can
be extracted and brought together in a data warehouse.
• Even when DIIT expands into several branches in multiple cities, it still can have one ware-house to
support the information needs of the institution.
• Data anomalies can be corrected through an ETL package.
• Missing or incomplete records can be detected and duly corrected.
• Uniformity can be maintained over each attribute of a table.
• Data can be conveniently retrieved for analysis and generating reports (like the report on spending
requested above).
• Fact-based decision making can be easily supported by a data warehouse.
• Ad hoc queries can be easily supported
DATA MART
DATA MART
GOALS of DATA WAREHOUSE
BI-PROCESS
What are the problems faced in Data
Integration?
Challenges in Data Integration
Data Mapping
• Identification of data relationships
• Consolidation of multiple database in single database
• Data transformation between data source and data destination
DATA MAPPING
• The process of creating data element mapping between two distinct
data models
• It is used as the first step towards a wide variety of data integration
tasks
Data transformation between data source & data destination
Identification of data relationships
Discovery of hidden data
Consolidation of multiple database into single database
Some Transformation Types
Selecting only certain columns to load
Translating a few coded values
Encoding some free –form values
Deriving a new calculated value
Joining together data derived from multiple sources
Summarizing multiple rows data
Splitting a column into multiple column
Ralph Kimball’s Approach Vs W.H. Inmon’s
Approach
• Kimball : A data warehouse is made up of all data marts in the
enterprise – Bottom –up Approach . Faster , cheaper and less complex
• W.H . Inmon : DW is subject oriented, non-volatile integrated, time-
variant collection of data – Top-down approach . Expensive and
complex . Achieves “ Single Version of truth” for lrge organisations
and worth the investment.
Data Integration Technologies
• Data Interchange :
Structured transmission of data between 2 or more organizations
electronically –or between two trading partners not email.
Object Request Brokering :
Middleware software that allow making calls from one another and
also in-process data transformation .
Modelling Techniques
Entity Relationship Modelling
Logical Design Technique
Focuses on reducing data redundancy
Contributes in initial stages of constructing
DW
Problem of creating huge number of tables
with dozens of joins –Massive spider web of
joins between tables
Steps To Draw ER Model
Identify
Entities
Identify
Relation
between
Entities
Identify Key
attribute
Identify
other
relevant
attributes
Draw ER
diagram
Review with
Business
Users
Problems posed by ER Modeling
• End Users finds it difficult to comprehend and traverse
• Lack of software to query general ER Model
• ER modeling cannot be used for DW where performance access and
ad hoc queries are required to be done
Dimensional Modeling
• Logical design techniques that focuses on presenting data in standard
format for end –user consumption
• Based on Schema –Star or Snowflake
• Consists of one large table and number of small dimension tables
• Fact table has multipart primary key and each table has a single part
primary key
Basics of Data Integration
(Extraction Transformation
Loading)
Difference Between ER Modeling and
Dimensional Modeling
ER Modeling Dimensional Modeling
Optimized for Transactional Data Optimized for Query ability and performance
Eliminates redundant data Does not eliminate redundant data where appropriate
Highly normalized Aggregates most of the attributes of a dimension into a
single entity
It is a complex maze with hundreds of entities It has logical grouped set in schemas
Used for Transactional Systems Used for Analytical Systems
It is split as per the entities It is split as per dimensions and facts
Data Quality
What is data quality?
In simple terms, data quality tells us how reliable a particular set of
data is and whether or not it will be good enough for a user to employ
in decision-making. This quality is often measured by degrees.
Data Quality Dimensions
There are six primary, or core, dimensions to data quality. These are
the metrics analysts use to determine the data’s viability and its
usefulness to the people who need it.
ACCURACY,
COMPLETENESS,
VALIDITY,
CONSISTENCY,
TIMELINESS,
Accuracy
• The data must conform to actual, real-world scenarios and reflect
real-world objects and events. Analysts should use verifiable sources
to confirm the measure of accuracy, determined by how close the
values jibe with the verified correct information sources
E.g. Address of the customer in the database is the real address
Age of the patient is accurate as in hospital database
Completeness
• Completeness measures the data's ability to deliver all the
mandatory values that are available successfully.
E.g. Data of all students at University is available
Data of all the patients of a hospital are available
Dat of all employees
Consistency
• Data consistency describes the data’s uniformity as it moves across
applications and networks and when it comes from multiple sources.
Consistency also means that the same datasets stored in different
locations should be the same and not conflict.
• E.g. Customer cancelled and surrendered his credit card and yet
status reads due
• Employee left but his email id is active
Timeliness
Timely data is information that is readily available whenever it’s
needed. This dimension also covers keeping the data current; data
should undergo real-time updates to ensure that it is always available
and accessible.
• Eg Airlines need to provide timely flight information to passengers
• Enterprise to publish timely quarterly results
Uniqueness
• Uniqueness means that no duplications or redundant information
are overlapping across all the datasets. No record in the dataset
exists multiple times. Analysts use data cleansing and deduplication
to help address a low uniqueness score.
• For example, when reviewing customer data, you should expect that
each customer has a unique customer ID.
Validity
• Data must be collected according to the organization’s defined
business rules and parameters. The information should also conform
to the correct, accepted formats, and all dataset values should fall
within the proper range.
• Formatting usually includes metadata, such as valid data types, ranges,
patterns, and more.
• Data profiling is the process of
examining, analyzing, and
creating useful summaries of
data. The process yields a high-
level overview which aids in the
discovery of data quality issues,
Data Profiling risks, and overall trends. Data
profiling produces critical
insights into data that companies
can then leverage to their
advantage.
Two types of Data Profiling
Data Quality Profiling
Database Profiling
Data Quality Profiling
• Analysing the data from a data source or a
database against the business requirements
• Enables to identify issues in data quality
• Analysis may be represented as
• Summaries : Counts and Percentages on
completeness of the dataset , uniqueness
of columns etc .
Details : Involves lists containing
information on data records or data
problems in individual records
Database Profiling
Procedures of Analysis of a database wrt its schema ,
relationship between tables , columns used , keys of the tables
etc
Database Profiling is done initially followed by data quality
profiling
When is it conducted?
At the discovery /
requirements
gathering phase
• As soon as source data
system are identified
• Business requirements laid
out
• Data quality profiling needs
to be done to avoid
correction and re-work
Just before the Dimensional
Modelling Process
• Intensive data Profiling is done
• More database Profiling
• Analysis of schema designs for DW
• Aims to identify the best method to
convert the source data to the
dimensional model
To identify possible errors that
may creep in during ETL
During ETL
Helps in identifying what data
Package to extract and what filters to
apply
Design
More Data quality Profiling
Domino’s data avalanche
• With almost 14,000 locations, Domino’s was already the largest pizza
company in the world by 2015. But when the company launched
its AnyWare ordering system, it was suddenly faced with an avalanche of
data. Users could now place orders through virtually any type of device or
app, including smart watches, TVs, car entertainment systems, and social
media platforms.
• That meant Domino’s had data coming at it from all sides. By putting
reliable data profiling to work, Domino’s now collects and analyzes
data from all of the company’s point of sales systems in order to streamline
analysis and improve data quality. As a result, Domino’s has gained deeper
insights into its customer base, enhanced its fraud detection processes,
boosted operational efficiency, and increased sales.