Scenario
Retail:
Large grocery chain -The business has 100 grocery stores spread across five states. Each store has a full
complement of departments, including grocery, frozen foods, dairy, meat, produce, bakery, floral,
and health/beauty aids. Each store has approximately 60,000 individual products, called stock keeping
units (SKUs), on its shelves.
Data is collected at several interesting places in a grocery store. Some of the most useful data is
collected at the
Cash registers as customers purchase products. The point of-sale (POS) system scans product
barcodes at the cash register, measuring consumer takeaway at the front door of the grocery
store.
Other data is captured at the store’s back door where vendors make deliveries
Problem Statement:
At the grocery store, management is concerned with the logistics of ordering, stocking, and selling
products while maximizing profit. The profit ultimately comes from charging as much as possible for
each product, lowering costs for product acquisition and overhead, and at the same time attracting as
many customers as possible in a highly competitive environment.
Some of the most significant management decisions have to do with pricing and promotions.
Promotions in a grocery store include temporary price reductions, ads in newspapers and newspaper
inserts, displays in the grocery store, and coupons. The most direct and effective way to create a surge in
the volume of product sold is to lower the price dramatically. A 50-cent reduction in the price of paper
towels, especially when coupled with an ad and display, can cause the sale of the paper towels to jump
by a factor of 10. Unfortunately, such a big price reduction usually is not sustainable because the towels
probably are being sold at a loss. As a result of these issues, the visibility of all forms of promotion is an
important part of analyzing the operations of a grocery store.
Product Vision:
Create a data warehouse and BI platform for Analytical & reporting capabilities for Promotions related
activities.
To provide numerous benefits such as operation efficiency, favorable customer experience, loyalty and
retention of customers. But most importantly, it can be used to anticipate demand for efficient
inventory management, cash management and overall profitability.
Team:
Team of 9
1 PO
1 BA
2 System Architect/Developer (Among them 1 is Team lead as well as Project manager)
2 ETL Developers
2 QA
1 Report designer (BI developer)
Envisioning:
The envisioning effort is typically performed during the first week of a project, the goal of which is to
identify the scope of your system and a likely architecture for addressing it. To do this you will do both
high-level requirements modeling and high-level architecture modeling.
Initial Requirements Modeling
Initial Architecture Modeling
For a BI/DW project, the initial architecture views would likely be some form of deployment
diagram capturing the technologies you intend to use and a high-level domain model
overviewing the business entities and the relationships between them.
Iteration Modeling: Thinking Through What You'll Do This Iteration (Release Plan)
Create a BCP (disaster recovery plan).
Gather Business Requirements and Data Realities
Before launching a dimensional modeling effort, the team needs to understand the needs of the
business, as well as the realities of the underlying source data. You uncover the requirements via
sessions with business representatives to understand their objectives based on key performance
indicators, compelling business issues, decision-making processes, and supporting analytic needs. At the
same time, data realities are uncovered by meeting with source system experts and doing high-level
data profiling to assess data feasibilities.
Collaborative Dimensional Modeling Workshops
Dimensional models should be designed in collaboration with subject matter experts and data
governance representatives from the business. The data modeler is in charge, but the model
should unfold via a series of highly interactive workshops with business representatives.
A common criticism of the agile approaches is the lack of planning and architecture, coupled
with ongoing governance challenges. The enterprise data warehouse bus matrix is a powerful
tool to address these shortcomings. The bus matrix provides a framework and master plan for
agile development, plus identifies the reusable common descriptive dimensions that provide
both data consistency and reduced time-to-market delivery.
As you flesh out the portfolio of master conformed dimensions, the development crank starts
turning faster and faster. The time-to-market for a new business process data source shrinks as
developers reuse existing conformed dimensions. Ultimately, new ETL development focuses
almost exclusively on delivering more fact tables because the associated dimension tables are
already sitting on the shelf ready to go.
This architecture decomposes the DW/ BI planning process into manageable pieces by focusing
on business processes, while delivering integration via standardized conformed dimensions that
are reused across processes.
BUS Matrix is created
Opportunity/Stakeholder Mix: (As a PO)
After the enterprise data warehouse bus matrix rows have been identified, you can draft a
different matrix by replacing the dimension columns with business functions, such as marketing,
sales, and finance, and then shading the matrix cells to indicate which business functions are
interested in which business process rows. The opportunity/stakeholder matrix helps identify
which business groups should be invited to the collaborative design sessions for each process-
centric row
High Level Grooming:
Step 1: Selecting Business Process
DW/BI project should focus on the business process that is both the most critical to the business users,
as well as the most feasible. Feasibility covers a range of considerations, including data availability and
quality, as well as organizational readiness.
management wants to better understand customer purchases as captured by the POS system. Thus,
the business process you’re modeling is POS retail sales transactions. This data enables the business
users to analyze which products are selling in which stores on which days under what promotional
conditions in which transactions.
Step 2: Declare the Grain
Design team – decides on the granularity
The most granular data is an individual product on a POS transaction, assuming the POS system rolls up
all sales for a given product within a shopping cart into a single line item
Although users probably are not interested in analyzing single items associated with a specific POS
transaction, you can’t predict all the ways they’ll want to cull through that data. For example,
they may want to understand the difference in sales on Monday versus Sunday.
Or they may want to assess whether it’s worthwhile to stock so many individual sizes of certain
brands.
Or they may want to understand how many shoppers took advantage of the 50-cents-off
promotion on shampoo.
Or they may want to determine the impact of decreased sales when a competitive diet soda
product was promoted heavily.
Although none of these queries calls for data from one specific transaction, they are broad questions
that require detailed data sliced in precise ways. None of them could have been answered if you
elected to provide access only to summarized data.
(Below is only for promotions which is broken down further in high-level)
Business analysts at both headquarters and the stores are interested in determining whether a
promotion is effective. Promotions are judged on one or more of the following factors:
Whether the products under promotion experienced a gain in sales, called lift, during the
promotional period. The lift can be measured only if the store can agree on what the baseline
sales of the promoted products would have been without the promotion. Baseline values can be
estimated from prior sales history and, in some cases, with the help of sophisticated models.
Whether the products under promotion showed a drop in sales just prior to or after the
promotion, canceling the gain in sales during the promotion (time shifting). In other words, did
you transfer sales from regularly priced products to temporarily reduced priced products?
Whether the products under promotion showed a gain in sales but other products nearby on
the shelf showed a corresponding sales decrease (cannibalization).
Whether all the products in the promoted category of products experienced a net overall gain in
sales taking into account the time periods before, during, and after the promotion (market
growth).
Whether the promotion was profitable. Usually the profit of a promotion is taken to be the
incremental gain in profit of the promoted category over the baseline sales taking into account
time shifting and cannibalization, as well as the costs of the promotion
Release Plan:
Team Backlog Grooming:
User Story 1:
As a Financial Analyst
I need the ability to see the promotions run on a store during a period
In order to identify the lift (gain in sales)
User Story:
As a Financial Analyst
I need the ability to see sales of the products after promotion
In order to identify a drop in sales (time shifting)
User Story:
As a Financial Analyst
I need the ability to see the profit margin per year
In order to identify the least profitable years
User Story:
As a Financial Analyst
I need the ability to see the profit margin per branch and year
In order to identify the least profitable branches
Sprint 0:
Before starting to work on the DW / BI system user stories, it is very helpful to start with what is known
as the Iteration 0. This iteration can last from one to two weeks and its main objective is to create
everything you need from the technical point of view to start building the DW / BI system. This includes:
Version control system (SCM)
Work environments provisioning: developer (developer sandbox), staging, preproduction and
production with their respective software installation and configuration base, RDBMS, platforms
and tools that will be used for working
Installing and configuring the continuous integration server
Installation and configuration of the agile project management and collaboration tools
Version control system (SCM)
The following base structure can be used as a reference for project artifacts versioning:
dw_bi_system Data Warehousing / Business Intelligence System
├── doc System Documentation
├── provisioning Environment provisioning code
└── src System Source Code
├── apps BI Applications code
├── data System Static data (.csv, .txt, .xml, .sql)
├── db_migrations SQL scripts for databases change management
├── etls ETL Code (Data Warehousing)
├── reports Dashboards and reports source files
└── schemas Metadata schemes or models
Provisioning Environments
One of the agile success factors is automating repetitive tasks, so that development teams can focus on
issues that add value to the DW / BI system.
With regard to environments where the DW / BI system will run, automation implies creating code that
allows provision of the operating system, base software, database server, settings, tools, etc. This
usually is called IaC (Infrastructure as Code). Ansible is a platform that allows you to create YAML code
for provisioning environments and works together with Vagrant which is a virtual environment
manager.
(YAML (a recursive acronym for "YAML Ain't Markup Language") is a human-readable data-serialization language.
It is commonly used for configuration files and in applications where data is being stored or transmitted)
For a DW / BI system we can have four environments:
Production Environment (PROD): DW / BI system production-ready
Developer Environment (Developer Sandbox): this is the DW / BI system’s development
environment
Quality Assurance Environment (QA): all developers’ changes are integrated in this
environment and the DW / BI system quality controls are performed
Pre-production environment (PRE-PROD): is a production-like environment, tests and end-user
demonstrations run on this environment
An important feature of Ansible is that the same infrastructure code can be used to provision any of the
environments (DEV, QA, PRE-PROD, PROD). In addition, it accomplishes to maintain consistency of
software versions used in all environments.
The infrastructure code is versioned it in the /provisioning directory mentioned in the versioning section
above.
The provisioning code for Pentaho v5.4 CE platform using the PostgreSQL v9.4 RDBMS and running
on CentOS v7.1 operating system can be found in this GitHub repository.
Sprint Planning:
DOR – Data Mapping document (for that user story)
During Sprint
Evolutionary Dimensional Modeling
A well-written user story is the work unit to begin building and evolving the DW / BI system. Assuming
you have the following user story 1 & 2:
To keep track of incremental changes in the structure of the dimensional model, it is suitable to use a
database change management tool. Some of the best known Open Source tools
are Flyway, Liquibase or DBDeploy.
Step 2: Identify Grain
Step 3: Identify the dimensions
Following descriptive dimensions apply to the case: date, product, store, promotion, cashier, and
method of payment.
Step 4: Identify the Facts
When considering potential facts, you may again discover adjustments need to be made to either your
earlier grain assumptions or choice of dimensions.
The facts collected by the POS system include the sales quantity (for example, the number of cans of
chicken noodle soup), per unit regular, discount, and net paid prices, and extended discount and sales
dollar amounts. The extended sales dollar amount equals the sales quantity multiplied by the net unit
price. Likewise, the extended discount dollar amount is the sales quantity multiplied by the unit discount
amount
--
Dimensional modelers sometimes question whether a calculated derived fact should be stored in the
database. We generally recommend it be stored physically. In this case study, the gross profit calculation
is straightforward, but storing it means it’s computed consistently in the ETL process, eliminating the
possibility of user calculation errors. The cost of a user incorrectly representing gross profit overwhelms
the minor incremental storage cost. Storing it also ensures all users and BI reporting applications refer to
gross profit consistently.
Likewise, some organizations want to perform the calculation in the BI tool. Again, this works if all
users access the data using a common tool, which is seldom the case in our experience. However,
sometimes non-additive metrics on a report such as percentages or ratios must be computed in the BI
application because the calculation cannot be precalculated and stored in a fact table. OLAP cubes excel
in these situations.
Product Dimension
The product dimension describes every SKU in the grocery store. Although a typical store may stock
60,000 SKUs, when you account for different merchandising schemes and historical products that are no
longer available, the product dimension may have 300,000 or more rows. The product dimension is
almost always sourced from the operational product master file.
Most retailers administer their product master file at headquarters and download a subset to each
store’s POS system at frequent intervals. It is headquarters’ responsibility to define the appropriate
product master record (and unique SKU number) for each new product.
Retail Schema in Action:
With our retail POS schema designed - how it would be put to use in a query environment.
Query:
A business user might be interested in better understanding weekly sales dollar volume by promotion
for the snacks category during January 2013 for stores in the Boston district.
you would place query constraints on month and year in the date dimension, district in the store
dimension, and category in the product dimension
Dimensional model loading processes
Once you have built a first version of the dimensional model and identified the data sources, you can
begin building the processes to load the dimensional model. These processes are commonly known as
ETL (Extraction, Transformation and Load) or data ingestion.
ETL processes allow us to move, transform and load the data into the temporary repository (staging) and
then to the dimensional model. These processes can be programmed using a programming language, or
constructed in a data integration tool. Some Open Source tools for building ETL processes and data
warehousing activities are: Pentaho Data Integration, Talend, Jaspersoft ETL.
It is important that the ETL processes’ metadata is based on files so that they can be versioned in the to
the /etls directory mentioned in versioning section above.
It is important to perform unit tests on the ETL processes to ensure they serve their purpose and that
the data is consistent between repositories it gets moved and transformed to. To generate test data
(fake data) you can use a tool like Mockaroo.
Continuous Integration
Continuous integration used in the context of DW / BI system allows two main activities:
1. Provisioning environments
2. Execution and scheduling of data loading processes (ETLs)
There are several continuous integration servers, some known Open Source are: go.cd, Jenkins, TravisCI.
In the next picture contains two pipelines configured in the go.cd tool for the DW / BI system:
Information delivery mechanisms
Once the dimensional model has been populated with data through ETLs processes, different solutions
are built for business intelligence activities. These solutions can be categorized into:
1. Reporting: institutional reports, on-demand reports (ad hoc), dashboards
2. OLAP Solutions: data analysis cubes and pivot tables
3. Custom: web portals, visualization applications, infographies
The delivery mechanism that business users decide to have should add the maximum value to answer
their business questions and support their decision-making process.