Data Mesh MD010585
Data Mesh MD010585
Data Mesh
2 TERADATA.COM
WH ITE PA PE R DATA MESH
It is also the case that despite several decades The remainder of this paper outlines practical steps
of intensive academic research, distributed query to deploying the Data Mesh concept as an effective
optimization is still relatively complex and unproven foundation for enterprise analytics.
at scale – and that improvements in the performance
of multi-core CPUs continue to outpace increases in
performance of network and storage sub-systems.
Aligning data products with real-world
Whilst development of usable and useful data products business requirements
is invariably business-led, understanding and respecting
these engineering fundamentals when architecting and The development of data and analytic products is
designing domains remains critical to success. inherently complex, for at least three reasons:
Finally, although “Data is the new oil” has become a 1. Data products often require that data is reused for
cliché, when Clive Humby coined the phrase in 2006 he purposes that were not foreseen when the processes
was also pointing out that raw data – the crude oil in that generate it were created – necessitating
the analogy – must be refined before high-value data complex data transformation, cleansing and
products are created. Raw, un-sessionized weblogs are, integration;
by themselves, neither terribly interesting nor remotely
2. Requirements are often ambiguous and incompletely
comprehensible to most business users. However, when
defined at the start of the project – and are
the raw data have been refined through the removal
frequently fluid thereafter;
of the web-bot traffic and the identification of user
sessions, the resulting data are a powerful predictor of 3. Integrating analytic insights into business processes
customer intent – let’s call this ‘diesel fuel.’ When the demands that complex trade-offs are revealed,
sessionized web data are combined with interaction understood and assessed.
data across channels and re-socialized, we have even For example, we may be able to improve the predictive
more powerful predictors – let’s call these ‘gasoline.’ accuracy of a fraud detection model by training it on
And when these behavioral data are combined with a larger set of features, but at the cost of increased
customer, transaction history and demographic data,
yet more powerful predictors – ‘kerosene’ or ‘jet fuel’
can be created.
“We believe that what we term “the connected
data warehouse” model will be fundamental to
Successful data-driven organizations ensure that data
successful data mesh implementation.”
products like these can be discovered and connected so
that the jet-fuel that powers their digital transformation
initiatives can be created quickly and efficiently, and so
that complex value chains can be optimized. run-times when the model is scored in production. An
increase in decision latency from 150ms to 200ms
As large enterprises operating across multiple might be an acceptable price to pay for a 20% increase
geographies continue to embrace cloud deployment in the lift of a fraud detection model – or it might not.
models and multiple service providers, we believe But we are unlikely to know at the start of the project
that what we term “the connected data warehouse” whether an improvement is possible or not – and even
model will be fundamental to successful Data Mesh less likely to be able to quantify the response time
implementation. Co-location of multiple schemas “cost” or the lift “benefit” so that one can be weighed
aligned to specific business domains within a single, against the other.
scalable database instance provides a natural platform
for at-scale Data Mesh deployment, with lightweight Agile, incremental approaches to the development of
governance processes providing interoperability. data and analytic platforms and products have proven
3 TERADATA.COM
WH ITE PA PE R DATA MESH
4 TERADATA.COM
WH ITE PA PE R DATA MESH
Federal Government in the US to the COVID epidemic data elements and products that in practice may be
rested on the assumption that COVID infection data shared only infrequently. It often makes more sense to
were fundamentally sound. In fact, they were not. decompose the problem space into domains that are
Different states were collecting what looked like the aligned with key business processes and to allow each
same data according to different policies and processes domain to implement the subject areas applicable to its
– so that what appeared to be the ‘same’ data could own activities. This is illustrated below in figure 1.
not in fact be reliably compared. Models fed with bad
data made bad predictions – and the result was bad Subject Areas
public policy. You may not have to deal with 50 states –
Database concept
but you are fortunate indeed if all of the data from your Subject Areas
manufacturing plants is created to the same standards Spans multiple domains
• Database concept
and supplied on the same schedule, if you sell products
• Spans multiple domains
ELDM overwhelming to business
in the same quantity and using the same identifier that
• ELDM overwhelming to business
Data centric
you use to order them, etc. These are not problems Demand
• Data centric
that, by themselves, distributed architectures,
Kubernetes clusters, and CI/CD development pipelines
will solve because they are not technology problems in
the first place. Labor Store ns
Promotions Finance Operatio
Transa Plan/
Six features of successful approaches ctions Shipping Pricing Forecast
n
Product
Inventory Logistics Locatio
In practice, we observe six critical success factors for
reducing time-to-market for the development of new
data and analytic products whilst also preserving
cross-domain interoperability.
Domains
1. Business-driven decomposition; or “subject • Business unit concept
areas versus domains” • Self-contained from business point-of view
One of the central concepts of domain-driven design • Small-to-medium size schemas
that is often misunderstood by organizations pursuing • Business area centric
distributed data architectures is “bounded context.”
Decomposition of a large and complex problem space
Yes No
Order Cancel
into a collection of smaller models is not a new idea in received order
2 https://www.theatlantic.com/science/archive/2021/03/americas-coronavirus-catastrophe-began-with-data/618287/
5 TERADATA.COM
WH ITE PA PE R DATA MESH
This model works well where each domain is defined with two extra-large pizzas, it may be too big – and further
an explicit boundary and all users within that domain decomposition should be considered.
are working towards a common purpose and use a
consistent business vocabulary. This can be further All of this implies some degree of management and
enhanced using global standards, for example, for the co-ordination between different development teams.
use of surrogation to obfuscate PII data in natural keys. Lightweight governance processes ensure minimum
Identification, definition, and sizing of domains is also levels of co-ordination across domains, providing
a critical consideration. If domains are defined with too bounded context. Published meta-data ensure that
large a context, agility is sacrificed due to the number data products’ high-value data elements, refined at
of products that must be built and maintained within significant cost to the organization, can be discovered
the domain, and the number of people required to do so. and re-used in other contexts.
Conversely, where domains are drawn with too narrow
a focus, organizations find themselves forever creating 2. Separate schemas by domain to
additional cross-domain, enterprise teams that risk provide agility
redundancy and duplication. For us, the “two pizza rule”
One of the primary advantages of embracing domain-
of agile systems development remains a good guide; if
driven design is agility: loosely coupled teams working
the team building a data product cannot be fed with
more-or-less independently, and each focusing within
their specific areas of business expertise, are able to
deliver data products in parallel.
A simplified retail banking scenario of
business-driven decomposition Our recommended approach to implementation
of Data Mesh based architectures is to create
Consider a simplified retail banking scenario. separate schemas for each domain. Responsibility
There are multiple attributes of a mortgage for data stewardship, data modeling, and population
product that are of limited value outside of the of the schema content is owned by experts with
mortgage domain. Loan-to-value ratios, the business knowledge about the specific domain under
type of survey on which a property valuation construction. This approach removes many of the
was based, and the date of that valuation bottlenecks associated with attempting to implement
all represent important information to the a single, centralized consolidation of all enterprise data
mortgage function. But they have limited into a single schema. The domain-oriented schemas
value in other domains from across the Bank. provide a collection of data products aligned to areas
Decisions about how these data are captured, of business focus within the enterprise. In our simplified
cleansed, modelled, transformed, managed, retail bank scenario, for example, the mortgage domain
and exploited should therefore be delegated may have a legitimate and urgent requirement to create
exclusively to the mortgage domain. By a new data product to understand and measure the
contrast, information about customer salaries impact of the COVID-19 pandemic on the demand for
and mortgage debt will be highly relevant in larger suburban properties. At a minimum, this new data
other domains including unsecured loans, credit product will probably require the roll-up of mortgage
card, and risk. Ensuring that these data can be product sales by a new and different geographical
shared and combined across domains is not only hierarchy from that used by the rest of the organization.
highly desirable, but probably essential to ensure
regulatory compliance in most geographies. And A domain-aligned development process and schema
if delinquency codes can be standardized across makes this possible without lengthy discussion and
all the domains that extend credit to customers, negotiation across the rest of the organization, so long
then the task of understanding which customers as interoperability standards that also enable total
have or are likely to default across multiple sales of loan products to be rolled-up according to
product lines will be greatly simplified. corporate reporting hierarchies exist and are respected.
6 TERADATA.COM
WH ITE PA PE R DATA MESH
7 TERADATA.COM
WH ITE PA PE R DATA MESH
Re-use is therefore about much more than merely 5. Supertypes and subtypes
avoiding the creation of 99 different versions of a
As we have already discussed, to deliver enterprise
nearly identical report. Ultimately, it is about creating
data products successfully and efficiently we need the
layered data architectures that enable high-value data
domain teams concerned to be able to reliably combine
elements to be discovered and re-used to efficiently
and aggregate data across multiple domains. In the
create multiple data products that support multiple
banking scenario described earlier it would be useful to
business processes. Interoperability across the domains
have an enterprise schema to capture information
requires the definition of consistent primary and foreign
about customers accounts across all the products they
key relationships and global standards for data typing,
have with the bank.
naming conventions, and quality metrics.
account_interest_amt check_limit_amt
Robust governance and agile, incremental approaches late_pay_fee_amt overdraft_fee_amt
to delivery can co-exist. Where they do, combined with reward_points_qty min_balance_amt
8 TERADATA.COM
WH ITE PA PE R DATA MESH
Digital transformation and modern business initiatives ‘Engineered’ levels of modeling and integration should
are driving the need for more, not less, integration be deferred until there is a sophisticated understanding
across domains. Providing the coherent, cross- of which data will need to be frequently and reliably
functional view across operations required by modern compared and/or joined with one another. Since this
businesses requires that data are not merely technically level of understanding exists only rarely when the
connected, but also that they are semantically first few MVP data products are being developed,
linked. The consistency in implementation across organizations should take care to avoid over-investing
domains required to make this happen does not just in data modeling and data engineering during the early
spontaneously emerge; rather it requires a co-ordinated, stages of a new programme or project by adopting a
business-driven approach to data governance. ‘Light Integration’ approach, like Teradata’s
Furthermore, it is still the case that specialist expertise LIMA framework.
9 TERADATA.COM
WH ITE PA PE R DATA MESH
Query
Grid
10 TERADATA.COM
WH ITE PA PE R DATA MESH
11 TERADATA.COM
WH ITE PA PE R DATA MESH
Amazon similarly dominates retail today by combining Figure 5 shows a grossly simplified schematic
purchase data with behavioral data to understand representation of a grossly simplified Retail architecture,
what customers want better than its competitors do. illustrating the concepts of co-location, connection
By enabling partners to leverage the platform it has and isolation in the development of data products. The
created, it generates even more data about even more Sales, Orders, and Inventory products are domain-
customers in a virtuous circle. Data integration is critical oriented and developed in parallel, but domain
to both business models. interrelationships are defined and the products are
co-located on the same platform so that data can be
Product performance, supplier performance, and cost combined to create a scalable and high-performance
management domains can build out dedicated data Demand Forecast data product.
products in parallel, but unless these domains engage
in the kind of lightweight collaboration and governance The Customer Experience data products are also
described earlier in this document, development of domain-oriented and are also built-out in parallel,
one of the high-value analytic applications – demand but on a separate platform that enables these data
forecasting – will be significantly more complex and products to be run-time connected to improve both the
more time consuming. Furthermore, because that Customer and the Demand Forecast data products;
application will require large volumes of product, sales, whilst integration is deferred until run-time, it still
and order data to be routinely and frequently combined, requires interrelationships to be defined and modeled.
co-locating these data products on the same platform A conscious decision is taken to de-couple the
to avoid unnecessary data movement - and so improve Activity Based Costing data product from the Product
performance and scalability - is likely to be Performance data product, at the expense of potential
highly desirable. inconsistency in Sales reporting and increased
technical debt.
Acvitivity
Demand Forecast Behavioral
based costing
Object
Lake
12 TERADATA.COM
WH ITE PA PE R DATA MESH
13 TERADATA.COM
WH ITE PA PE R DATA MESH
The Teradata logo is a trademark, and Teradata is a registered trademark of Teradata Corporation and/or its affiliates in the U.S. and worldwide. Teradata continually
improves products as new technologies and components become available. Teradata, therefore, reserves the right to change specifications without prior notice. All features,
functions and operations described herein may not be marketed in all parts of the world. Consult your Teradata representative or Teradata.com for more information.