2/3/22, 12:46 PM What Is a Data Lake?
Definition from SearchDataManagement
9 SearchDataManagement
g
Tech Accelerator
What is data architecture? A data management blueprint
Home
> Big data management
> Data center management
> data lake
DEFINITION
data lake
By Craig Stedman,
Editor at Large Ben Lutkevich,
Technical Writer
What is a data lake?
A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed for analytics applications. While a
traditional data warehouse stores data in hierarchical dimensions and tables, a data lake uses a flat architecture to store data, primarily in files or object
storage. That gives users more flexibility on data management, storage and usage.
Data lakes are often associated with Hadoop systems. In deployments based on the distributed processing framework, data is loaded into the Hadoop
Distributed File System (HDFS) and resides on the different computer nodes in a Hadoop cluster. Increasingly, though, data lakes are being built on cloud
object storage services instead of Hadoop. Some NoSQL databases are also used as data lake platforms.
?What is a Data Lake? How is it Different From a Data Warehouse
https://searchdatamanagement.techtarget.com/definition/data-lake 1/11
2/3/22, 12:46 PM What Is a Data Lake? Definition from SearchDataManagement
Why do organizations use data lakes?
Data lakes commonly store sets of big data that can include a combination of structured, unstructured and semistructured data. Such environments aren't a
good fit for the relational databases that most data warehouses are built on. Relational systems require a rigid schema for data, which typically limits them
to storing structured transaction data. Data lakes support various schemas and don't require any to be defined upfront. That enables them to handle
different types of data in separate formats.
As a result, data lakes are a key data architecture component in many organizations. Companies primarily use them as a platform for big data analytics and
other data science applications requiring large volumes of data and involving advanced analytics techniques, such as data mining, predictive modeling and
machine learning.
THIS ARTICLE IS PART OF
What is data architecture? A data management blueprint
Which also includes:
5 principles of a well-designed data architecture
Data modeling vs. data architecture: What's the difference?
Data architecture vs. information architecture: How they differ
https://searchdatamanagement.techtarget.com/definition/data-lake 2/11
2/3/22, 12:46 PM What Is a Data Lake? Definition from SearchDataManagement
A data lake provides a central location for data scientists and analysts to find, prepare and analyze relevant data. Without one, that process is more
complicated. It's also harder for organizations to take full advantage of their data assets to help drive more informed business decisions and strategies.
Data lake architecture
Many technologies can be used in data lakes, and organizations can combine them in different ways. That means the architecture of a data lake often
varies from organization to organization. For example, one company might deploy Hadoop with the Spark processing engine and HBase, a NoSQL
database that runs on top of HDFS. Another might run Spark against data stored in Amazon Simple Storage Service (S3). A third might choose other
technologies.
Also, not all data lakes store raw data only. Some data sets may be filtered and processed for analysis when they're ingested. If so, the data lake
architecture must enable that and include sufficient storage capacity for prepared data. Many data lakes also include analytics sandboxes, dedicated
storage spaces that individual data scientists can use to work with data.
However, three main architectural principles distinguish data lakes from conventional data repositories:
No data needs to be turned away. Everything collected from source systems can be loaded and retained in a data lake if desired.
Data can be stored in an untransformed or nearly untransformed state, as it was received from the source system.
That data is later transformed and fit into a schema as needed based on specific analytics requirements, an approach known as schema-on-read.
Whatever technology is used in a data lake deployment, some other elements should also be included to ensure that the data lake is functional and that the
data it contains doesn't go to waste. That includes the following:
A common folder structure with naming conventions.
A searchable data catalog to help users find and understand data.
A data classification taxonomy to identify sensitive data, with information such as data type, content, usage scenarios and groups of possible users.
Data profiling tools to provide insights for classifying data and identifying data quality issues.
A standardized data access process to help control and keep track of who is accessing data.
Data protections, such as data masking, data encryption and automated usage monitoring.
https://searchdatamanagement.techtarget.com/definition/data-lake 3/11
2/3/22, 12:46 PM What Is a Data Lake? Definition from SearchDataManagement
Data awareness among the users of a data lake is also a must, especially if they include business users acting as citizen data scientists. In addition to
being trained on how to navigate the data lake, users should understand proper data management and data quality techniques, as well as the
organization's data governance and usage policies.
Sample data lake architecture diagram
k w
This is a sample architectural diagram for a data lake that supports advanced anal
Data lake vs. data warehouse
The biggest distinctions between data lakes and data warehouses are their support for data types and their approach to schema. In a data warehouse that
primarily stores structured data, the schema for data sets is predetermined, and there's a plan for processing, transforming and using the data when it's
loaded into the warehouse. That's not necessarily the case in a data lake. It can house different types of data and doesn't need to have a defined schema
for them or a specific plan for how the data will be used.
To illustrate the differences between the two platforms, think of an actual warehouse versus a lake. A lake is liquid, shifting, amorphous and fed by rivers,
streams and other unfiltered water sources. Conversely, a warehouse is a structure with shelves, aisles and designated places to store the items it
contains, which are purposefully sourced for specific uses.
Data lake vs. data warehouse comparison
k A side-by-side comparison of data lakes and data warehouses.
w
This conceptual difference manifests itself in several ways, including the following:
Technology platforms. A data warehouse architecture usually includes a relational database running on a conventional server, whereas a data lake is
typically deployed in a Hadoop cluster or other big data environment.
Data sources. The data stored in a warehouse is primarily extracted from internal transaction processing applications to support basic business
intelligence (BI) and reporting queries, which are often run in associated data marts created for specific departments and business units. Data lakes
typically store a combination of data from business applications and other internal and external sources, such as websites, IoT devices, social media and
mobile apps.
Users. Data warehouses are useful for analyzing curated data from operational systems through queries written by a BI team or business analysts and
other self-service BI users. Because the data in a data lake is often uncurated and can originate from various sources, it's generally not a good fit for the
https://searchdatamanagement.techtarget.com/definition/data-lake 4/11
2/3/22, 12:46 PM What Is a Data Lake? Definition from SearchDataManagement
average BI user. Instead, data lakes are better suited for use by data scientists who have the skills to sort through the data and extract meaning from it.
Data quality. The data in a data warehouse is generally trusted as a single source of truth because it has been consolidated, preprocessed and cleansed
to find and fix errors. The data in a data lake is less reliable because it's often pulled in from different sources as is and left in its raw state without first being
checked for accuracy and consistency.
Agility and scalability. Data lakes are highly agile platforms: Because they use commodity hardware, most can be reconfigured and expanded as needed
to meet changing data requirements and business needs. Data warehouses are less flexible because of their rigid schema and prepared data sets.
Security. Data warehouses have more mature security protections because they have existed for longer and are usually based on mainstream
technologies that likewise have been around for decades. But data lake security methods are improving, and various security frameworks and tools are
now available for big data environments.
Because of their differences, many organizations use both a data warehouse and a data lake, often in a hybrid deployment that integrates the two
platforms. Frequently, data lakes are an addition to an organization's data architecture and enterprise data management strategy instead of replacing a
data warehouse.
Cloud vs. on-premises data lakes
Initially, most data lakes were deployed in on-premises data centers. But they're now a part of cloud data architectures in many organizations.
The shift began with the introduction of cloud-based big data platforms and managed services that incorporate Hadoop and Spark, plus various other
technologies. In particular, cloud platform market leaders AWS, Microsoft and Google offer big data technology bundles: Amazon EMR, Azure HDInsight
and Google Dataproc, respectively.
The availability of cloud object storage services, such as S3, Azure Blob Storage and Google Cloud Storage, gave organizations lower-cost data storage
alternatives to HDFS, which made data lake deployments in the cloud more appealing financially. Cloud vendors also added data lake development, data
integration and other data management services to automate deployments. Even Cloudera, a Hadoop pioneer that still obtained about 90% of its revenues
from on-premises users as of 2019, now offers a cloud-native platform that supports both object storage and HDFS.
What are the benefits of a data lake?
Data lakes provide a foundation for data science and advanced analytics applications. By doing so, they help enable organizations to manage business
operations more effectively and identify business trends and opportunities. For example, a company can use predictive models on customer buying
behavior to improve its online advertising and marketing campaigns. Analytics in a data lake can also aid in risk management, fraud detection, equipment
maintenance and other business functions.
https://searchdatamanagement.techtarget.com/definition/data-lake 5/11
2/3/22, 12:46 PM What Is a Data Lake? Definition from SearchDataManagement
Like data warehouses, data lakes also help break down data silos by combining data sets from different systems in a single repository. That gives data
science teams a complete view of available data and simplifies the process of finding relevant data and preparing it for analytics uses. It can also help
reduce IT and data management costs by eliminating duplicate data platforms in an organization.
A data lake also offers other benefits, including the following:
It enables data scientists and other users to create data models, analytics applications and queries on the fly.
Data lakes are relatively inexpensive to implement because Hadoop, Spark and many other technologies used to build them are open source and can
be installed on low-cost hardware.
Labor-intensive schema design and data cleansing, transformation and preparation can be deferred until after a clear business need for the data is
identified.
Various analytics methods can be used in data lake environments, including predictive modeling, machine learning, statistical analysis, text mining, real-
time analytics and SQL querying.
What challenges do data lakes pose?
Despite the business benefits that data lakes provide, deploying and managing them can be a difficult process. These are some of the challenges that data
lakes pose for organizations:
Data swamps. One of the biggest challenges is preventing a data lake from turning into a data swamp. If it isn't set up and managed properly, the data
lake can become a messy dumping ground for data. Users may not find what they need, and data managers may lose track of data that's stored in the
data lake, even as more pours in.
Technology overload. The wide variety of technologies that can be used in data lakes also complicates deployments. First, organizations must find the
right combination of technologies to meet their particular data management and analytics needs. Then they need to install them, although the growing
use of the cloud has made that step easier.
Unexpected costs. While the upfront technology costs may not be excessive, that can change if organizations don't carefully manage data lake
environments. For example, companies may get surprise bills for cloud-based data lakes if they're used more than expected. The need to scale up data
lakes to meet workload demands also increases costs.
Data governance. One of the purposes of a data lake is to store raw data as-is for various analytics uses. But without effective governance of data
lakes, organizations may be hit with data quality, consistency and reliability issues. Those problems can hamper analytics applications and produce
flawed results that lead to bad business decisions.
https://searchdatamanagement.techtarget.com/definition/data-lake 6/11
2/3/22, 12:46 PM What Is a Data Lake? Definition from SearchDataManagement
Data lake vendors
The Apache Software Foundation develops Hadoop, Spark and various other open source technologies used in data lakes. The Linux Foundation and
other open source groups also oversee some data lake technologies. The open source software can be downloaded and used for free. But software
vendors offer commercial versions of many of the technologies and provide technical support to their customers. Some vendors also develop and sell
proprietary data lake software.
There are numerous data lake technology vendors, some offering full platforms and others with tools to help users deploy and manage data lakes. Some
prominent vendors include:
AWS. In addition to Amazon EMR and S3, it has supporting tools like AWS Lake Formation for building data lakes and AWS Glue for data integration
and preparation.
Cloudera. Its Cloudera Data Platform can be deployed in the public cloud or hybrid clouds that include on-premises systems, and it's supported by a
data lake service.
Databricks. Founded by Spark's creators, it offers a cloud-based data lakehouse platform that combines elements of both data lakes and data
warehouses.
Dremio. It sells a "SQL lakehouse" platform that supports BI dashboard design and interactive querying on data lakes and is also available as a fully
managed cloud service.
Google. It augments Dataproc and Google Cloud Storage with Google Cloud Data Fusion for data integration and a set of services for moving on-
premises data lakes to the cloud.
HPE. The HPE GreenLake platform supports Hadoop environments in the cloud and on premises, with both file and object storage and a Spark-based
data lakehouse service.
Microsoft. Along with Azure HD Insight and Azure Blob Storage, it offers Azure Data Lake Storage Gen2, a repository that adds a hierarchical
namespace to Blob Storage.
Oracle. Its cloud-based data lake technologies include a big data service for Hadoop and Spark clusters, an object storage service and a set of data
management tools.
Qubole. The cloud-native Qubole data lake platform provides data management, engineering and governance capabilities and supports various
analytics applications.
Snowflake. While it's best known as a cloud data warehouse vendor, the Snowflake platform also supports data lakes and can work with data in cloud
object stores.
https://searchdatamanagement.techtarget.com/definition/data-lake 7/11
2/3/22, 12:46 PM What Is a Data Lake? Definition from SearchDataManagement
This was last updated in October 2021
∙∙
m Continue Reading About data lake
Data fabrics help data lakes seek the truth
∙∙
7 steps to a successful data lake implementation
Building a strong data analytics platform architecture
∙
5 principles of a well-designed data architecture
Data lake or data swamp: It's about governance
Related Terms
big data
Big data is a combination of structured, semistructured and unstructured data collected by organizations that can be mined for ... See complete definitionq
big data analytics
Big data analytics is the often complex process of examining big data to uncover information -- such as hidden patterns, ... See complete definitionq
What is data preparation? An in-depth guide to data prep
Data preparation is the process of gathering, combining, structuring and organizing data so it can be used in business ... See complete definitionq
https://searchdatamanagement.techtarget.com/definition/data-lake 8/11
2/3/22, 12:46 PM What Is a Data Lake? Definition from SearchDataManagement
m Dig Deeper on Big data management
What lies beyond the data warehouse? Data lake storage: Cloud vs on-premise data lakes
By: Stephen Pritchard By: Antony Adshead
Data warehouse vs. data lake: Key differences Key factors for successful data lake implementation
By: Bridget Botelho By: Chris Foot
-ADS BY GOOGLE
https://searchdatamanagement.techtarget.com/definition/data-lake 9/11
2/3/22, 12:46 PM What Is a Data Lake? Definition from SearchDataManagement
Latest TechTarget SearchBusinessAnalytics
resources
BUSINESS ANALYTICS
A 2 App development a focus of MicroStrategy analytics
platform
The longtime independent vendor is making application development and embedded BI
AWS priorities in order to enable data-driven ...
2
CONTENT MANAGEMENT
Yellowfin acquired by B2B software specialist Idera
The formerly independent analytics vendor hopes to expand its customer base by getting
ORACLE
acquired, while its app development tools ...
SAP
SQL SERVER
About Us Contributors Guides
Editorial Ethics Policy Reprints Opinions
Meet The Editors Answers Photo Stories
Contact Us Definitions Quizzes
Advertisers E-Products Tips
Business Partners Events Tutorials
Media Kit Features Videos
Corporate Site
All Rights Reserved, Copyright 2005 - 2022, TechTarget
Privacy Policy
Do Not Sell My Personal Info
https://searchdatamanagement.techtarget.com/definition/data-lake 10/11
2/3/22, 12:46 PM What Is a Data Lake? Definition from SearchDataManagement
https://searchdatamanagement.techtarget.com/definition/data-lake 11/11