Dragon1 Icon for
Data LakeCREATED BY ANONYMOUS, CREATIVE COMMONS LICENSEDragon1 Definition for
Data Lake:
A data lake is a governed storage repository architecture that holds a large amount of raw data (i.e., data in its native format) that is only defined and structured upon usage.
Let us define Data Lake
Data lakes are in the news today. More and more IT Managers and Enterprise Architects make Data Lake a core part of the future architecture of the organization. Many people think that a data lake is just some kind of data warehouse. However, a data lake and a data warehouse are different things, although they have similarities.
The short definition of a data lake is: "A data lake is a governed storage repository architecture holding a large amount of raw data (i.e. data in its native format) that is only defined and structured upon usage."
Let us investigate the differences.
Courtesy of Zaloni.
What is the Difference Between a Data Lake and a Data Warehouse?
Both of them are repositories of data storage. That is their only resemblance.
A data lake holds data that is structured, semi-structured, and unstructured. The data structure and requirements are not defined or changed until the data is needed. This will increase the speeds of extracting, loading, and working with the data.
A data warehouse is a large store of data accumulated from a wide range of sources within a company and used to guide management decisions.
Differences in Treating Data
A data lake stores all data without changing it. A data warehouse stores data that first has been made fit to store. It has been defined and structured.
Differences in Processing Data
Data is loaded using two different approaches. In a data lake, data is loaded via schema-on-read, meaning the data is loaded as raw data, as-is. In a data warehouse data is loaded schema-on-write, meaning the data is defined and structured before it is loaded.
An Agile Solution
A data lake is an agile solution because only at the moment the data is needed, definitions and structures have to be created and models, queries, and apps can be generated. A data warehouse is less of an agile solution because all the business processes that make use of certain parts of the data warehouse will not permit the data warehouse to be changed all of a sudden. Data warehouses cannot be changed as quickly as data lakes.
A Secure Solution
A data lake is a new technology. So data lake products are built using the newest security requirements and principles. A data warehouse is a fairly old technology, so the products built with it contain older security requirements and principles.
Hadoop
Not always, but often a data lake is implemented using Hadoop.
Hadoop is open-source software and a framework that can be used for distributed storage and processing of data sets of big data using the MapReduce programming model. Hadoop consists of computer clusters built from commodity hardware. Many data lake solutions make use of or are related to Hadoop. But of course, Hadoop is not the only software and framework for Data Lakes.
Data Lake Architecture Principle
The architecture principle of the data lake concept is: By concentrating all data in one collection and placing smart governance on top of it, without spending time and resources in the restructuring of defining data before usage, the business can be presented with a much better single and agile data view than otherwise.
The above picture shows a data lake design pattern compliant with the principle.
Also Read
More sources of Data Lakes are:
If you have comments or remarks about this Dragon1 term or definition, please mail to specs@dragon1.com.