Data Lake Definition

Dragon1 Icon for Data Lake
Dragon1 Icon for Data Lake
CREATED BY ANONYMOUS, CREATIVE COMMONS LICENSE

Dragon1 Definition for Data Lake:
A data lake is a governed storage repository architecture that holds a large amount of raw data (i.e., data in its native format) that is only defined and structured upon usage.

Let us define Data Lake

Data lakes are in the news today. More and more IT managers and enterprise architects are making Data Lake a core part of the organization's future architecture. Many people think a data lake is just some kind of data warehouse. However, a data lake and a data warehouse are different, although they have similarities.

The short definition of a data lake is: "A data lake is a governed storage repository architecture holding a large amount of raw data (i.e., data in its native format) that is only defined and structured upon usage."

Let us investigate the differences.

Courtesy of Zaloni.

What is the Difference Between a Data Lake and a Data Warehouse?

Both of them are repositories of data storage. That is their only resemblance.

A data lake holds structured, semi-structured, and unstructured data. The data structure and requirements are not defined or changed until the data is needed. This will increase the speeds of extracting, loading, and working with the data.

A data warehouse is a large store of data accumulated from various sources within a company and used to guide management decisions.

Differences in Treating Data

A data lake stores all data without changing it. A data warehouse stores data that is first made fit to be stored. It has been defined and structured.

Differences in Processing Data

Data is loaded using two different approaches. In a data lake, data is loaded via schema-on-read, which is as-is as raw data. In a data warehouse, data is loaded schema-on-write, meaning the data is defined and structured before it is loaded.

An Agile Solution

A data lake is an agile solution because definitions and structures can be created only when the data is needed, and models, queries, and apps can be generated. A data warehouse is less of an agile solution because all the business processes that use certain parts of the data warehouse will not suddenly permit the data warehouse to be changed. Data warehouses cannot be changed as quickly as data lakes.

A Secure Solution

A data lake is a new technology. So, data lake products are built using the newest security requirements and principles. A data warehouse is a fairly old technology, so the products built with it contain older security requirements and principles.

Hadoop

Not always, but often a data lake is implemented using Hadoop.

Hadoop is open-source software and a framework that can be used for distributed storage and processing big data data sets using the MapReduce programming model. Hadoop consists of computer clusters built from commodity hardware. Many data lake solutions make use of or are related to Hadoop. But of course, Hadoop is not the only software and framework for Data Lakes.

Data Lake Architecture Principle

The architecture principle of the data lake concept is: By concentrating all data in one collection and placing smart governance on top of it, without spending time and resources in the restructuring of defining data before usage, the business can be presented with a much better single and agile data view than otherwise.

The above picture shows a data lake design pattern compliant with the principle.

Also Read

More sources of Data Lakes are:



If you have comments or remarks about this Dragon1 term or definition, please mail to specs@dragon1.com.

Architecting Solutions