What is a Data Lakehouse?

What is a Data Lakehouse?

7th of June, 2023


We can describe a Data Lakehouse as an Data Architecture pattern where we combine the Data Management structure found in traditional Data warehouses with the scalability of a cloud storage on a Datalake. But, it is more than just a Data Architecture pattern.


It is a new technology, that brings agility and scalability of storing on the data lake and pairs it with the performance and structure imposed by a Data warehouse, thus overcoming the limitations of a traditional Data warehouse. Features such as ACID transactions, Change data capture, auditing changes and roll-back. These wouldn't be possible without an open source format in Apache Spark, called "Delta" format.


Essentially, Delta format is an extension on top of Parquet file storage, that enables meta data to be maintained around Parquet to capture changes, enable version control, optimisation, and roll-back. 

Furthermore, if we look the technologies that constitute a Lakehouse, these will consist of the following:

  1. Metadata layers: With metadata layers, managing and storing data lake metadata, such as schema, lineage, and quality, becomes effortless. Leveraging this metadata can significantly enhance the performance and governance of lake house workloads. This is where Delta format comes in!
  2. Query engines: Ingenious query engines have emerged to efficiently retrieve data from lakehouses. They employ various techniques like columnar storage, in-memory caching, and query optimisation, enabling remarkable performance improvements. These include but not limited to Databricks and Azure Synapse.
  3. Data processing frameworks: Leading frameworks like Apache Spark play a vital role in processing data within lake houses. They empower organisations to carry out diverse data processing tasks, such as data cleaning, data transformation, and data modelling. This more specifically relates to data frame manipulation and lazy execution on Apache Spark.

Lakehouses epitomise the emergence of a modern and personalised data architecture, surpassing the limitations of traditional data warehouse and data lake setups. They are the preferred choice for organisations seeking a data architecture that offers flexibility, scalability, and exceptional performance.