Topics In Demand
Notification
New

No notification found.

Lakehouse — New kid in data town
Lakehouse — New kid in data town

May 11, 2023

211

0

A brief look at one of the latest architecture paradigms in the data analytics space

The data volume generated by the new digital era platforms is continuously and exponentially increasing and the insights generated from the data is becoming more and more valuable and critical, there are a bunch of architectural innovations and reforms happening. In this article, we will have a close look at one of the most disrupting architectures in the data analytics space.

 

Let’s have a look at a traditional data analytics platform

 

The core components in the above architecture are data ingestion, data lake, ETL and a data warehouse. Data lake and data warehouse are similar and different at the same time. The similarity is basically they are storage platforms for keeping the data. Now let’s look at the differences.

 

 

The million-dollar question — Is there a way to combine data lake and data warehouse?

Obviously, someone would ask. “Can we have a platform that can serve both as a data lake and data warehouse” ? The answer is “Data lakehouse”

Data lakehouse is a new architecture that emerged recently which combines the best of both data lake and data warehouse. In a nutshell, data lakehouse enables

  • ACID compliance and full transactional update capabilities on data lake
  • High performance query execution
  • Data science and data engineering workloads
  • Unified real-time and batch data processing
  • Schema enforcement and data governance

 

 

As you can see in the diagram, lake house storage acts as both data lake and data warehouse. This is a very simplified version of the architecture. We will now look at a few of the innovations and software products that made this architecture possible. They are quite interesting.

Some popular early players (they are still very popular)

The below products are based on MPP (Massively Parallel Processing) architecture

Apache Presto: An open-source, distributed SQL querying platform that runs on a cluster of machines. Originating on Facebook, Presto is now one of the most popular query engines out there. It enables analytics using ANSI SQL on large amounts of data stored in a variety of systems such as HDFS, cloud storage, NoSQL stores etc. Presto enables data warehouse-like query performance for BI and reporting tools.

Apache Drill: Similar to Apache Presto, Drill is another open-source SQL query engine for Big Data exploration. Drill is designed from the ground up to support high-performance analysis on the semi-structured and rapidly evolving data coming from modern Big Data applications, using ANSI SQL. Drill originated at MapR (now acquired by HP Enterprise) and is based on Google’s Dremel.

Both Presto and Drill are Apache top-level projects

Amazon Athena: This is an AWS native version of Presto. It provides all functionalities of Presto but there is no need to install or manage the infrastructure. You will only have to pay for the queries fired and the data scanned. Even though Athena is meant to query data stored in the AWS S3 data lake, federated queries along with source connectors can be used to query a large variety of data sources.

Amazon Redshift Spectrum: Another AWS product based on Amazon Redshift, a popular data warehouse product offering from Amazon. Redshift Spectrum leverages the parallel processing capabilities available in an already provisioned Redshift cluster to connect and query data available in Amazon S3 storage.

Google BigQuery: Based on Google’s Dremel, it is a data warehouse offering from Google Cloud. Unlike a typical data warehouse solution, BigQuery supports machine learning workloads using BigQuery ML. It can also connect to external big data storage systems to query data using Federated queries.

As you can see, there are a variety of products available. But none of them truly possess all the capabilities of a data lake house.

The capabilities of ACID transactions are limited to none. Most of these work on immutable storage and updating records is impossible. Also, it is difficult to achieve a unified real-time and batch data layer. Moreover, we require different storage solutions for meeting all our requirements, and that deviates from the concept of single data storage for both data lake and data warehouse.

Delta Lake, Apache Iceberg and Apache Hudi

These platforms implement data lakehouse using a metadata layer based on open table formats on top of the data lake storage solutions such as HDFS, AWS S3, Azure Blob, Google Cloud Storage etc.

Delta Lake: Delta lake was created by Databricks and then open-sourced. It is based on the popular file format Apache Parquet and uses transaction logs created as JSON files to support ACID transactions on the data lake. Delta format is supported by many data processing frameworks which can now leverage this framework to enable data warehouse capabilities on the existing data lake storage systems. Other key features include

Scalable metadata on billions of partitions and files with ease

 

  • Time travel to old data for audit or rollback

 

  • Unified batch/streaming with transaction capabilities

 

  • Schema evolution/Enforcement — Prevent bad data

 

  • Audit history using transaction logs for full audit trial

 

  • DML operations — SQL, Scala/Java and Python APIs to merge, update and delete datasets

 

Apache Iceberg: Another open-source project which was initially developed at Netflix and is currently an Apache top-level project. Apache Iceberg is an open table format designed for huge, petabyte-scale tables. The function of a table format is to determine how you manage, organize and track all of the files that make up a table. It supports multiple file formats such as Apache Parquet, Apache Avro and Apache ORC. Using a combination of metadata, manifest and data files, ACID transactions are supported on data lake storage. Other capabilities include

  • Full schema evolution to track changes to a table over time

 

  • Time travel to query historical data and verify changes between updates

 

  • Partition layout and evolution enabling updates to partition schemes as queries and data volumes change without relying on hidden partitions or physical directories

 

  • Rollback to prior versions to quickly correct issues and return tables to a known good state

 

  • Advanced planning and filtering capabilities for high performance on large data volumes

 

Apache Hudi: Open-source product initially developed at Uber. Hudi (Hadoop Upsert Delete and Incremental) initially started as a streaming data lake platform on top of Apache Hadoop. It has evolved a lot now to support incremental batch jobs on any cloud storage platform. Hudi uses Apache Parquet and Apache Avro and enables ACID transaction capabilities. Other features include

 

  • Upserts, Deletes with fast, pluggable indexing

 

  • Transactions, Rollbacks, Concurrency Control

 

  • Automatic file sizing, data clustering, compactions, cleaning

 

  • Built-in metadata tracking for scalable storage access

 

  • Incremental queries, Record level change streams

 

  • Backwards compatible schema evolution and enforcement

 

  • SQL Read/Writes from data processing frameworks such as Spark, Presto, Trino, Hive & more

 

Products for Enterprise

Now let’s look at some enterprise offerings

Delta Lakehouse by Databricks: Based on the delta lake and developed by Databricks which can run on popular cloud providers. The entire platform is powered by Apache Spark, the most popular data processing platform in the market now. By combining the capabilities of Delta lake and Apache Spark, delta lakehouse provides capabilities such as

  • Lightning-fast performance for data processing with auto-scaling and indexing

 

  • Data science and machine learning workloads at scale using MLflow

 

  • Databricks SQL, which is server less SQL query execution engine exclusively for ultra-fast BI and dashboarding requirements

 

  • Unity catalog — A unified data catalog for all data in the delta lakehouse which also provides data governance and security capabilities
  • Built-in dashboarding platform to create reports and visualizations
  • Built-in dashboarding platform to create reports and visualizations

Dremio: Product offering based on Apache Iceberg. Unlike Databricks, Dremio can work both on the cloud as well as on-premise. Unlike the Databrick platform, Dremio does not provide built-in machine learning or data science capabilities as the focus is mainly on data engineering, data warehousing, and fast query performance. With a high-performance SQL query engine and data transfer capabilities, Dremio solves the problem of combining a data lake and a data warehouse. The two services that are powering Dremio platform are

Dremio Sonar — A lakehouse engine built for SQL

Dremio Arctic — A metadata and data management service for Apache Iceberg that provides a unique Git-like experience for the lakehouse

Final thoughts

I have done my best to capture relevant details on each of these platforms. The ecosystem is always growing powered by more and more contributions to the open-source community. The beauty of these open-source platforms is that you can get all details on the internals and even more, you can check out the source code and start contributing. There is no limit to learning!

Manu Mukundan, Data Architect, NeST Digital


That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.


NeST Digital, the software arm of the NeST Group, has been transforming businesses, providing customized and innovative software solutions and services for customers across the globe. A leader in providing end-to-end solutions under one roof, covering contract manufacturing and product engineering services, NeST has 25 years of proven experience in delivering industry-specific engineering and technology solutions for customers, ranging from SMBs to Fortune 500 enterprises, focusing on Transportation, Aerospace, Defense, Healthcare, Power, Industrial, GIS, and BFSI domains.

© Copyright nasscom. All Rights Reserved.