Lakehouse — New kid in data town | nasscom | The Official Community of Indian IT Industry

Terms of use

Terms of Use

The use of this site and the content contained therein is governed by the Terms of Use. When you use this site you acknowledge that you have read the Terms of Use and that you accept and will be bound by the terms hereof and such terms as may be modified from time to time.

All text, graphics, audio, design and other works on the site are the copyrighted works of nasscom unless otherwise indicated. All rights reserved.
Content on the site is for personal use only and may be downloaded provided the material is kept intact and there is no violation of the copyrights, trademarks, and other proprietary rights. Any alteration of the material or use of the material contained in the site for any other purpose is a violation of the copyright of nasscom and / or its affiliates or associates or of its third-party information providers. This material cannot be copied, reproduced, republished, uploaded, posted, transmitted or distributed in any way for non-personal use without obtaining the prior permission from nasscom.
The nasscom Members login is for the reference of only registered nasscom Member Companies.
nasscom reserves the right to modify the terms of use of any service without any liability. nasscom reserves the right to take all measures necessary to prevent access to any service or termination of service if the terms of use are not complied with or are contravened or there is any violation of copyright, trademark or other proprietary right.
From time to time nasscom may supplement these terms of use with additional terms pertaining to specific content (additional terms). Such additional terms are hereby incorporated by reference into these Terms of Use.

Disclaimer

The Company information provided on the nasscom web site is as per data collected by companies. nasscom is not liable on the authenticity of such data.
nasscom has exercised due diligence in checking the correctness and authenticity of the information contained in the site, but nasscom or any of its affiliates or associates or employees shall not be in any way responsible for any loss or damage that may arise to any person from any inadvertent error in the information contained in this site. The information from or through this site is provided "as is" and all warranties express or implied of any kind, regarding any matter pertaining to any service or channel, including without limitation the implied warranties of merchantability, fitness for a particular purpose, and non-infringement are disclaimed. nasscom and its affiliates and associates shall not be liable, at any time, for any failure of performance, error, omission, interruption, deletion, defect, delay in operation or transmission, computer virus, communications line failure, theft or destruction or unauthorised access to, alteration of, or use of information contained on the site. No representations, warranties or guarantees whatsoever are made as to the accuracy, adequacy, reliability, completeness, suitability or applicability of the information to a particular situation.
nasscom or its affiliates or associates or its employees do not provide any judgments or warranty in respect of the authenticity or correctness of the content of other services or sites to which links are provided. A link to another service or site is not an endorsement of any products or services on such site or the site.
The content provided is for information purposes alone and does not substitute for specific advice whether investment, legal, taxation or otherwise. nasscom disclaims all liability for damages caused by use of content on the site.
All responsibility and liability for any damages caused by downloading of any data is disclaimed.
nasscom reserves the right to modify, suspend / cancel, or discontinue any or all sections, or service at any time without notice.

For any grievances under the Information Technology Act 2000, please get in touch with Grievance Officer, Mr. Anirban Mandal at data-query@nasscom.in.

New

See all

No notification found.

Lakehouse — New kid in data town

NeST Digital

@nestdigital

May 11, 2023

Big Data Analytics

229

A brief look at one of the latest architecture paradigms in the data analytics space

The data volume generated by the new digital era platforms is continuously and exponentially increasing and the insights generated from the data is becoming more and more valuable and critical, there are a bunch of architectural innovations and reforms happening. In this article, we will have a close look at one of the most disrupting architectures in the data analytics space.

Let’s have a look at a traditional data analytics platform

The core components in the above architecture are data ingestion, data lake, ETL and a data warehouse. Data lake and data warehouse are similar and different at the same time. The similarity is basically they are storage platforms for keeping the data. Now let’s look at the differences.

The million-dollar question — Is there a way to combine data lake and data warehouse?

Obviously, someone would ask. “Can we have a platform that can serve both as a data lake and data warehouse” ? The answer is “Data lakehouse”

Data lakehouse is a new architecture that emerged recently which combines the best of both data lake and data warehouse. In a nutshell, data lakehouse enables

ACID compliance and full transactional update capabilities on data lake
High performance query execution
Data science and data engineering workloads
Unified real-time and batch data processing
Schema enforcement and data governance

As you can see in the diagram, lake house storage acts as both data lake and data warehouse. This is a very simplified version of the architecture. We will now look at a few of the innovations and software products that made this architecture possible. They are quite interesting.

Some popular early players (they are still very popular)

The below products are based on MPP (Massively Parallel Processing) architecture

Apache Presto: An open-source, distributed SQL querying platform that runs on a cluster of machines. Originating on Facebook, Presto is now one of the most popular query engines out there. It enables analytics using ANSI SQL on large amounts of data stored in a variety of systems such as HDFS, cloud storage, NoSQL stores etc. Presto enables data warehouse-like query performance for BI and reporting tools.

Apache Drill: Similar to Apache Presto, Drill is another open-source SQL query engine for Big Data exploration. Drill is designed from the ground up to support high-performance analysis on the semi-structured and rapidly evolving data coming from modern Big Data applications, using ANSI SQL. Drill originated at MapR (now acquired by HP Enterprise) and is based on Google’s Dremel.

Both Presto and Drill are Apache top-level projects

Amazon Athena: This is an AWS native version of Presto. It provides all functionalities of Presto but there is no need to install or manage the infrastructure. You will only have to pay for the queries fired and the data scanned. Even though Athena is meant to query data stored in the AWS S3 data lake, federated queries along with source connectors can be used to query a large variety of data sources.

Amazon Redshift Spectrum: Another AWS product based on Amazon Redshift, a popular data warehouse product offering from Amazon. Redshift Spectrum leverages the parallel processing capabilities available in an already provisioned Redshift cluster to connect and query data available in Amazon S3 storage.

Google BigQuery: Based on Google’s Dremel, it is a data warehouse offering from Google Cloud. Unlike a typical data warehouse solution, BigQuery supports machine learning workloads using BigQuery ML. It can also connect to external big data storage systems to query data using Federated queries.

As you can see, there are a variety of products available. But none of them truly possess all the capabilities of a data lake house.

The capabilities of ACID transactions are limited to none. Most of these work on immutable storage and updating records is impossible. Also, it is difficult to achieve a unified real-time and batch data layer. Moreover, we require different storage solutions for meeting all our requirements, and that deviates from the concept of single data storage for both data lake and data warehouse.

Delta Lake, Apache Iceberg and Apache Hudi

These platforms implement data lakehouse using a metadata layer based on open table formats on top of the data lake storage solutions such as HDFS, AWS S3, Azure Blob, Google Cloud Storage etc.

Delta Lake: Delta lake was created by Databricks and then open-sourced. It is based on the popular file format Apache Parquet and uses transaction logs created as JSON files to support ACID transactions on the data lake. Delta format is supported by many data processing frameworks which can now leverage this framework to enable data warehouse capabilities on the existing data lake storage systems. Other key features include

Scalable metadata on billions of partitions and files with ease

Time travel to old data for audit or rollback

Unified batch/streaming with transaction capabilities

Schema evolution/Enforcement — Prevent bad data

Audit history using transaction logs for full audit trial

DML operations — SQL, Scala/Java and Python APIs to merge, update and delete datasets

Apache Iceberg: Another open-source project which was initially developed at Netflix and is currently an Apache top-level project. Apache Iceberg is an open table format designed for huge, petabyte-scale tables. The function of a table format is to determine how you manage, organize and track all of the files that make up a table. It supports multiple file formats such as Apache Parquet, Apache Avro and Apache ORC. Using a combination of metadata, manifest and data files, ACID transactions are supported on data lake storage. Other capabilities include

Full schema evolution to track changes to a table over time

Time travel to query historical data and verify changes between updates

Partition layout and evolution enabling updates to partition schemes as queries and data volumes change without relying on hidden partitions or physical directories

Rollback to prior versions to quickly correct issues and return tables to a known good state

Advanced planning and filtering capabilities for high performance on large data volumes

Apache Hudi: Open-source product initially developed at Uber. Hudi (Hadoop Upsert Delete and Incremental) initially started as a streaming data lake platform on top of Apache Hadoop. It has evolved a lot now to support incremental batch jobs on any cloud storage platform. Hudi uses Apache Parquet and Apache Avro and enables ACID transaction capabilities. Other features include

Upserts, Deletes with fast, pluggable indexing

Transactions, Rollbacks, Concurrency Control

Automatic file sizing, data clustering, compactions, cleaning

Built-in metadata tracking for scalable storage access

Incremental queries, Record level change streams

Backwards compatible schema evolution and enforcement

SQL Read/Writes from data processing frameworks such as Spark, Presto, Trino, Hive & more

Products for Enterprise

Now let’s look at some enterprise offerings

Delta Lakehouse by Databricks: Based on the delta lake and developed by Databricks which can run on popular cloud providers. The entire platform is powered by Apache Spark, the most popular data processing platform in the market now. By combining the capabilities of Delta lake and Apache Spark, delta lakehouse provides capabilities such as

Lightning-fast performance for data processing with auto-scaling and indexing

Data science and machine learning workloads at scale using MLflow

Databricks SQL, which is server less SQL query execution engine exclusively for ultra-fast BI and dashboarding requirements

Unity catalog — A unified data catalog for all data in the delta lakehouse which also provides data governance and security capabilities
Built-in dashboarding platform to create reports and visualizations
Built-in dashboarding platform to create reports and visualizations

Dremio: Product offering based on Apache Iceberg. Unlike Databricks, Dremio can work both on the cloud as well as on-premise. Unlike the Databrick platform, Dremio does not provide built-in machine learning or data science capabilities as the focus is mainly on data engineering, data warehousing, and fast query performance. With a high-performance SQL query engine and data transfer capabilities, Dremio solves the problem of combining a data lake and a data warehouse. The two services that are powering Dremio platform are

Dremio Sonar — A lakehouse engine built for SQL

Dremio Arctic — A metadata and data management service for Apache Iceberg that provides a unique Git-like experience for the lakehouse

Final thoughts

I have done my best to capture relevant details on each of these platforms. The ecosystem is always growing powered by more and more contributions to the open-source community. The beauty of these open-source platforms is that you can get all details on the internals and even more, you can check out the source code and start contributing. There is no limit to learning!

Manu Mukundan, Data Architect, NeST Digital

Disclaimer

That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.

NeST Digital

NeST Digital, the software arm of the NeST Group, has been transforming businesses, providing customized and innovative software solutions and services for customers across the globe. A leader in providing end-to-end solutions under one roof, covering contract manufacturing and product engineering services, NeST has 25 years of proven experience in delivering industry-specific engineering and technology solutions for customers, ranging from SMBs to Fortune 500 enterprises, focusing on Transportation, Aerospace, Defense, Healthcare, Power, Industrial, GIS, and BFSI domains.

How Master Data is Foundational to Business Transformation?

CSM Tech

@csmtechnologies

13 Aug 2025

Big Data Analytics

Digital transformation has evolved rapidly over the years, becoming a critical driver of business innovation and growth. What started as a slow shift towards technology adoption has now become an essential strategy for businesses looking to have…

Developing Intelligent Chatbots with Generative AI Capabilities

Motherson Tec..

@Jaydip Roy

11 Aug 2025

AI Inside AI Big Data Analytics

Developing Intelligent Chatbots with Generative AI Capabilities “Intelligent chatbot development is advancing through generative AI applications, integrating NLP chatbot solutions and conversational AI tools. This…

From Global Talent to Global Impact: How Remote Staff Augmentation Unlocks 24/7 Expertise

C5i (Course5 ..

@Ronald Fernandes

06 Aug 2025

Analytics

Research AI Markets don’t sleep anymore, and neither can your operations. As research timelines shrink and clients expect answers in real time, traditional team setups just can’t keep pace. Many leaders still depend on local teams to…

How To Simplify Insurance Claims Processes with Data Analytics?

Ken Milko

@kenmilko

05 Aug 2025

Big Data Analytics

In our last blog, we discussed the important factors to bear in mind before transforming insurance claims operations. In this post, we will uncover how data analytics can streamline insurance claims workflows. A digitized Insurance claims…

Worker Lives Matter: The Tech Revolution Transforming Workplace Safety

TATA Communic..

@tatacommunications

30 Jul 2025

Manufacturing Retail - FMCG CPG

In an era defined by rapid technological advancement and global interconnectedness, one would expect workplace safety to be a universally upheld standard. Yet, the grim reality is that millions of workers worldwide continue to face life-threatening…

Why Cash Flow Management Is Important If You Run a Small Business?

Vandna Jadhav

@veronicawinston

29 Jul 2025

Analytics

Running a small business is a labor of love, but it’s also a balancing act. You’re managing inventory, handling customer relationships, hiring the right people—and in the middle of it all, there’s one thing that can make or break your progress: cash…

Topics In Demand

Notification

New

Lakehouse — New kid in data town

Share this blog

Related blogs