Lakehouse — New kid in data town | nasscom | The Official Community of Indian IT Industry

Terms of use

Terms of Use

The use of this site and the content contained therein is governed by the Terms of Use. When you use this site you acknowledge that you have read the Terms of Use and that you accept and will be bound by the terms hereof and such terms as may be modified from time to time.

All text, graphics, audio, design and other works on the site are the copyrighted works of nasscom unless otherwise indicated. All rights reserved.
Content on the site is for personal use only and may be downloaded provided the material is kept intact and there is no violation of the copyrights, trademarks, and other proprietary rights. Any alteration of the material or use of the material contained in the site for any other purpose is a violation of the copyright of nasscom and / or its affiliates or associates or of its third-party information providers. This material cannot be copied, reproduced, republished, uploaded, posted, transmitted or distributed in any way for non-personal use without obtaining the prior permission from nasscom.
The nasscom Members login is for the reference of only registered nasscom Member Companies.
nasscom reserves the right to modify the terms of use of any service without any liability. nasscom reserves the right to take all measures necessary to prevent access to any service or termination of service if the terms of use are not complied with or are contravened or there is any violation of copyright, trademark or other proprietary right.
From time to time nasscom may supplement these terms of use with additional terms pertaining to specific content (additional terms). Such additional terms are hereby incorporated by reference into these Terms of Use.

Disclaimer

The Company information provided on the nasscom web site is as per data collected by companies. nasscom is not liable on the authenticity of such data.
nasscom has exercised due diligence in checking the correctness and authenticity of the information contained in the site, but nasscom or any of its affiliates or associates or employees shall not be in any way responsible for any loss or damage that may arise to any person from any inadvertent error in the information contained in this site. The information from or through this site is provided "as is" and all warranties express or implied of any kind, regarding any matter pertaining to any service or channel, including without limitation the implied warranties of merchantability, fitness for a particular purpose, and non-infringement are disclaimed. nasscom and its affiliates and associates shall not be liable, at any time, for any failure of performance, error, omission, interruption, deletion, defect, delay in operation or transmission, computer virus, communications line failure, theft or destruction or unauthorised access to, alteration of, or use of information contained on the site. No representations, warranties or guarantees whatsoever are made as to the accuracy, adequacy, reliability, completeness, suitability or applicability of the information to a particular situation.
nasscom or its affiliates or associates or its employees do not provide any judgments or warranty in respect of the authenticity or correctness of the content of other services or sites to which links are provided. A link to another service or site is not an endorsement of any products or services on such site or the site.
The content provided is for information purposes alone and does not substitute for specific advice whether investment, legal, taxation or otherwise. nasscom disclaims all liability for damages caused by use of content on the site.
All responsibility and liability for any damages caused by downloading of any data is disclaimed.
nasscom reserves the right to modify, suspend / cancel, or discontinue any or all sections, or service at any time without notice.

For any grievances under the Information Technology Act 2000, please get in touch with Grievance Officer, Mr. Anirban Mandal at data-query@nasscom.in.

New

See all

No notification found.

Lakehouse — New kid in data town

NeST Digital

@nestdigital

May 11, 2023

Big Data Analytics

223

A brief look at one of the latest architecture paradigms in the data analytics space

The data volume generated by the new digital era platforms is continuously and exponentially increasing and the insights generated from the data is becoming more and more valuable and critical, there are a bunch of architectural innovations and reforms happening. In this article, we will have a close look at one of the most disrupting architectures in the data analytics space.

Let’s have a look at a traditional data analytics platform

The core components in the above architecture are data ingestion, data lake, ETL and a data warehouse. Data lake and data warehouse are similar and different at the same time. The similarity is basically they are storage platforms for keeping the data. Now let’s look at the differences.

The million-dollar question — Is there a way to combine data lake and data warehouse?

Obviously, someone would ask. “Can we have a platform that can serve both as a data lake and data warehouse” ? The answer is “Data lakehouse”

Data lakehouse is a new architecture that emerged recently which combines the best of both data lake and data warehouse. In a nutshell, data lakehouse enables

ACID compliance and full transactional update capabilities on data lake
High performance query execution
Data science and data engineering workloads
Unified real-time and batch data processing
Schema enforcement and data governance

As you can see in the diagram, lake house storage acts as both data lake and data warehouse. This is a very simplified version of the architecture. We will now look at a few of the innovations and software products that made this architecture possible. They are quite interesting.

Some popular early players (they are still very popular)

The below products are based on MPP (Massively Parallel Processing) architecture

Apache Presto: An open-source, distributed SQL querying platform that runs on a cluster of machines. Originating on Facebook, Presto is now one of the most popular query engines out there. It enables analytics using ANSI SQL on large amounts of data stored in a variety of systems such as HDFS, cloud storage, NoSQL stores etc. Presto enables data warehouse-like query performance for BI and reporting tools.

Apache Drill: Similar to Apache Presto, Drill is another open-source SQL query engine for Big Data exploration. Drill is designed from the ground up to support high-performance analysis on the semi-structured and rapidly evolving data coming from modern Big Data applications, using ANSI SQL. Drill originated at MapR (now acquired by HP Enterprise) and is based on Google’s Dremel.

Both Presto and Drill are Apache top-level projects

Amazon Athena: This is an AWS native version of Presto. It provides all functionalities of Presto but there is no need to install or manage the infrastructure. You will only have to pay for the queries fired and the data scanned. Even though Athena is meant to query data stored in the AWS S3 data lake, federated queries along with source connectors can be used to query a large variety of data sources.

Amazon Redshift Spectrum: Another AWS product based on Amazon Redshift, a popular data warehouse product offering from Amazon. Redshift Spectrum leverages the parallel processing capabilities available in an already provisioned Redshift cluster to connect and query data available in Amazon S3 storage.

Google BigQuery: Based on Google’s Dremel, it is a data warehouse offering from Google Cloud. Unlike a typical data warehouse solution, BigQuery supports machine learning workloads using BigQuery ML. It can also connect to external big data storage systems to query data using Federated queries.

As you can see, there are a variety of products available. But none of them truly possess all the capabilities of a data lake house.

The capabilities of ACID transactions are limited to none. Most of these work on immutable storage and updating records is impossible. Also, it is difficult to achieve a unified real-time and batch data layer. Moreover, we require different storage solutions for meeting all our requirements, and that deviates from the concept of single data storage for both data lake and data warehouse.

Delta Lake, Apache Iceberg and Apache Hudi

These platforms implement data lakehouse using a metadata layer based on open table formats on top of the data lake storage solutions such as HDFS, AWS S3, Azure Blob, Google Cloud Storage etc.

Delta Lake: Delta lake was created by Databricks and then open-sourced. It is based on the popular file format Apache Parquet and uses transaction logs created as JSON files to support ACID transactions on the data lake. Delta format is supported by many data processing frameworks which can now leverage this framework to enable data warehouse capabilities on the existing data lake storage systems. Other key features include

Scalable metadata on billions of partitions and files with ease

Time travel to old data for audit or rollback

Unified batch/streaming with transaction capabilities

Schema evolution/Enforcement — Prevent bad data

Audit history using transaction logs for full audit trial

DML operations — SQL, Scala/Java and Python APIs to merge, update and delete datasets

Apache Iceberg: Another open-source project which was initially developed at Netflix and is currently an Apache top-level project. Apache Iceberg is an open table format designed for huge, petabyte-scale tables. The function of a table format is to determine how you manage, organize and track all of the files that make up a table. It supports multiple file formats such as Apache Parquet, Apache Avro and Apache ORC. Using a combination of metadata, manifest and data files, ACID transactions are supported on data lake storage. Other capabilities include

Full schema evolution to track changes to a table over time

Time travel to query historical data and verify changes between updates

Partition layout and evolution enabling updates to partition schemes as queries and data volumes change without relying on hidden partitions or physical directories

Rollback to prior versions to quickly correct issues and return tables to a known good state

Advanced planning and filtering capabilities for high performance on large data volumes

Apache Hudi: Open-source product initially developed at Uber. Hudi (Hadoop Upsert Delete and Incremental) initially started as a streaming data lake platform on top of Apache Hadoop. It has evolved a lot now to support incremental batch jobs on any cloud storage platform. Hudi uses Apache Parquet and Apache Avro and enables ACID transaction capabilities. Other features include

Upserts, Deletes with fast, pluggable indexing

Transactions, Rollbacks, Concurrency Control

Automatic file sizing, data clustering, compactions, cleaning

Built-in metadata tracking for scalable storage access

Incremental queries, Record level change streams

Backwards compatible schema evolution and enforcement

SQL Read/Writes from data processing frameworks such as Spark, Presto, Trino, Hive & more

Products for Enterprise

Now let’s look at some enterprise offerings

Delta Lakehouse by Databricks: Based on the delta lake and developed by Databricks which can run on popular cloud providers. The entire platform is powered by Apache Spark, the most popular data processing platform in the market now. By combining the capabilities of Delta lake and Apache Spark, delta lakehouse provides capabilities such as

Lightning-fast performance for data processing with auto-scaling and indexing

Data science and machine learning workloads at scale using MLflow

Databricks SQL, which is server less SQL query execution engine exclusively for ultra-fast BI and dashboarding requirements

Unity catalog — A unified data catalog for all data in the delta lakehouse which also provides data governance and security capabilities
Built-in dashboarding platform to create reports and visualizations
Built-in dashboarding platform to create reports and visualizations

Dremio: Product offering based on Apache Iceberg. Unlike Databricks, Dremio can work both on the cloud as well as on-premise. Unlike the Databrick platform, Dremio does not provide built-in machine learning or data science capabilities as the focus is mainly on data engineering, data warehousing, and fast query performance. With a high-performance SQL query engine and data transfer capabilities, Dremio solves the problem of combining a data lake and a data warehouse. The two services that are powering Dremio platform are

Dremio Sonar — A lakehouse engine built for SQL

Dremio Arctic — A metadata and data management service for Apache Iceberg that provides a unique Git-like experience for the lakehouse

Final thoughts

I have done my best to capture relevant details on each of these platforms. The ecosystem is always growing powered by more and more contributions to the open-source community. The beauty of these open-source platforms is that you can get all details on the internals and even more, you can check out the source code and start contributing. There is no limit to learning!

Manu Mukundan, Data Architect, NeST Digital

Disclaimer

That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.

NeST Digital

NeST Digital, the software arm of the NeST Group, has been transforming businesses, providing customized and innovative software solutions and services for customers across the globe. A leader in providing end-to-end solutions under one roof, covering contract manufacturing and product engineering services, NeST has 25 years of proven experience in delivering industry-specific engineering and technology solutions for customers, ranging from SMBs to Fortune 500 enterprises, focusing on Transportation, Aerospace, Defense, Healthcare, Power, Industrial, GIS, and BFSI domains.

What Private Equity Firms Expect fr...

Tanya Gupta

BFSI

09 Jul 2025

Unified Data Fabric

Shubham Pawar

Big Data Analyt..

07 Jul 2025

Edge Computing for Real-Time Analyt...

Chirag Akbari

116

Big Data Analyt..

27 Jun 2025

Best Expense Reimbursement Software...

Vandna Jadhav

Application

26 Jun 2025

AI Workloads on the Cloud: Building...

Motherson Technology..

AI

25 Jun 2025

How Generative AI Is Supercharging ...

Tanya Gupta

AI

20 Jun 2025

Enhancing Supplier Performance and ...

Motherson Technology..

2054

AI

06 Jun 2025

Enterprise Data Analytics: Transfor...

Chirag Akbari

Big Data Analyt..

26 May 2025

Why Cash Flow Management is Importa...

Vandna Jadhav

Fintech

23 May 2025

Applying agentic AI to master data ...

Expleo

AI

22 May 2025

Why Your Data Isn’t Driving Results...

Xoriant

Big Data Analyt..

12 May 2025

The Role of Data Collection Service...

Gurpreet Singh Arora

Data Science &a..

02 May 2025

Redefining Customer Service with Conversational AI and LLMs

chandan gowda

@chandangowda

13 Dec 2024

Data Science & AI Community Data Privacy Big Data Analytics

Customer service remains a key driver to any organization irrespective of its size or type. Consumers in the current world seek ease, efficiency, and personal treatment by businesses. Alright, let me introduce you to the new heroes of…

Data and AI Trends in 2025: Innovations IT Leaders Can Capitalize

Intelliswift ..

@IntelliswiftMarketing

06 Dec 2024

Data Science & AI Community Digital Transformation AI Big Data Analytics Industry Trends

The convergence of artificial intelligence (AI), cleantech, quantum computing, and the Internet of Things (IoT) is reshaping the digital ecosystem. For IT leaders, these technological advancements offer unparalleled opportunities to enhance…

How Natural Language Processing (NLP) is Used to Detect Fraudulent Transactions and Claims

adarshgowda

@adarshgowda

05 Dec 2024

Data Science & AI Community Big Data Analytics

The level of financial fraud has dramatically increased in recent years, ranging from credit card deception to insurance fraud. This surge highlights the urgent need to develop innovative procedures for identifying such cases. One of the…

[Part 2/2] : Innovating Care: How HealthTech Startups Are Reshaping The Present and Future of Indian Healthcare

Dhiraj Sharma

@DhirajSharma

01 Dec 2024

Digital Transformation Analytics Emerging Tech AI Big Data Analytics Blockchain IOT Product/Startups

In the first part of this blog, we looked at the complex challenges that the Indian healthcare sector currently faces – ranging from gaps in accessibility to issues with affordability and quality of care. We, then, examined ways in which some…

[Part 1/2] : Innovating Care: How HealthTech Startups Are Reshaping The Present and Future of Indian Healthcare

Dhiraj Sharma

@DhirajSharma

01 Dec 2024

Digital Transformation Analytics Emerging Tech AI Big Data Analytics Blockchain IOT Product/Startups

The healthcare ecosystem in India is at a pivotal moment, both for the economy and society. As the nation strives toward achieving its “Viksit Bharat @ 2047” vision, healthcare will be a foundational pillar in this transformation. With over 1.4…

Data Governance: Navigating the Complexities of the Data-Driven Era

CSM Tech

@csmtechnologies

27 Nov 2024

Big Data Analytics

Data today has become the lifeblood of enterprise operations. From customer insights to operational efficiencies, data drives decision-making across all levels of organizations. However, as the volume, variety, and velocity of data continue to grow…

New

Lakehouse — New kid in data town

NeST Digital

NeST Digital

Redefining Customer Service with Conversational AI and LLMs

chandan gowda

Data and AI Trends in 2025: Innovations IT Leaders Can Capitalize

Intelliswift ..

How Natural Language Processing (NLP) is Used to Detect Fraudulent Transactions and Claims

adarshgowda

[Part 2/2] : Innovating Care: How HealthTech Startups Are Reshaping The Present and Future of Indian Healthcare

Dhiraj Sharma

[Part 1/2] : Innovating Care: How HealthTech Startups Are Reshaping The Present and Future of Indian Healthcare

Dhiraj Sharma

Data Governance: Navigating the Complexities of the Data-Driven Era

CSM Tech

About Us

Knowledge Center

In the News

Topics In Demand

Notification

New

Lakehouse — New kid in data town

Share this blog

Related blogs

Tanya Gupta

09 Jul 2025

Shubham Pawar

07 Jul 2025

Chirag Akbari

27 Jun 2025

Vandna Jadhav

26 Jun 2025

Motherson Technology..

25 Jun 2025

Tanya Gupta

20 Jun 2025

Motherson Technology..

06 Jun 2025

Chirag Akbari

26 May 2025

Vandna Jadhav

23 May 2025

Expleo

22 May 2025

Xoriant

12 May 2025

Gurpreet Singh Arora

02 May 2025

About Us

Knowledge Center

In the News

Newsletter