Data Lakehouse. 5 Whys.

Terms of use

Terms of Use

The use of this site and the content contained therein is governed by the Terms of Use. When you use this site you acknowledge that you have read the Terms of Use and that you accept and will be bound by the terms hereof and such terms as may be modified from time to time.

All text, graphics, audio, design and other works on the site are the copyrighted works of nasscom unless otherwise indicated. All rights reserved.
Content on the site is for personal use only and may be downloaded provided the material is kept intact and there is no violation of the copyrights, trademarks, and other proprietary rights. Any alteration of the material or use of the material contained in the site for any other purpose is a violation of the copyright of nasscom and / or its affiliates or associates or of its third-party information providers. This material cannot be copied, reproduced, republished, uploaded, posted, transmitted or distributed in any way for non-personal use without obtaining the prior permission from nasscom.
The nasscom Members login is for the reference of only registered nasscom Member Companies.
nasscom reserves the right to modify the terms of use of any service without any liability. nasscom reserves the right to take all measures necessary to prevent access to any service or termination of service if the terms of use are not complied with or are contravened or there is any violation of copyright, trademark or other proprietary right.
From time to time nasscom may supplement these terms of use with additional terms pertaining to specific content (additional terms). Such additional terms are hereby incorporated by reference into these Terms of Use.

Disclaimer

The Company information provided on the nasscom web site is as per data collected by companies. nasscom is not liable on the authenticity of such data.
nasscom has exercised due diligence in checking the correctness and authenticity of the information contained in the site, but nasscom or any of its affiliates or associates or employees shall not be in any way responsible for any loss or damage that may arise to any person from any inadvertent error in the information contained in this site. The information from or through this site is provided "as is" and all warranties express or implied of any kind, regarding any matter pertaining to any service or channel, including without limitation the implied warranties of merchantability, fitness for a particular purpose, and non-infringement are disclaimed. nasscom and its affiliates and associates shall not be liable, at any time, for any failure of performance, error, omission, interruption, deletion, defect, delay in operation or transmission, computer virus, communications line failure, theft or destruction or unauthorised access to, alteration of, or use of information contained on the site. No representations, warranties or guarantees whatsoever are made as to the accuracy, adequacy, reliability, completeness, suitability or applicability of the information to a particular situation.
nasscom or its affiliates or associates or its employees do not provide any judgments or warranty in respect of the authenticity or correctness of the content of other services or sites to which links are provided. A link to another service or site is not an endorsement of any products or services on such site or the site.
The content provided is for information purposes alone and does not substitute for specific advice whether investment, legal, taxation or otherwise. nasscom disclaims all liability for damages caused by use of content on the site.
All responsibility and liability for any damages caused by downloading of any data is disclaimed.
nasscom reserves the right to modify, suspend / cancel, or discontinue any or all sections, or service at any time without notice.

For any grievances under the Information Technology Act 2000, please get in touch with Grievance Officer, Mr. Anirban Mandal at data-query@nasscom.in.

New

See all

No notification found.

Data Lakehouse. 5 Whys.

June 14, 2023 144 0 NeST Digital Big Data Analytics

As the data volume generated by the new digital era platforms is continuously and exponentially increasing and the insights generated from the data are becoming more and more valuable and critical, there are a bunch of architectural innovations and reforms happening. May it be IoT sensors which generate critical events related to device health and operational metrics, or application backend system for an online advertising platform where millions of events are generated continuously, there is no shortage of tools and platforms which enables architecting a solid data lake platform, which can act as a single source of truth for all data. In this paper, we will have a close look at one of the most disrupting architectures which is getting extreme popularity in the data domain.

144

Introduction

Traditionally a data lake and a data warehouse are considered two separate entities addressing two specific use cases. A data lake is meant to address the “Single source of truth” data repository use cases where organizations can leverage it to store large volumes of raw data to be consumed by different types of data consumers. On the other hand, a data warehouse is addressing the use cases where “high performance aggregated reports” for answering specific business and operational questions. As technology evolved with a lot of open-source contributions in the data processing and storage landscape, data lakehouse was made a reality which provides the best of both worlds.

HISTORY OF ANALYTICS PLATFORMS

IF WE LOOK AT THEM SEPARATELY

If we look at a typical, traditional architecture where data lake and data warehouse work in synergy, it will be something like the below.

As you can see, this architecture is relatively complex. For starters, raw and processed data are brought into the data lake for a batch processing scenario (relational data sources and file-based data sources) and ETL processing is required to load the subset of data lake contents which provide actionable insights into the data warehouse. A data warehouse is an isolated system that is accessed by end-users access to get critical insights from data.

For a real-time data processing scenario, where events from a streaming data source such as an IoT event bus is collected, another pipeline needs to be set up to ingest, process, and load events to data warehouse to deliver real-time insights. There are a number of challenges with this approach.

Complex data architecture: The architecture is based on the popular lambda architecture where real-time load and bulk load of data is done using two different pipelines, mostly using two different technology stacks. There is no common data pipeline that can handle both real-time and streaming data processing. In addition to this once data moved to data warehouse, there may be different requirements from different consumers to modify the data in data warehouse to create their own data marts and BI cubes, which in-turn requires to create and maintain additional ETL pipelines.

Data duplication: It is obvious from the design that data duplication is inevitable. Data in the data lake has to be moved to data warehouse for reporting requirements, which will be the same data. In addition to that, based on different reporting requirements, different copies of the same data will be moved to the data warehousing layer from the data lake and it will keep on growing.

Operational overhead: Since the technology stack is fairly complex, the operational cost to keep it running is going to be high. Monitoring and alerting for ETL pipelines, data lake processing frameworks, data warehouse storage, etc. needs to be well maintained for operational excellence.

Vendor lock-in: Datawarehouse will be sourced from a proprietary vendor and that makes a tight dependency on the vendor. Data formats and underlying technology will be proprietary. If the organization has multiple data warehouses across different vendors, combining the data between them is going to be a nightmare where it will be required to copy the data back to data lake.

With the growth of multiple cloud platforms, the availability of storage and compute resources are practically unlimited. For traditional data lakes built on distributed storage systems such as AWS S3, Azure Data Lake Storage, or HDFS, the main use cases are data processing and machine learning workloads. The data lake house which is powered by an open table format brings in the data warehousing capabilities to the data lake. Data stored in cheap storage as open table formats works hand-in-hand with the highly performant decoupled processing frameworks making the concept of bringing in the data warehouse capabilities to a data lake, a reality.

LAKEHOUSE TOP 5 WHYS

For explaining the top 5 whys of data lakehouse, we are going to look at one of the popular lakehouse implementations Delta Lake. There are other implementations such as Apache Iceberg and Apache Hudi but we are going to focus on Delta Lake. The code snippets presented below are all based on Apache Spark using Scala programming language pets.

Transactions and ACID compliance on immutable data lake storage

For a typical data lake, there will be a lot of consumers reading and writing data and it is impossible to get transactional updates with ACID guarantees, which an RDBMS data warehousing system can provide. Delta Lake brings in this feature by keeping a transaction log for all commits that have been done to the table. Users can perform multiple concurrent transactions on their data, and in the event of an error with a data source or a stream, Delta Lake cancels the execution of the transaction to ensure that the data is kept clean and intact.

When a user creates a Delta Lake table, that table’s transaction log is automatically created in the _delta_log subdirectory. As he or she makes changes to that table, those changes are recorded as ordered, atomic commits in the transaction log. Each commit is written out as a JSON file, starting with 000000.json. Additional changes to the table generate subsequent JSON files in ascending numerical order so that the next commit is written out as 000001.json, the following as 000002.json, and so on.

Time travel to access earlier versions of data for auditing

In a traditional data lake system with immutable storage, once the data is overwritten, it is impossible to go back and check the history. This time travel feature is critical for some of the use cases such as auditing data changes for compliance, and reproducing experiments, especially in the data science domain when data is constantly getting changed by data pipelines and rollback to recover from corrupt data writes. Take an example below where an existing folder that contains parquet file data is overwritten using apache spark.

Once this operation is done, the target folder contents are overwritten and we can’t go back and check what data was there earlier. If the data lake is backed up by an object storage system such as AWS S3, we can recover the old data by enabling S3 versioning. But this approach requires additional work.

Delta Lake solves this challenge with time travel again implemented with the transaction logs. Every operation, may it be inserts, updates, or overwrites, done on a delta table is versioned and kept in the transaction logs. Delta Lake provides the option to get different versions of data either using a timestamp or a version.

Openness for data format Delta lake is powered by the open table format Delta. It has two components as described earlier. The metadata part is the delta log in JSON format and the data file is the actual data in parquet format. The delta format is open and there’s a wide variety of data processing and analytical framework that support delta tables. Some examples would be Apche Spark, Apache Flink, and Apache Presto.

Single batch and real-time pipeline Since the transactional capabilities are brought into the data lake using data lakehouse, real-time streaming events and bulk data loads can have the same delta table as a target. Unlike the lambda architecture where separate pipeline and storage for real-time and bulk data load scenarios are required, data lakehouse removes unnecessary complexities which makes the development and operation of data pipelines seamless and simple.

Schema enforcement and evolution

Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table's schema. Delta Lake uses schema validation on write, which means that all new writes to a table are checked for compatibility with the target table's schema at the right time. If the schema is not compatible, Delta Lake cancels the transaction altogether (no data is written), and raises an exception to let the user know about the mismatch. Schema evolution is a feature that allows users to easily change a table's current schema to accommodate data that is changing over time.

Schema evolution and schema enforcement are mutually exclusive. It is always better to enforce a schema to make sure the agreement with downstream applications is respected. But there can be use cases where the schema changes need to be addressed when the error is notified. Adding a column is a common scenario, where the column that is not present in the target delta table is automatically added at the end of the schema as part of the write transaction. Other schema evolution use cases such as deleting the column or renaming the column may require overwriting all the data files to match the new schema.

CONCLUSION

Data Lakehouse is a powerful architecture pattern where different types of data analytics workloads can be served by a much simpler data architecture. Data engineers can implement complex data engineering pipelines in frameworks such as Apache Spark and write to data Lakehouse. Data analysts can perform ad-hoc analytics on data lakehouse. Data scientists can consume data from the lakehouse and train and develop robust data science models. BI engineers can directly query data in Lakehouse for dashboards and reports without needing to copy data to separate data warehouse platforms. With the open table format concept and a wide variety of top-notch data processing framework support, data lakehouse is really the next-generation architecture paradigm for the enterprise data repository.

Disclaimer

That the contents of third-party research report/s published here on the website, and the interpretation of all information in the report/s such as data, maps, numbers etc. displayed in the content and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party research report/s published, are provided solely as convenience; and the presence of these research report/s should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these research report/s, you do so at your own risk.

Download

NeST Digital

NeST Digital, the software arm of the NeST Group, has been transforming businesses, providing customized and innovative software solutions and services for customers across the globe. A leader in providing end-to-end solutions under one roof, covering contract manufacturing and product engineering services, NeST has 25 years of proven experience in delivering industry-specific engineering and technology solutions for customers, ranging from SMBs to Fortune 500 enterprises, focusing on Transportation, Aerospace, Defense, Healthcare, Power, Industrial, GIS, and BFSI domains.

LATEST REPORTS

GCCs Transforming India’s Office Landscape

@CBRE India 12 Aug 2025 GCC

India's abundant skilled workforce and favourable business environment have driven global firms to actively establish and expand their global capability centre (GCC) operations. Notably, office leasing by GCCs surged to an impressive 29.2 million sq. ft. in 2024, marking a 29% year-over-year expansion. Bengaluru and Hyderabad stood at the forefront of this growth, solidifying their positions as preeminent destinations for multinational corporations.

143

@KNOLSKAPE 06 Aug 2025 Digital Transformation AI Manufacturing Future of work

The manufacturing industry stands at a pivotal juncture in 2025, facing unprecedented challenges and opportunities in workforce development. With 2.4 million manufacturing positions projected to remain unfilled by 2028 and a potential economic loss of $1 trillion by 2030, the sector requires strategic transformation in learning and development approaches. The convergence of artificial intelligence, Industry 4.0 technologies, and evolving workforce expectations demands comprehensive strategies that address both immediate skill gaps and long-term competitiveness. The ‘Workforce Insights 2025: Manufacturing Industry’ report by KNOLSKAPE explores how organizations are navigating this shift, identifying emerging skills, leadership capabilities, and learning models essential for a future-ready manufacturing workforce. Based on proprietary KNOLSKAPE survey insights and conversations with industry leaders, this report presents a strategic analysis of workforce development trends that go beyond technology adoption—focusing on people, purpose, and performance.

@CBRE India 12 Aug 2025 Data Science & AI Community IT Services

The report gives a snapshot of the current DC trends in the country, focusing on policy enablers across various states, and strategies / way forward for occupiers, operators & investors in the coming few years.

@nasscom Insights 12 Aug 2025 nasscom insights Digital Transformation Emerging Tech AI Talent & Skills

The tech services industry in India has seen a dramatic acceleration in AI readiness in the past six months. The industry has moved from cautious experimentation to scaling AI across service delivery, internal operations, and even revenue generation. Yet, behind the headlines of productivity boosts and AI copilots lies a far more complex and nuanced reality. As contracts evolve and delivery models get re-architected, service providers are forced to answer: Are we truly AI-ready? This Nasscom-EY joint study dives deep into these transitions as it captures emerging realities from the trenches and spotlights what it takes to move beyond experimentation into real AI-led impact.

425

@nasscom Insights 07 Aug 2025 nasscom insights AI

India’s GenAI startup ecosystem stands at a transformative inflection point in 2025, rapidly scaling in size, deepening in capability, and becoming more strategically relevant to enterprises, policymakers, and the broader technology landscape. Over the last year, the ecosystem has seen an extraordinary 3.7X surge in startup formation and is now among the most active globally in terms of GenAI application development. However, this rise is also being shaped by an environment that presents both enabling tailwinds and structural challenges. This report, titled “India Generative AI Startup Landscape 2025: Mapping the Momentum”, presents an in-depth analysis of this fast-evolving landscape. It offers a structured view across global and Indian market trends, startup formation patterns, funding shifts, enterprise adoption behavior, and policy interventions. The report also highlights where India stands, what it uniquely brings to the global GenAI table, and what must be done to unlock its full potential. The recommendations are in the form of a forward-looking playbook for founders, investors, enterprises, and policymakers to act decisively in shaping the next decade of AI innovation from India.

1071

New

Data Lakehouse. 5 Whys.

NeST Digital

LATEST REPORTS

GCCs Transforming India’s Office Landscape

Roadmap for Transforming Manufacturing Talent in the AI-age

2024 India Data Centre Market Update

(AI)deation to Impact: Architecting the AI-First Tech Services Workforce

India Generative AI Startup Landscape 2025: Mapping the Momentum

Topics In Demand

Notification

New

Data Lakehouse. 5 Whys.

Share this report

NeST Digital

LATEST REPORTS

GCCs Transforming India’s Office Landscape

Roadmap for Transforming Manufacturing Talent in the AI-age

2024 India Data Centre Market Update

(AI)deation to Impact: Architecting the AI-First Tech Services Workforce

India Generative AI Startup Landscape 2025: Mapping the Momentum