Topics In Demand
Notification
New

No notification found.

Data Lakehouse. 5 Whys.

June 14, 2023 99 0 Big Data Analytics

Data Lakehouse. 5 Whys.

As the data volume generated by the new digital era platforms is continuously and exponentially increasing and the insights generated from the data are becoming more and more valuable and critical, there are a bunch of architectural innovations and reforms happening. May it be IoT sensors which generate critical events related to device health and operational metrics, or application backend system for an online advertising platform where millions of events are generated continuously, there is no shortage of tools and platforms which enables architecting a solid data lake platform, which can act as a single source of truth for all data. In this paper, we will have a close look at one of the most disrupting architectures which is getting extreme popularity in the data domain.

Introduction

Traditionally a data lake and a data warehouse are considered two separate entities addressing two specific use cases. A data lake is meant to address the “Single source of truth” data repository use cases where organizations can leverage it to store large volumes of raw data to be consumed by different types of data consumers. On the other hand, a data warehouse is addressing the use cases where “high performance aggregated reports” for answering specific business and operational questions. As technology evolved with a lot of open-source contributions in the data processing and storage landscape, data lakehouse was made a reality which provides the best of both worlds.

HISTORY OF ANALYTICS PLATFORMS

IF WE LOOK AT THEM SEPARATELY

 

If we look at a typical, traditional architecture where data lake and data warehouse work in synergy, it will be something like the below.

As you can see, this architecture is relatively complex. For starters, raw and processed data are brought into the data lake for a batch processing scenario (relational data sources and file-based data sources) and ETL processing is required to load the subset of data lake contents which provide actionable insights into the data warehouse. A data warehouse is an isolated system that is accessed by end-users access to get critical insights from data.

For a real-time data processing scenario, where events from a streaming data source such as an IoT event bus is collected, another pipeline needs to be set up to ingest, process, and load events to data warehouse to deliver real-time insights. There are a number of challenges with this approach.

Complex data architecture: The architecture is based on the popular lambda architecture where real-time load and bulk load of data is done using two different pipelines, mostly using two different technology stacks. There is no common data pipeline that can handle both real-time and streaming data processing. In addition to this once data moved to data warehouse, there may be different requirements from different consumers to modify the data in data warehouse to create their own data marts and BI cubes, which in-turn requires to create and maintain additional ETL pipelines.

Data duplication: It is obvious from the design that data duplication is inevitable. Data in the data lake has to be moved to data warehouse for reporting requirements, which will be the same data. In addition to that, based on different reporting requirements, different copies of the same data will be moved to the data warehousing layer from the data lake and it will keep on growing.

Operational overhead: Since the technology stack is fairly complex, the operational cost to keep it running is going to be high. Monitoring and alerting for ETL pipelines, data lake processing frameworks, data warehouse storage, etc. needs to be well maintained for operational excellence.

Vendor lock-in: Datawarehouse will be sourced from a proprietary vendor and that makes a tight dependency on the vendor. Data formats and underlying technology will be proprietary. If the organization has multiple data warehouses across different vendors, combining the data between them is going to be a nightmare where it will be required to copy the data back to data lake.

With the growth of multiple cloud platforms, the availability of storage and compute resources are practically unlimited. For traditional data lakes built on distributed storage systems such as AWS S3, Azure Data Lake Storage, or HDFS, the main use cases are data processing and machine learning workloads. The data lake house which is powered by an open table format brings in the data warehousing capabilities to the data lake. Data stored in cheap storage as open table formats works hand-in-hand with the highly performant decoupled processing frameworks making the concept of bringing in the data warehouse capabilities to a data lake, a reality.

LAKEHOUSE TOP 5 WHYS

For explaining the top 5 whys of data lakehouse, we are going to look at one of the popular lakehouse implementations Delta Lake. There are other implementations such as Apache Iceberg and Apache Hudi but we are going to focus on Delta Lake. The code snippets presented below are all based on Apache Spark using Scala programming language pets.

Transactions and ACID compliance on immutable data lake storage

For a typical data lake, there will be a lot of consumers reading and writing data and it is impossible to get transactional updates with ACID guarantees, which an RDBMS data warehousing system can provide. Delta Lake brings in this feature by keeping a transaction log for all commits that have been done to the table. Users can perform multiple concurrent transactions on their data, and in the event of an error with a data source or a stream, Delta Lake cancels the execution of the transaction to ensure that the data is kept clean and intact.

When a user creates a Delta Lake table, that table’s transaction log is automatically created in the _delta_log subdirectory. As he or she makes changes to that table, those changes are recorded as ordered, atomic commits in the transaction log. Each commit is written out as a JSON file, starting with 000000.json. Additional changes to the table generate subsequent JSON files in ascending numerical order so that the next commit is written out as 000001.json, the following as 000002.json, and so on.

 

Time travel to access earlier versions of data for auditing

In a traditional data lake system with immutable storage, once the data is overwritten, it is impossible to go back and check the history. This time travel feature is critical for some of the use cases such as auditing data changes for compliance, and reproducing experiments, especially in the data science domain when data is constantly getting changed by data pipelines and rollback to recover from corrupt data writes. Take an example below where an existing folder that contains parquet file data is overwritten using apache spark.

Once this operation is done, the target folder contents are overwritten and we can’t go back and check what data was there earlier. If the data lake is backed up by an object storage system such as AWS S3, we can recover the old data by enabling S3 versioning. But this approach requires additional work.

Delta Lake solves this challenge with time travel again implemented with the transaction logs. Every operation, may it be inserts, updates, or overwrites, done on a delta table is versioned and kept in the transaction logs. Delta Lake provides the option to get different versions of data either using a timestamp or a version.

 

Openness for data format Delta lake is powered by the open table format Delta. It has two components as described earlier. The metadata part is the delta log in JSON format and the data file is the actual data in parquet format. The delta format is open and there’s a wide variety of data processing and analytical framework that support delta tables. Some examples would be Apche Spark, Apache Flink, and Apache Presto.

Single batch and real-time pipeline Since the transactional capabilities are brought into the data lake using data lakehouse, real-time streaming events and bulk data loads can have the same delta table as a target. Unlike the lambda architecture where separate pipeline and storage for real-time and bulk data load scenarios are required, data lakehouse removes unnecessary complexities which makes the development and operation of data pipelines seamless and simple.

 

Schema enforcement and evolution

Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table's schema. Delta Lake uses schema validation on write, which means that all new writes to a table are checked for compatibility with the target table's schema at the right time. If the schema is not compatible, Delta Lake cancels the transaction altogether (no data is written), and raises an exception to let the user know about the mismatch. Schema evolution is a feature that allows users to easily change a table's current schema to accommodate data that is changing over time.

Schema evolution and schema enforcement are mutually exclusive. It is always better to enforce a schema to make sure the agreement with downstream applications is respected. But there can be use cases where the schema changes need to be addressed when the error is notified. Adding a column is a common scenario, where the column that is not present in the target delta table is automatically added at the end of the schema as part of the write transaction. Other schema evolution use cases such as deleting the column or renaming the column may require overwriting all the data files to match the new schema.

CONCLUSION

Data Lakehouse is a powerful architecture pattern where different types of data analytics workloads can be served by a much simpler data architecture. Data engineers can implement complex data engineering pipelines in frameworks such as Apache Spark and write to data Lakehouse. Data analysts can perform ad-hoc analytics on data lakehouse. Data scientists can consume data from the lakehouse and train and develop robust data science models. BI engineers can directly query data in Lakehouse for dashboards and reports without needing to copy data to separate data warehouse platforms. With the open table format concept and a wide variety of top-notch data processing framework support, data lakehouse is really the next-generation architecture paradigm for the enterprise data repository.

 


That the contents of third-party research report/s published here on the website, and the interpretation of all information in the report/s such as data, maps, numbers etc. displayed in the content and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party research report/s published, are provided solely as convenience; and the presence of these research report/s should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these research report/s, you do so at your own risk.


NeST Digital, the software arm of the NeST Group, has been transforming businesses, providing customized and innovative software solutions and services for customers across the globe. A leader in providing end-to-end solutions under one roof, covering contract manufacturing and product engineering services, NeST has 25 years of proven experience in delivering industry-specific engineering and technology solutions for customers, ranging from SMBs to Fortune 500 enterprises, focusing on Transportation, Aerospace, Defense, Healthcare, Power, Industrial, GIS, and BFSI domains.



LATEST REPORTS

© Copyright nasscom. All Rights Reserved.