Topics In Demand
Notification
New

No notification found.

Data Lakehouse  - Where the Best of Data Warehouse and Data Lake Meet
Data Lakehouse - Where the Best of Data Warehouse and Data Lake Meet

March 18, 2022

296

0

In a turn of events,& Data lakehouse is pitched as the novel paradigm heralding a shift in data management. With the prevalence of data lakes (DL) and data warehouses (DWH), why would enterprises home in on the data lakehouse concept?

You can’t theorise before having data at hand, said Holmes. And contrary to that, today, you have mountains of data to work with but find it daunting to theorise. That’s owing to the complex data architecture.

Sample this complicated scenario. An enterprise has many systems including a data warehouse and a data lake to drive diverse data workloads (BI, machine learning, streaming, and data engineering). There is a need for resources with varied skillsets to manage diverse data pipelining and data management. Another concern is the mounting operational costs to manage multiple systems.

What the data silos also do here is:

  • Hinder seamless communication
  • Trigger data duplication
  • Lead to inconsistent data governance & security
  • Necessitate increased data movement hindering performance

And then, you have the concept of Data lakehouse to fill the void of a single repository addressing various analytics needs.

Data Lakehouse – A Primer

Data lakehouse is an evolutionary architecture empowering enterprise with the structured analytics facilitated by a DWH on data housed in cost-effective cloud-based data lake. Addressing the drawbacks of capturing, processing, and analysing data in a multiple-solution system, data lakehouse champions the new data management paradigm of using a single data repository for meeting diverse analytics needs.

The concept combines the best of Data lake and Data warehouse to help address the ‘high cost factor’ involved in laying the data to business insights pipeline. A comparison of data lake vs data warehouse will help understand how the ‘ups and downs of DWH and DL’ have influenced the new data lakehouse concept.

Data lake vs Data warehouse

Data lake houses data, but without schema. You can bring raw data comprising structured, semi-structured and unstructured data into a data lake, but you don’t define the structure while capturing data. The lack of structure makes it an ordeal to manage and govern data.

You have schemas in a data warehouse to analyse organized data. But you cannot accommodate unstructured data and media files in a data warehouse. And data management becomes a tedious process at that. More so when you consider the slow progress you make to acquire insights from data. On the flip side, when data increases in a DWH, performance decreases

Now, enterprises are using data lake and data warehouse as separate entities – data lake to store data and DWH to acquire business insights. In another scenario, data warehouse is either embedded into a data lake or DWH and data lake are brought together in a single platform. What stands out is the constant need for ETL engineering between the data lake and the DWH.  In all of these scenarios, cost of managing such ecosystems is quite high.

It is where data lakehouse serves as the single platform merging the best of data warehouse and data lakes.

Single Platform for varied Analytics with Data lakehouse architecture

In context, data lakehouse helps set up a single platform to support multiple analytics and BI use cases. The picture below illustrates the layers and the salient components of data lakehouse architecture.

Single Platform for varied Analytics with Data lakehouse architecture

Data lake

Data lake is the starting point for this concept. This sets up the low-cost storage facility to bring in raw data covering structured, unstructured and semi-structured data.

Structured transactional layer

The structured transaction layer strengthens data management by supporting ACID or (atomic, consistent, isolated durable) compliant transactions on big data workloads. With this layer, you can facilitate quicker deletes, updates and schema enforcement. For instance, open source table-format Delta Lake or the metadata layers are used to work with data lakes. The metadata layer sitting on top of the data lake provides the structure you need to govern and manage data lake.

High-performance query engine

The new query engine is the layer meant to bolster performance. For instance, Delta Engine built by Databricks serves this purpose. The query engine is built to enable query acceleration, facilitate SQL search on data lakes.

All analytics workloads

With a data lakehouse, you can run all your analytics workloads. You can use the single data repository to facilitate diverse workloads encompassing machine learning, data science, SQL and analytics.

Benefits of Data lakehouse

To start with, Data lakehouse democratizes data. Enterprises also gain flexibility, cost savings, scalability and increased productivity by embracing the data lakehouse approach. The features captured below help acquire cost and productivity gains.

  • Streamlines the complete data engineering architecture
  • Acts as common staging tier for all analytics use cases and applications
  • Decouples storage from compute resources
  • Accommodates and provides access to various data types including audio, video, images, and text
  • Uses standardized and open data formats like Parquet, providing direct data access to data science and machine learning
  • Empowers with querying of massive unstructured & structured data in quick time
  • Enables direct connect between data and analysis tools
  • Supports real-time data applications, eliminating the need to have a separate system for real-time reports
  • Leverages cost-effective cloud-based storage such as the Amazon S3, Google Cloud Storage and Azure Blob Storage
  • Allows use of different query engines like Presto or Spark based on the type of data that needs to be queried
  • Uses machine learning tools like PyTorch, TensorFlow and pandas to access sources such as Parquet
  • Supports DW schema architectures including snowflake/star schemas
  • Empowers by promoting diverse use cases

With the promise of a forward-looking data architecture, future awaits the performance of data lakehouse as an open platform serving all enterprise analytics needs and its ability to meet enterprise expectations.

This blog was originally posted in saksoft.com


That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.


We enhance customer experience through digital transformation solutions. We help clients across the globe with Enterprise Applications with omni channel access and provide data driven insights with Augmented Analytics. We improve Speed, accuracy and predictability of the clients IT systems with Intelligent Automation. We help the customers access their information anywhere, any device with Enterprise Cloud Solutions. We are present in US,UK, Europe,India and Singapore with offices in 14 locations.

© Copyright nasscom. All Rights Reserved.