Terms of use

Terms of Use

The use of this site and the content contained therein is governed by the Terms of Use. When you use this site you acknowledge that you have read the Terms of Use and that you accept and will be bound by the terms hereof and such terms as may be modified from time to time.

All text, graphics, audio, design and other works on the site are the copyrighted works of nasscom unless otherwise indicated. All rights reserved.
Content on the site is for personal use only and may be downloaded provided the material is kept intact and there is no violation of the copyrights, trademarks, and other proprietary rights. Any alteration of the material or use of the material contained in the site for any other purpose is a violation of the copyright of nasscom and / or its affiliates or associates or of its third-party information providers. This material cannot be copied, reproduced, republished, uploaded, posted, transmitted or distributed in any way for non-personal use without obtaining the prior permission from nasscom.
The nasscom Members login is for the reference of only registered nasscom Member Companies.
nasscom reserves the right to modify the terms of use of any service without any liability. nasscom reserves the right to take all measures necessary to prevent access to any service or termination of service if the terms of use are not complied with or are contravened or there is any violation of copyright, trademark or other proprietary right.
From time to time nasscom may supplement these terms of use with additional terms pertaining to specific content (additional terms). Such additional terms are hereby incorporated by reference into these Terms of Use.

Disclaimer

The Company information provided on the nasscom web site is as per data collected by companies. nasscom is not liable on the authenticity of such data.
nasscom has exercised due diligence in checking the correctness and authenticity of the information contained in the site, but nasscom or any of its affiliates or associates or employees shall not be in any way responsible for any loss or damage that may arise to any person from any inadvertent error in the information contained in this site. The information from or through this site is provided "as is" and all warranties express or implied of any kind, regarding any matter pertaining to any service or channel, including without limitation the implied warranties of merchantability, fitness for a particular purpose, and non-infringement are disclaimed. nasscom and its affiliates and associates shall not be liable, at any time, for any failure of performance, error, omission, interruption, deletion, defect, delay in operation or transmission, computer virus, communications line failure, theft or destruction or unauthorised access to, alteration of, or use of information contained on the site. No representations, warranties or guarantees whatsoever are made as to the accuracy, adequacy, reliability, completeness, suitability or applicability of the information to a particular situation.
nasscom or its affiliates or associates or its employees do not provide any judgments or warranty in respect of the authenticity or correctness of the content of other services or sites to which links are provided. A link to another service or site is not an endorsement of any products or services on such site or the site.
The content provided is for information purposes alone and does not substitute for specific advice whether investment, legal, taxation or otherwise. nasscom disclaims all liability for damages caused by use of content on the site.
All responsibility and liability for any damages caused by downloading of any data is disclaimed.
nasscom reserves the right to modify, suspend / cancel, or discontinue any or all sections, or service at any time without notice.

For any grievances under the Information Technology Act 2000, please get in touch with Grievance Officer, Mr. Anirban Mandal at data-query@nasscom.in.

New

See all

No notification found.

Delta Lakehouse in K8S - Unlocking the Power of Data Lakehouse in Kubernetes: A Comprehensive Guide

NeST Digital

@nestdigital

February 16, 2024

Big Data Analytics Cloud Computing

113

Unlocking the power of Delta Lakehouse in Kubernetes is a journey through the realms of data lakes and container orchestration. In this blog post, we’ll explore the seamless integration of Delta Lakehouse into Kubernetes, unraveling the potential it holds for data-driven applications and workflows. Whether you’re a data engineer, developer, or simply curious about the intersection of big data and containerization, fasten your seatbelts as we dive into the fascinating world of Delta Lakehouse in K8S.

Delta Lakehouse, built on top of Apache Spark, brings reliability and performance to data lakes by providing ACID transactions and scalable metadata handling. In the context of Kubernetes (K8S), this technology brings a new dimension to managing and processing large-scale data within containerized environments.

Why Delta Lakehouse in Kubernetes?

Kubernetes, commonly known as K8s, has emerged as the industry standard for container orchestration. It offers a robust platform for deploying, managing, and scaling containerized applications. Its dynamic capabilities make it particularly well-suited for data lakes, allowing organizations to efficiently handle evolving data processing requirements. By harnessing Kubernetes, businesses can take advantage of automatic scaling, fault tolerance, streamlined resource allocation, and simplified deployment of data processing workloads.

Understanding Data Lakehouse

Before diving into the world of data lakehouses in Kubernetes, let’s quickly recap what a data lakehouse is. A data lakehouse is a unified data storage architecture that combines the strengths of data lakes and data warehouses. It provides a single repository for storing structured, semi-structured, and unstructured data, making it ideal for handling diverse data types and formats. Data lakes offer schema enforcement, ACID transactions, and support for both batch and real-time analytics, allowing organizations to run complex queries and perform advanced analytics on large datasets without compromising performance.

Setting Up Delta Lakehouse in Kubernetes

To establish Delta Lakehouse in Kubernetes, utilizing the Spark Operator streamlines the deployment and management of Spark applications interacting with Delta Lake. After installing the Spark Operator, define a SparkApplication custom resource specifying key parameters. Submit your Delta Lake application as a Spark job, prompting the Spark Operator to deploy and manage the Spark cluster dynamically. This automated process includes monitoring the application’s status, adjusting resources based on workload changes, and ensuring optimal efficiency. The integration provides a robust, scalable, and containerized solution for handling significant data processing tasks within Kubernetes. Subsequent sections will delve into specific configurations and practical use cases to maximize the potential of Delta Lakehouse in the Kubernetes environment.

The diagram shows how different components interact and work together.

Deploying Spark jobs in K8s

For installing the Spark application in K8s, we use the Spark operator. The Spark operator is a K8s native tool that helps us run Spark applications on K8s clusters. It simplifies the deployment and management of Spark applications by providing a custom resource definition (CRD) that allows you to define Spark applications as Kubernetes resources. we deployed spark application for our project.

A walkthrough for running a Delta Lakehouse Spark application on Kubernetes.

We’re using Delta Lakehouse in Kubernetes, which combines the best parts of a data warehouse and a data lake. Delta Lakehouse can handle both structured and unstructured data. In this blog, we’re trying to set up Delta Lakehouse with the Spark Operator.

Here is a step-by-step guide on setting up a delta lake within a Spark application on Kubernetes.

Version:

Spark: 3.3.0
Kubernetes v1.25.6
spark-on-k8s-operator: sparkoperator.k8s.io/v1beta2

Step 1: Create custom Docker images that include the necessary packages to run a Spark application with Delta Lake support.

Why might you want to create your own special Docker image?

The main reason for creating a custom Docker image, instead of using the default Spark Docker image, is to include additional packages required for executing the code. There were issues with running the packages and dependencies when adding them directly to the deps package in the YAML file. The primary problem that can result in a “ModuleNotFoundError” is the absence of required packages for import.

The packages used for creating Docker images in our project are as follows, but keep in mind that they may vary depending on your specific requirements.

https://medium.com/media/858de0c834edd7f42398c6570f0040e3

Step 2: Execute the following commands to add a Helm chart repository.

https://medium.com/media/21850ac33a6fa493b3db1afb8ae5a16f

install the chart.

https://medium.com/media/da93ebed7d23a948d4ea40b9d99a4851

Now the infrastructure is ready. Let’s run the spark job.

Step 3: Create a file as per format you have in mind, over here we are referring to spark-pi.yaml, which is readily available along with spark-on-k8s-operator GitHub, which we have modified for my use case.

To learn more about this file refer to this link.

https://medium.com/media/f7216a6c7c0dd6c0b737b80c14cf5f1b

https://medium.com/media/36c20935c95f9dd28f0eb4352c8d44a9

After executing this command, your application will start running.

Sample Screenshot of k9s window showing the Spark application.

For more information, please refer to the GitHub link.

Authorize access to data in Azure Storage

We utilized two different types of Azure storage services for this project.

Azure Blob storage

Azure Blob Storage is a cloud-based object storage service provided by Microsoft Azure. It is designed to store and manage large amounts of unstructured data, such as text or binary data, like documents, images, and videos. Azure Blob Storage can be accessed using REST APIs or Azure SDKs, and it can be used to build applications that require scalable and durable storage for unstructured data.

Configuration (Azure Blob storage)

Here are the steps to configure Delta Lake on Azure Blob storage.

Include hadoop-azure JAR in the classpath.
Set up credentials.

You can set up your credentials in the Spark configuration property (Opens in new window or tab).
"fs.azure.account.key.<storage-account-name>.core.windows.net": "<your-storage-account-access-key>"

https://medium.com/media/5bebf6ecbc843252fc46ba7a215132e8

Usage (Azure Blob storage)

https://medium.com/media/51ae12430ce5893f20f3b5034011b780

Data Lake Storage Gen2 converges the capabilities of Azure Data Lake Storage Gen1 with Azure Blob Storage. For example, Data Lake Storage Gen2 provides file system semantics, file-level security, and scale. Because these capabilities are built on Blob storage, you also get low-cost, tiered storage, with high availability/disaster recovery capabilities.

Configuration (ADLS Gen2)

Here are the steps to configure Delta Lake on Azure Data Lake Storage Gen1.

Include the JAR of the Maven artifact hadoop-azure-datalake in the classpath. See the requirements for version details. In addition, you may also have to include JARs for Maven artifacts hadoop-azure and wildfly-openssl.
Set up Azure Data Lake Storage Gen2 credentials.

https://medium.com/media/5c8dcda842f088416b0e62bf99ede9b8

Usage (ADLS Gen2)

https://medium.com/media/e4bb941e74f1c63c84028f579b6a2061

Difficulties encountered while running Spark in AKS

when partitioning the data. Special characters cannot be used in directory names on Azure Storage.
An open error that prevents the reading of a delta log will occur when foreachbatch in PySpark is being used on Azure.

SparkPodSpec

The SparkPodSpec defines common elements that can be customized for a Spark driver or executor pod. In the following file, we customized the drivers and executors with respect to our resource requirements and computation time.

https://medium.com/media/8a542478acce0fcd41b66726a5a56781

Conclusion

In conclusion, the strategic choice of Delta Lakehouse in Kubernetes aligns with industry standards, capitalizing on Kubernetes’ automatic scaling, fault tolerance, and resource efficiency. Throughout the setup process, the Spark Operator emerged as a pivotal tool, streamlining the deployment and management of Spark applications interacting with Delta Lake.

In summary, the synergy between Delta Lakehouse and Kubernetes holds immense potential for advancing data-driven applications and workflows. This convergence, at the intersection of big data and containerization, stands poised to shape the future of efficient and scalable data processing.

Delta Lakehouse kubernetes data processing Containerized Data Analytics

Disclaimer

That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.

Download Attachment

NeST Digital

NeST Digital, the software arm of the NeST Group, has been transforming businesses, providing customized and innovative software solutions and services for customers across the globe. A leader in providing end-to-end solutions under one roof, covering contract manufacturing and product engineering services, NeST has 25 years of proven experience in delivering industry-specific engineering and technology solutions for customers, ranging from SMBs to Fortune 500 enterprises, focusing on Transportation, Aerospace, Defense, Healthcare, Power, Industrial, GIS, and BFSI domains.

Top 7 Benefits of Using Cloud-Integ...

CarterDC

Cloud Computing

11 Jul 2025

What Private Equity Firms Expect fr...

Tanya Gupta

BFSI

09 Jul 2025

Azure Cloud: A Smarter Way to Drive...

skt

Cloud Computing

08 Jul 2025

Multiply the benefits of cloud with...

Opcito Technologies

Cloud Computing

08 Jul 2025

Unified Data Fabric

Shubham Pawar

Big Data Analyt..

07 Jul 2025

Windows VPS Hosting on a Budget: Af...

Net2secure11

Cloud Computing

07 Jul 2025

Infrastructure as Code: Acceleratin...

NuSummit

Cloud Computing

07 Jul 2025

Edge Computing for Real-Time Analyt...

Chirag Akbari

118

Big Data Analyt..

27 Jun 2025

How Managed Cloud Secures the Futur...

skt

Cloud Computing

26 Jun 2025

Best Expense Reimbursement Software...

Vandna Jadhav

Application

26 Jun 2025

AI Workloads on the Cloud: Building...

Motherson Technology..

AI

25 Jun 2025

Spend Less on Cloud With These Ten ...

CSM Tech

Cloud Computing

25 Jun 2025

Data Lineage in Cloud Environments: Challenges and Solutions

Intelliswift ..

@IntelliswiftMarketing

30 Jan 2024

Big Data Analytics Cloud Computing

In today's data-driven world, organizations increasingly rely on cloud computing for their data management needs. This paradigm shift toward cloud environments presents a host of advantages, such as scalability, flexibility, and cost-efficiency,…

The Role of Big Data in Modern Business Strategies

David Starc

@davidstarc

05 Jan 2024

Big Data Analytics Data Science & AI Community

Consumers use social media to discuss their favorite products but contribute to discussion forums to express dissatisfaction with a brand’s customer service. Likewise, news portals, industry magazines, and research publications can provide exciting…

Data Pipelines – Navigating Data's Journey from Chaos to Oasis

CSM Tech

@csmtechnologies

04 Jan 2024

Big Data Analytics

In a constantly evolving business landscape where data rules the roost, data pipelines are the unsung heroes. Though undervalued, data pipelines are the spine of the data management architecture. Their role is akin to the circulatory system as data…

How Data Analytics Detects and Combats Fraud in Mining

CSM Tech

@csmtechnologies

04 Jan 2024

Big Data Analytics

Tackling fraud has long been a persistent challenge in the mining industry, leading to significant financial losses and reputational damage. Historically, mining has been a boots-on-the-ground activity with unconnected processes and components. The…

Building Oversight in Aid Management Systems

CSM Tech

@csmtechnologies

29 Dec 2023

Big Data Analytics

Globally, aid organizations and development agencies play an essential role in improving people's lives. However, coordinating and managing aid programs can be a challenging task involving a large amount of data. The challenges compound if you…

What Is A Data Lakehouse?

DView

@DView

18 Dec 2023

Big Data Analytics

From the apps we use to the businesses we interact with, data plays a pivotal role in shaping our experiences. Managing this vast amount of information effectively is crucial. It paves the way to smooth operations and helps in gaining insights and…

New

Delta Lakehouse in K8S - Unlocking the Power of Data Lakehouse in Kubernetes: A Comprehensive Guide

NeST Digital

Why Delta Lakehouse in Kubernetes?

Understanding Data Lakehouse

Setting Up Delta Lakehouse in Kubernetes

Deploying Spark jobs in K8s

A walkthrough for running a Delta Lakehouse Spark application on Kubernetes.

Authorize access to data in Azure Storage

Azure Blob storage

Configuration (Azure Blob storage)

Usage (Azure Blob storage)

Configuration (ADLS Gen2)

Usage (ADLS Gen2)

Difficulties encountered while running Spark in AKS

SparkPodSpec

Conclusion

NeST Digital

Data Lineage in Cloud Environments: Challenges and Solutions

Intelliswift ..

The Role of Big Data in Modern Business Strategies

David Starc

Data Pipelines – Navigating Data's Journey from Chaos to Oasis

CSM Tech

How Data Analytics Detects and Combats Fraud in Mining

CSM Tech

Building Oversight in Aid Management Systems

CSM Tech

What Is A Data Lakehouse?

DView

About Us

Knowledge Center

In the News

Topics In Demand

Notification

New

Delta Lakehouse in K8S - Unlocking the Power of Data Lakehouse in Kubernetes: A Comprehensive Guide

Why Delta Lakehouse in Kubernetes?

Understanding Data Lakehouse

Setting Up Delta Lakehouse in Kubernetes

Deploying Spark jobs in K8s

A walkthrough for running a Delta Lakehouse Spark application on Kubernetes.

Authorize access to data in Azure Storage

Azure Blob storage

Configuration (Azure Blob storage)

Usage (Azure Blob storage)

Configuration (ADLS Gen2)

Usage (ADLS Gen2)

Difficulties encountered while running Spark in AKS

SparkPodSpec

Conclusion

Share this blog

Related blogs

CarterDC

11 Jul 2025

Tanya Gupta

09 Jul 2025

skt

08 Jul 2025

Opcito Technologies

08 Jul 2025

Shubham Pawar

07 Jul 2025

Net2secure11

07 Jul 2025

NuSummit

07 Jul 2025

Chirag Akbari

27 Jun 2025

skt

26 Jun 2025

Vandna Jadhav

26 Jun 2025

Motherson Technology..

25 Jun 2025

CSM Tech

25 Jun 2025

About Us

Knowledge Center

In the News

Newsletter