What is an Open Data Lake?

Terms of use

Terms of Use

The use of this site and the content contained therein is governed by the Terms of Use. When you use this site you acknowledge that you have read the Terms of Use and that you accept and will be bound by the terms hereof and such terms as may be modified from time to time.

All text, graphics, audio, design and other works on the site are the copyrighted works of nasscom unless otherwise indicated. All rights reserved.
Content on the site is for personal use only and may be downloaded provided the material is kept intact and there is no violation of the copyrights, trademarks, and other proprietary rights. Any alteration of the material or use of the material contained in the site for any other purpose is a violation of the copyright of nasscom and / or its affiliates or associates or of its third-party information providers. This material cannot be copied, reproduced, republished, uploaded, posted, transmitted or distributed in any way for non-personal use without obtaining the prior permission from nasscom.
The nasscom Members login is for the reference of only registered nasscom Member Companies.
nasscom reserves the right to modify the terms of use of any service without any liability. nasscom reserves the right to take all measures necessary to prevent access to any service or termination of service if the terms of use are not complied with or are contravened or there is any violation of copyright, trademark or other proprietary right.
From time to time nasscom may supplement these terms of use with additional terms pertaining to specific content (additional terms). Such additional terms are hereby incorporated by reference into these Terms of Use.

Disclaimer

The Company information provided on the nasscom web site is as per data collected by companies. nasscom is not liable on the authenticity of such data.
nasscom has exercised due diligence in checking the correctness and authenticity of the information contained in the site, but nasscom or any of its affiliates or associates or employees shall not be in any way responsible for any loss or damage that may arise to any person from any inadvertent error in the information contained in this site. The information from or through this site is provided "as is" and all warranties express or implied of any kind, regarding any matter pertaining to any service or channel, including without limitation the implied warranties of merchantability, fitness for a particular purpose, and non-infringement are disclaimed. nasscom and its affiliates and associates shall not be liable, at any time, for any failure of performance, error, omission, interruption, deletion, defect, delay in operation or transmission, computer virus, communications line failure, theft or destruction or unauthorised access to, alteration of, or use of information contained on the site. No representations, warranties or guarantees whatsoever are made as to the accuracy, adequacy, reliability, completeness, suitability or applicability of the information to a particular situation.
nasscom or its affiliates or associates or its employees do not provide any judgments or warranty in respect of the authenticity or correctness of the content of other services or sites to which links are provided. A link to another service or site is not an endorsement of any products or services on such site or the site.
The content provided is for information purposes alone and does not substitute for specific advice whether investment, legal, taxation or otherwise. nasscom disclaims all liability for damages caused by use of content on the site.
All responsibility and liability for any damages caused by downloading of any data is disclaimed.
nasscom reserves the right to modify, suspend / cancel, or discontinue any or all sections, or service at any time without notice.

For any grievances under the Information Technology Act 2000, please get in touch with Grievance Officer, Mr. Anirban Mandal at data-query@nasscom.in.

New

See all

No notification found.

What is an Open Data Lake?

QuboleTechnologies

@QuboleTechnologies

June 30, 2020

Data Science & AI Community

598

A data lake is a system or repository that stores data in its raw format as well as transformed trusted datasets and provides both programmatic and SQL based access to this data for diverse analytics tasks such as machine learning, data exploration, and interactive analytics.

The data stored in a data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).

A data lake, where data is stored in an open format and accessed through open standards-based interfaces, is defined as an Open Data Lake. This adherence to an open philosophy, aimed at preventing vendor lock-in, permeates through every aspect of the system, including data storage, data management, data processing, operations, data access, governance, and security.

We define an open format as a format that is based on an underlying open standard, developed and shared through a publicly visible and community-driven process without vendor-specific proprietary extensions. For example, an Open Data Format is a platform-independent, machine-readable data format such as ORC or Parquet, whose specification is published to the community, such that any organization can create tools and applications to read data in the format.

A typical data lake has the following capabilities:

Data Ingestion and storage
Data processing and support for continuous data engineering
Data Access and consumption
Data Governance – Discoverability, Security and Compliance
Infrastructure and operations

In the following sections, we will describe openness requirements for each capability.

Data Ingestion and storage

An Open Data Lake ingests data from sources such as applications, databases, data warehouses, and real-time streams. It formats and stores the data into an open data format, such as ORC and Parquet, that is platform-independent, machine-readable, optimized for fast access and analytics, and made available to consumers without restrictions that would impede the re-use of that information.

An Open Data Lake supports both the pull and push based ingestion of data. It supports pull-based ingestion through batch data pipelines and push-based ingestion through stream processing. For both these types of data ingestion, an Open Data Lake supports open standards such as SQL and Apache Spark for authoring data transformations. For batch data pipelines, it supports row-level inserts and updates — UPSERT — to datasets in the lake. Upsert capability with snapshot isolation — and more generally, ACID semantics — greatly simplifies the task, as opposed to rewriting data partitions or entire datasets.

The ingest capability of Open Data Lake ensures zero data loss and writes exactly-once or at-least-once; handles schema variability; writes in the most optimized data format into the right partitions, and provides the ability to re-ingest data when needed.

Data Processing and support for Continuous Data Engineering

Open Data Lake stores the raw data from various data sources in a standardized open format. However, use cases such as data exploration, interactive Analytics, and Machine Learning require that the raw data is processed to create use-case driven trusted datasets. For Data Exploration and Machine Learning use cases, users continually refine data sets for their analysis needs. As a result, every data lake implementation must enable users to iterate between data engineering and use cases such as interactive analytics and Machine Learning. We call this “Continuous Data Engineering”.

Continuous Data Engineering involves the interactive ability to author, monitor, and debug data pipelines. In an Open Data Lake, these pipelines are authored using standard interfaces and open source frameworks such as SQL, python, Apache Spark, and/or Apache Hive.

Data Access and consumption

The most visible outcome of Data Lake is the types of use cases it enables. Whether the use case is Data Exploration, Interactive analytics, or machine learning, access to data is vital. The access to data can be through SQL or programmatic languages such as Python, R, Scala, etc. While SQL is the norm for interactive analysis, programmatic languages are used for more advanced applications like machine learning and deep learning.

Open Data Lake supports data access through standards-based implementation of SQL with no proprietary extensions. It enables external tools to access that data through standards such as ODBC and JDBC. Also, the Open Data Lake supports programmatic access to data via standard programming languages such as R, Python, and Scala and standard libraries for numerical computation and machine learning such as TensorFlow, Apache Spark, MLib, MXNet, Tensorflow, Keras, and SciKit Learn.

Data Governance – Discoverability, Security and Compliance

When data ingestion and data access are implemented well, data can be made widely available to users in a democratized fashion. When multiple teams start accessing data, data architects need to exercise oversight for governance, security, and compliance purposes.

Data Discovery

Data itself is hard to find and comprehend and not always trustworthy. Users need the ability to discover and profile datasets for integrity before they can trust them for their use case. A data catalog enriches metadata through different mechanisms, uses it to document datasets, and supports a search interface to aid discovery.

Since the first step is to discover the required datasets, it’s essential to surface metadata to end-users for exploration purposes, to see where the data resides and what it contains, and to determine if it is useful for answering a particular question. Discovery includes data profiling capabilities that support interactive previews of datasets to shine a light on formatting, standardization, labels, data shape, and so on.

Open Data Lake provides an open metadata repository. As an example, Apache Hive metadata repository is an open metadata repository that prevents vendor lockin for metadata.

Security

Increasing accessibility to the data requires data lakes to support strong access control and security features on the data. An Open Data Lake does this through non-proprietary security and access control APIs. As an example, deep integration with open source frameworks such as Apache Ranger and Apache Sentry can facilitate table, row and column level granular security. This enables administrators to grant permissions against already-defined user roles in enterprise directories such as Active Directory etc. By basing access control on open source frameworks, the Open Data Lake avoids vendor lock-in through proprietary security implementation.

Compliance

New or expanded data privacy regulations, such as GDPR and CCPA, have created new requirements around “Right to Erasure” and “Right to Be Forgotten”. These govern consumers’ rights about their data and involve stiff financial penalties for non-compliance (as much as 4% of global turnover), so they must not be overlooked. Therefore, the ability to delete specific subsets of data without disrupting a data management process is essential. An Open Data Lake supports this ability on open formats and open metadata repositories. In this way, they enable a vendor agnostic solution to compliance needs.

Infrastructure and Operations

Whether the data lake is deployed in the cloud or on-premises, each cloud provider has specific implementation to provision, configure, monitor, and manage the data lake as well as the resources it needs.

Open Data Lake is cloud-agnostic and is portable across any cloud-native environment including public and private clouds. This enables administrators to leverage benefits of both public and private cloud from economics, security, governance and agility perspective.

Conclusion

The increase in volume, velocity and variety of data, combined with new types of analytics and machine learning is creating the need for an open data lake architecture. Across our 200+ customers including market leaders like Expedia, Disney, Lyft, Adobe and more, we find that the Open Data Lake is becoming a common feature alongside the Data Warehouse. While the Data Warehouse has been designed and optimized for SQL analytics, the need for an open, simple and secure data lake platform that can support new types of analytics and machine learning is driving the Open Data Lake adoption. Unlike the Data Warehouse’s world of proprietary formats, proprietary SQL extensions, proprietary metadata repository and lack of programmatic access to data, the Open Data Lake ensures no vendor lock-in while supporting a diverse range of analytics. The Open Data Lakes provide a robust and future-proof data management paradigm to support a wide range of data processing needs including data exploration, interactive analytics, and machine learning.

P.S – The article has been first published on https://www.qubole.com/

DataProcessing DataEngineering datasecurity #OpenDataLake #DataGovernance

Disclaimer

That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.

QuboleTechnologies

How AI Can Improves Data Protection...

AlgoDocs

AI

14 Jul 2025

What makes agentic AI the future of...

Opcito Technologies

231

AI

11 Jul 2025

A Step-by-Step Guide to Building an...

Getlatest

Sales & Mar..

10 Jul 2025

Breaking Down Today’s Top Headlines...

Getlatest

Sales & Mar..

10 Jul 2025

The Latest Buzz in Tech, Culture, a...

Getlatest

Sales & Mar..

10 Jul 2025

Agentforce 2dx: Enhancing Enterpris...

Daniel Walker

Mulesoft and Sa..

09 Jul 2025

How New Tech Is Transforming Crypto...

aaron

Blockchain

08 Jul 2025

Legal AI Chatbots: Benefits and Use...

elint AI

AI

07 Jul 2025

The Enterprise Sprint and Marathon ...

Janhvi Juyal

Digital Transfo..

07 Jul 2025

The Future of AI Companions – Marke...

Ashish Pandey

Data Science &a..

07 Jul 2025

The Real AI Bottleneck Isn’t Comput...

Bipin Kondalkar

Data Science &a..

04 Jul 2025

How Professional ICO Development Se...

Marco luther

Blockchain

04 Jul 2025

The Future of Retail: Data Science and Augmented Reality in Smartphone Shopping Apps

shashankkumar

@shashank

09 Oct 2024

Data Science & AI Community Data Privacy

The retail industry is transforming, driven by technological advances and changing consumer behaviors. Two of the most significant technologies shaping the future of retail are Data Science and Augmented Reality (AR). These tools enable…

AI for Design at Baker Hughes

Baker Hughes

@bakerhughes

08 Oct 2024

AI Data Science & AI Community

AI is revolutionizing engineering design and scientific research across different business segments and research fields with speed-ups and accuracy improvements as stated clearly by Christopher Bishop, head of AI4Science at Microsoft Research: “Over…

What is RLHF in Generative AI, And How Does it Work ?

amit155

@amit155

07 Oct 2024

AI Inside Data Science & AI Community AI Machine Learning EdTech

Generative AI has taken significant strides in recent years, from producing creative content like art, music, and literature to enhancing human-machine interactions. However, fine-tuning these models to align with human values, preferences, and…

The Evolution of Personalized AR Experiences: Leveraging Data Science for More Engaging Content

adarshgowda

@adarshgowda

07 Oct 2024

Data Science & AI Community Big Data Analytics

For instance, augmented reality (AR) has developed enormously in recent years, given that it enhances the way users interface with digital information that is placed in real-life contexts. Earlier AR was identified as a specific field…

The Evolution of AI and Trends To Watch in 2024

Centaur Digit..

@Centaurdigital

04 Oct 2024

AI AI Inside Emerging Tech Machine Learning

AI (Artificial Intelligence) has become a buzzword and that is for very genuine reasons. It is one of the fastest-evolving technologies that is significantly impacting different industries and businesses. With the evolution of AI, it is now possible…

The Impact of AI on Insurance: 18 Top Use Cases You Must Know

Maruti Techla..

@marutitech

04 Oct 2024

Data Science & AI Community AI

Introduction The insurance industry is on the brink of a significant paradigm shift. From risk assessment and insurance underwriting to claims management and customer engagement, disruptive AI applications are transforming the very fabric…

New