Why Self-Service Access to Data Stored in Data Lakes Is Such a Challenge for Today’s Enterprises (Part 1)

Terms of use

Terms of Use

The use of this site and the content contained therein is governed by the Terms of Use. When you use this site you acknowledge that you have read the Terms of Use and that you accept and will be bound by the terms hereof and such terms as may be modified from time to time.

All text, graphics, audio, design and other works on the site are the copyrighted works of nasscom unless otherwise indicated. All rights reserved.
Content on the site is for personal use only and may be downloaded provided the material is kept intact and there is no violation of the copyrights, trademarks, and other proprietary rights. Any alteration of the material or use of the material contained in the site for any other purpose is a violation of the copyright of nasscom and / or its affiliates or associates or of its third-party information providers. This material cannot be copied, reproduced, republished, uploaded, posted, transmitted or distributed in any way for non-personal use without obtaining the prior permission from nasscom.
The nasscom Members login is for the reference of only registered nasscom Member Companies.
nasscom reserves the right to modify the terms of use of any service without any liability. nasscom reserves the right to take all measures necessary to prevent access to any service or termination of service if the terms of use are not complied with or are contravened or there is any violation of copyright, trademark or other proprietary right.
From time to time nasscom may supplement these terms of use with additional terms pertaining to specific content (additional terms). Such additional terms are hereby incorporated by reference into these Terms of Use.

Disclaimer

The Company information provided on the nasscom web site is as per data collected by companies. nasscom is not liable on the authenticity of such data.
nasscom has exercised due diligence in checking the correctness and authenticity of the information contained in the site, but nasscom or any of its affiliates or associates or employees shall not be in any way responsible for any loss or damage that may arise to any person from any inadvertent error in the information contained in this site. The information from or through this site is provided "as is" and all warranties express or implied of any kind, regarding any matter pertaining to any service or channel, including without limitation the implied warranties of merchantability, fitness for a particular purpose, and non-infringement are disclaimed. nasscom and its affiliates and associates shall not be liable, at any time, for any failure of performance, error, omission, interruption, deletion, defect, delay in operation or transmission, computer virus, communications line failure, theft or destruction or unauthorised access to, alteration of, or use of information contained on the site. No representations, warranties or guarantees whatsoever are made as to the accuracy, adequacy, reliability, completeness, suitability or applicability of the information to a particular situation.
nasscom or its affiliates or associates or its employees do not provide any judgments or warranty in respect of the authenticity or correctness of the content of other services or sites to which links are provided. A link to another service or site is not an endorsement of any products or services on such site or the site.
The content provided is for information purposes alone and does not substitute for specific advice whether investment, legal, taxation or otherwise. nasscom disclaims all liability for damages caused by use of content on the site.
All responsibility and liability for any damages caused by downloading of any data is disclaimed.
nasscom reserves the right to modify, suspend / cancel, or discontinue any or all sections, or service at any time without notice.

For any grievances under the Information Technology Act 2000, please get in touch with Grievance Officer, Mr. Anirban Mandal at data-query@nasscom.in.

New

See all

No notification found.

Why Self-Service Access to Data Stored in Data Lakes Is Such a Challenge for Today’s Enterprises (Part 1)

QuboleTechnologies

@QuboleTechnologies

August 31, 2019

Big Data Analytics

569

Every data source — from the sensors in cars and machinery to the devices we carry around in our pockets — constantly records and transmits data amounting to 2.5 quintillion bytes per day. With the influx of information, enterprises are scrambling to capture all of the right data points, organize and process that data, and convert it into actionable insights. Yet in the hustle to transform raw data into usable information, we tend to forget how sophisticated and challenging the process between the input and output has become.

We’ve seen the infrastructure to process data become far more complex, in part due to the seismic shift in the diversity of data that businesses collect today and the variety of analyses to which data is subjected. Data lakes now coexist with traditional data warehouses — the result of relatively simple methods of decades past morphing into an elaborate process involving many different components, all of which interact with one another.

This patchwork infrastructure creates its own set of challenges. While self-service access to data stored in data warehouses has become a reality with the evolution of business intelligence platforms, it remains challenging for enterprises to deliver the same type of access to data stored in data lakes. Today, providing self-service access to data stored in data lakes means mastering open source software, data operations, infrastructure management, security, and supporting the variety of ways different data users within the enterprise access data.

Business leaders and data teams now need to consider many moving parts when giving users self-service access to data lakes, and to think about their data processes from a holistic perspective. Neglecting any of the puzzle pieces can have a lasting effect far beyond the current and future state of your infrastructure — it can impact your company’s overall productivity and your bottom line.

Resolving Data Operations Challenges

Data teams today face an endless flood of data sources, and are finding it impossible to keep up with the operational demands of such an extensive data intake. The proliferation of structured, unstructured, and semi-structured data formats requires teams to continuously assess data quality prior to publishing data sets for enterprise data consumers. Teams expend significant time and resources preparing and ingesting data generated from sources within the company’s control as well as data provided by third-party vendors.

Not only is the quality of data critical, but delivering that data in a timely manner to analysts, data scientists, and software development teams is equally important. Data delayed is data denied. As a result, data teams must be able to measure SLAs and incorporate early fault detection, fault remediation, and high-availability practices into data operations. Furthermore, any industrial-grade data operations practice must involve tracking and managing the costs of operating the data lake infrastructure as well as driving visibility into how costs are allocated across the organization and different data projects.

Simplifying Infrastructure Management

Unlike previous decades where data storage and data processing were tightly connected, the emergence of standard formats has led to storage formats becoming decoupled from specific analytical processing engines. Different analytical engines can now process the same data set without having to replicate the data in repositories for each type of processing.

For instance, many companies now leverage more than one of the leading big data engines or frameworks: Presto, Apache Spark, Hive, Hadoop, TensorFlow. Qubole’s recently published Big Data Activation Report uncovered 162 percent growth in usage across Spark, Presto, and Hadoop/Hive in the span of one year (from 2017 to 2018). What’s noteworthy is that this growth was not driven by a single engine — Presto usage grew by 420 percent, Spark by 298 percent, and the remainder stemmed from growth of Hadoop/Hive usage. Data teams use these technologies to conduct different types of analysis on the same data set stored in Parquet, ORC, JSON, or other open formats. While this has simplified data pipelines since data does not need to be replicated into many different monolithic systems (data warehouses, graphical databases, machine learning systems, etc.), it has also made managing the data infrastructure much more complex.

Data teams are also experiencing increasingly diverse use cases driven by an ever-increasing number of users within the enterprise seeking access to a growing variety of data. This scenario places an onus on the infrastructure to continue to evolve to meet these growing and changing requirements. Predicting data demands on the infrastructure has become impossible, which leaves the data team playing a never-ending game of catch-up and missing SLAs.

Securing Enterprise Data

Data security remains difficult for enterprises to successfully execute — a problem compounded by the breadth of protection that today’s vast volumes of data require. Regardless of how sophisticated the security policies and controls being implemented are, organizations not in the big data business will struggle to keep pace with the costs and resources that a comprehensive security strategy demands. The proliferation of security tools and frameworks alone calls for a dedicated staff and the resources to provide constant security testing, patching, penetration testing, and vulnerability assessments.

Extensive oversight of access controls is also crucial for identifying security vulnerabilities and preventing breaches. Your data administrators need the means to control access across all processing engines and frameworks not only to ensure privacy, but also to conduct auditing and guarantee compliance. Encryption is yet another essential component of your security infrastructure to help safeguard the privacy of your data assets at rest and in transit. And with so many tools and technologies living in the cloud, it’s vital that your team employ and maintain appropriate encryption methodologies to protect against security infiltrations.

Supporting End-User Tools And Interfaces

Modern enterprises are dependent on many tools and technologies, each with their own interfaces. This problem is confounded by the fact that different tools are designed or optimized for specific users, whether data scientists, data engineers, data analysts, or administrators. Data scientists are concerned with having tools that enable machine learning, provide end-to-end visibility of the data pipeline, and automate the mundane tasks of scaling the data infrastructure. Data engineers require tools to optimize data processing tasks and streamline production of data pipelines, while data analysts need a way to discover, query, and visualize all data formats. Data administrators, on the other hand, are focused on configuring, managing, and governing data and infrastructure resources.

Building self-service access to a data lake means that all of these tools and interfaces must work well with the data lake infrastructure. It becomes the data team’s burden to ensure interoperability between these tools and the data lake infrastructure meets the desired needs of the personnel using the tools. For example, the latency of ODBC/JDBC connectivity with the BI tool of choice is important for data analysts, while data engineers and administrators must be able to expose metadata about their workloads to the tools they use.

Although this complexity affects interactions between the data lake infrastructure and related tools, data teams must also set up and operate the tools necessary for diverse data users. Data scientists might be clamoring for a Jupyter- or Zeppelin-based notebook, while an analyst wants to use Looker or Tableau. The diversity of tools means that the data team now has to operate and manage both the Tableau implementation and Jupyter notebooks service.

Embracing Open Source Software

As companies expand their big data initiatives — across machine learning, artificial intelligence, business intelligence, ETL (Extract, Transform, Load), and future workload types — they must likewise expand their toolkit, as some tools will perform optimally for certain workloads but less so for others. For instance, Presto performs best for interactive analyses and data discovery while Apache Spark works well processing memory-intensive workflows for the purpose of creating data pipelines or implementing machine learning algorithms.

Presto and Spark, along with many other open source big data tools (Hive, Hadoop, TensorFlow, Airflow, etc.), deliver unique and powerful crowd-sourced capabilities that companies can leverage for their big data initiatives. And yet these open source options require a greater commitment than one unfamiliar with technology stacks might believe.

Any company interested in or currently using open source engines must be constantly plugged into the open source community. Each open source tool has its own group of members who actively enhance the technology, so it’s critical that you have the means to monitor changes and up-and-coming advancements to determine their value for your environments as well as decide on the right timing and process for upgrades or applying patches. For every company leveraging big data this problem is amplified by the use of multiple open source tools and frameworks. A recent Qubole report found that 76 percent of enterprises actively use at least three engines for their big data workloads — requiring you to juggle community activity, maintenance, and performance tuning across all of those workload types.

Closing Thoughts

As a result of all of these changes, enabling self-service access to data stored in data lakes has become increasingly complex. Delivering such a sophisticated infrastructure to data users within the enterprise places an enormous burden on data teams. As a result, many installations fail — 85 percent of all big data projects, in fact. This failure holds companies back from effectively using data — the most important asset that they have today. In the subsequent blog posts of this series, I will outline how data teams can get ahead of this game and provide adequate infrastructure and support to their data users — and, in the process, take steps toward becoming a data-driven company.

technology Bigdata Infrastructure MachineLearning DataLakes OpenSourceSoftware

Disclaimer

That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.

QuboleTechnologies

The Future of Market Research and Strategy: AI, Big Data & Beyond

Tanya Gupta

@tanyagupta

01 Sep 2025

Analytics Big Data Analytics Sales & Marketing

In today's fast-changing business world, accurate market research and strong strategies are significant. Consumer priorities are changing rapidly, digital changes are again reforming industries, and competition is really high. Organisations are…

Look to Cross-Tab and Tabular Analytical BI Tools for Clear Results

kartikpatel

@KartikPatel

28 Aug 2025

Analytics Big Data Analytics

Cross-Tab and Tabular Reporting Tools Are Foundational! The evolution of advanced analytics has been rapid and impressive. With features that provide support for business users and help to transition them into Citizen Data Scientists, and the…

How Master Data is Foundational to Business Transformation?

CSM Tech

@csmtechnologies

13 Aug 2025

Big Data Analytics

Digital transformation has evolved rapidly over the years, becoming a critical driver of business innovation and growth. What started as a slow shift towards technology adoption has now become an essential strategy for businesses looking to have…

Developing Intelligent Chatbots with Generative AI Capabilities

Motherson Tec..

@Jaydip Roy

11 Aug 2025

AI Inside AI Big Data Analytics

Developing Intelligent Chatbots with Generative AI Capabilities “Intelligent chatbot development is advancing through generative AI applications, integrating NLP chatbot solutions and conversational AI tools. This…

From Global Talent to Global Impact: How Remote Staff Augmentation Unlocks 24/7 Expertise

C5i (Course5 ..

@Ronald Fernandes

06 Aug 2025

Analytics

Research AI Markets don’t sleep anymore, and neither can your operations. As research timelines shrink and clients expect answers in real time, traditional team setups just can’t keep pace. Many leaders still depend on local teams to…

How To Simplify Insurance Claims Processes with Data Analytics?

Ken Milko

@kenmilko

05 Aug 2025

Big Data Analytics

In our last blog, we discussed the important factors to bear in mind before transforming insurance claims operations. In this post, we will uncover how data analytics can streamline insurance claims workflows. A digitized Insurance claims…

Topics In Demand

Notification

New

Why Self-Service Access to Data Stored in Data Lakes Is Such a Challenge for Today’s Enterprises (Part 1)

Resolving Data Operations Challenges

Simplifying Infrastructure Management

Securing Enterprise Data

Supporting End-User Tools And Interfaces

Embracing Open Source Software

Closing Thoughts

Share this blog

Related blogs