Cloud Data Lakes – Best Practices | nasscom | The Official Community of Indian IT Industry

Terms of use

Terms of Use

The use of this site and the content contained therein is governed by the Terms of Use. When you use this site you acknowledge that you have read the Terms of Use and that you accept and will be bound by the terms hereof and such terms as may be modified from time to time.

All text, graphics, audio, design and other works on the site are the copyrighted works of nasscom unless otherwise indicated. All rights reserved.
Content on the site is for personal use only and may be downloaded provided the material is kept intact and there is no violation of the copyrights, trademarks, and other proprietary rights. Any alteration of the material or use of the material contained in the site for any other purpose is a violation of the copyright of nasscom and / or its affiliates or associates or of its third-party information providers. This material cannot be copied, reproduced, republished, uploaded, posted, transmitted or distributed in any way for non-personal use without obtaining the prior permission from nasscom.
The nasscom Members login is for the reference of only registered nasscom Member Companies.
nasscom reserves the right to modify the terms of use of any service without any liability. nasscom reserves the right to take all measures necessary to prevent access to any service or termination of service if the terms of use are not complied with or are contravened or there is any violation of copyright, trademark or other proprietary right.
From time to time nasscom may supplement these terms of use with additional terms pertaining to specific content (additional terms). Such additional terms are hereby incorporated by reference into these Terms of Use.

Disclaimer

The Company information provided on the nasscom web site is as per data collected by companies. nasscom is not liable on the authenticity of such data.
nasscom has exercised due diligence in checking the correctness and authenticity of the information contained in the site, but nasscom or any of its affiliates or associates or employees shall not be in any way responsible for any loss or damage that may arise to any person from any inadvertent error in the information contained in this site. The information from or through this site is provided "as is" and all warranties express or implied of any kind, regarding any matter pertaining to any service or channel, including without limitation the implied warranties of merchantability, fitness for a particular purpose, and non-infringement are disclaimed. nasscom and its affiliates and associates shall not be liable, at any time, for any failure of performance, error, omission, interruption, deletion, defect, delay in operation or transmission, computer virus, communications line failure, theft or destruction or unauthorised access to, alteration of, or use of information contained on the site. No representations, warranties or guarantees whatsoever are made as to the accuracy, adequacy, reliability, completeness, suitability or applicability of the information to a particular situation.
nasscom or its affiliates or associates or its employees do not provide any judgments or warranty in respect of the authenticity or correctness of the content of other services or sites to which links are provided. A link to another service or site is not an endorsement of any products or services on such site or the site.
The content provided is for information purposes alone and does not substitute for specific advice whether investment, legal, taxation or otherwise. nasscom disclaims all liability for damages caused by use of content on the site.
All responsibility and liability for any damages caused by downloading of any data is disclaimed.
nasscom reserves the right to modify, suspend / cancel, or discontinue any or all sections, or service at any time without notice.

For any grievances under the Information Technology Act 2000, please get in touch with Grievance Officer, Mr. Anirban Mandal at data-query@nasscom.in.

New

See all

No notification found.

Cloud Data Lakes – Best Practices

QuboleTechnologies

@QuboleTechnologies

July 17, 2020

Big Data Analytics

967

BI tools have been the go-to for data analysts who help business track top line, bottom line and customer experience metrics. BI tools analyze small sets of relational data (a few terabytes) in a data warehouse, which require small data scans (a few gigabytes) to execute.

But, businesses are now looking beyond BI to interactive, streaming and clickstream analytics, machine learning and deep learning in order to gain the data-led advantage. For these types of analytics applications, data lakes are the preferred option. Data lakes can ingest volume, variety and velocity of data and stage and cataloged them centrally. Data is then made available for a variety of analytics applications, at any scale, in a cost-efficient manner.

Let’s look at best practices in setting up and managing data lakes across three dimensions –

Data ingestion,
Data layout
Data governance

Cloud Data Lake – Data Ingestion Best Practices

Ingestion can be in batch or streaming form. The data lake must ensure zero data loss and write exactly-once or at-least-once. The data lake must also handle variability in schema and ensure that data is written in the most optimized data format into the right partitions, and provide the ability to re-ingest data when needed.

Batch Data Ingestion: For batch ingestion of transactional data, the data lake must support UPSERT – row-level inserts and updates — to datasets in the lake. Upsert capability with snapshot isolation and ACID semantics simplifies the task, as opposed to rewriting data partitions or entire datasets. ACID semantics ensures concurrent writes and reads are on a data lake without issues with data integrity issues or reduction in read performance.
Streaming Data Ingestion: For streaming data, the data lake must guarantee that data is written exactly once or at least once. A recommended combination is Spark Structured Streaming in conjunction with streaming data arriving at variable velocity from message queues such as Kafka and Amazon Kinesis. A data lake solution for stream processing should integrate with the schema registry in message queues and must support replay capability to keep up with business evolution on stream processing and re-process/reinstate outdated events.

Apart from batch and stream ingestion modes, data lakes must also provide for

Source-to-target schema conversion – intelligently detect source schema and create logical tables on the fly, and flatten semi-structured JSON, XML or CSV into columnar file formats.
Monitoring data movement – connect pipelines and the underlying infrastructure to rich monitoring and alerting tools such as Datadog, Prometheus and SignalFx, to shorten time to recovery after a failure
Keeping data fresh – data restatement and row-level data inserts using UPSERT are key to keeping data fresh

Cloud Data Lake – Data Layout Best Practices

Data generation and data collection across semi-structured and unstructured formats is both bursty and continuous. Inspecting, exploring and analyzing these datasets in their raw form is tedious, because the analytical engines scan the entire data set across multiple files. We recommend five ways to reduce data scanned and reduce query overheads –

Columnar data formats for read analytics – use open source columnar formats such as ORC and Parquet to reduce data scans and avoid queries that need to parse JSON by using json_parse and json_extract
Partition data – use time, geo, lob to reduce data scans, tune partition granularity based on data set under consideration (by hour vs. by second)
Compaction to chunk up small files – chunk up small files into bigger ones asynchronously to reduce network overheads
Perform stats-based cost-based optimization – collect dataset stats like file size, rows, histogram of values to optimize queries with join reordering.
Use Z-order indexed Materialized views for cost-based optimization – a z-order index serves queries with multiple columns in any combination and not just data sorted on a single column.

Managed data lakes can deliver autonomous data management capabilities to operationalize the aforementioned data layout strategy.

Cloud Data Lake – Data Governance Best Practices

With data lakes, multiple teams will start accessing data. There needs to be a strong focus on oversight, regulatory compliance and role-based access control along with delivering meaningful experiences. A single interface for configuration management, auditing, obtaining job reports and exercising cost control is key. Here are three recommendations for data governance –

Discover Your Data

Having a data catalog helps users discover and profile datasets for integrity by enriching metadata through different mechanisms, document datasets, and support a search interface

Use crawlers and classifiers to catalog data. Automatically adding descriptions about the context of how data, especially unstructured data, came in, and keeping the metadata and data in sync, will speed up the end-to-end cycle from discovery to consumption.
Data dictionary and lineage. Data dictionaries contain table and column descriptions, the most frequent users and usage statistics and canonical queries for a specific table. Data lineage allows users to trust data for business use by showcasing a data life cycle map that indicates all its modifications from its origin
Metadata management. Answering questions like a customer churn analysis, typically requires wrangling new and disparate datasets. It is essential to surface a data dictionary to end-users for exploration, to see where the data resides and what it contains, and to determine if it is useful for answering a particular question.

Regulatory And Compliance Needs

New or expanded data privacy regulations, such as GDPR and CCPA, have created new requirements around Right to Erasure and Right to Be Forgotten. Therefore, the ability to delete specific subsets of data without disrupting a data management process is essential. In addition to the throughput of DELETE itself, you need support for special handling of PCI/PII data, and auditability.

Permissioning And Financial Governance

Using the Apache Ranger open source framework that facilitates table, row and column level granular access, architects can grant permissions against already-defined user roles in the identity and access management (IAM) access solutions of cloud service providers. With wide-ranging usage, monitoring and audit capabilities are essential to detect access violations and flag adversarial queries. To give P&L owners and architects a bird’s eye view of usage, they need cost attribution and exploration capabilities at the cluster-, job- and user-level from a single interface

Conclusion

The data lake best practices can help build a sustainable advantage using the data you collect. A cloud data lake can break down data silos and facilitate multiple analytics workloads at scale and at lower costs.

P.S – This article was first published on https://www.qubole.com/

#DataGovernance #CloudDataLake #DataIngestion #FinancialGovernance

Disclaimer

That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.

QuboleTechnologies

Look to Cross-Tab and Tabular Analytical BI Tools for Clear Results

kartikpatel

@KartikPatel

28 Aug 2025

Analytics Big Data Analytics

Cross-Tab and Tabular Reporting Tools Are Foundational! The evolution of advanced analytics has been rapid and impressive. With features that provide support for business users and help to transition them into Citizen Data Scientists, and the…

How Master Data is Foundational to Business Transformation?

CSM Tech

@csmtechnologies

13 Aug 2025

Big Data Analytics

Digital transformation has evolved rapidly over the years, becoming a critical driver of business innovation and growth. What started as a slow shift towards technology adoption has now become an essential strategy for businesses looking to have…

Developing Intelligent Chatbots with Generative AI Capabilities

Motherson Tec..

@Jaydip Roy

11 Aug 2025

AI Inside AI Big Data Analytics

Developing Intelligent Chatbots with Generative AI Capabilities “Intelligent chatbot development is advancing through generative AI applications, integrating NLP chatbot solutions and conversational AI tools. This…

From Global Talent to Global Impact: How Remote Staff Augmentation Unlocks 24/7 Expertise

C5i (Course5 ..

@Ronald Fernandes

06 Aug 2025

Analytics

Research AI Markets don’t sleep anymore, and neither can your operations. As research timelines shrink and clients expect answers in real time, traditional team setups just can’t keep pace. Many leaders still depend on local teams to…

How To Simplify Insurance Claims Processes with Data Analytics?

Ken Milko

@kenmilko

05 Aug 2025

Big Data Analytics

In our last blog, we discussed the important factors to bear in mind before transforming insurance claims operations. In this post, we will uncover how data analytics can streamline insurance claims workflows. A digitized Insurance claims…

Worker Lives Matter: The Tech Revolution Transforming Workplace Safety

TATA Communic..

@tatacommunications

30 Jul 2025

Manufacturing Retail - FMCG CPG

In an era defined by rapid technological advancement and global interconnectedness, one would expect workplace safety to be a universally upheld standard. Yet, the grim reality is that millions of workers worldwide continue to face life-threatening…

Topics In Demand

Notification

New

Cloud Data Lakes – Best Practices

Cloud Data Lake – Data Ingestion Best Practices

Cloud Data Lake – Data Layout Best Practices

Cloud Data Lake – Data Governance Best Practices

Discover Your Data

Regulatory And Compliance Needs

Permissioning And Financial Governance

Conclusion

Share this blog

Related blogs