The Key to Building Data Pipelines for Machine Learning: Support for Multiple Engines

Terms of use

Terms of Use

The use of this site and the content contained therein is governed by the Terms of Use. When you use this site you acknowledge that you have read the Terms of Use and that you accept and will be bound by the terms hereof and such terms as may be modified from time to time.

All text, graphics, audio, design and other works on the site are the copyrighted works of nasscom unless otherwise indicated. All rights reserved.
Content on the site is for personal use only and may be downloaded provided the material is kept intact and there is no violation of the copyrights, trademarks, and other proprietary rights. Any alteration of the material or use of the material contained in the site for any other purpose is a violation of the copyright of nasscom and / or its affiliates or associates or of its third-party information providers. This material cannot be copied, reproduced, republished, uploaded, posted, transmitted or distributed in any way for non-personal use without obtaining the prior permission from nasscom.
The nasscom Members login is for the reference of only registered nasscom Member Companies.
nasscom reserves the right to modify the terms of use of any service without any liability. nasscom reserves the right to take all measures necessary to prevent access to any service or termination of service if the terms of use are not complied with or are contravened or there is any violation of copyright, trademark or other proprietary right.
From time to time nasscom may supplement these terms of use with additional terms pertaining to specific content (additional terms). Such additional terms are hereby incorporated by reference into these Terms of Use.

Disclaimer

The Company information provided on the nasscom web site is as per data collected by companies. nasscom is not liable on the authenticity of such data.
nasscom has exercised due diligence in checking the correctness and authenticity of the information contained in the site, but nasscom or any of its affiliates or associates or employees shall not be in any way responsible for any loss or damage that may arise to any person from any inadvertent error in the information contained in this site. The information from or through this site is provided "as is" and all warranties express or implied of any kind, regarding any matter pertaining to any service or channel, including without limitation the implied warranties of merchantability, fitness for a particular purpose, and non-infringement are disclaimed. nasscom and its affiliates and associates shall not be liable, at any time, for any failure of performance, error, omission, interruption, deletion, defect, delay in operation or transmission, computer virus, communications line failure, theft or destruction or unauthorised access to, alteration of, or use of information contained on the site. No representations, warranties or guarantees whatsoever are made as to the accuracy, adequacy, reliability, completeness, suitability or applicability of the information to a particular situation.
nasscom or its affiliates or associates or its employees do not provide any judgments or warranty in respect of the authenticity or correctness of the content of other services or sites to which links are provided. A link to another service or site is not an endorsement of any products or services on such site or the site.
The content provided is for information purposes alone and does not substitute for specific advice whether investment, legal, taxation or otherwise. nasscom disclaims all liability for damages caused by use of content on the site.
All responsibility and liability for any damages caused by downloading of any data is disclaimed.
nasscom reserves the right to modify, suspend / cancel, or discontinue any or all sections, or service at any time without notice.

For any grievances under the Information Technology Act 2000, please get in touch with Grievance Officer, Mr. Anirban Mandal at data-query@nasscom.in.

New

See all

No notification found.

The Key to Building Data Pipelines for Machine Learning: Support for Multiple Engines

QuboleTechnologies

@QuboleTechnologies

May 15, 2020

Big Data Analytics

3278

As a consumer of goods and services, you experience the results of machine learning (ML) whenever the institutions you rely on use ML processes to run their operations. You may receive a text message from a bank requiring verification after the bank has paused a credit card transaction. Or, an online travel site may send you an email that offers personalized accommodations for your next personal or business trip.

The work that happens behind the scenes to facilitate these experiences can be difficult to fully realize or appreciate. An important portion of that work is done by the data engineering teams that build the data pipelines to help train and deploy those ML models. Once focused on building pipelines to support traditional data warehouses, today’s data engineering teams now build more technically demanding continuous data pipelines that feed applications with artificial intelligence and ML algorithms. These data pipelines must be cost-effective, fast, and reliable regardless of the type of workload and use case.

Big Data Engines For ML

Due to the diversity of data sources and the volume of data that needs to be processed, traditional data processing tools fail to meet the performance and reliability requirements for modern machine learning applications. The need to build reliable pipelines for these workloads coupled with advances in distributed high performance computing has given rise to big data processing engines such as Hadoop.

Let’s quickly review the different engines and frameworks often used in data engineering aimed at supporting ML efforts:

Apache Hadoop/Hive

Hive is an Apache open-source project built on top of Hadoop for querying, summarizing, and analyzing large data sets using a SQL-like interface. Apache Hive is used mostly for batch processing of large ETL jobs and batch SQL queries on very large data sets as well as exploration on large volumes of structured, semi-structured, and unstructured data.

Apache Spark

Spark is a general purpose open-source computational engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. Spark is also distributed, scalable, fault tolerant, flexible, and fast.

Presto

Presto is an open-source SQL query engine developed by Facebook. Presto is used for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was built to provide SQL query capabilities against disparate data sources, which allows Presto users to combine data from multiple data sources in one query.

Airflow

Although technically not a big data engine, Airflow is an open-source tool to programmatically author, schedule, and monitor data workflows. With Airflow, users can author workflows as directed acyclic graphs (DAGs) of tasks. A DAG is the set of tasks needed to complete a pipeline organized to reflect their relationships and interdependencies.

Leveraging Popular Engines/Frameworks For Data Engineering

While the typical steps of data engineering may be standardized across industries — data exploration, building pipelines, data orchestration, and delivery — organizations have various workload types and data engineering needs. That’s where support for multiple engines and workloads becomes essential. Let’s talk about which engines are most effective for each stage of the data engineering cycle:

Data Exploration

Data engineering always starts with data exploration. Data exploration involves inspecting the data to understand its characteristics and what it represents. The learnings acquired during this stage will impact the amount and type of work that the data scientist will conduct during their data preparation phase.

Hadoop/Hive is an excellent choice for data exploration of larger unstructured data sets, because of its inexpensive storage and its compatibility with SQL. Spark works well for data sets that require a programmatic approach, such as with file formats widely used in healthcare insurance processing. On the other hand, Presto provides a quick and easy way to access data from a variety of sources using the industry standard SQL query language.

Building Data Pipelines

Data pipelines carry and process data from data sources to the business intelligence (BI) and ML applications that take advantage of it. These pipelines consist of multiple steps: reading data, moving it from one system to the next, reformatting it, joining it with other data sources, and adding derived columns (feature engineering).

When the persistence of large data sets is important, Hive offers diverse computational techniques and is cost effective. Alternatively, Spark offers an in-memory computational engine and may be the better choice if processing speed is critical. Spark also offers the best facilities for near real-time data streaming, allowing engineers to create streaming jobs the same way they write batch jobs. However, micro-batching with Hive may be a workable and more economical option.

Orchestrating Data Pipelines

Orchestration of data pipelines refers to the sequencing, coordination, scheduling, and management of complex data pipelines from diverse sources with the aim of delivering data sets that are ready for consumption either by business intelligence applications and/or data science ML models.

Airflow is natively integrated to work with big data systems such as Hive, Presto and Spark, making it an ideal framework to orchestrate jobs running on any of these engines. As a result, Airflow works well with workloads that follow the batch processing model.

Delivering Data Sets

Qubole’s multi-engine platform allows data engineers to build, update and refine data pipelines in order to reliably and cost-effectively deliver those data sets on predefined schedules or on-demand. Qubole provides the ability to publish data through notebooks or templates and deliver the data to downstream advanced analytics and ML applications.

At Qubole we believe your use case should determine the delivery engine. For example, if the information is going to be delivered as a dashboard or the intention is to probe the resulting data sets with low-latency SQL queries, then Presto would be the optimal choice. With Presto, queries run faster than with Spark because there is no consideration for mid-query fault-tolerance. Spark, on the other hand, supports mid-query fault-tolerance and will recover in case of a failure — but actively planning for failure impacts Spark’s query performance, especially in the absence of any technical hiccups.

Why Use Multiple Engines?

When building a house, you choose different tools for different tasks — because it is impossible to build a house using only one tool. Similarly, when building data pipelines, you should choose the optimal big data engine by considering your specific use case and the specific business needs of your company or department.

Building data pipelines calls for a multi-engine platform with the ability to autoscale. The table below shows how Qubole customers apply different engines to fulfill different stages of the data engineering function when building ML data pipelines:

Most Common Engine and Framework Usage Patterns

Bigdata ApacheHadoop ApacheSpark Airflow DataEngineering #BigDataEnginees #Presto

Disclaimer

That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.

QuboleTechnologies

How Master Data is Foundational to Business Transformation?

CSM Tech

@csmtechnologies

13 Aug 2025

Big Data Analytics

Digital transformation has evolved rapidly over the years, becoming a critical driver of business innovation and growth. What started as a slow shift towards technology adoption has now become an essential strategy for businesses looking to have…

Developing Intelligent Chatbots with Generative AI Capabilities

Motherson Tec..

@Jaydip Roy

11 Aug 2025

AI Inside AI Big Data Analytics

Developing Intelligent Chatbots with Generative AI Capabilities “Intelligent chatbot development is advancing through generative AI applications, integrating NLP chatbot solutions and conversational AI tools. This…

From Global Talent to Global Impact: How Remote Staff Augmentation Unlocks 24/7 Expertise

C5i (Course5 ..

@Ronald Fernandes

06 Aug 2025

Analytics

Research AI Markets don’t sleep anymore, and neither can your operations. As research timelines shrink and clients expect answers in real time, traditional team setups just can’t keep pace. Many leaders still depend on local teams to…

How To Simplify Insurance Claims Processes with Data Analytics?

Ken Milko

@kenmilko

05 Aug 2025

Big Data Analytics

In our last blog, we discussed the important factors to bear in mind before transforming insurance claims operations. In this post, we will uncover how data analytics can streamline insurance claims workflows. A digitized Insurance claims…

Worker Lives Matter: The Tech Revolution Transforming Workplace Safety

TATA Communic..

@tatacommunications

30 Jul 2025

Manufacturing Retail - FMCG CPG

In an era defined by rapid technological advancement and global interconnectedness, one would expect workplace safety to be a universally upheld standard. Yet, the grim reality is that millions of workers worldwide continue to face life-threatening…

Why Cash Flow Management Is Important If You Run a Small Business?

Vandna Jadhav

@veronicawinston

29 Jul 2025

Analytics

Running a small business is a labor of love, but it’s also a balancing act. You’re managing inventory, handling customer relationships, hiring the right people—and in the middle of it all, there’s one thing that can make or break your progress: cash…

Topics In Demand

Notification

New

The Key to Building Data Pipelines for Machine Learning: Support for Multiple Engines

Big Data Engines For ML

Leveraging Popular Engines/Frameworks For Data Engineering

Data Exploration

Building Data Pipelines

Orchestrating Data Pipelines

Delivering Data Sets

Why Use Multiple Engines?

Share this blog

Related blogs