DStreams vs. DataFrames: Two Flavors of Spark Streaming

Terms of use

Terms of Use

The use of this site and the content contained therein is governed by the Terms of Use. When you use this site you acknowledge that you have read the Terms of Use and that you accept and will be bound by the terms hereof and such terms as may be modified from time to time.

All text, graphics, audio, design and other works on the site are the copyrighted works of nasscom unless otherwise indicated. All rights reserved.
Content on the site is for personal use only and may be downloaded provided the material is kept intact and there is no violation of the copyrights, trademarks, and other proprietary rights. Any alteration of the material or use of the material contained in the site for any other purpose is a violation of the copyright of nasscom and / or its affiliates or associates or of its third-party information providers. This material cannot be copied, reproduced, republished, uploaded, posted, transmitted or distributed in any way for non-personal use without obtaining the prior permission from nasscom.
The nasscom Members login is for the reference of only registered nasscom Member Companies.
nasscom reserves the right to modify the terms of use of any service without any liability. nasscom reserves the right to take all measures necessary to prevent access to any service or termination of service if the terms of use are not complied with or are contravened or there is any violation of copyright, trademark or other proprietary right.
From time to time nasscom may supplement these terms of use with additional terms pertaining to specific content (additional terms). Such additional terms are hereby incorporated by reference into these Terms of Use.

Disclaimer

The Company information provided on the nasscom web site is as per data collected by companies. nasscom is not liable on the authenticity of such data.
nasscom has exercised due diligence in checking the correctness and authenticity of the information contained in the site, but nasscom or any of its affiliates or associates or employees shall not be in any way responsible for any loss or damage that may arise to any person from any inadvertent error in the information contained in this site. The information from or through this site is provided "as is" and all warranties express or implied of any kind, regarding any matter pertaining to any service or channel, including without limitation the implied warranties of merchantability, fitness for a particular purpose, and non-infringement are disclaimed. nasscom and its affiliates and associates shall not be liable, at any time, for any failure of performance, error, omission, interruption, deletion, defect, delay in operation or transmission, computer virus, communications line failure, theft or destruction or unauthorised access to, alteration of, or use of information contained on the site. No representations, warranties or guarantees whatsoever are made as to the accuracy, adequacy, reliability, completeness, suitability or applicability of the information to a particular situation.
nasscom or its affiliates or associates or its employees do not provide any judgments or warranty in respect of the authenticity or correctness of the content of other services or sites to which links are provided. A link to another service or site is not an endorsement of any products or services on such site or the site.
The content provided is for information purposes alone and does not substitute for specific advice whether investment, legal, taxation or otherwise. nasscom disclaims all liability for damages caused by use of content on the site.
All responsibility and liability for any damages caused by downloading of any data is disclaimed.
nasscom reserves the right to modify, suspend / cancel, or discontinue any or all sections, or service at any time without notice.

For any grievances under the Information Technology Act 2000, please get in touch with Grievance Officer, Mr. Anirban Mandal at data-query@nasscom.in.

New

See all

No notification found.

DStreams vs. DataFrames: Two Flavors of Spark Streaming

QuboleTechnologies

@QuboleTechnologies

April 28, 2020

Big Data Analytics

978

This post is a guest publication written by Yaroslav Tkachenko, a Software Architect at Activision.

Apache Spark is one of the most popular and powerful large-scale data processing frameworks. It was created as an alternative to Hadoop’s MapReduce framework for batch workloads, but now it also supports SQL, machine learning, and stream processing. Today I want to focus on Spark Streaming and show a few options available for stream processing.

Stream data processing is used when dynamic data is generated continuously, and it is often found in big data use cases. In most instances data is processed in near-real time, one record at a time, and the insights derived from the data are also used to provide alerts, render dashboards, and feed machine learning models that can react quickly to new trends within the data.

DStreams Vs. DataFrames

Spark Streaming went alpha with Spark 0.7.0. It’s based on the idea of discretized streams or DStreams. Each DStream is represented as a sequence of RDDs, so it’s easy to use if you’re coming from low-level RDD-backed batch workloads. DStreams underwent a lot of improvements over that period of time, but there were still various challenges, primarily because it’s a very low-level API.

As a solution to those challenges, Spark Structured Streaming was introduced in Spark 2.0 (and became stable in 2.2) as an extension built on top of Spark SQL. Because of that, it takes advantage of Spark SQL code and memory optimizations. Structured Streaming also gives very powerful abstractions like Dataset/DataFrame APIs as well as SQL. No more dealing with RDD directly!

Both Structured Streaming and Streaming with DStreams use micro-batching. The biggest difference is latency and message delivery guarantees: Structured Streaming offers exactly-once delivery with 100+ milliseconds latency, whereas the Streaming with DStreams approach only guarantees at-least-once delivery, but can provide millisecond latencies.

I personally prefer Spark Structured Streaming for simple use cases, but Spark Streaming with DStreams is really good for more complicated topologies because of its flexibility. That’s why below I want to show how to use Streaming with DStreams and Streaming with DataFrames (which is typically used with Spark Structured Streaming) for consuming and processing data from Apache Kafka. I’m going to use Scala, Apache Spark 2.3, and Apache Kafka 2.0.

Also, for the sake of example I will run my jobs using Apache Zeppelin notebooks provided by Qubole. Qubole is a data platform that I use daily. It manages Hadoop and Spark clusters, makes it easy to run ad hoc Hive and Presto queries, and also provides managed Zeppelin notebooks that I happily use. With Qubole I don’t need to think much about configuring and tuning Spark and Zeppelin, it’s just handled for me.

The actual use case I have is very straightforward:

Some sort of telemetry is written to Kafka: small JSON messages with metadata and arbitrary key/value pairs
I want to connect to Kafka, consume, and deserialize those messages
Then apply transformations if needed
Collect some aggregations
Finally, I’m interested in anomalies and generally bad data — since I don’t control the producer, I want to catch things like NULLs, empty strings, maybe incorrect dates and other values with specific formats, etc.
The job should run for some time, then automatically terminate. Typically, Spark Streaming jobs run continuously, but sometimes it might be useful to run it ad hoc for analysis/debugging (or as an example in my case, since it’s so easy to run a Spark job in a notebook).

Streaming With DStreams

In this approach we use DStreams, which is simply a collection of RDDs.

Streaming With DataFrames

Now we can try to combine Streaming with DataFrames API to get the best of both worlds!

Conclusion

Which approach is better? Since DStream is just a collection of RDDs, it’s typically used for low-level transformations and processing. Adding a DataFrames API on top of that provides very powerful abstractions like SQL, but requires a bit more configuration. And if you have a simple use case, Spark Structured Streaming might be a better solution in general!

Bigdata #DataAnalytics #DataFrames #DStreams

Disclaimer

That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.

QuboleTechnologies

Decentralizing Intelligence: How In...

Kuhu Singh

Emerging Tech

30 Jun 2025

Edge Computing for Real-Time Analyt...

Chirag Akbari

Big Data Analyt..

27 Jun 2025

Acknowledging Major Strides In Tech...

SumCircle

AI

27 Jun 2025

The Critical Role of Data Annotatio...

Gurpreet Singh Arora

241

Data Science &a..

26 Jun 2025

Best Expense Reimbursement Software...

Vandna Jadhav

Application

26 Jun 2025

?️ AI Adoption Is Racing Ahead — Bu...

Rashmin Sanwatsarkar

Cyber Security ..

26 Jun 2025

How AI Agents Are Helping Restauran...

Aeologic Technologie..

AI Inside

25 Jun 2025

AI Workloads on the Cloud: Building...

Motherson Technology..

AI

25 Jun 2025

The Dawn of Superintelligence: AGI ...

Vidyatech

144

Data Science &a..

24 Jun 2025

The Rise of Dark Factories: Transfo...

Cisco India

161

Manufacturing

23 Jun 2025

Is AI finding its way back to Data?

Janhvi Juyal

158

Data Science &a..

23 Jun 2025

Eyes, Fingers, and Faces: The Trife...

Sneha Rane

493

Cyber Security ..

21 Jun 2025

The Role of Data Science Modeling in AI Development

Harish Kumar ..

@harishkumar1

07 Mar 2025

Data Science & AI Community

As a Senior Data Analyst, I've seen firsthand how important data science modeling is in the field of Artificial Intelligence (AI). Today, AI has transformed industries by automating processes, improving decision-making, and enabling innovative…

The Hidden Truths of Pi Network: Are You Investing in a Revolution or Illusion?

Crypto World

@seo Company

02 Mar 2025

Blockchain

The Hidden Truths of Pi Network: Are You Investing in a Revolution or Illusion? The Pi Network has gained significant attention in the cryptocurrency world as a new and innovative project that promises to make digital currency mining accessible to…

How to Start a Career as a Financial Business Analyst

Harish Kumar ..

@harishkumar1

28 Feb 2025

Analytics

Starting a career as a financial business analyst can be a rewarding choice for anyone interested in combining finance, data analysis, and business strategy. As a Senior Data Analyst myself, I understand the nuances and requirements of…

Types of Business Intelligence for Beginners

Harish Kumar ..

@harishkumar1

27 Feb 2025

Data Science & AI Community Big Data Analytics

As a Senior Data Analyst, I’ve worked with various business intelligence (BI) tools and techniques to help businesses make informed decisions. Business Intelligence and Data Science work together to analyze data, identify trends, and predict…

Easy Steps in Data Analysis for Beginners

Harish Kumar ..

@harishkumar1

25 Feb 2025

Data Science & AI Community

As a Senior Data Analyst, one of my main tasks is to make complex data simple and easy to understand for others. Data Analysis is an essential skill for anyone who wants to make data-driven decisions, whether you're working in business, marketing…

Beyond Static AI: Harnessing Continuous Learning for Business Success with Agentic AI

Xoriant

@xoriant

25 Feb 2025

AI Big Data Analytics

Authored by: Chirag Trivedi, Technical Lead - Xoriant Agentic AI marks a major leap in the evolution of artificial intelligence, offering autonomous decision-making capabilities that go beyond traditional AI applications. Unlike conventional AI,…

New

DStreams vs. DataFrames: Two Flavors of Spark Streaming

QuboleTechnologies

DStreams Vs. DataFrames

Streaming With DStreams

Streaming With DataFrames

Conclusion

QuboleTechnologies

The Role of Data Science Modeling in AI Development

Harish Kumar ..

The Hidden Truths of Pi Network: Are You Investing in a Revolution or Illusion?

Crypto World

How to Start a Career as a Financial Business Analyst

Harish Kumar ..

Types of Business Intelligence for Beginners

Harish Kumar ..

Easy Steps in Data Analysis for Beginners

Harish Kumar ..

Beyond Static AI: Harnessing Continuous Learning for Business Success with Agentic AI

Xoriant

About Us

Knowledge Center

In the News

Topics In Demand

Notification

New

DStreams vs. DataFrames: Two Flavors of Spark Streaming

DStreams Vs. DataFrames

Streaming With DStreams

Streaming With DataFrames

Conclusion

Share this blog

Related blogs

Kuhu Singh

30 Jun 2025

Chirag Akbari

27 Jun 2025

SumCircle

27 Jun 2025

Gurpreet Singh Arora

26 Jun 2025

Vandna Jadhav

26 Jun 2025

Rashmin Sanwatsarkar

26 Jun 2025

Aeologic Technologie..

25 Jun 2025

Motherson Technology..

25 Jun 2025

Vidyatech

24 Jun 2025

Cisco India

23 Jun 2025

Janhvi Juyal

23 Jun 2025

Sneha Rane

21 Jun 2025

About Us

Knowledge Center

In the News

Newsletter