Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
46 views13 pages

Chapter 1

Chapter 1 introduces streaming data, highlighting its significance and the differences between real-time and streaming data systems. It explains the continuous flow of data from various sources, the importance of stream processing for real-time analytics, and the benefits and challenges of implementing streaming data architectures. The chapter also discusses the architectural blueprint for streaming systems and provides examples of real-world applications such as Lyft and YouTube.

Uploaded by

rajendranmani.p
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views13 pages

Chapter 1

Chapter 1 introduces streaming data, highlighting its significance and the differences between real-time and streaming data systems. It explains the continuous flow of data from various sources, the importance of stream processing for real-time analytics, and the benefits and challenges of implementing streaming data architectures. The chapter also discusses the architectural blueprint for streaming systems and provides examples of real-world applications such as Lyft and YouTube.

Uploaded by

rajendranmani.p
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Chapter 1.

Introducing streaming
data
This chapter covers
 Differences between real-time and streaming data systems
 Why streaming data is important
 The architectural blueprint
 Security for streaming data systems

Overview:

Data is flowing everywhere around us, through phones, credit cards, sensor-equipped buildings,
vending machines, thermostats, trains, buses, planes, posts to social media, digital pictures and
video—and the list goes on.

What is Streaming?
The term "streaming" is used to describe continuous, never-ending data streams
with no beginning or end, that provide a constant feed of data that can be
utilized/acted upon without needing to be downloaded first.

Similarly, data streams are generated by all types of sources, in various formats and
volumes. From applications, networking devices, and server log files, to website
activity, banking transactions, and location data, they can all be aggregated to
seamlessly gather real-time information and analytics from a single source of truth.
What is data streaming?

Data streaming is the process of continuously collecting data as it's


generated and moving it to a destination. This data is usually handled by
stream processing software to analyze, store, and act on this information.
Data streaming combined with stream processing produces real-time
intelligence.

Data Streaming also known as event stream processing, streaming data is the
continuous flow of data generated by various sources. By using stream processing
technology, data streams can be processed, stored, analyzed, and acted upon as it's
generated in real-time.
How Streaming Data Works

In previous years, legacy infrastructure was much more structured because it only
had a handful of sources that generated data. The entire system could be
architected in a way to specify and unify the data and data structures. With the
advent of stream processing systems, the way we process data has changed
significantly to keep up with modern requirements.

Overview of Stream Data Processing


Today's data is generated by an infinite amount of sources - IoT sensors, servers,
security logs, applications, or internal/external systems. It’s almost impossible to
regulate structure, data integrity, or control the volume or velocity of the data
generated.

While traditional solutions are built to ingest, process, and structure data before it
can be acted upon, streaming data architecture adds the ability to consume, persist
to storage, enrich, and analyze data in motion.

Requirements:

As such, applications working with data streams will always require two main
functions: storage and processing. Storage must be able to record large streams of
data in a way that is sequential and consistent. Processing must be able to interact
with storage, consume, analyze and run computation on the data.

This also brings up additional challenges and considerations when working with
legacy databases or systems. Many platforms and tools are now available to help
companies build streaming data applications.

Data streams combine various sources and formats to create a comprehensive view
of operations. For instance, combining network, server, and application data can
monitor website health and quickly detect performance issues or outages.
Image Source

This video reviews the concept of data streaming and also provides an introduction
to batch processing, which will be examined later in this section:

Examples
Some real-life examples of streaming data include use cases in every industry,
including real-time stock trades, up-to-the-minute retail inventory management,
social media feeds, multiplayer games, and ride-sharing apps.

For example, when a passenger calls Lyft, real-time streams of data join together to
create a seamless user experience. Through this data, the application pieces
together real-time location tracking, traffic stats, pricing, and real-time traffic data to
simultaneously match the rider with the best possible driver, calculate pricing, and
estimate time to destination based on both real-time and historical data.

In this sense, streaming data is the first step for any data-driven organization, fueling
big data ingestion, integration, and real-time analytics.

1.5. Stream Processing:

Streaming the data is only half the battle. You also need to process that data to
derive insights.

Stream processing software is configured to ingest the continual data flow down the
pipeline and analyze that data for patterns and trends. Stream processing may also
include data visualization for dashboards and other interfaces so that data
personnel may also monitor these streams.
Image Source
Data streams and stream processing are combined to produce real-time or near real-
time insights. To accomplish this, stream processors need to offer low latency so that
analysis happens as quickly as data is received. A drop in performance by the
stream processor can lead to a backlog or data points being missed,
threatening data integrity.
Stream processing software needs to scale and be highly available. It should handle
spikes in traffic and have redundancies to prevent software crashes. Crashes reduce
your data quality since the stream is not analyzed for however long the outage
persists.

Benefits of Data Streaming

Data streaming provides real-time insight by leveraging the latest internal and
external information to inform decision-making in day-to-day operations and overall
strategy.

Let's examine a few more benefits of data streaming.

Increase ROI

Real-time intelligence gives companies a competitive edge by enabling quick data


collection, analysis, and action. It enhances responsiveness to market trends,
customer needs, and business opportunities, making it a valuable distinguishing
feature in the fast-paced digitalized business environment.

Increase Customer Satisfaction

Responding quickly to customer complaints and providing resolutions improves a


company's reputation, leading to positive word-of-mouth advertising and online
reviews that attract new prospects and convert them into customers.
Reduce Losses

Data streaming not only supports customer retention but also prevents losses by
providing real-time intelligence on potential issues such as system outages, financial
downturns, and data breaches. This allows companies to proactively mitigate the
impact of these events.

Data Stream Challenges to Consider

Data streaming opens a world of possibilities, but it also comes with challenges to
keep in mind as you incorporate real-time data into your applications.

1. Availability

Data needs to be accessed and logged in a datastore for historical context. If you
can't view previous subscription periods, you may miss opportunities to offer
valuable products or services based on a customer's purchase history.

2. Timeliness

Data streams must be constantly updated to avoid stale information and ensure that
the user's actions in one tab are reflected across all tabs.

3. Scalability

To avoid data loss during spikes in volume or system outages, it's crucial to build
failsafes into your system and provision extra computing and storage resources.

4. Ordering

Recording a sequence of customer interactions in your CRM provides deeper


insights than just tracking individual web page visits. For example, you can see when
a person has downloaded related eBooks, viewed a product demo, and visited the
product page, giving you a clearer understanding of their interest in the product.

1.1. What is a real-time system?


Real-time systems and real-time computing have been around for decades, but with the advent of
the internet they have become very popular. Unfortunately, with this popularity has come
ambiguity and debate. What constitutes a real-time system?

Real-time systems are classified as hard, soft, and near. The definitions for hard and soft real-
time are based on Hermann Kopetz’s book Real-Time Systems (Springer, 2011). For near real-
time the definition found in the Portland Pattern Repository’s Wiki (http://c2.com/cgi/wiki?
NearRealTime). “Denoting or relating to a data-processing system that is slightly slower than
real-time.” To help clear up the ambiguity, table 1.1 breaks out the common classifications of
real-time systems along with the prominent characteristics by which they differ.
Table 1.1. Classification of real-time systems
ClassificationExamples Latency measured Tolerance for delay
in
Hard Pacemaker, anti-lock brakes Microseconds– None—total system
milliseconds failure, potential loss of
life
Soft Airline reservation system, Milliseconds– Low—no system failure,
online stock quotes, VoIP seconds no life at risk
(Skype)
Near Skype video, home automation Seconds–minutes High—no system failure,
no life at risk

You can identify hard real-time systems fairly easily. They are almost always found in embedded
systems and have very strict time requirements that, if missed, may result in total system failure.
The design and implementation of hard real-time systems are well studied in the literature.

Determining whether a system is soft or near real-time, because the overlap in their definitions
often results in confusion. Here are three examples:

 Someone you are following on Twitter posts a tweet, and moments later you see the tweet
in your Twitter client.
 You are tracking flights around New York using the real-time Live Flight Tracking
service from FlightAware (http://flightaware.com/live/airport/KJFK).
 You are using the NASDAQ Real Time Quotes application
(www.nasdaq.com/quotes/real-time.aspx) to track your favorite stocks.

Although these systems are all quite different, figure 1.1 shows what they have in common.

Figure 1.1. A generic real-time system with consumers

In each of the examples, is it reasonable to conclude that the time delay may only last for
seconds, no life is at risk, and an occasional delay for minutes would not cause total system
failure? If someone posts a tweet, and you see it almost immediately, is that soft or near real-
time? What about watching live flight status or real-time stock quotes? Some of these can go
either way: what if there were a delay in the data due to slow Wi-Fi at the coffee shop or on the
plane? As you consider these examples, the line differentiating soft and near real-time becomes
blurry, at times disappears, is very subjective, and may often depend on the consumer of the data.

Now let’s change our examples by taking the consumer out of the picture and focusing on the
services at hand:
 A tweet is posted on Twitter.
 The Live Flight Tracking service from FlightAware is tracking flights.
 The NASDAQ Real Time Quotes application is tracking stock quotes.

We don’t know how these systems work internally, but the essence of what we are asking is
common to all of them. It can be stated as follows:

Is the process of receiving data all the way to the point where it is ready for consumption a soft or
near real-time process?

this looks like figure 1.2.

Figure 1.2. A generic real-time system with no consumers

Does focusing on the data processing and taking the consumers of the data out of the picture
change your answer? For example, how would you classify the following?

 A tweet posted to Twitter


 A tweet posted by someone whom you follow and your seeing it in your Twitter client

If you classified them differently, why? Was it due to the lag or perceived lag in seeing the tweet
in your Twitter client? After a while, the line between whether a system is soft or near real-time
becomes quite blurry. Often people settle on calling them real-time.

1.2. Differences between real-time and streaming


systems
A system may be labeled soft or near real-time based on the perceived delay experienced by
consumers. We have seen, with simple examples, how the distinction between the types of real-
time system can be hard to discern. This can become a larger problem in systems that involve
more people in the conversation. Our goal here is to settle on a common language we can use to
describe these systems. When you look at the big picture, we are trying to use one term to define
two parts of a larger system. As illustrated in figure 1.3, the end result is that it breaks down,
making it very difficult to communicate with others with these systems because we don’t have a
clear definition.

Figure 1.3. Real-time computation and consumption split apart


On the left-hand side of figure 1.3 we have the non-hard real-time service, or
the computation part of the system, and on the right-hand side we have the clients, called
the consumption side of the system.

DEFINITION: STREAMING DATA SYSTEM

In many scenarios, the computation part of the system is operating in a non-hard real-time
fashion, but the clients may not be consuming the data in real time due to network delays,
application design, or a client application that isn’t even running. Put another way, what we have
is a non-hard real-time service with clients that consume data when they need it. This is called
a streaming data system—a non-hard real-time system that makes its data available at the
moment a client application needs it. It’s neither soft nor near—it is streaming.

Figure 1.4 shows the result of applying this definition to our example architecture from figure
1.3.

Figure 1.4. A first view of a streaming data system

The concept of streaming data eliminates the confusion of soft versus near and real-time versus
not real-time, allowing us to concentrate on designing systems that deliver the information a
client requests at the moment it is needed. but from the standpoint of streaming, if you can split
each one up and recognize the streaming data service and streaming client.

 Someone you are following on Twitter posts a tweet, and moments later you see the tweet
in your Twitter client.
 You are tracking flights around New York using the real-time Live Flight Tracking
service from FlightAware.
 You are using the NASDAQ Real Time Quotes application to track your favorite stocks.

 Twitter— A streaming system that processes tweets and allows clients to request the
latest tweets at the moment they are needed; some may be seconds old, and others may be
hours old.
 FlightAware— A streaming system that processes the most recent flight status data and
allows a client to request the latest data for particular airports or flights.
 NASDAQ Real Time Quotes— A streaming system that processes the price quotes of all
stocks and allows clients to request the latest quote for particular stocks.

You got to think and focus on what and how a service makes its data available to clients at the
moment they need it. The system is an in-the-moment system—any system that delivers the data
at the point in time when it is needed. We don’t know how these systems work behind the scenes,
we are going to learn to assemble systems that use open source technologies to consume, process,
and present data streams.

the differences between stream processing and traditional batch processing.

Batch Processing vs Real-Time Streams

Batch processing requires data to be downloaded before it is analyzed and stored,


while stream processing continuously ingests and analyzes data. Stream processing
is preferred for its speed, especially when real-time intelligence is needed. Batch
processing is used in scenarios where immediate analysis is not necessary or when
working with legacy technologies like mainframes.

With the complexity of today's modern requirements, legacy batch data processing
has become insufficient for most use cases, as it can only process data as groups of
transactions collected over time. Modern organizations need to act on up-to-the-
millisecond data, before the data becomes stale. Being able to access data in real-
time comes with numerous advantages and use cases.

two concepts and their use cases:

Data Stream Examples

Data streams capture critical real-time data, such as location, stock prices, IT system
monitoring, fraud detection, retail inventory, sales, and customer activity.

The following companies use some of these data types to power their business
activity.

1. Lyft

Lyft requires real-time data to match riders with drivers accurately, displaying current
vehicle availability and prices based on distance, demand, and traffic conditions.
This data needs to be instantly available to set accurate user expectations.
After the rider selects a service level, Lyft uses additional GPS and traffic data to
match the best driver to the rider based on vehicle availability, distance, driver
status, and expected time of arrival.

Lyft uses location data from the driver's phone to track their progress, match them
with other ride requests, and provide real-time updates on traffic conditions. They
have optimized their processors to handle and aggregate these data streams for an
enhanced customer experience.

Image Source

2. YouTube

YouTube processes and stores a massive amount of data every hour due to the
more than 500 hours of video uploaded every minute, according to Statista.

YouTube must ensure high availability to support creators' content and provide real-
time data to viewers, including view counts, comments, subscribers, and other
metrics. YouTube supports live videos with real-time interaction between content
creators and viewers, requiring critical instant data transfer for uninterrupted
conversations.

Speaking of YouTube, the presenter in this video walks through how to create an
example data stream using PowerShell and Power BI:

1.3. The architectural blueprint


With an understanding of real-time and streaming systems we can now turn our attention to the
architectural blueprint. Throughout our journey we are going to follow an architectural blueprint
that will enable us to talk about all streaming systems in a generic way—our pattern
language. Figure 1.5 depicts the architecture.

Figure 1.5. The streaming data architectural blueprint


Although our architecture calls out the different tiers, remember these tiers are not hard and rigid,
as you may have seen in other architectures. We will call them tiers, but we will use them more
like LEGO pieces, allowing us to design the correct solution for the problem at hand. Our tiers
don’t prescribe a deployment scenario. they are in many cases distributed across different
physical locations.

Let’s take our examples how Twitter’s service maps to our architecture:

 Collection tier— When a user posts a tweet, it is collected by the Twitter services.
 Message queuing tier— Undoubtedly, Twitter runs data centers in locations across the
globe, and conceivably the collection of a tweet doesn’t happen in the same location as
the analysis of the tweet.
 Analysis tier— Although I’m sure a lot of processing is done to those 140 characters,
suffice it to say, at a minimum for our examples, Twitter needs to identify the followers
of a tweet.
 Long-term storage tier— Even though we’re not going to discuss this optional tier in
depth in this book, you can probably guess that tweets going back in time imply that
they’re stored in a persistent data store.
 In-memory data store tier— The tweets that are mere seconds old are most likely held in
an in-memory data store.
 Data access— All Twitter clients need to be connected to Twitter to access the service.

the exercise of decomposing the other two examples and see how they fit our streaming
architecture:

 FlightAware— A streaming system that processes the most recent flight status data and
allows a client to request the latest data for particular airports or flights.
 NASDAQ Real Time Quotes— A streaming system that processes the price quotes of all
stocks and allows clients to request the latest quote for particular stocks.

1.4. Security for streaming systems


Security is important in many cases, but it can be overlaid on this architecture naturally. Figure
1.6 shows how security can be applied to this architecture.

Figure 1.6. The architectural blueprint with security identified

1.5. How do we scale?


From a high level, there are two common ways of scaling a service: vertically and horizontally.

Vertical scaling lets you increase the capacity of your existing hardware (physical or virtual) or
software by adding resources. A restaurant is a good example of the limitations of vertical
scaling. When you enter a restaurant, you may see a sign that tells you the maximum occupancy.
As more patrons come in, more tables may be set up and more chairs added to accommodate the
crowd—this is scaling vertically. But when the maximum capacity is reached, you can’t seat any
more customers. In the end, the capacity is limited by the size of the restaurant. In the computing
world, adding more memory, CPUs, or hard drives to your server are examples of vertical
scaling. But as with the restaurant, you’re limited by the maximum capacity of the system,
physical or virtual.

Horizontal scaling approaches the problem from a different angle. Instead of continuing to add
resources to a server, you add servers. A highway is a good example of horizontal scaling.
Imagine a two-lane highway that was originally constructed to handle 2,000 vehicles an hour.
Over time more homes and commercial buildings are built along the highway, resulting in a load
of 8,000 vehicles per hour. As you might imagine (and perhaps have experienced), the results are
terrible traffic jams during rush hour and overall unpleasant commutes. To alleviate these issues,
more lanes are added to the highway—now it is horizontally scaled and can handle the traffic.
But it would be even more efficient if it could expand (add lanes) and contract (remove lanes)
based on traffic demands. At an airport security checkpoint, when there are few travelers TSA
closes down screening lines, and when the volume increases they open lines up. If you’re hosting
your service with one of the major cloud providers (Google, AWS, Microsoft Azure), you may be
able to take advantage of this elasticity—a feature they often call auto-scaling. The basic idea is
that as demand for your service increases, servers are automatically added, and as demand
decreases, servers are removed.
In modern-day system design, our goal is to have horizontal scaling—but that doesn’t mean that
we won’t use vertical scaling too. Vertical scaling is often employed to determine an ideal
resource configuration for a service, and then the service is scaled out. when the topic of scaling
comes up, the focus will be on horizontal, not vertical scaling.

Figure 1.7. Architectural blueprint with emphasis on the first tier

We’re going to take on the tiers one at a time, starting from the left with the collection tier. Don’t
let the lack of emphasis on the message queuing tier in figure 1.7 bother you—in certain cases
where it serves a collection role, I’ll talk about it and clear up any confusion. Now, on to our first
tier, the collection tier—our entry point for bringing data into our streaming, in-the-moment
system.

You might also like