Stream processing tools
Apache tools: Kafka, Spark, Storm, and Flink
Originally developed by LinkedIn as a messaging queue
application, Apache Kafka was open-sourced and donated to Apache in
2011. After that, Kafka evolved into an open-source, data-streaming
platform.
Kafka is a stream processor, which integrates applications and data streams
via an API. In line with other products by Apache Foundation Software,
Kafka has been popularized by giants like Uber and Netflix. Because of its
ability to run concurrent processing and move large amounts of data quickly,
Kafka is used for big data streams, like Netflix’s big data ingestion platform.
Apache Kafka can be also integrated with Apache Hive, a warehousing
solution, and Hadoop for batch processing of the stored data. Or it can be
used with Apache Spark, a big data processing engine. But, there are a
bunch of other instruments to work with stream processing such
as Storm and Flink for distributed stream processing, and mixed types of
data processing. Both can also be used as an ETL tool or a batch
processor integrated into Hadoop.
Amazon tools: Kinesis Streams, Kinesis, and
Firehose
Amazon Kinesis Streams is a scalable and customizable solution for processing and
analyzing data streams. Kinesis Streams provides a stream processor and also allows you to
build your own applications by using client libraries, connectors, and APIs.
Amazon Kinesis. You can also use managed stream processing solution Amazon Kinesis.
Except for managed processing and fine-tuning, Amazon Kinesis offers a wide list of possible
integrations with Apache services like Spark and Kafka mentioned earlier.
Amazon Firehose enables you to integrate data streams into existing BI tools and analytical
interfaces or a warehouse. It can also help you to fetch data and integrate it into existing
warehousing solutions by Amazon such as S3 and Redshift cloud warehouse.
The architectural solution you choose would vary in the number of instances you run your
data through. However, all of these are available in the market for specific streaming needs.
Now let’s address the user-facing part that can serve as a real-time analytics interface.
Real-time analytical instruments
Azure Stream Analytics
Azure Stream Analytics is a stream processing platform by Microsoft paired
with its analytical interface Power BI. Both solutions are fully managed and
deployed in the cloud. Stream Analytics uses Trill stream processor by Microsoft.
It integrates data and provides low-latency processing across multiple sources.
Power BI
Power BI is a general-purpose, business intelligence tool that can be used
both for batch and real-time analytics. Microsoft documentation contains
a guide to integrate Stream Analytics into Power BI. What it does allow you
to do is connect data streams to Power BI and manage data by updating
dashboards and creating visuals via interactive elements and settings.
GoogleCloud Stream Analytics
Google Cloud Stream Analytics offers similar capabilities in terms of
stream processing, as their product includes a dedicated engine for data
ingestion, processing, and analysis. The operations with data can be handled
by three instruments:
•Pub/Sub is a messaging service and data ingestion tool, similar to Apache
Kafka. It allows you to perform event updates and data transition into a
staging storage.
• Dataflow is a tool for managing ingested data. Basically it’s a data
transformation/formatting tool designed both for batch and stream
processing.
•BigQuery is a cloud warehousing solution by Google, which can be used as
storage for your data platform.
Oracle Stream Analytics
Oracle Stream Analytics is a cloud-based platform that offers an all-in-one
solution for stream ingestion, processing, and visualization. The platform is
built on top of Apache Spark, so it’s compatible with other Apache
technologies. The data ingestion is done via Apache Kafka. But, data
transformations, log captures, and real-time data integration can be done via
dedicated package Oracle GoldenGate.
Other important considerations are visualizations and data representation. You
can find a dedicated BI tool by Oracle, but it doesn’t provide much information
about integration with the Stream Analytics platform. The reason for that might
be in a dedicated interface to work with tabular real-time data and visualization
of IoT streamed data.
IBM Streaming Analytics
IBM Streaming Analytics is available for building real-time analytical
applications. It’s powered by IBM Streams, a data platform for stream
processing, data ingestion/transformation, and analysis. Streams can be
deployed either on cloud as a part of IBM Cloud or on premise. As all the
mentioned platforms above, IBM also supports Kafka as a messaging and
data ingestion instrument.