Table of Contents
This demo shows users how to monitor Kafka streaming ETL deployments using Confluent Control Center.
The use case is a streaming pipeline built around live edits to real Wikipedia pages. Wikimedia Foundation has IRC channels that publish edits happening to real wiki pages (e.g. #en.wikipedia, #en.wiktionary) in real time. Using Kafka Connect, a Kafka source connector kafka-connect-irc streams raw messages from these IRC channels, and a custom Kafka Connect transform kafka-connect-transform-wikiedit transforms these messages and then the messages are written to Kafka. This demo uses KSQL for data enrichment, or you can optionally develop and run your own Kafka Streams application. Then a Kafka sink connector kafka-connect-elasticsearch streams the data out of Kafka, applying another custom Kafka Connect transform called NullFilter. The data is materialized into Elasticsearch for analysis by Kibana.
Note: this is a Docker environment and has all services running on one host. This demo is not to be used in production; this is exclusively to easily demo the Confluent Platform. In production, Confluent Control Center should be deployed with a valid license and with its own dedicated metrics cluster, separate from the cluster with production traffic. Using a dedicated metrics cluster is more resilient because it continues to provide system health monitoring even if the production traffic cluster experiences issues.
- Since this repository uses submodules,
git clonewith the--recursiveoption:
$ git clone --recursive https://github.com/confluentinc/cp-demo
Otherwise, git clone and then use the submodule commands to initialize and update:
$ git clone https://github.com/confluentinc/cp-demo
$ cd cp-demo
$ git submodule init
Submodule 'kafka-connect-irc' (https://github.com/cjmatta/kafka-connect-irc) registered for path 'kafka-connect-irc'
Submodule 'kafka-connect-transform-wikiedit' (https://github.com/cjmatta/kafka-connect-transform-wikiedit) registered for path 'kafka-connect-transform-wikiedit'
$ git submodule update
-
In the advanced Docker preferences settings, increase the memory available to Docker to at least 8GB (default is 2GB).
-
From the
cp-demodirectory, runmake clean allto build the IRC connector and the transformer that will parse the Wikipedia edit messages to data. These are saved toconnect-pluginspath, which is a shared volume to theconnectdocker container.
$ make clean all
...
$ ls connect-pluginsNote: If make has a FATAL error as shown below, it means this git repo was not cloned with the submodules. Please go back to step 1 above and correct this.
[FATAL] Non-readable POM /private/tmp/cp-demo/kafka-connect-irc/pom.xml: /private/tmp/cp-demo/kafka-connect-irc/pom.xml (No such file or directory)- Start Docker Compose. It will take about 2 minutes for all containers to start and for Confluent Control Center GUI to be ready.
$ docker-compose up -d- Verify the status of the Docker containers show "Up" state, except for the
kafka-clientcontainer which is expected to have "Exit 0" state. If any containers are not up, verify in the advanced Docker preferences settings that the memory available to Docker is at least 8GB (default is 2GB).
$ docker-compose ps
Name Command State Ports
------------------------------------------------------------------------------------------------------------------------------
cpdemo_connect_1 /etc/confluent/docker/run Up 0.0.0.0:8083->8083/tcp, 9092/tcp
cpdemo_control-center_1 /etc/confluent/docker/run Up 0.0.0.0:9021->9021/tcp
cpdemo_elasticsearch_1 /bin/bash bin/es-docker Up 0.0.0.0:9200->9200/tcp, 0.0.0.0:9300->9300/tcp
cpdemo_kafka-client_1 bash -c echo Waiting for K ... Exit 0
cpdemo_kafka1_1 /etc/confluent/docker/run Up 0.0.0.0:29092->29092/tcp, 0.0.0.0:9092->9092/tcp
cpdemo_kafka2_1 /etc/confluent/docker/run Up 0.0.0.0:29093->29093/tcp, 9092/tcp, 0.0.0.0:9093->9093/tcp
cpdemo_kibana_1 /bin/sh -c /usr/local/bin/ ... Up 0.0.0.0:5601->5601/tcp
cpdemo_ksql-cli_1 perl -e while(1){ sleep 99 ... Up 0.0.0.0:9098->9098/tcp
cpdemo_schemaregistry_1 /etc/confluent/docker/run Up 0.0.0.0:8081->8081/tcp
cpdemo_zookeeper_1 /etc/confluent/docker/run Up 0.0.0.0:2181->2181/tcp, 2888/tcp, 3888/tcp - Wait till Confluent Control Center is running fully. Verify when it's ready when the logs show the following event
$ docker-compose logs -f control-center | grep -e HTTP
control-center_1 | [2017-09-06 16:37:33,133] INFO Started NetworkTrafficServerConnector@26a529dc{HTTP/1.1}{0.0.0.0:9021} (org.eclipse.jetty.server.NetworkTrafficServerConnector)- Decide how you want to run the rest of the demo, with or without KSQL. The reason there are two ways to run the demo is because KSQL does not support Avro with Schema Registry at this time. When KSQL supports Avro with Schema Registry, we will collapse the workflows into one.
# With KSQL: data streams from Wikipedia IRC to KSQL to Elasticsearch. The Kafka source and sink connectors use Json
$ export DEMOPATH=scripts_ksql# Without KSQL: data streams straight through Kafka from Wikipedia IRC to Elasticsearch without KSQL. The Kafka source and sink connectors use Avro with Confluent Schema Registry
$ export DEMOPATH=scripts_pipeline- Setup the cluster and connectors
$ ./$DEMOPATH/setup.sh-
Use Google Chrome to view the Confluent Control Center GUI at http://localhost:9021. Click on the top right button that shows the current date, and change
Last 4 hourstoLast 30 minutes. -
View the data in the Kibana dashboard at http://localhost:5601/app/kibana#/dashboard/Wikipedia
- Monitoring --> System Health: Confluent Control Center landing page shows the overall system health of a given Kafka cluster. For capacity planning activities, view cluster utilization:
- CPU: look at network and thread pool usage, produce and fetch request latencies
- Network utilization: look at throughput per broker or per cluster
- Disk utilization: look at disk space used by all log segments, per broker
- Management --> Kafka Connect: Confluent Control Center uses the Kafka Connect API to manage Kafka connectors. Kafka Connect Sources tab shows the connector
wikipedia-irc. Click "Edit" to see the details of the connector configuration and custom transforms.
- Kafka Connect Sinks tab shows the connector
elasticsearch-ksql(orelasticsearch-pipelineif you are running without KSQL). Click "Edit" to see the details of the connector configuration and custom transforms.
- Monitoring --> Data Streams --> Message Delivery: The Kafka Connect sink connector has a corresponding consumer group
connect-elasticsearch-ksqlconsuming from the configured Kafka topic. This consumer group will be in the consumer group statistics in the stream monitoring charts.
- Management --> Topics --> Topic Information: For a given topic, click on the three dots "..." next to the topic name and click on "View details". View which brokers are leaders for which partitions and the number of consumer groups currently consuming from this topic. Click on the boxed consumer group count to select a consumer group for which to monitor its data streams and jump to it.
- Monitoring --> Data Streams --> Message Delivery: hover over any chart to see number of messages and average latency within a minute time interval.
- Monitoring --> System Health: to identify bottlenecks, you can see a breakdown of produce and fetch latencies through the entire request lifecycle. Click on the line graph in the "Request latency" chart. The request latency values can be shown at the median, 95th, 99th, or 99.9th percentile. Depending on where the bottlenecks are, you can tune your brokers and clients appropriately.
If you ran the demo with KSQL, i.e. DEMOPATH=scripts_ksql, then there are additional things you can look at. If you did not run the demo with KSQL, skip this section.
- Run KSQL CLI to get more information on the queries, streams, and tables.
$ docker-compose exec ksql-cli ksql-cli remote http://localhost:8080
...
ksql> show queries;
ksql> describe wikipediabot;
ksql> select * from wikipediabot limit 3;
ksql> describe en_wikipedia_gt_1;
ksql> select * from en_wikipedia_gt_1 limit 3;- Monitoring --> Data Streams --> Message Delivery: all KSQL queries are materialized in Confluent Control Center as consumer groups with names
ksql_query_. To correlate these consumer groups to the actual KSQL query, note the query number and query string in the output of:
$ docker-compose exec ksql-cli ksql-cli remote http://localhost:8080 --exec "show queries;"- Monitoring --> Data Streams --> Message Delivery: graphs for consumer groups
EN_WIKIPEDIA_GT_1_COUNTS-consumerandksql_query_5are displaying data at intervals instead of smoothly like the other consumer groups. This is because Confluent Control Center displays data based on message timestamps, and this particular stream of a data is a tumbling window with a window size of 5 minutes. Thus all its message timestamps are marked to the beginning of each 5-minute window and this is why the latency for these streams appears to be high. Kafka streaming tumbling windows are working as designed and Confluent Control Center is reporting them accurately.
Control Center shows which consumers in a consumer group are consuming from which partitions and on which brokers those partitions reside. Control Center updates as consumer rebalances occur in a consumer group.
- If your consumer group
appis not running, start consuming from topicwikipedia.parsedwith a new consumer groupappwith one consumerconsumer_app_1. It will run in the background.
$ ./$DEMOPATH/start_consumer_app.sh 1- Let this consumer group run for 2 minutes until Control Center stream monitoring shows the consumer group
appwith steady consumption. Click on the box "View Details" above the bar graph to drill down into consumer group details. This consumer groupapphas a single consumerconsumer_app_1consuming all of the partitions in the topicwikipedia.parsed. The first bar may be red because the consumer started in the middle of a time window and did not receive all messages produced during that window. This does not mean messages were lost.
- Add a second consumer
consumer_app_2to the existing consumer groupapp.
$ ./$DEMOPATH/start_consumer_app.sh 2- Let this consumer group run for 2 minutes until Control Center stream monitoring shows the consumer group
appwith steady consumption. Notice that the consumersconsumer_app_1andconsumer_app_2now share consumption of the partitions in the topicwikipedia.parsed. When the second consumer was added, that bar may be red for both consumers because a consumer rebalance occurred during that time window. This does not mean messages were lost, as you can confirm at the consumer group level.
Streams monitoring in Control Center can highlight consumers that are slow to keep up with the producers. This is critial to monitor for real-time applications where consumers should consume produced messages with as low latency as possible. To simulate a slow consumer, we will use Kafka's quota feature to rate-limit consumption from the broker side, for just one of two consumers in a consumer group.
- Click on Data Streams, and "View Details" for the consumer group
app. Click on the blue circle on the consumption line on the left to verify there are two consumersconsumer_app_1andconsumer_app_2, that were created in an earlier section. If these two consumers are not running, start them as described in the section consumer rebalances.
-
Let this consumer group run for 2 minutes until Control Center stream monitoring shows the consumer group
appwith steady consumption. -
Add a consumption quota for one of the consumers in the consumer group
app.
$ ./$DEMOPATH/throttle_consumer.sh 1 addNote: you are running a Docker demo environment with all services running on one host, which you would never do in production. Depending on your system resource availability, sometimes applying the quota may stall the consumer (KAFKA-5871), thus you may need to adjust the quota rate. See the ./$DEMOPATH/throttle_consumer.sh script for syntax on modifying the quota rate.
- If consumer group
appdoes not increase latency, decrease the quota rate - If consumer group
appseems to stall, increase the quota rate
- View the details of the consumer group
appagain,consumer_app_1now shows high latency, andconsumer_app_2shows normal latency.
- In the System Health dashboard, you see that the fetch request latency has likewise increased. This is the because the broker that has the partition that
consumer_app_1is consuming from is taking longer to service requests.
- Click on the fetch request latency line graph to see a breakdown of produce and fetch latencies through the entire request lifecycle. The middle number does not necessarily equal the sum of the percentiles of individual segments because it is the total percentile latency.
- Remove the consumption quota for the consumer. Latency for
consumer_app_1recovers to steady state values.
$ ./$DEMOPATH/throttle_consumer.sh 1 deleteStreams monitoring in Control Center can highlight consumers that are over consuming some messages, which is an indication that consumers are processing a set of messages more than once. This may happen intentionally, for example an application with a software bug consumed and processed Kafka messages incorrectly, got a fix, and then reprocesses previous messages correctly. This may also happen unintentionally if an application crashes before committing processed messages. To simulate over consumption, we will use Kafka's consumer offset reset tool to set the offset of the consumer group app to an earlier offset, thereby forcing the consumer group to reconsume messages it has previously read.
- Click on Data Streams, and "View Details" for the consumer group
app. Click on the blue circle on the consumption line on the left to verify there are two consumersconsumer_app_1andconsumer_app_2, that were created in an earlier section. If these two consumers are not running and were never started, start them as described in the section consumer rebalances.
-
Let this consumer group run for 2 minutes until Control Center stream monitoring shows the consumer group
appwith steady consumption. -
Stop the consumer group
appto stop consuming from topicwikipedia.parsed. Note that the command below stops the consumers gracefully withkill -15, so the consumers follow the shutdown sequence.
$ ./$DEMOPATH/stop_consumer_app_group_graceful.sh- Wait for 2 minutes to let messages continue to be written to the topics for a while, without being consumed by the consumer group
app. Notice the red bar which highlights that during the time window when the consumer group was stopped, there were some messages produced but not consumed. These messages are not missing, they are just not consumed because the consumer group stopped.
- Reset the offset of the consumer group
appby shifting 200 offsets backwards. The offset reset tool must be run when the consumer is completely stopped. Offset values in output shown below will vary.
$ docker-compose exec kafka1 kafka-consumer-groups --reset-offsets --group app --shift-by -200 --bootstrap-server kafka1:9092 --all-topics --execute
TOPIC PARTITION NEW-OFFSET
wikipedia.parsed 1 4071
wikipedia.parsed 0 7944 - Restart consuming from topic
wikipedia.parsedwith the consumer groupappwith two consumers.
$ ./$DEMOPATH/start_consumer_app.sh 1
$ ./$DEMOPATH/start_consumer_app.sh 2- Let this consumer group run for 2 minutes until Control Center stream monitoring shows the consumer group
appwith steady consumption. Notice several things:
- Even though the consumer group
appwas not running for some of this time, all messages are shown as delivered. This is because all bars are time windows relative to produce timestamp. - For some time intervals, the the bars are red and consumption line is above expected consumption because some messages were consumed twice due to rewinding offsets.
- The latency peaks and then gradually decreases, because this is also relative to the produce timestamp.
Streams monitoring in Control Center can highlight consumers that are under consuming some messages. This may happen intentionally when consumers stop and restart and operators change the consumer offsets to the latest offset. This avoids delay processing messages that were produced while the consumers were stopped, especially when they care about real-time. This may also happen unintentionally if a consumer is offline for longer than the log retention period, or if a producer is configured for acks=0 and a broker suddenly fails before having a chance to replicate data to other brokers. To simulate under consumption, we will use Kafka's consumer offset reset tool to set the offset of the consumer group app to the latest offset, thereby skipping messages that will never be read.
- Click on Data Streams, and "View Details" for the consumer group
app. Click on the blue circle on the consumption line on the left to verify there are two consumersconsumer_app_1andconsumer_app_2, that were created in an earlier section. If these two consumers are not running and were never started, start them as described in the section consumer rebalances.
-
Let this consumer group run for 2 minutes until Control Center stream monitoring shows the consumer group
appwith steady consumption. -
Stop the consumer group
appto stop consuming from topicwikipedia.parsed. Note that the command below stops the consumers ungracefully withkill -9, so the consumers did not follow the shutdown sequence.
$ ./$DEMOPATH/stop_consumer_app_group_ungraceful.sh- Wait for 2 minutes to let messages continue to be written to the topics for a while, without being consumed by the consumer group
app. Notice the red bar which highlights that during the time window when the consumer group was stopped, there were some messages produced but not consumed. These messages are not missing, they are just not consumed because the consumer group stopped.
- Wait for another few minutes and notice that the bar graph changes and there is a herringbone pattern to indicate that perhaps the consumer group stopped ungracefully.
- Reset the offset of the consumer group
appby setting it to latest offset. The offset reset tool must be run when the consumer is completely stopped. Offset values in output shown below will vary.
$ docker-compose exec kafka1 kafka-consumer-groups --reset-offsets --group app --to-latest --bootstrap-server kafka1:9092 --all-topics --execute
TOPIC PARTITION NEW-OFFSET
wikipedia.parsed 1 8601
wikipedia.parsed 0 15135- Restart consuming from topic
wikipedia.parsedwith the consumer groupappwith two consumers.
$ ./$DEMOPATH/start_consumer_app.sh 1
$ ./$DEMOPATH/start_consumer_app.sh 2- Let this consumer group run for 2 minutes until Control Center stream monitoring shows the consumer group
appwith steady consumption. Notice that during the time period that the consumer groupappwas not running, no produced messages are shown as delivered.
To simulate a failed broker, stop the Docker container running one of the two Kafka brokers.
- Stop the Docker container running Kafka broker 2.
$ docker-compose stop kafka2- After a few minutes, observe the System Health shows the broker count has gone down from 2 to 1, and there are many under replicated partitions.
- View topic details to see that there are out of sync replicas on broker 2.
- Restart the Docker container running Kafka broker 2.
$ docker-compose start kafka2- After about a minute, observe the System Health view in Confluent Control Center. The broker count has recovered to 2, and the topic partitions are back to reporting no under replicated partitions.
- Click on the broker count
2inside the circle to view when the broker counts changed.
There are many types of Control Center alerts and many ways to configure them. Use the Alerts management page to define triggers and actions, or click on a streams monitoring graph for consumer groups or topics to setup alerts from there.
- This demo already has pre-configured triggers and actions. View the Alerts "Overview" screen, and click "Edit" to see configuration details.
- The trigger
Under Replicated Partitionshappens when a broker reports non-zero under replicated partitions, and it causes an actionEmail Administrator. - The trigger
Consumption Differencehappens when consumption difference for the Elasticsearch connector consumer group is greater than0, and it causes an actionEmail Administrator.
-
If you followed the steps in the failed broker section, view the Alert history to see that the trigger
Under Replicated Partitionshappened and caused an alert when you stopped broker 2. -
You can also trigger the
Consumption Differencetrigger. In the Kafka Connect -> Sinks screen, edit the running Elasticsearch sink connector. -
Pause the Elasticsearch sink connector by pressing the pause icon in the top left. This will stop consumption for the related consumer group.
- View the Alert history to see that this trigger happened and caused an alert.
- Viewing topic data: if you want to watch the live messages from the
wikipedia.parsedtopic:
$ ./$DEMOPATH/listen_wikipedia.parsed.sh- Stop the consumer group
appto stop consuming from topicwikipedia.parsed. Note that the command below stops the consumers gracefully withkill -15, so the consumers follow the shutdown sequence.
$ ./$DEMOPATH/stop_consumer_app_group_graceful.sh- Stop the Docker demo, destroy all components and clear all Docker volumes.
$ ./$DEMOPATH/reset_demo.sh