Description
There appears to be a race condition bug of some kind in KafkaConsumer.subscribe(pattern='some pattern')
.
Normally the call works fine, the consumers picks up matching topics, assigns partitions to group members, etc.
However, once in a blue moon I've observed that the consumer finds the matching topic, but never successfully assigns the topic partitions to the group members. Once it's in this state, it will call poll()
for hours without returning messages because the consumer thinks it has no assigned partitions, and because the consumer's subscription already contains the topic, there's never a change that triggers a rebalance.
I'm embarrassed to say that I've spent 40+ hours over the past two weeks trying to figure this out as we hit it in production, but all I've managed to do is isolate is a semi-consistently reproducible example. Unfortunately that requires running a service that has a KafkaConsumer
instance and has a bunch of associated docker containers, so I can't make this setup public. The wrapper service does use gevent
which I'm not very familiar with, but I disabled all the service's other greenlets so I don't think that should affect this at all.
Every time I try to isolate it down to a simple kafka-python
script, I cannot reproduce it. But after spending hours stepping through the code, I'm reasonably certain it's a race condition in kafka-python and not the wrapper service.
Here's what I know:
-
The issue doesn't show up the first time I run the service. If I kill the service (without calling
KafkaConsumer.close()
) and then restart it before the group coordinator evicts the consumer from the group, then I trigger the issue. If I then kill it, wait until I know the group coordinator has evicted all consumers, and then re-run it, it will work fine. Unfortunately, I have no idea if this behavior is related to the root cause, or just a trigger that makes the docker kafka container busy enough that it slows down its response times. -
In the failure case, calling
KafkaConsumer.subscription()
returns the expected topic name, but callingKafkaConsumer.assignment()
returns an empty set. -
In the failure case, I can see that the cluster metadata object has both the topic and the list of partitions, so the cluster metadata is getting correctly updated, it's just not making it into the group assignments.
-
SubscriptionState.change_subscription()
has a check that short circuits the group rebalance if the previous/current topic subscriptions are equal. If I comment out thisreturn
in that short circuit check, the group rebalances properly and the problem disappears. -
Tracing the TCP calls in Wireshark, I see the following:
Success case:
- Metadata v1 Request
- Metadata v2 Response
- GroupCoordinator v0 Request
- GroupCoordinator v0 Response
- JoinGroup v0 Request - protocol member metadata is all 0's
- JoinGroup v0 Response - protocol member metadata is all 0's
- SyncGroup v0 Request - member assignment is all 0's
- SyncGroup v0 Response - member assignment is all 0's
(note this is a second generation of the group) - JoinGroup v0 Request - protocol member metadata has data
- JoinGroup v0 Response - protocol member metadata has data
- SyncGroup v0 Request - member assignment has data
- SyncGroup v0 Response - member assignment has data
- From here on it's the expected behavior of polling the assigned parttions with the occasion Metadata Request/Response when the metadata refresh timeout kicks in
Failure case:
1. Metadata v1 Request
2. Metadata v2 Response
3. GroupCoordinator v0 Request
4. GroupCoordinator v0 Response
5. JoinGroup v0 Request - protocol member metadata is all 0's
6. JoinGroup v0 Response - protocol member metadata is all 0's
7. SyncGroup v0 Request - member assignment is all 0's
(Here is the problem, we never trigger a second JoinGroup v0 Request that contains the partition data)
8. From here on there are no requests other than the Metadata Request/Response when the metadata refresh timeout kicks in
Setup:
- Single Kafka broker, version 0.10.2.1, running on docker.
- Single instance of the consumer, so it always elects itself as the leader and consumes all partitions for the topic.
- To keep things simple, my topic has only one partition. However, this race condition might be partition agnostic, meaning that a consumer might be working perfectly and then we expand the number of partitions and it might not pick up that the partitions changed.
After spending a lot of time poking through this code, I understand why the consumer is stuck once this happens, but I don't understand how it gets into this state in the first place.