Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
1.5.0.Final
-
None
-
False
-
False
-
Undefined
Description
We recently experienced Debezium failing to process messages. Prior to the issue, this was seen in the logs:
[2021-07-12 05:58:46,899] INFO [Worker clientId=connect-1, groupId=xxx-kafka-connect-debezium] Member connect-1-c8c1fc2b-4688-4a0f-982a-64d4fd14225a sending LeaveGroup request to coordinator kafka-xyzxyz-3.dev-xyzxyz.internal:9092 (id: 2147483644 rack: null) due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records. (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
Debezium then continued processing messages until at least 10:15, and from 13:02 we started seeing the following repeating in the logs (there were definitely messages available in our source tables between these times):
[2021-07-12 13:02:59,002] INFO [Consumer clientId=consumer-xxx-kafka-connect-debezium-1, groupId=xxx-kafka-connect-debezium] Node 1 was unable to process the fetch request with (sessionId=1113180491, epoch=49579): FETCH_SESSION_ID_NOT_FOUND. (org.apache.kafka.clients.FetchSessionHandler)
We have a K8s cluster running 2 Kafka Connect pods, and both were failing to pick up messages. Killing the pods and letting k8s restart them resolved the issue, but we would like to understand why it may have happened and take steps to mitigate in future. Please advise on whether the above points to a known issue or need for configuration tweak.