-
Bug
-
Resolution: Obsolete
-
Major
-
2.1.0.GA
The customer is having the following KafkaConnect error following an automatic upgrade on OpenShift 4:
2022-04-07 20:47:57,406 ERROR [Worker clientId=connect-1, groupId=ircc-connect-cluster] Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder) [DistributedHerder-connect-1-1] org.apache.kafka.common.config.ConfigException: Topic 'connect-cluster-offsets' supplied via the 'offset.storage.topic' property is required to have 'cleanup.policy=compact' to guarantee consistency and durability of source connector offsets, but found the topic currently has 'cleanup.policy=delete'. Continuing would likely result in eventually losing source connector offsets and problems restarting this Connect cluster in the future. Change the 'offset.storage.topic' property in the Connect worker configurations to use a topic with 'cleanup.policy=compact'.
It is all good when you deploy KafkaConnect after the Kafka cluster is up and running.
$ kubectl get kt | grep connect-cluster
connect-cluster-configs my-cluster 1 3 True
connect-cluster-offsets my-cluster 25 3 True
connect-cluster-status my-cluster 5 3 True
$ kubectl get kt connect-cluster-offsets -o yaml | yq eval ".spec" -
config:
cleanup.policy: compact
partitions: 25
replicas: 3
topicName: connect-cluster-offsets
Instead, this is what happens when Kafka and KafkaConnect are reconciled concurrently and you manually delete all topic resources.
$ kubectl delete po --all && kubectl delete kt --all
...
$ kubectl get kt | grep connect-cluster
connect-cluster-configs my-cluster 3 3
connect-cluster-offsets my-cluster 3 3
connect-cluster-status my-cluster 3 3
$ kubectl get kt connect-cluster-offsets -o yaml | yq eval ".spec" -
config: {}
partitions: 3
replicas: 3
topicName: connect-cluster-offsets
At this point, we can check the TopicOperator log to see what happened to our connect-cluster-offsets topic for example.
Initially, the topic is only present in Kafka from the previous deployment, so we need to create it in K8s.
2022-04-29 10:52:21,07830 INFO [vert.x-eventloop-thread-1] TopicOperator:576 - Reconciliation #100(initial kafka connect-cluster-offsets) KafkaTopic(test/connect-cluster-offsets): Reconciling topic connect-cluster-offsets, k8sTopic:null, kafkaTopic:nonnull, privateTopic:nonnull
Then we have lots of invalid state store errors, which I think are responsible for the lost topic configuration.
2022-04-29 10:54:30,63425 INFO [vert.x-eventloop-thread-1] TopicOperator:576 - Reconciliation #735(periodic -connect-cluster-offsets) KafkaTopic(test/connect-cluster-offsets): Reconciling topic connect-cluster-offsets, k8sTopic:null, kafkaTopic:nonnull, privateTopic:null 2022-04-29 10:54:38,37543 ERROR [vert.x-eventloop-thread-0] K8sTopicWatcher:69 - Reconciliation #943(kube +connect-cluster-offsets) KafkaTopic(test/connect-cluster-offsets): Failure processing KafkaTopic watch event ADDED on resource connect-cluster-offsets with labels {strimzi.io/cluster=my-cluster}: The state store, topic-store, may have migrated to another instance.
Finally, the topic is created, but with the wrong configuration.
2022-04-29 10:54:38,37525 INFO [kubernetes-ops-pool-11] CrdOperator:113 - Reconciliation #926(kube +connect-cluster-offsets) KafkaTopic(test/connect-cluster-offsets): Status of KafkaTopic connect-cluster-offsets in namespace test has been updated
After that, the TopicOperator does not work anymore, as it is stuck with an invalid state store (restarting the pod does not seem to help).