-
Bug
-
Resolution: Done
-
Major
-
None
-
None
Testing with newly created `2x` instance, I see the KafkaTopicPartitionReplicaSpreadMax firing for the canary topic.
Checking the partition assignment, I do indeed see some partitions with replicas co-located in the same availability zone.
./bin/kafka-topics.sh {}topic __redhat_strimzi_canary{-} -describe --bootstrap-server localhost:9096
Topic: __redhat_strimzi_canary TopicId: T-cSuR5_Te6nEYMec3xSjg PartitionCount: 6 ReplicationFactor: 3 Configs: min.insync.replicas=2,cleanup.policy=delete,segment.bytes=16384,retention.ms=600000,message.format.version=3.0-IV1,max.message.bytes=1048588
Topic: __redhat_strimzi_canary Partition: 0 Leader: 0 Replicas: 0,1,2 Isr: 0,1,2
Topic: __redhat_strimzi_canary Partition: 1 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
Topic: __redhat_strimzi_canary Partition: 2 Leader: 2 Replicas: 2,3,4 Isr: 2,3,4
Topic: __redhat_strimzi_canary Partition: 3 Leader: 3 Replicas: 3,4,5 Isr: 3,4,5
Topic: __redhat_strimzi_canary Partition: 4 Leader: 4 Replicas: 4,5,0 Isr: 4,5,0
Topic: __redhat_strimzi_canary Partition: 5 Leader: 5 Replicas: 5,0,1 Isr: 5,0,1
For instance partition 2 is on 2,3,4 but brokers 2 and 3 are both in 1b.
oc logs kafka-instance-kafka-2 | grep -E ^broker.rack
broker.rack=us-east-1b
oc logs kafka-instance-kafka-3 | grep -E ^broker.rack
broker.rack=us-east-1b
The canary topic was created:
I0610 13:03:45.234567 1 topic.go:162] The canary topic __redhat_strimzi_canary was created
I0610 13:03:45.234593 1 consumer.go:135] Waiting consumer group to be up and running
The brokers seem to have been up before that.
kwall@Oslo kas-installer % oc logs kafka-instance-kafka-0 | grep "Startup complete"
2022-06-10T13:03:24Z INFO [main] [GroupCoordinator] [GroupCoordinator 0]: Startup complete.
2022-06-10T13:03:24Z INFO [main] [TransactionCoordinator] [TransactionCoordinator id=0] Startup complete.
kwall@Oslo kas-installer % oc logs kafka-instance-kafka-1 | grep "Startup complete"
2022-06-10T13:03:25Z INFO [main] [GroupCoordinator] [GroupCoordinator 1]: Startup complete.
2022-06-10T13:03:25Z INFO [main] [TransactionCoordinator] [TransactionCoordinator id=1] Startup complete.
kwall@Oslo kas-installer % oc logs kafka-instance-kafka-2 | grep "Startup complete"
2022-06-10T13:03:21Z INFO [main] [GroupCoordinator] [GroupCoordinator 2]: Startup complete.
2022-06-10T13:03:21Z INFO [main] [TransactionCoordinator] [TransactionCoordinator id=2] Startup complete.
kwall@Oslo kas-installer % oc logs kafka-instance-kafka-3 | grep "Startup complete"
2022-06-10T13:03:27Z INFO [main] [GroupCoordinator] [GroupCoordinator 3]: Startup complete.
2022-06-10T13:03:27Z INFO [main] [TransactionCoordinator] [TransactionCoordinator id=3] Startup complete.
kwall@Oslo kas-installer % oc logs kafka-instance-kafka-4 | grep "Startup complete"
2022-06-10T13:03:30Z INFO [main] [GroupCoordinator] [GroupCoordinator 4]: Startup complete.
2022-06-10T13:03:30Z INFO [main] [TransactionCoordinator] [TransactionCoordinator id=4] Startup complete.
kwall@Oslo kas-installer % oc logs kafka-instance-kafka-5 | grep "Startup complete"
2022-06-10T13:03:34Z INFO [main] [GroupCoordinator] [GroupCoordinator 5]: Startup complete.
2022-06-10T13:03:35Z INFO [main] [TransactionCoordinator] [TransactionCoordinator id=5] Startup complete.
IMPACT
As the canary is used for probing/metrics purposes, the impact on the customer should be none. However
- the alert generated will affect the alert SREs.
- the latency measures will be incorrect in some cases (the canary uses acks=all so it probes all paths).
- it could also mean that some kinds of (inter-AZ) network partitions could do undetected.
Once the Cruise Control MVP is delivered, it will rectify an instance in this state. However, it is not clear why the instance got into this state in the first place. Kafka should honor rack.