Uploaded image for project: 'Managed Service - Streams'
  1. Managed Service - Streams
  2. MGDSTRM-8813

Canary topic of newly provisioned kafka instance may ignore rack awareness - 6 brokers - KafkaTopicPartitionReplicaSpreadMax firing

XMLWordPrintable

    • MK - Sprint 221

      Testing with newly created `2x` instance, I see the KafkaTopicPartitionReplicaSpreadMax firing for the canary topic.

      Checking the partition assignment, I do indeed see some partitions with replicas co-located in the same availability zone.

      ./bin/kafka-topics.sh {}topic __redhat_strimzi_canary{-}  -describe --bootstrap-server localhost:9096
      Topic: __redhat_strimzi_canary    TopicId: T-cSuR5_Te6nEYMec3xSjg    PartitionCount: 6    ReplicationFactor: 3    Configs: min.insync.replicas=2,cleanup.policy=delete,segment.bytes=16384,retention.ms=600000,message.format.version=3.0-IV1,max.message.bytes=1048588
          Topic: __redhat_strimzi_canary    Partition: 0    Leader: 0    Replicas: 0,1,2    Isr: 0,1,2
          Topic: __redhat_strimzi_canary    Partition: 1    Leader: 1    Replicas: 1,2,3    Isr: 1,2,3
          Topic: __redhat_strimzi_canary    Partition: 2    Leader: 2    Replicas: 2,3,4    Isr: 2,3,4
          Topic: __redhat_strimzi_canary    Partition: 3    Leader: 3    Replicas: 3,4,5    Isr: 3,4,5
          Topic: __redhat_strimzi_canary    Partition: 4    Leader: 4    Replicas: 4,5,0    Isr: 4,5,0
          Topic: __redhat_strimzi_canary    Partition: 5    Leader: 5    Replicas: 5,0,1    Isr: 5,0,1

       

      For instance partition 2 is on 2,3,4 but brokers 2 and 3 are both in 1b.

      oc logs kafka-instance-kafka-2 | grep -E ^broker.rack
      broker.rack=us-east-1b

      oc logs kafka-instance-kafka-3 | grep -E ^broker.rack
      broker.rack=us-east-1b

      The canary topic was created:

      I0610 13:03:45.234567       1 topic.go:162] The canary topic __redhat_strimzi_canary was created
      I0610 13:03:45.234593       1 consumer.go:135] Waiting consumer group to be up and running

      The brokers seem to have been up before that.

      kwall@Oslo kas-installer % oc logs kafka-instance-kafka-0 | grep "Startup complete"
      2022-06-10T13:03:24Z INFO  [main] [GroupCoordinator] [GroupCoordinator 0]: Startup complete.
      2022-06-10T13:03:24Z INFO  [main] [TransactionCoordinator] [TransactionCoordinator id=0] Startup complete.
      kwall@Oslo kas-installer % oc logs kafka-instance-kafka-1 | grep "Startup complete"
      2022-06-10T13:03:25Z INFO  [main] [GroupCoordinator] [GroupCoordinator 1]: Startup complete.
      2022-06-10T13:03:25Z INFO  [main] [TransactionCoordinator] [TransactionCoordinator id=1] Startup complete.
      kwall@Oslo kas-installer % oc logs kafka-instance-kafka-2 | grep "Startup complete"
      2022-06-10T13:03:21Z INFO  [main] [GroupCoordinator] [GroupCoordinator 2]: Startup complete.
      2022-06-10T13:03:21Z INFO  [main] [TransactionCoordinator] [TransactionCoordinator id=2] Startup complete.
      kwall@Oslo kas-installer % oc logs kafka-instance-kafka-3 | grep "Startup complete"
      2022-06-10T13:03:27Z INFO  [main] [GroupCoordinator] [GroupCoordinator 3]: Startup complete.
      2022-06-10T13:03:27Z INFO  [main] [TransactionCoordinator] [TransactionCoordinator id=3] Startup complete.

      kwall@Oslo kas-installer % oc logs kafka-instance-kafka-4 | grep "Startup complete"
      2022-06-10T13:03:30Z INFO  [main] [GroupCoordinator] [GroupCoordinator 4]: Startup complete.
      2022-06-10T13:03:30Z INFO  [main] [TransactionCoordinator] [TransactionCoordinator id=4] Startup complete.
      kwall@Oslo kas-installer % oc logs kafka-instance-kafka-5 | grep "Startup complete"
      2022-06-10T13:03:34Z INFO  [main] [GroupCoordinator] [GroupCoordinator 5]: Startup complete.
      2022-06-10T13:03:35Z INFO  [main] [TransactionCoordinator] [TransactionCoordinator id=5] Startup complete.

       

      IMPACT

      As the canary is used for probing/metrics purposes, the impact on the customer should be none. However

      • the alert generated will affect the alert SREs.
      • the latency measures will be incorrect in some cases (the canary uses acks=all so it probes all paths).
      • it could also mean that some kinds of (inter-AZ) network partitions could do undetected.  

      Once the Cruise Control MVP is delivered, it will rectify an instance in this state. However, it is not clear why the instance got into this state in the first place.  Kafka should honor rack.

       

       

              keithbwall Keith Wall
              keithbwall Keith Wall
              Kafka Integrations
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: