Uploaded image for project: 'AMQ Streams'
  1. AMQ Streams
  2. ENTMQST-6341

Topic Operator replication factor changes seem to conflict with Cruise Control rebalancing

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Undefined Undefined
    • 2.8.0.GA
    • None
    • None
    • None
    • False
    • None
    • False

      It looks like the Topic Operator feature to change the replication factor of topics seems to conflict with the ongoing rebalances done through Cruise Control. The steps to reproduce seems to be following:

      • Start with Kafka cluster with 3 brokers
      • Create a topic with Replication Factor 3 and 100 partitions
      • Fill in the topic with a significant amount of data (in my case many GBs) and keep the brokers under load to make sure the brokers are busy
      • Scale-up your Kafka cluster to 4 or 5 nodes and trigger the rebalance to add the new broker(s)
      • While the rebalance is ongoing, observe the TO log. You should see the following errors:
         
          2024-09-23 18:31:39,67375 INFO  [LoopRunnable-0] LoopRunnable:216 - [Batch #9807] Reconciling batch of 1 topics
          2024-09-23 18:31:39,71644 INFO  [LoopRunnable-0] CruiseControlHandler:160 - Replicas change pending, Topics: [load-test]
          2024-09-23 18:31:39,89126 ERROR [LoopRunnable-0] CruiseControlHandler:199 - Replicas change failed, Request failed (500), Another task is executing, Topics: [load-test]
          2024-09-23 18:31:40,07910 INFO  [LoopRunnable-0] LoopRunnable:225 - [Batch #9807] Batch reconciliation completed
          2024-09-23 18:31:40,18186 INFO  [LoopRunnable-0] LoopRunnable:216 - [Batch #9808] Reconciling batch of 1 topics
          2024-09-23 18:31:40,23586 INFO  [LoopRunnable-0] CruiseControlHandler:160 - Replicas change pending, Topics: [load-test]
          2024-09-23 18:31:40,26293 ERROR [LoopRunnable-0] CruiseControlHandler:199 - Replicas change failed, Request failed (500), Another task is executing, Topics: [load-test]
          2024-09-23 18:31:40,27479 INFO  [LoopRunnable-0] LoopRunnable:225 - [Batch #9808] Batch reconciliation completed
          2024-09-23 18:33:41,58528 INFO  [-1443900956-pool-3-thread-11] TopicEventHandler:52 - Triggering periodic reconciliation of KafkaTopic resources for namespace myproject
          2024-09-23 18:33:41,59522 INFO  [LoopRunnable-0] LoopRunnable:216 - [Batch #11015] Reconciling batch of 1 topics
          2024-09-23 18:33:41,65075 INFO  [LoopRunnable-0] CruiseControlHandler:160 - Replicas change pending, Topics: [load-test]
          2024-09-23 18:33:41,71033 ERROR [LoopRunnable-0] CruiseControlHandler:199 - Replicas change failed, Request failed (500), Another task is executing, Topics: [load-test]
          2024-09-23 18:33:41,73336 INFO  [LoopRunnable-0] LoopRunnable:225 - [Batch #11015] Batch reconciliation completed
          2024-09-23 18:35:45,58174 INFO  [-1443900956-pool-3-thread-12] TopicEventHandler:52 - Triggering periodic reconciliation of KafkaTopic resources for namespace myproject
          2024-09-23 18:35:45,66834 INFO  [LoopRunnable-0] LoopRunnable:216 - [Batch #12246] Reconciling batch of 1 topics
          2024-09-23 18:35:45,71518 INFO  [LoopRunnable-0] CruiseControlHandler:160 - Replicas change pending, Topics: [load-test]
          2024-09-23 18:35:45,89525 ERROR [LoopRunnable-0] CruiseControlHandler:199 - Replicas change failed, Request failed (500), Error processing POST request '/topic_configuration' due to: 'java.lang.IllegalStateException: All topics matching given pattern already have target replication factor. Requested topic pattern by replication factor: {3=load-test}.'., Topics: [load-test]
          2024-09-23 18:35:45,89570 INFO  [LoopRunnable-0] CruiseControlHandler:190 - Replicas change completed or reverted, Topics: [load-test]
          2024-09-23 18:35:46,01890 INFO  [LoopRunnable-0] LoopRunnable:225 - [Batch #12246] Batch reconciliation completed
          2024-09-23 18:35:46,11944 INFO  [LoopRunnable-0] LoopRunnable:216 - [Batch #12247] Reconciling batch of 1 topics
          2024-09-23 18:35:46,15141 INFO  [LoopRunnable-0] LoopRunnable:225 - [Batch #12247] Batch reconciliation completed
          2024-09-23 18:36:26,01802 INFO  [LoopRunnable-0] LoopRunnable:216 - [Batch #12643] Reconciling batch of 1 topics
          2024-09-23 18:36:26,13627 INFO  [LoopRunnable-0] LoopRunnable:225 - [Batch #12643] Batch reconciliation completed
          2024-09-23 18:36:26,23744 INFO  [LoopRunnable-0] LoopRunnable:216 - [Batch #12644] Reconciling batch of 1 topics
          2024-09-23 18:36:26,32027 INFO  [LoopRunnable-0] LoopRunnable:225 - [Batch #12644] Batch reconciliation completed
          2024-09-23 18:37:49,58071 INFO  [-1443900956-pool-3-thread-13] TopicEventHandler:52 - Triggering periodic reconciliation of KafkaTopic resources for namespace myproject
          2024-09-23 18:37:49,63800 INFO  [LoopRunnable-0] LoopRunnable:216 - [Batch #13472] Reconciling batch of 1 topics
          2024-09-23 18:37:49,69097 INFO  [LoopRunnable-0] CruiseControlHandler:160 - Replicas change pending, Topics: [load-test]
          2024-09-23 18:37:49,74838 ERROR [LoopRunnable-0] CruiseControlHandler:199 - Replicas change failed, Request failed (500), Another task is executing, Topics: [load-test]
          2024-09-23 18:37:49,81402 INFO  [LoopRunnable-0] LoopRunnable:225 - [Batch #13472] Batch reconciliation completed
          2024-09-23 18:37:49,91563 INFO  [LoopRunnable-0] LoopRunnable:216 - [Batch #13473] Reconciling batch of 1 topics
          2024-09-23 18:37:49,97352 INFO  [LoopRunnable-0] CruiseControlHandler:160 - Replicas change pending, Topics: [load-test]
          2024-09-23 18:37:49,99809 ERROR [LoopRunnable-0] CruiseControlHandler:199 - Replicas change failed, Request failed (500), Another task is executing, Topics: [load-test]
          2024-09-23 18:37:50,02118 INFO  [LoopRunnable-0] LoopRunnable:225 - [Batch #13473] Batch reconciliation completed
          

         (Notice the errors)

      • In Cruise Control, you will see that TO tries to trigger the RF changes even through the replication factor of the topic is unchanged.

      This does not seem to be cause any visible issues in the scenario above as Cruise Control seems to reject the task from the TO. But one has to wonder:

      • If some race condition can cause problems when TO thinks the topics has wrong RF and the competing rebalance is just finished
      • What if the competing rebalance was done manually and the TO used CruiseControl to fight against it?

      Created by Strimzi#10630.

              rhn-support-fvaleri Federico Valeri
              scholzj JAkub Scholz
              Maros Orsak Maros Orsak
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: