Uploaded image for project: 'Managed Service - Streams'
  1. Managed Service - Streams
  2. MGDSTRM-9196

Reduce time taken for followers to catch-up

XMLWordPrintable

    • Reduce time taken for followers to catch-up
    • False
    • None
    • False
    • No
    • To Do
    • MGDSRVS-48 - Be able to sustain an external paying customer in production
    • 0% To Do, 0% In Progress, 100% Done
    • ---
    • ---

      WHAT

      Kafka has the notion of leader and follower brokers. In Kafka, ** for each topic partition, one broker is the leader and (as replication factor is 3), another two brokers are followers.  The leader broker uses resources (threads) to replicate data to the followers.  If the number of threads is inadequate, there will be queuing within the leader and the replicas may become out of sync. This will affect the ability of the customer to ingress messages into their kafka instance.

      Currently RHOSAK uses the default num.replica.fetchers threads (1).  This might be a bottleneck for some use-cases.

      It's also good to check what other service provider set this config. For example, MSK increased this number to 2 here.

      WHY

      Improve the followers fetch performance 

      HOW

      1. Run benchmark tests to find out the best configuration for RHOSAK. One suggestion is to create large partitions with 1 replication factor,  and feed some data. After that, increase the replication factor, and check how long will all the new added replicas catch-up with leaders. 
      Note: While increasing the replication factor, please keep producing/consuming records to/from brokers, to mimic real production environment.

      2. Update the service to use the new number of threads.

       

              rh-ee-robeyoun Robert Young
              lukchen@redhat.com Luke Chen
              Kafka Integrations
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: