Uploaded image for project: 'Managed Service - Streams'
  1. Managed Service - Streams
  2. MGDSTRM-8857

alterIsr request will not be retried after error returned in Kafka v3.0.1

XMLWordPrintable

      Summary of the issue:

      The AlterIsr request is sent by partition leader (because follower fetch data from leader, only leader knows if followers catchup). The partition leader tried to send AlterIsr request to controller, and controller will forward to all brokers to update the leader and ISR. But in Kafka 3.0.1, we found a bug that will cause the AlterIsr request won't retry if the returned error is retriable, and the result is the ISR change won't update to all brokers.

       

      Here's the reason:

      1. When preparing AlterIsr request, we will set "handleAlterIsrResponse" method as callback, and submit it here
      2. In submit, we check if there is unsent isr update for this partition, if yes, we return false, and output error. But this is the 1st sent, it'll be enqueued, and send out the request
      3. When response returned, we will enter here, and here's the root cause. We use try/finally to make sure the `unsentIsrUpdates` will be removed. But before it removed, we called the callback, which is the "handleAlterIsrResponse" in point 1, and then, we check if the error is retriable, if so, we retry here.
      4. After retry, we return back to point 2 above, we check if there's unsent isr update for this partition, it's yes, now, (because race condition in point 3, the finally block might not run yet). So, we thought there's an inflight request, and we do nothing.
      5. In the end, this alterIsr request doesn't send successfully

       

      Because upstream Kafka won't release v3.0.2, we might need to patch ourselves, or upgrade Kafka.

            Unassigned Unassigned
            lukchen@redhat.com Luke Chen
            Kafka Integrations
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: