-
Bug
-
Resolution: Done
-
Major
-
kafka-3.0.1
-
None
-
False
-
False
-
Yes
-
---
-
---
Summary of the issue:
The AlterIsr request is sent by partition leader (because follower fetch data from leader, only leader knows if followers catchup). The partition leader tried to send AlterIsr request to controller, and controller will forward to all brokers to update the leader and ISR. But in Kafka 3.0.1, we found a bug that will cause the AlterIsr request won't retry if the returned error is retriable, and the result is the ISR change won't update to all brokers.
Here's the reason:
1. When preparing AlterIsr request, we will set "handleAlterIsrResponse" method as callback, and submit it here
2. In submit, we check if there is unsent isr update for this partition, if yes, we return false, and output error. But this is the 1st sent, it'll be enqueued, and send out the request
3. When response returned, we will enter here, and here's the root cause. We use try/finally to make sure the `unsentIsrUpdates` will be removed. But before it removed, we called the callback, which is the "handleAlterIsrResponse" in point 1, and then, we check if the error is retriable, if so, we retry here.
4. After retry, we return back to point 2 above, we check if there's unsent isr update for this partition, it's yes, now, (because race condition in point 3, the finally block might not run yet). So, we thought there's an inflight request, and we do nothing.
5. In the end, this alterIsr request doesn't send successfully
Because upstream Kafka won't release v3.0.2, we might need to patch ourselves, or upgrade Kafka.
- blocks
-
MGDSTRM-8910 Once service is upgraded to Kafka 3.2.0 update SOPs to remove workarounds for ISR updates failing
- Closed
- is blocked by
-
MGDSTRM-8673 Upgrade service from Kafka 3.0.x to 3.2.x
- Closed
- is related to
-
MGDSTRM-9017 [track upstream KAFKA-14010] alterISR request won't retry when receiving retriable error
- Closed
-
MGDSTRM-8860 Update SOPs so that SREs know how to work around MGDSTRM-8857
- Closed
- links to