-
Task
-
Resolution: Done
-
Critical
-
None
-
None
-
None
-
False
-
None
-
False
-
No
-
---
-
---
-
MK - Sprint 221
WHAT
MGDSTRM-8857 is known to be one of causes for UnderReplicatedPartitions or UnderMinIsrPartitionCount conditions. ** We need to document the recovery procedure.
Update SOPs so that SREs know to try:
- using kafka-reassign-partitions.sh to move the leadership (in the case where two replicas are in-sync)
- Restarting the broker that it leader for the partition concerned.
- First using the Strimzi annotation
- Fallback using a {{oc delete pod }}(only if Strimzi won't action the roll).
We need to phrase the such a way the SREs only take this path when this defect is suspected. We don't want the SREs to be conditioned into 'turning it off/on' as a cure all.
WHY
Allow SREs to workaround MGDSTRM-8857 and return managed kafka to normal service.
HOW
DONE
Include the following where applicable:
- SOPs updated, reviewed and accepted by RTS and SREs.
- is related to
-
MGDSTRM-8910 Once service is upgraded to Kafka 3.2.0 update SOPs to remove workarounds for ISR updates failing
- Closed
- relates to
-
MGDSTRM-8857 alterIsr request will not be retried after error returned in Kafka v3.0.1
- Closed