[ENTMQST-5669] Should manual rolling update failure fail the whole reconciliation? - Red Hat Issue Tracker

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: 2.8.0.GA
Affects Version/s: None
Component/s: None
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Target Release:

2.8.0.GA
Intelligence Requested:
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Currently, when a pod is annotated for manual rolling update and the manual rolling update fails (for example because topics or the other controllers are not in-sync), it also fails the whole reconciliation. But in some (rare) situations, this can cause issues. Imagine following scenario:

Due to a storage issues, one of the KRaft controller nodes is deleted including its PVC and PV
At the same time, another KRaft controller pod is annotated for manual rolling update (either by the user but possibly also by Drain Cleaner)
The StrimziPodSet controller will restart the failed pod, but without the PVC/PV, it will be in a Pending state
Next periodical reconciliation starts and tries to roll the annotated controller pod => this fails because with another controller in Pending state it would break quorum and as a result the whole reconciliation fails as well.
However, the PVC creation step is only after the manual rolling update. So the PVC is never recreated and the Pendign controller pod remains pending.
When the next reconciliation happens ... the same repeats again because these two events block each other.

We should try to avoid this and when the manual rolling update fails, we should try to suppress the error - just log a warning but proceed with the reconciliation. That should give the operator the chance to fix issues (e.g. recreate PVCs). I do not think this should cause any problems. The manual rolling update annotation should either survive till the next reconciliation or the pod will be rolled in regular rolling update -> I do not have any scenario where this would be an issue. But this should be triaged so that we can see if anyone sees any problems.

Created by Strimzi#9654

links to

RHSA-2024:142550 Streams for Apache Kafka 2.8.0 release and security update

Assignee:: Unassigned

Reporter:: JAkub Scholz

Tester:: Lukas Kral

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/02/07 1:14 PM

Updated:: 2024/11/13 4:21 PM

Resolved:: 2024/09/10 10:41 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates