-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
None
-
None
-
False
-
-
False
-
?
-
None
-
-
-
Important
Related slack thread ,
The issue is non-master pods get's stuck in Terminating state:-
pod/ovsdbserver-nb-1 1/1 Terminating 0 4m31s
pod/ovsdbserver-nb-2 1/1 Terminating 0 4m31s
pod/ovsdbserver-sb-1 1/1 Terminating 0 4m31s
They will likely be removed once the termination grace period time is over. which is currently set to 5 minutes. These are just symptoms not actual issue.
As part of the test we delete pods using "oc delete pods -n $NAMESPACE -l service=ovsdbserver-nb"
And it could be that ovsdbserver-nb-0 and ovsdbserver-sb-0 are deleted first, giving no time to other pods to run cluster leave command and stuck in terminating state.
There were some warning events seen:-
4m12s Warning RecreatingFailedPod statefulset/ovsdbserver-nb StatefulSet ovn-kuttl-tests/ovsdbserver-nb is recreating failed Pod ovsdbserver-nb-0
6m46s Warning FailedUpdate statefulset/ovsdbserver-nb update Pod ovsdbserver-nb-0 in StatefulSet ovsdbserver-nb failed error: Could not update claim ovndbcluster-nb-sample-etc-ovn-ovsdbserver-nb-0 for delete policy ownerRefs: Operation cannot be fulfilled on persistentvolumeclaims "ovndbcluster-nb-sample-etc-ovn-ovsdbserver-nb-0": the object has been modified; please apply your changes to the latest version and try again6m45s Warning FailedUpdate statefulset/ovsdbserver-sb update Pod ovsdbserver-sb-0 in StatefulSet ovsdbserver-sb failed error: Could not update claim ovndbcluster-sb-sample-etc-ovn-ovsdbserver-sb-0 for delete policy ownerRefs: Operation cannot be fulfilled on persistentvolumeclaims "ovndbcluster-sb-sample-etc-ovn-ovsdbserver-sb-0": the object has been modified; please apply your changes to the latest version and try again
The ticket is to identify the cause and fix it. One option may be to have some dummy preStop hook(may be some sleep) also for pod-0 so it do not terminate immediately.
UPD: Note that a workaround that bumped timeout for kuttl landed: https://github.com/openstack-k8s-operators/ovn-operator/pull/356 We'll need to revert it in the scope of this issue before closing it (after confirming the gate is stable).
- is caused by
-
FDP-662 Multiple cluster/leave calls can result in a leaderless cluster after a downed member returns
-
- Closed
-
- links to