-
Bug
-
Resolution: Done
-
Major
-
None
-
4.10.z
-
None
-
Rejected
-
False
-
Description of problem:
When a pod becomes stuck in Terminating and the node it belongs too, no longer exists then OVNkube-master will become stuck in a loop
The exact cause of how the pod ended up in this state is unknown however a good guess is this:
- A pod was given a finalizer of 'test'
- The pod was deleted but the finalizer 'test' was not removed
- The node the pod was running on was cordoned
- The node was deleted via machine/node health checks
- The pod is now stuck in Terminating and the node does not exist. (This is the state the cluster was in)
OVN then iterates over all pods to configure them. However the error logic in the for loop crashes out of the for loop when any error occurs instead of processing the rest of the for loop and collecting all the errors.
What this resulted in was a single pod being responsible for the entire openshift control plane not coming up.
https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/pods.go#L44
To summarise line 44 should not throw a error until the end of the for loop.
However Tim Rozet has indicated some refactoring may also make this code block redundant.
- impacts account
-
OCPBUGS-15788 OVN silently failing in case of a running pod
- Closed
-
OCPBUGS-24343 OVN silently failing in case of a running pod
- Closed
- is cloned by
-
OCPBUGS-4554 [4.12] OVN silently failing in case of a stuck pod
- Closed
- is depended on by
-
OCPBUGS-3739 Pod stuck in containerCreating state when the node on which it is running is Terminated
- Closed
-
OCPBUGS-4554 [4.12] OVN silently failing in case of a stuck pod
- Closed
- links to