-
Bug
-
Resolution: Not a Bug
-
Normal
-
None
-
4.13.z, 4.14.z, 4.15.z
-
None
-
Critical
-
No
-
Rasputin OLM Sprint 252
-
1
-
Rejected
-
False
-
Description of problem:
The certified-operators crashed, but OLM failed to remove it automatically due to its `status.phase` still output `running`.
jiazha-mac:~ jiazha$ oc get pods NAME READY STATUS RESTARTS AGE ... certified-operators-jxnpp 0/1 CrashLoopBackOff 35 (47s ago) 5d14h community-operators-n55vz 1/1 Running 1 5d14h marketplace-operator-fc999f7db-p8wgs 1/1 Running 2 (154m ago) 5d15h redhat-marketplace-45mcm 1/1 Running 1 5d14h redhat-operators-mpvzm 1/1 Running 1 5d14h jiazha-mac:~ jiazha$ omg get pods certified-operators-jxnpp -o yaml |grep phase f:phase: {} phase: Running
So, we should NOT use the `pod.Status.Phase` to judge the pod status.
// podFailed checks whether the pod status is in a failed or unknown state, and deletes the pod if so. func (c *GrpcRegistryReconciler) podFailed(pod *corev1.Pod) (bool, error) { if pod.Status.Phase == corev1.PodFailed || pod.Status.Phase == corev1.PodUnknown { logrus.WithField("UpdatePod", pod.GetName()).Infof("catalog polling result: update pod %s failed to start", pod.GetName()) err := c.removePods([]*corev1.Pod{pod}, pod.GetNamespace()) if err != nil { return true, errors.Wrapf(err, "error deleting failed catalog polling pod: %s", pod.GetName()) } return true, nil } return false, nil }
Version-Release number of selected component (if applicable):
4.15.2
How reproducible:
always
Steps to Reproduce:
1. Install OCP cluster. 2. Make a catalog source pod crash. 3. check if its pod removed.
Actual results:
The crashed pod is still there.
Expected results:
The crashed pod should be removed automatically.
Additional info:
The must-gather log: https://drive.google.com/file/d/16_tFq5QuJyc_n8xkDFyK83TdTkrsVFQe/view?usp=drive_link
- is related to
-
OCPBUGS-31391 The certified operator crash due to computed digest is different from the cache digest
-
- Closed
-