Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31440

crashed catalog source pods cannot be remove

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Normal Normal
    • None
    • 4.13.z, 4.14.z, 4.15.z
    • OLM
    • None
    • Critical
    • No
    • Rasputin OLM Sprint 252
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      The certified-operators crashed, but OLM failed to remove it automatically due to its `status.phase` still output `running`. 

      jiazha-mac:~ jiazha$ oc get pods 
      NAME                                                              READY   STATUS             RESTARTS       AGE
      ...
      certified-operators-jxnpp                                         0/1     CrashLoopBackOff   35 (47s ago)   5d14h
      community-operators-n55vz                                         1/1     Running            1              5d14h
      marketplace-operator-fc999f7db-p8wgs                              1/1     Running            2 (154m ago)   5d15h
      redhat-marketplace-45mcm                                          1/1     Running            1              5d14h
      redhat-operators-mpvzm                                            1/1     Running            1              5d14h
       
      jiazha-mac:~ jiazha$ omg get pods certified-operators-jxnpp  -o yaml |grep phase
              f:phase: {}
        phase: Running
      
      

      So, we should NOT use the `pod.Status.Phase` to judge the pod status.

      https://github.com/openshift/operator-framework-olm/blob/master/staging/operator-lifecycle-manager/pkg/controller/registry/reconciler/grpc.go#L613 

      // podFailed checks whether the pod status is in a failed or unknown state, and deletes the pod if so.
      func (c *GrpcRegistryReconciler) podFailed(pod *corev1.Pod) (bool, error) {
          if pod.Status.Phase == corev1.PodFailed || pod.Status.Phase == corev1.PodUnknown {
              logrus.WithField("UpdatePod", pod.GetName()).Infof("catalog polling result: update pod %s failed to start", pod.GetName())
              err := c.removePods([]*corev1.Pod{pod}, pod.GetNamespace())
              if err != nil {
                  return true, errors.Wrapf(err, "error deleting failed catalog polling pod: %s", pod.GetName())
              }
              return true, nil
          }
          return false, nil
      } 

      Version-Release number of selected component (if applicable):

          4.15.2

      How reproducible:

          always

      Steps to Reproduce:

          1. Install OCP cluster.
          2. Make a catalog source pod crash.
          3. check if its pod removed.
          

      Actual results:

      The crashed pod is still there.

          

      Expected results:

      The crashed pod should be removed automatically. 

          

      Additional info:

      The must-gather log: https://drive.google.com/file/d/16_tFq5QuJyc_n8xkDFyK83TdTkrsVFQe/view?usp=drive_link  

          

              rh-ee-bpalmer Bryce Palmer
              rhn-support-jiazha Jian Zhang
              Jian Zhang Jian Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: