Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-4433

operator pods in openshift-marketplace namespace are repeatedly created and killed

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      In this job, Goto events and set Namespace=openshift-marketplace, Message="Startup probe", and Reason="Kill" – you will see several of the redhat-operators and community-operators pods that were killed.

      These are the time ranges for the initial installation and the e2e test:

      01:02:47 Running step e2e-gcp-ovn-upgrade-ipi-install-install-stableinitial.
      01:51:17 Step e2e-gcp-ovn-upgrade-ipi-install-install-stableinitial succeeded after 48m30s.
      
      01:51:17 Running step e2e-gcp-ovn-upgrade-openshift-e2e-test.
      04:14:08 Step e2e-gcp-ovn-upgrade-openshift-e2e-test failed after 2h22m50s.
      

      These are the pods that stayed until the end of the test job (taken by looking at KAAS on this job):

      $ oc -n openshift-marketplace get po
      NAME                                                              READY   STATUS      RESTARTS
      35bf84980eae616415d313c531c679e41675267a04ae718fdcc880c71598nhk   0/1     Completed   0
      certified-operators-l5qlh                                         1/1     Running     0
      community-operators-xwx2z                                         1/1     Running     0         2:57:50 - 3:53:25
      marketplace-operator-66b7b88fc5-mlpsk                             1/1     Running     0
      redhat-marketplace-xssx8                                          1/1     Running     0
      redhat-operators-82xjt                                            0/1     Running     0         4:16:56 - 4:17:27
      redhat-operators-wkfxj                                            1/1     Running     0         3:03:45 - 3:04:26
      

      Looking at community-operators-xwx2z, you can see it was around from 2:57:50 - 3:53:25.

      During that time, you can see in the events file that these pods were started and killed:

      community-operators-dczkb 3:07:22 - 3:07:55
      community-operators-jfk9b 3:17:40 - 3:18:12
      community-operators-sm5fc 3:28:31 - 3:28:53
      community-operators-gf6n7 3:41:42 - 3:42:04
      community-operators-2jq2g 3:52:06 - 3:52:38
      

      The startupProbe for community-operators-xwx2z looks like:

          startupProbe:
            exec:
              command:
              - grpc_health_probe
              - -addr=:50051
            failureThreshold: 15
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          terminationMessagePath: /
      

      Since for each of those community-operator pods, there is not 15 startup probe failures, the pods are probably killed by some operator.

      The repeated scheduling and killing seems unnecessary.

      Version-Release number of selected component (if applicable):

      4.13
      

      How reproducible:

      seems to happen often
      

      Steps to Reproduce:

      1. Run the periodic jobs
      2.
      3.
      

      Actual results:

      community-operators and redhat-operators pods repeatedly scheduled and killed
      

      Expected results:

      The right number of pods created and no more scheduling/killing of more pods
      

      Additional info:

      
      

              lmurthy Latha Sreenivasa Murthy
              dperique@redhat.com Dennis Periquet
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: