-
Bug
-
Resolution: Won't Do
-
Undefined
-
None
-
4.13
-
None
-
False
-
Description of problem:
In this job, Goto events and set Namespace=openshift-marketplace, Message="Startup probe", and Reason="Kill" – you will see several of the redhat-operators and community-operators pods that were killed.
These are the time ranges for the initial installation and the e2e test:
01:02:47 Running step e2e-gcp-ovn-upgrade-ipi-install-install-stableinitial. 01:51:17 Step e2e-gcp-ovn-upgrade-ipi-install-install-stableinitial succeeded after 48m30s. 01:51:17 Running step e2e-gcp-ovn-upgrade-openshift-e2e-test. 04:14:08 Step e2e-gcp-ovn-upgrade-openshift-e2e-test failed after 2h22m50s.
These are the pods that stayed until the end of the test job (taken by looking at KAAS on this job):
$ oc -n openshift-marketplace get po
NAME READY STATUS RESTARTS
35bf84980eae616415d313c531c679e41675267a04ae718fdcc880c71598nhk 0/1 Completed 0
certified-operators-l5qlh 1/1 Running 0
community-operators-xwx2z 1/1 Running 0 2:57:50 - 3:53:25
marketplace-operator-66b7b88fc5-mlpsk 1/1 Running 0
redhat-marketplace-xssx8 1/1 Running 0
redhat-operators-82xjt 0/1 Running 0 4:16:56 - 4:17:27
redhat-operators-wkfxj 1/1 Running 0 3:03:45 - 3:04:26
Looking at community-operators-xwx2z, you can see it was around from 2:57:50 - 3:53:25.
During that time, you can see in the events file that these pods were started and killed:
community-operators-dczkb 3:07:22 - 3:07:55 community-operators-jfk9b 3:17:40 - 3:18:12 community-operators-sm5fc 3:28:31 - 3:28:53 community-operators-gf6n7 3:41:42 - 3:42:04 community-operators-2jq2g 3:52:06 - 3:52:38
The startupProbe for community-operators-xwx2z looks like:
startupProbe: exec: command: - grpc_health_probe - -addr=:50051 failureThreshold: 15 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 terminationMessagePath: /
Since for each of those community-operator pods, there is not 15 startup probe failures, the pods are probably killed by some operator.
The repeated scheduling and killing seems unnecessary.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
seems to happen often
Steps to Reproduce:
1. Run the periodic jobs 2. 3.
Actual results:
community-operators and redhat-operators pods repeatedly scheduled and killed
Expected results:
The right number of pods created and no more scheduling/killing of more pods
Additional info: