Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.13
Component/s: Documentation / OLM + Marketplace
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

In this job, Goto events and set Namespace=openshift-marketplace, Message="Startup probe", and Reason="Kill" – you will see several of the redhat-operators and community-operators pods that were killed.

These are the time ranges for the initial installation and the e2e test:

01:02:47 Running step e2e-gcp-ovn-upgrade-ipi-install-install-stableinitial.
01:51:17 Step e2e-gcp-ovn-upgrade-ipi-install-install-stableinitial succeeded after 48m30s.

01:51:17 Running step e2e-gcp-ovn-upgrade-openshift-e2e-test.
04:14:08 Step e2e-gcp-ovn-upgrade-openshift-e2e-test failed after 2h22m50s.

These are the pods that stayed until the end of the test job (taken by looking at KAAS on this job):

$ oc -n openshift-marketplace get po
NAME                                                              READY   STATUS      RESTARTS
35bf84980eae616415d313c531c679e41675267a04ae718fdcc880c71598nhk   0/1     Completed   0
certified-operators-l5qlh                                         1/1     Running     0
community-operators-xwx2z                                         1/1     Running     0         2:57:50 - 3:53:25
marketplace-operator-66b7b88fc5-mlpsk                             1/1     Running     0
redhat-marketplace-xssx8                                          1/1     Running     0
redhat-operators-82xjt                                            0/1     Running     0         4:16:56 - 4:17:27
redhat-operators-wkfxj                                            1/1     Running     0         3:03:45 - 3:04:26

Looking at community-operators-xwx2z, you can see it was around from 2:57:50 - 3:53:25.

During that time, you can see in the events file that these pods were started and killed:

community-operators-dczkb 3:07:22 - 3:07:55
community-operators-jfk9b 3:17:40 - 3:18:12
community-operators-sm5fc 3:28:31 - 3:28:53
community-operators-gf6n7 3:41:42 - 3:42:04
community-operators-2jq2g 3:52:06 - 3:52:38

The startupProbe for community-operators-xwx2z looks like:

    startupProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: 15
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    terminationMessagePath: /

Since for each of those community-operator pods, there is not 15 startup probe failures, the pods are probably killed by some operator.

The repeated scheduling and killing seems unnecessary.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

seems to happen often

Steps to Reproduce:

1. Run the periodic jobs
2.
3.

Actual results:

community-operators and redhat-operators pods repeatedly scheduled and killed

Expected results:

The right number of pods created and no more scheduling/killing of more pods

Additional info:

Assignee:: Latha Sreenivasa Murthy

Reporter:: Dennis Periquet

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2022/12/02 4:32 PM

Updated:: 2025/07/28 5:35 PM

Resolved:: 2023/03/10 8:08 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates