Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: 4.11
Component/s: OLM
Labels:
- catalog-content
- good-first-issue

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:

4.13.0
Release Blocker:
Rejected
Sprint:
Windu 231, X-Files 232
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Priority Data:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Redhat-operator part of the marketplace is failing regularly due to startup probe timing out connecting to registry-server container part of the same pod within 1 sec which in turn increases CPU/Mem usage on Master nodes:

62m         Normal    Scheduled                pod/redhat-operators-zb4j7                         Successfully assigned openshift-marketplace/redhat-operators-zb4j7 to ip-10-0-163-212.us-west-2.compute.internal by ip-10-0-149-93
62m         Normal    AddedInterface           pod/redhat-operators-zb4j7                         Add eth0 [10.129.1.112/23] from ovn-kubernetes
62m         Normal    Pulling                  pod/redhat-operators-zb4j7                         Pulling image "registry.redhat.io/redhat/redhat-operator-index:v4.11"
62m         Normal    Pulled                   pod/redhat-operators-zb4j7                         Successfully pulled image "registry.redhat.io/redhat/redhat-operator-index:v4.11" in 498.834447ms
62m         Normal    Created                  pod/redhat-operators-zb4j7                         Created container registry-server
62m         Normal    Started                  pod/redhat-operators-zb4j7                         Started container registry-server
62m         Warning   Unhealthy                pod/redhat-operators-zb4j7                         Startup probe failed: timeout: failed to connect service ":50051" within 1s
62m         Normal    Killing                  pod/redhat-operators-zb4j7                         Stopping container registry-server


Increasing the threshold of the probe might fix the problem:
  livenessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: 3
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    name: registry-server
    ports:
    - containerPort: 50051
      name: grpc
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: 3
      initialDelaySeconds: 5
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

1. Install OSD cluster using 4.11.0-0.nightly-2022-08-26-162248 payload
2. Inspect redhat-operator pod in openshift-marketplace namespace
3. Observe the resource usage ( CPU and Memory ) of the pod

Actual results:

Redhat-operator failing leading to increase to CPU and Mem usage on master nodes regularly during the startup

Expected results:

Redhat-operator startup probe succeeding and no spikes in resource on master nodes

Additional info:

Attached cpu, memory and event traces.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

marketplace-cpu-usage.png
2022/08/29 6:08 PM
118 kB
Naga Ravi Chaitanya Elluri
marketplace-memory-usage.png
2022/08/29 6:08 PM
81 kB
Naga Ravi Chaitanya Elluri
openshift-marketplace.events
2022/08/29 6:08 PM
82 kB
Naga Ravi Chaitanya Elluri
redhat-operator.spec
2022/08/29 6:08 PM
5 kB
Naga Ravi Chaitanya Elluri

blocks

OCPBUGS-7650 Redhat-operators are failing regularly due to startup probe timing out which in turn increases CPU/Mem usage on Master nodes

Closed

duplicates

OCPBUGS-828 backport opm server change to use pre-existent cache

Closed

is caused by

OCPBUGS-52422 OPM no longer prunes metadata from non-channel heads

is cloned by

OCPBUGS-7650 Redhat-operators are failing regularly due to startup probe timing out which in turn increases CPU/Mem usage on Master nodes

Closed

links to

openshift/operator-framework-olm#446: OCPBUGS-672: Catalog Pod Startup Probe Timeout

Assignee:: Daniel Franz

Reporter:: Naga Ravi Chaitanya Elluri

Need Info From:: None

Contributors:: None

QA Contact:: bruno andrade

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Created:: 2022/08/29 6:08 PM

Updated:: 2025/10/08 12:51 PM

Resolved:: 2023/05/17 10:40 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates