Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: OLM
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
Mewtwo Sprint 273
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

platform: ARO

OCP Version: 4.16.37
ARO SRE have found out the presence of the openshift-marketplace pods in a master node that somehow causes disk IO contention. The following symptoms are observed in one customer cluster,

In master-2 a few openshift-marketplace pods were spiking CPU usage, relatively greater than kube-apiserver or other typical top CPU user. In master-2, etcd latencies are as high as > 1 to ~9s. In master-2 VM disk queue length and IO bandwidth are relatively high or higher than average. 
We are anticipating the possibility that this may be a regression for the fix for the bug -
[OCPBUGS-48697] OLMv0: excessive catalog source snapshots cause severe performance regression [openshift-4.15.z] - Red Hat Issue Tracker

Please investigate or help us rule this out. SRE team needs OLM expertise in order to confirm this bug exists in the customer's cluster or not.

Version-Release number of selected component (if applicable):

How reproducible:

100%

Steps to Reproduce:

    1. Install an ARO cluster with version 4.16.37.
    2. Wait for some time, perhaps instal operators and put average load on etcd. Or anything that simulates realistic cluster and OLM usage.
    3. In the openshit-console go to the alerts page and observe the etcdGRPCRequestsSlow alert. 
    4. The etcdGRPCRequestsSlow alert should be flipping between pending and inactive, or firing. 
    4. Grab the alerts query and run it in and observe the etcd latencies.

Actual results:

    etcd latencies are > 1s, and can even reach to 5s or 9s

Expected results:

    etcd latencies should be < 1s or the alert should not be in pending nor fire

Additional info:

MG link: https://attachments.access.redhat.com/hydra/rest/cases/04179961/attachments/08b1bb49-3b20-4cbc-b212-94fd3facb1f5?usePresignedUrl=true

clones

OCPBUGS-54936 High latency etcd disk writes due to openshift-marketplace pods/OLM

Closed

duplicates

OCPBUGS-43966 high snapshot rate on redhat-operators, OLM operator install hangs. RPC DeadlineExceeded while listing bundles.

Closed

is duplicated by

OCPBUGS-61307 High CPU Usage Observed in OpenShift Marketplace and API Server on ARO 4.16.z

Closed

is related to

OCPBUGS-43966 high snapshot rate on redhat-operators, OLM operator install hangs. RPC DeadlineExceeded while listing bundles.

Closed

relates to

OCPBUGS-48696 OLMv0: excessive catalog source snapshots cause severe performance regression [openshift-4.16.z]

Closed

Assignee:: Jordan Keister

Reporter:: Jose Gavine Cueto

Need Info From:: None

Contributors:: None

QA Contact:: Kui Wang

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2025/06/25 4:19 AM

Updated:: 2025/10/03 1:55 AM

Resolved:: 2025/06/30 5:12 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide