Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.12.0
Component/s: OLM
Labels:
None

Severity:
Critical
Regression:
None
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.12.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

In OCP 4.12, OLM catalog images serve data from a file-based catalog. During startup, the server reads the FBC to generate a cache on the filesystem so that queries are fast and so that baseline memory remains low.

However, in comparison to older sqlite-based servers, the startup process has a longer duration and uses more memory due to YAML/JSON unmarshaling and re-structuring of the data to align with OLM's GRPC API.

As a result, this causes catalog pods to allocate memory beyond their requests and to fail health probes that expect the pod to become healthy much faster.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

1. Simultaneously create catalog sources using images "registry.redhat.io/redhat/redhat-operator-index:v4.12" and  "registry.redhat.io/redhat/redhat-operator-index:v4.10" in the openshift-marketplace namespace
2. Observe catalog source status and note that startup time with the 4.12 image is is noticably longer than startup time of the 4.10 image (which is sqlite based)
3. Observe memory usage of catalog pods and not that the 4.12 catalog pod has a large memory spike during startup that the 4.10 image does not have.

  - Prometheus query: sum (container_memory_rss{namespace='openshift-marketplace',container="",pod!="",pod=~".*-catalog.*"}) by (pod)

Actual results:

There is a startup time and memory spike regression in the 4.12 images.

Expected results:

There is no (or at least a very limited) startup time and memory spike regression in the 4.12 images.

Additional info:

In standard OCP clusters, the OLM has already resolved the health probe issue by introducing a startupProbe that accounts for the potential of a longer startup time. However, Hypershift uses custom catalog pod specs and did not inherit this part of the fix.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2023-01-17-10-05-04-602.png
224 kB
2023/01/17 2:05 AM

is blocked by

OPRUN-2726 downstream opm server change to use pre-existent cache

Closed

is caused by

OCPBUGS-52422 OPM no longer prunes metadata from non-channel heads

is related to

OCPBUGS-5217 [IBM Z]: Openshift-marketplace pods are in CrashLoopBackOff state

Closed

Assignee:: Jordan Keister

Reporter:: Joe Lanford

QA Contact:: Xia Zhao

Contributors:: Joe Lanford, Jordan Keister, Seth Jennings, Tulsi Chandwani, Yashvardhan Nanavati

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Created:: 2022/12/07 2:01 PM

Updated:: 2025/03/05 6:41 PM

Resolved:: 2023/01/30 5:31 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates