Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.13.z, 4.12.z, 4.11.z, 4.14.z, 4.15.z, 4.17.z, 4.16.z, 4.18.z
Component/s: OLM
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
Yes

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

When the OLM team introduced the sqlite database and catalog pods as a replacement for the legacy AppRegistry feature in quay, one of the reasons is because the AppRegistry API was not scaling to the amount of content being served and the number of requests being made.

In order to help solve this problem, the OLM team designed APIs for bundle images and catalog images, where catalog images would be lightweight indexes over all of the available bundles. Put another way, it meant that bundle contents no longer had to be included in the catalog-level APIs.

As part of this effort however, the catalog did still include bundle objects for each of the channel heads, which is important for OLM on-cluster to be able to serve the packagemanifests API, which provides catalog discoverability information at the package and channel level.

Because the packagemanifests API did not provide information about individual bundles, the OLM team specifically designed opm to prune these objects out of the catalog when they were no longer necessary to fulfill the needs for the packagemanifests API. Specifically, the design was to remove bundle objects for a bundle when that bundle is replaced by a new bundle as the new channel head.

It has come to our attention that this pruning mechanism was accidentally disabled when bundles are added via opm's replaces mode. This bug was introduced in https://github.com/operator-framework/operator-registry/pull/571.

Any catalog pipeline that is using a version of opm that contains that PR is now skipping the prune step for replaces mode bundle additions (note: semver and semver-skippatch additions still trigger a database-wide prune step)

Up until direct FBC management was implemented by Konflux, the redhat-operators catalog pipeline supported _only_ replaces mode, which means that the redhat-operators catalog has dramatically increased in size in the intervening years.

The community catalog pipelines, however, implemented support for semver and semver-skippatch mode very quickly after opm added support. The first community operator opted in in Dec 2020: https://github.com/redhat-openshift-ecosystem/community-operators-prod/commit/870c7b89ac18d0577ed990296a81cd48086de908

The certified catalog pipeline had been exclusively using replaces mode additions until July 2024, when the vpc-operator switched to semver mode: https://github.com/redhat-openshift-ecosystem/certified-operators/pull/4232

So as of today:
- The community catalogs are likely seeing enough activity on semver mode additions that they are being periodically pruned
- The certified catalog has only one semver mode bundle, so that catalog is only pruned as often as that package is being changed.
- The redhat-operators catalog is not being pruned

Over the years that this bug as been quietly allowing catalogs to grow faster than expected, the OLM team has been dealing with many performance-related issues.
- FBC-based catalog pod startup times took so much memory and CPU resources, that they were unable to complete their startup sequence prior to liveness/readiness probes failing. The first attempt to resolve this issue was to add startupProbe configuration to catalogs pods (https://github.com/operator-framework/operator-lifecycle-manager/pull/2791)
- Even with the startup probe, the catalog pods were still requiring too much memory and CPU at startup, so the OLM team quickly implemented a caching feature that reduced (but did not eliminate) the memory and CPU overhead of catalog pod startup, but the tradeoff was that the cache needed to be included in the image, thus ~doubling the image size.
- Hypershift needed to implement a complex solution for running OLM's default catalogs on the management host to avoid saddling customers' guest clusters with resource intensive catalog pods.
- It became clear to the OLM team that there were major issues with the actual size of the catalog contents. Downloading, parsing, rendering, and querying catalogs had noticeably poor performance. Using pprof to analyze where the most CPU and memory allocations were coming from, it became clear that the `olm.bundle.object` properties were the primary cause of the performance issues. So the OLM team replaced that property with `olm.csv.metadata`, which lead to a dramatic improvement. It reduced the size of the metadata stored on a bundle, and dramatically increased performance (98x improvement on an 8-core machine).
- Yet _still_, the fact that bundle pruning had stopped happening was not caught by the OLM team or by QE.

Version-Release number of selected component (if applicable):

v4.18

How reproducible:

Always, when using `replaces` mode with `opm index|registry add`

Steps to Reproduce:

    1. Create and push two bundle images A and B, where B replaces A, and both are in the same channel
    2. Run `opm registry add -b A`
    3. Run `opm registry add -b B`
    4. Run `opm render bundles.db| jq 'select(.schema == "olm.bundle" and ([.properties[] | select(.type == "olm.bundle.object")] | length) > 0).name`

Actual results:

Output of opm render command is: 
    "A"
    "B"

Expected results:

Output of opm render command is:
    "B"

Additional info:

I did some analysis to show the huge opportunity cost we are paying (and customers are paying) by allowing the catalog metadata to accrue without pruning.

https://docs.google.com/spreadsheets/d/1k9cF9w_yxIRErZUSh5gLtFbDmwgfFbzXb_ytXb2LP5Y/edit?gid=0#gid=0

causes

OCPBUGS-52981 OLMv1 is not Created on On OpenShift-Local (CRC)

OCPBUGS-4600 FBC catalog server has high startup time and initial memory usage

Closed

OCPBUGS-38944 The redhat-operators catalogsource pod is using high CPU

Closed

OCPBUGS-672 Redhat-operators are failing regularly due to startup probe timing out which in turn increases CPU/Mem usage on Master nodes

Closed

OCPBUGS-33094 redhat-operator needs a lot of CPU resources every 15 minutes

Closed

OCPBUGS-36421 redhat-operators pod experiencing unusually high CPU utilization

Closed

OCPBUGS-50952 redhat-operators pod consuming lot of the master node's CPU

Closed

OCPBUGS-42005 Catalog Operator pods CPU spike every few minutes

Closed

OPECO-2575 Investigate opm memory usage for hypershift

Closed

is related to

OCPBUGS-11552 oc-mirror generated file-based catalogs crashloop

Closed

OCPBUGS-48468 OLMv0: excessive catalog source snapshots cause severe performance regression

Closed

OCPBUGS-61036 Operators are unable to Install/Update due to RPC DeadlineExceeded while listing bundles error.

Closed

OCPBUGS-643 catsrc is not ready due to "compute digest: compute hash: write tar: open /tmp/cache/cache: permission denied"

Closed

OCPBUGS-27234 Catalog pod health probes have significant delay, reaching timeout

Closed

OCPBUGS-37667 IBM Operator Index Image fails with "cache requires rebuild: cache reports digest as xxx, but computed digest is yyy"

Closed

OCPBUGS-56031 redhat-operators pod is consuming high cpu

Closed

OCPBUGS-31427 The opm way should support all custom index image

Closed

OCPBUGS-34488 The index image created by the latest opm will lead the Pod CrashLoopBackOff

Closed

OCPBUGS-31391 The certified operator crash due to computed digest is different from the cache digest

Closed

links to

Access: OLM-managed operator install or update requests may time out on OpenShift 4.8

Bug 1921778 - Push to stage now failing with semver issues on old releases

KCS 7107684: The opm in redhat-operators pod using high CPU in RHOCP 4

Operator Catalog Size and Performance Issues

(4 causes, 10 is related to, 4 links to)

Assignee:: Catherine Chan-Tse

Reporter:: Joe Lanford

Need Info From:: None

Contributors:: None

QA Contact:: Xia Zhao

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2025/03/05 6:25 PM

Updated:: 2025/09/24 8:14 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates