-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.13.z, 4.12.z, 4.11.z, 4.14.z, 4.15.z, 4.17.z, 4.16.z, 4.18.z
-
None
-
Important
-
Yes
-
Rejected
-
False
-
Description of problem:
When the OLM team introduced the sqlite database and catalog pods as a replacement for the legacy AppRegistry feature in quay, one of the reasons is because the AppRegistry API was not scaling to the amount of content being served and the number of requests being made. In order to help solve this problem, the OLM team designed APIs for bundle images and catalog images, where catalog images would be lightweight indexes over all of the available bundles. Put another way, it meant that bundle contents no longer had to be included in the catalog-level APIs. As part of this effort however, the catalog did still include bundle objects for each of the channel heads, which is important for OLM on-cluster to be able to serve the packagemanifests API, which provides catalog discoverability information at the package and channel level. Because the packagemanifests API did not provide information about individual bundles, the OLM team specifically designed opm to prune these objects out of the catalog when they were no longer necessary to fulfill the needs for the packagemanifests API. Specifically, the design was to remove bundle objects for a bundle when that bundle is replaced by a new bundle as the new channel head. It has come to our attention that this pruning mechanism was accidentally disabled when bundles are added via opm's replaces mode. This bug was introduced in https://github.com/operator-framework/operator-registry/pull/571. Any catalog pipeline that is using a version of opm that contains that PR is now skipping the prune step for replaces mode bundle additions (note: semver and semver-skippatch additions still trigger a database-wide prune step) Up until direct FBC management was implemented by Konflux, the redhat-operators catalog pipeline supported _only_ replaces mode, which means that the redhat-operators catalog has dramatically increased in size in the intervening years. The community catalog pipelines, however, implemented support for semver and semver-skippatch mode very quickly after opm added support. The first community operator opted in in Dec 2020: https://github.com/redhat-openshift-ecosystem/community-operators-prod/commit/870c7b89ac18d0577ed990296a81cd48086de908 The certified catalog pipeline had been exclusively using replaces mode additions until July 2024, when the vpc-operator switched to semver mode: https://github.com/redhat-openshift-ecosystem/certified-operators/pull/4232 So as of today: - The community catalogs are likely seeing enough activity on semver mode additions that they are being periodically pruned - The certified catalog has only one semver mode bundle, so that catalog is only pruned as often as that package is being changed. - The redhat-operators catalog is not being pruned Over the years that this bug as been quietly allowing catalogs to grow faster than expected, the OLM team has been dealing with many performance-related issues. - FBC-based catalog pod startup times took so much memory and CPU resources, that they were unable to complete their startup sequence prior to liveness/readiness probes failing. The first attempt to resolve this issue was to add startupProbe configuration to catalogs pods (https://github.com/operator-framework/operator-lifecycle-manager/pull/2791) - Even with the startup probe, the catalog pods were still requiring too much memory and CPU at startup, so the OLM team quickly implemented a caching feature that reduced (but did not eliminate) the memory and CPU overhead of catalog pod startup, but the tradeoff was that the cache needed to be included in the image, thus ~doubling the image size. - Hypershift needed to implement a complex solution for running OLM's default catalogs on the management host to avoid saddling customers' guest clusters with resource intensive catalog pods. - It became clear to the OLM team that there were major issues with the actual size of the catalog contents. Downloading, parsing, rendering, and querying catalogs had noticeably poor performance. Using pprof to analyze where the most CPU and memory allocations were coming from, it became clear that the `olm.bundle.object` properties were the primary cause of the performance issues. So the OLM team replaced that property with `olm.csv.metadata`, which lead to a dramatic improvement. It reduced the size of the metadata stored on a bundle, and dramatically increased performance (98x improvement on an 8-core machine). - Yet _still_, the fact that bundle pruning had stopped happening was not caught by the OLM team or by QE.
Version-Release number of selected component (if applicable):
v4.18
How reproducible:
Always, when using `replaces` mode with `opm index|registry add`
Steps to Reproduce:
1. Create and push two bundle images A and B, where B replaces A, and both are in the same channel 2. Run `opm registry add -b A` 3. Run `opm registry add -b B` 4. Run `opm render bundles.db| jq 'select(.schema == "olm.bundle" and ([.properties[] | select(.type == "olm.bundle.object")] | length) > 0).name`
Actual results:
Output of opm render command is: "A" "B"
Expected results:
Output of opm render command is: "B"
Additional info:
I did some analysis to show the huge opportunity cost we are paying (and customers are paying) by allowing the catalog metadata to accrue without pruning. https://docs.google.com/spreadsheets/d/1k9cF9w_yxIRErZUSh5gLtFbDmwgfFbzXb_ytXb2LP5Y/edit?gid=0#gid=0
- causes
-
OCPBUGS-33094 redhat-operator needs a lot of CPU resources every 15 minutes
-
- New
-
-
OCPBUGS-4600 FBC catalog server has high startup time and initial memory usage
-
- Closed
-
-
OCPBUGS-38944 The redhat-operators catalogsource pod is using high CPU
-
- Closed
-
-
OCPBUGS-672 Redhat-operators are failing regularly due to startup probe timing out which in turn increases CPU/Mem usage on Master nodes
-
- Closed
-
-
OCPBUGS-36421 redhat-operators pod experiencing unusually high CPU utilization
-
- Closed
-
-
OCPBUGS-50952 redhat-operators pod consuming lot of the master node's CPU
-
- Closed
-
-
OCPBUGS-42005 Catalog Operator pods CPU spike every few minutes
-
- Closed
-
-
OPECO-2575 Investigate opm memory usage for hypershift
-
- Closed
-
- is related to
-
OCPBUGS-48468 OLMv0: excessive catalog source snapshots cause severe performance regression
-
- Verified
-
-
OCPBUGS-11552 oc-mirror generated file-based catalogs crashloop
-
- Closed
-
-
OCPBUGS-643 catsrc is not ready due to "compute digest: compute hash: write tar: open /tmp/cache/cache: permission denied"
-
- Closed
-
-
OCPBUGS-27234 Catalog pod health probes have significant delay, reaching timeout
-
- Closed
-
-
OCPBUGS-37667 IBM Operator Index Image fails with "cache requires rebuild: cache reports digest as xxx, but computed digest is yyy"
-
- Closed
-
-
OCPBUGS-31427 The opm way should support all custom index image
-
- Closed
-
-
OCPBUGS-34488 The index image created by the latest opm will lead the Pod CrashLoopBackOff
-
- Closed
-
-
OCPBUGS-31391 The certified operator crash due to computed digest is different from the cache digest
-
- Closed
-
- links to