-
Bug
-
Resolution: Done-Errata
-
Major
-
None
-
4.15.z
Description of problem:
If a cluster is put into Hibernation via ACM/Hive, and during the time that the cluster is a asleep, if any of the 4 catalogs that ship with OCP's digest gets update (ie a new bundle is added to the catalog tag, 4,15 in this case), then the cluster is woken up, the catalog(s) Pods that were updated now CrashLoopBackOff, and cause the cluster to be in an un-usable state. By unusable state, it means that no other operator subscriptions, or catalog sources can be applied to the cluster.
Version-Release number of selected component (if applicable):
OCP 4.15
How reproducible:
See below
Steps to Reproduce:
1. Create a 4.15 cluster with ACM/Hive. 2. Wait for it to become Health. 3. Put the cluster into Hibernation. 4. While the cluster is asleep, add a new bundle to an existing catalog (yes, I understand catalogs are immutable, and this would result in a new catalog) such that the digest changes. 5. Wake the cluster up via ACK/Hive. 6. Either (both yeild similar logs/results): a: Note that the pods in `openshift-marketplace` namespace are in CrashBackLoop state. b: Create a new subscription and not that this fails, since the catalogs are unhealthy.
Actual results:
New catalogs, subscriptions, operators can't be applied to the cluster.
Expected results:
I'd expect that when a cluster wakes up that the catalogs are healthy, no matter if they have a different digest then when the cluster went to sleep.
Additional info:
Code in question (ie throwing the error): https://github.com/operator-framework/operator-registry/blob/master/pkg/cache/json.go#L181-L194
Log from certified-operator (note it could be any pod, we have examples of marketplace as well) pod:
time="2024-04-02T01:01:22Z" level=info msg="starting pprof endpoint" address="localhost:6060" 2 time="2024-04-02T01:01:22Z" level=fatal msg="cache requires rebuild: cache reports digest as \"2e210f20d7ad085a\", but computed digest is \"9d0c54855f748780\""
Custom Catalog Subscription {"apiVersion":"operators.coreos.com/v1alpha1","kind":"Subscription","metadata":{"creationTimestamp":"2024-04-01T22:35:35Z","generation":1,"labels":{"operators.coreos.com/nginx-ingress-operator.nginx-ingress-operator":""},"managedFields":[{"apiVersion":"operators.coreos.com/v1alpha1","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:labels":{".":{},"f:operators.coreos.com/nginx-ingress-operator.nginx-ingress-operator":{}}}},"manager":"Go-http-client","operation":"Update","time":"2024-04-01T22:35:35Z"},{"apiVersion":"operators.coreos.com/v1alpha1","fieldsType":"FieldsV1","fieldsV1":{"f:spec":{".":{},"f:channel":{},"f:name":{},"f:source":{},"f:sourceNamespace":{}}},"manager":"preflight","operation":"Update","time":"2024-04-01T22:35:35Z"},{"apiVersion":"operators.coreos.com/v1alpha1","fieldsType":"FieldsV1","fieldsV1":{"f:status":{".":{},"f:catalogHealth":{},"f:conditions":{},"f:lastUpdated":{}}},"manager":"catalog","operation":"Update","subresource":"status","time":"2024-04-01T22:38:35Z"}],"name":"nginx-ingress-operator","namespace":"nginx-ingress-operator","resourceVersion":"40266","uid":"b5ee2e64-c43e-4062-bbb8-4a1f68518753"},"spec":{"channel":"alpha","name":"nginx-ingress-operator","source":"nginx-ingress-operator","sourceNamespace":"nginx-ingress-operator"},"status":{"catalogHealth":[{"catalogSourceRef":{"apiVersion":"operators.coreos.com/v1alpha1","kind":"CatalogSource","name":"nginx-ingress-operator","namespace":"nginx-ingress-operator","resourceVersion":"39191","uid":"a1e95a54-2302-4cd1-9ad3-1c352c8f1379"},"healthy":true,"lastUpdated":"2024-04-01T22:36:10Z"},{"catalogSourceRef":{"apiVersion":"operators.coreos.com/v1alpha1","kind":"CatalogSource","name":"certified-operators","namespace":"openshift-marketplace","resourceVersion":"38032","uid":"5180bff4-d2e2-45e3-a24e-bb37826feef5"},"healthy":true,"lastUpdated":"2024-04-01T22:36:10Z"},{"catalogSourceRef":{"apiVersion":"operators.coreos.com/v1alpha1","kind":"CatalogSource","name":"community-operators","namespace":"openshift-marketplace","resourceVersion":"38073","uid":"8e2e195a-cb09-42ac-b931-666f798ab68f"},"healthy":true,"lastUpdated":"2024-04-01T22:36:10Z"},{"catalogSourceRef":{"apiVersion":"operators.coreos.com/v1alpha1","kind":"CatalogSource","name":"redhat-marketplace","namespace":"openshift-marketplace","resourceVersion":"38078","uid":"11781d2c-03c0-48bb-b29a-06b9a5e1990f"},"healthy":true,"lastUpdated":"2024-04-01T22:36:10Z"},{"catalogSourceRef":{"apiVersion":"operators.coreos.com/v1alpha1","kind":"CatalogSource","name":"redhat-operators","namespace":"openshift-marketplace","resourceVersion":"38079","uid":"3ea5a9a4-039b-4646-9f47-484e13589e83"},"healthy":true,"lastUpdated":"2024-04-01T22:36:10Z"}],"conditions":[{"message":"[failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 172.30.209.124:50051: connect: connection refused\", failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 172.30.243.69:50051: connect: connection refused\"]","reason":"ErrorPreventedResolution","status":"True","type":"ResolutionFailed"},{"lastTransitionTime":"2024-04-01T22:36:10Z","message":"all available catalogsources are healthy","reason":"AllCatalogSourcesHealthy","status":"False","type":"CatalogSourcesUnhealthy"}],"lastUpdated":"2024-04-01T22:38:34Z"}}
Custom CatalogSoruce {"apiVersion":"operators.coreos.com/v1alpha1","kind":"CatalogSource","metadata":{"creationTimestamp":"2024-04-01T22:35:35Z","generation":1,"managedFields":[{"apiVersion":"operators.coreos.com/v1alpha1","fieldsType":"FieldsV1","fieldsV1":{"f:spec":{".":{},"f:displayName":{},"f:icon":{".":{},"f:base64data":{},"f:mediatype":{}},"f:image":{},"f:secrets":{},"f:sourceType":{}}},"manager":"preflight","operation":"Update","time":"2024-04-01T22:35:35Z"},{"apiVersion":"operators.coreos.com/v1alpha1","fieldsType":"FieldsV1","fieldsV1":{"f:status":{".":{},"f:connectionState":{".":{},"f:address":{},"f:lastConnect":{},"f:lastObservedState":{}},"f:registryService":{".":{},"f:createdAt":{},"f:port":{},"f:protocol":{},"f:serviceName":{},"f:serviceNamespace":{}}}},"manager":"catalog","operation":"Update","subresource":"status","time":"2024-04-01T22:36:03Z"}],"name":"nginx-ingress-operator","namespace":"nginx-ingress-operator","resourceVersion":"39191","uid":"a1e95a54-2302-4cd1-9ad3-1c352c8f1379"},"spec":{"displayName":"nginx-ingress-operator","icon":{"base64data":"","mediatype":""},"image":"quay.io/operator-pipeline-prod/nginx-ingress-operator-index:v4.16-36a87cabd459f7be3258a7e60ef53751ea737de4","secrets":["registry-auth-keys"],"sourceType":"grpc"},"status":{"connectionState":{"address":"nginx-ingress-operator.nginx-ingress-operator.svc:50051","lastConnect":"2024-04-01T22:36:03Z","lastObservedState":"READY"},"registryService":{"createdAt":"2024-04-01T22:35:37Z","port":"50051","protocol":"grpc","serviceName":"nginx-ingress-operator","serviceNamespace":"nginx-ingress-operator"}}}
Must gather from prow (might not be for the above operator testing and another operator testing)
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-redhat-openshift-ecosystem-certified-operators-prod-ocp-4.15-preflight-prod-claim/1774927027938267136/
Slack Discussion:
https://redhat-internal.slack.com/archives/C3VS0LV41/p1711751930041179
- is cloned by
-
OCPBUGS-31842 [release-4.15] OLM: Catalog Pods CrashLoopBackOff after Cluster `WakesUp` from Hibernating
- Closed
- is depended on by
-
OCPBUGS-31842 [release-4.15] OLM: Catalog Pods CrashLoopBackOff after Cluster `WakesUp` from Hibernating
- Closed
- links to
-
RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update