Description of problem:
The pod of catalogsource without registryPoll wasn't recreated during the node failure
jiazha-mac:~ jiazha$ oc get pods NAME READY STATUS RESTARTS AGE certified-operators-rcs64 1/1 Running 0 123m community-operators-8mxh6 1/1 Running 0 123m marketplace-operator-769fbb9898-czsfn 1/1 Running 4 (117m ago) 136m qe-app-registry-5jxlx 1/1 Running 0 106m redhat-marketplace-4bgv9 1/1 Running 0 123m redhat-operators-ww5tb 1/1 Running 0 123m test-2xvt8 1/1 Terminating 0 12m jiazha-mac:~ jiazha$ oc get pods test-2xvt8 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-2xvt8 1/1 Running 0 7m6s 10.129.2.26 qe-daily-417-0708-cv2p6-worker-westus-gcrrc <none> <none> jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc NAME STATUS ROLES AGE VERSION qe-daily-417-0708-cv2p6-worker-westus-gcrrc NotReady worker 116m v1.30.2+421e90e
Version-Release number of selected component (if applicable):
Cluster version is 4.17.0-0.nightly-2024-07-07-131215
How reproducible:
always
Steps to Reproduce:
1. create a catalogsource without the registryPoll configure. jiazha-mac:~ jiazha$ cat cs-32183.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: test namespace: openshift-marketplace spec: displayName: Test Operators image: registry.redhat.io/redhat/redhat-operator-index:v4.16 publisher: OpenShift QE sourceType: grpc jiazha-mac:~ jiazha$ oc create -f cs-32183.yaml catalogsource.operators.coreos.com/test created jiazha-mac:~ jiazha$ oc get pods test-2xvt8 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-2xvt8 1/1 Running 0 3m18s 10.129.2.26 qe-daily-417-0708-cv2p6-worker-westus-gcrrc <none> <none> 2. Stop the node jiazha-mac:~ jiazha$ oc debug node/qe-daily-417-0708-cv2p6-worker-westus-gcrrc Temporary namespace openshift-debug-q4d5k is created for debugging node... Starting pod/qe-daily-417-0708-cv2p6-worker-westus-gcrrc-debug-v665f ... To use host binaries, run `chroot /host` Pod IP: 10.0.128.5 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host sh-5.1# systemctl stop kubelet; sleep 600; systemctl start kubelet Removing debug pod ... Temporary namespace openshift-debug-q4d5k was removed. jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc NAME STATUS ROLES AGE VERSION qe-daily-417-0708-cv2p6-worker-westus-gcrrc NotReady worker 115m v1.30.2+421e90e 3. check it this catalogsource's pod recreated.
Actual results:
No new pod was generated.
jiazha-mac:~ jiazha$ oc get pods NAME READY STATUS RESTARTS AGE certified-operators-rcs64 1/1 Running 0 123m community-operators-8mxh6 1/1 Running 0 123m marketplace-operator-769fbb9898-czsfn 1/1 Running 4 (117m ago) 136m qe-app-registry-5jxlx 1/1 Running 0 106m redhat-marketplace-4bgv9 1/1 Running 0 123m redhat-operators-ww5tb 1/1 Running 0 123m test-2xvt8 1/1 Terminating 0 12m
once node recovery, a new pod was generated.
jiazha-mac:~ jiazha$ oc get node qe-daily-417-0708-cv2p6-worker-westus-gcrrc
NAME STATUS ROLES AGE VERSION
qe-daily-417-0708-cv2p6-worker-westus-gcrrc Ready worker 127m v1.30.2+421e90e
jiazha-mac:~ jiazha$ oc get pods
NAME READY STATUS RESTARTS AGE
certified-operators-rcs64 1/1 Running 0 127m
community-operators-8mxh6 1/1 Running 0 127m
marketplace-operator-769fbb9898-czsfn 1/1 Running 4 (121m ago) 140m
qe-app-registry-5jxlx 1/1 Running 0 109m
redhat-marketplace-4bgv9 1/1 Running 0 127m
redhat-operators-ww5tb 1/1 Running 0 127m
test-wqxvg 1/1 Running 0 27s
Expected results:
During the node failure, a new catalog source pod should be generated.
Additional info:
Hi Team,
After some more investigating the source code of operator-lifecycle-manager, we figure out the reason.
- The commit [1] try to fix this issue by adding "force deleting dead pod" process into ensurePod() function.
- The ensurePod() is called by EnsureRegistryServer() [2].
- However, the syncRegistryServer() will return immediately without calling EnsureRegistryServer() if there is no registryPoll in catalog [3].
- There is no registryPoll defined in catalogsource that were generated when we build catalog image following Doc [4].
apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: redhat-operator-index namespace: openshift-marketplace spec: image: quay-server.bastion.tokyo.com:5000/redhat/redhat-operator-index-logging:logging-vstable-5.8-v5.8.5 sourceType: grpc
- So the catalog pod created by the catalogsource cannot recovered.
And we verified that the catalog pod can be recreated on other node if we add the configuration of registryPoll to catalogsource as the following (The lines with <==).
apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: redhat-operator-index namespace: openshift-marketplace spec: image: quay-server.bastion.tokyo.com:5000/redhat/redhat-operator-index-logging:logging-vstable-5.8-v5.8.5 sourceType: grpc updateStrategy: <== registryPoll: <== interval: 10m <==
The registryPoll is NOT MUST for catalogsource.
So the commit [1] trying to fix the issue in EnsureRegistryServer() is not properly.
[1] https://github.com/operator-framework/operator-lifecycle-manager/pull/3201/files
[2] https://github.com/joelanford/operator-lifecycle-manager/blob/82f499723e52e85f28653af0610b6e7feff096cf/pkg/controller/registry/reconciler/grpc.go#L290
[3] https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/catalog/operator.go#L1009
[4] https://docs.openshift.com/container-platform/4.16/operators/admin/olm-managing-custom-catalogs.html
- depends on
-
OCPBUGS-36661 OLM catalogsource pods do not recover from node failure when registryPoll is none
- Verified
- is cloned by
-
OCPBUGS-41217 OLM catalogsource pods do not recover from node failure when registryPoll is none
- Closed
- is depended on by
-
OCPBUGS-41217 OLM catalogsource pods do not recover from node failure when registryPoll is none
- Closed
- links to
-
RHEA-2024:3718 OpenShift Container Platform 4.17.z bug fix update