-
Bug
-
Resolution: Done-Errata
-
Major
-
None
-
4.16
-
Quality / Stability / Reliability
-
False
-
-
1
-
Important
-
None
-
None
-
Rejected
-
MCO Sprint 259, MCO Sprint 264, MCO Sprint 265
-
3
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
When OCB is enabled in a pool, and we pause the pool, and we create a new MC, then the pool reports updated=true status, but it should report udated=false. If we remove the MC (restoring the original configuration), and we unpause the pool, the latest image (the one with the MC) is applied anyway, and the configuration in the nodes is not consistent with the rendered MC.
Version-Release number of selected component (if applicable):
IPI on AWS, version: NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-05-31-062415 True False 54m Cluster version is 4.16.0-0.nightly-2024-05-31-062415
How reproducible:
Alwasy
Steps to Reproduce:
1. Enable OCP in the worker pool, for example: oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1alpha1 kind: MachineOSConfig metadata: name: worker spec: machineConfigPool: name: worker buildInputs: imageBuilder: imageBuilderType: PodImageBuilder baseImagePullSecret: name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy") renderedImagePushSecret: name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}') renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest" EOF 2. Wait for the build to finish and to be applied to the nodes 3. Pause the worker pool $ oc patch mcp worker --type merge -p '{"spec":{"paused": true}}' machineconfigpool.machineconfiguration.openshift.io/worker patched 4. Create a new MC oc create -f - << EOF apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: test-machine-config-1 spec: config: ignition: version: 3.1.0 storage: files: - contents: source: data:text/plain;charset=utf-8;base64,dGVzdA== filesystem: root mode: 420 path: /etc/test-file-1.test EOF 5. Wait for the build to finish 6. Remove the MC created in step 4 7 Unpause the pool
Actual results:
After step 4 (we create a MC in a paused pool), the worker MCP status is set to "updated=false" status, then a new build pod is created, the image is built and pushed, and it is not applied to the pool because the pool is paused. $ oc get pods -n openshift-machine-config-operator | grep build build-rendered-worker-558d0099ef37dc923293bc2bfd2f0d7f 2/2 Running 0 26s machine-os-builder-66fbf48666-s55qv 1/1 Running 0 4m55s [fedora@preserve-sregidor-work openshift-tests-private]$ oc get mcp worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-7f9a9285ae9428f69f9d927b2075af79 False False False 3 0 0 0 68m Nevertheless, once the build pod finishes building the image, the worker pool's status is set to "updated=true", even though the pool is paused and the image was not applied. $ oc get pods -n openshift-machine-config-operator | grep build machine-os-builder-66fbf48666-s55qv 1/1 Running 0 8m11s $ oc get machineosbuilds worker-rendered-worker-558d0099ef37dc923293bc2bfd2f0d7f-builder -ojsonpath='{.status.finalImagePullspec}' quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89 ]$ oc debug -q node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") -- chroot /host rpm-ostree status State: idle Deployments: * ostree-unverified-registry:quay.io/mcoqe/layering@sha256:1adb95f5e0e1afd482aefbf0482e3df8c9f358622ec5ec081476ac1690001567 <<--- the new image was not applied Digest: sha256:1adb95f5e0e1afd482aefbf0482e3df8c9f358622ec5ec081476ac1690001567 Version: 416.94.202405301601-0 (2024-05-31T08:44:45Z) $ oc get mcp worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-558d0099ef37dc923293bc2bfd2f0d7f True False False 3 3 3 0 72m <<-- but the status is updated=true If we remove the MC (restoring the original configuration) and we unpause the pool, we can see that the nodes apply the new image even if they shouldn't. We can see in the logs 2024-05-31T09:13:32.310491300+00:00 stderr F I0531 09:13:32.310479 2350 daemon.go:2443] Performing layered OS update 2024-05-31T09:13:32.324502561+00:00 stderr F I0531 09:13:32.324430 2350 update.go:2632] Adding SIGTERM protection 2024-05-31T09:13:32.342739023+00:00 stderr F I0531 09:13:32.342694 2350 update.go:845] Checking Reconcilable for config rendered-worker-7f9a9285ae9428f69f9d927b2075af79 to rendered-worker-7f9a9285ae9428f69f9d 927b2075af79 2024-05-31T09:13:32.378558402+00:00 stderr F I0531 09:13:32.378507 2350 update.go:2610] Update prepared; requesting cordon and drain via annotation to controller 2024-05-31T09:15:02.403077130+00:00 stderr F I0531 09:15:02.403025 2350 update.go:2610] drain complete 2024-05-31T09:15:02.404598371+00:00 stderr F I0531 09:15:02.404570 2350 drain.go:114] Successful drain took 90.024576284 seconds 2024-05-31T09:15:02.404598371+00:00 stderr F I0531 09:15:02.404589 2350 rpm-ostree.go:308] Running captured: rpm-ostree --version 2024-05-31T09:15:02.416771084+00:00 stderr F E0531 09:15:02.416736 2350 rpm-ostree.go:276] Merged secret file does not exist; defaulting to cluster pull secret 2024-05-31T09:15:02.416771084+00:00 stderr F I0531 09:15:02.416764 2350 rpm-ostree.go:263] Linking ostree authfile to /var/lib/kubelet/config.json 2024-05-31T09:15:02.416798545+00:00 stderr F I0531 09:15:02.416790 2350 rpm-ostree.go:243] Executing rebase to quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89 2024-05-31T09:15:02.416798545+00:00 stderr F I0531 09:15:02.416796 2350 update.go:2595] Running: rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb 2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89 2024-05-31T09:15:02.498820482+00:00 stdout F Pulling manifest: ostree-unverified-registry:quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89 We can see that the image applied is not consistent with the rendered machineconfig that the pool is using $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-6281f1cfe35fa581444ea78e0dca347d True False False 3 3 3 0 101m worker rendered-worker-7f9a9285ae9428f69f9d927b2075af79 True False False 2 2 2 0 101m $ oc get machineosbuild NAME PREPARED BUILDING SUCCEEDED INTERRUPTED FAILED worker-rendered-worker-558d0099ef37dc923293bc2bfd2f0d7f-builder False False True False False worker-rendered-worker-7f9a9285ae9428f69f9d927b2075af79-builder False False True False False $ oc get machineosbuild worker-rendered-worker-7f9a9285ae9428f69f9d927b2075af79-builder -ojsonpath='{.status.finalImagePullspec}' quay.io/mcoqe/layering@sha256:1adb95f5e0e1afd482aefbf0482e3df8c9f358622ec5ec081476ac1690001567 $ oc get machineosbuild worker-rendered-worker-558d0099ef37dc923293bc2bfd2f0d7f-builder -ojsonpath='{.status.finalImagePullspec}' quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89 oc debug node/ip-10-0-0-214.us-east-2.compute.internal -- chroot /host rpm-ostree status Starting pod/ip-10-0-0-214us-east-2computeinternal-debug-559wq ... To use host binaries, run `chroot /host` State: idle Deployments: * ostree-unverified-registry:quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89 Digest: sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89 Version: 416.94.202405301601-0 (2024-05-31T08:58:19Z) The image applied is the one is not the one that belongs to the right MC after we unpause the pool
Expected results:
When a pool is paused and there are pending MCs to be applied its status should always be updated=false and updating=false. When we restore the initial configuration and we upause the pool, no image should be applied.
Additional info:
- links to
-
RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update