-
Bug
-
Resolution: Done-Errata
-
Major
-
None
-
4.16
-
Quality / Stability / Reliability
-
False
-
-
1
-
Important
-
None
-
None
-
Rejected
-
MCO Sprint 259, MCO Sprint 264, MCO Sprint 265
-
3
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
When OCB is enabled in a pool, and we pause the pool, and we create a new MC, then the pool reports updated=true status, but it should report udated=false.
If we remove the MC (restoring the original configuration), and we unpause the pool, the latest image (the one with the MC) is applied anyway, and the configuration in the nodes is not consistent with the rendered MC.
Version-Release number of selected component (if applicable):
IPI on AWS, version:
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.16.0-0.nightly-2024-05-31-062415 True False 54m Cluster version is 4.16.0-0.nightly-2024-05-31-062415
How reproducible:
Alwasy
Steps to Reproduce:
1. Enable OCP in the worker pool, for example:
oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: MachineOSConfig
metadata:
name: worker
spec:
machineConfigPool:
name: worker
buildInputs:
imageBuilder:
imageBuilderType: PodImageBuilder
baseImagePullSecret:
name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy")
renderedImagePushSecret:
name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}')
renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest"
EOF
2. Wait for the build to finish and to be applied to the nodes
3. Pause the worker pool
$ oc patch mcp worker --type merge -p '{"spec":{"paused": true}}'
machineconfigpool.machineconfiguration.openshift.io/worker patched
4. Create a new MC
oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: test-machine-config-1
spec:
config:
ignition:
version: 3.1.0
storage:
files:
- contents:
source: data:text/plain;charset=utf-8;base64,dGVzdA==
filesystem: root
mode: 420
path: /etc/test-file-1.test
EOF
5. Wait for the build to finish
6. Remove the MC created in step 4
7 Unpause the pool
Actual results:
After step 4 (we create a MC in a paused pool), the worker MCP status is set to "updated=false" status, then a new build pod is created, the image is built and pushed, and it is not applied to the pool because the pool is paused.
$ oc get pods -n openshift-machine-config-operator | grep build
build-rendered-worker-558d0099ef37dc923293bc2bfd2f0d7f 2/2 Running 0 26s
machine-os-builder-66fbf48666-s55qv 1/1 Running 0 4m55s
[fedora@preserve-sregidor-work openshift-tests-private]$ oc get mcp worker
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
worker rendered-worker-7f9a9285ae9428f69f9d927b2075af79 False False False 3 0 0 0 68m
Nevertheless, once the build pod finishes building the image, the worker pool's status is set to "updated=true", even though the pool is paused and the image was not applied.
$ oc get pods -n openshift-machine-config-operator | grep build
machine-os-builder-66fbf48666-s55qv 1/1 Running 0 8m11s
$ oc get machineosbuilds worker-rendered-worker-558d0099ef37dc923293bc2bfd2f0d7f-builder -ojsonpath='{.status.finalImagePullspec}'
quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89
]$ oc debug -q node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") -- chroot /host rpm-ostree status
State: idle
Deployments:
* ostree-unverified-registry:quay.io/mcoqe/layering@sha256:1adb95f5e0e1afd482aefbf0482e3df8c9f358622ec5ec081476ac1690001567 <<--- the new image was not applied
Digest: sha256:1adb95f5e0e1afd482aefbf0482e3df8c9f358622ec5ec081476ac1690001567
Version: 416.94.202405301601-0 (2024-05-31T08:44:45Z)
$ oc get mcp worker
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
worker rendered-worker-558d0099ef37dc923293bc2bfd2f0d7f True False False 3 3 3 0 72m <<-- but the status is updated=true
If we remove the MC (restoring the original configuration) and we unpause the pool, we can see that the nodes apply the new image even if they shouldn't.
We can see in the logs
2024-05-31T09:13:32.310491300+00:00 stderr F I0531 09:13:32.310479 2350 daemon.go:2443] Performing layered OS update
2024-05-31T09:13:32.324502561+00:00 stderr F I0531 09:13:32.324430 2350 update.go:2632] Adding SIGTERM protection
2024-05-31T09:13:32.342739023+00:00 stderr F I0531 09:13:32.342694 2350 update.go:845] Checking Reconcilable for config rendered-worker-7f9a9285ae9428f69f9d927b2075af79 to rendered-worker-7f9a9285ae9428f69f9d
927b2075af79
2024-05-31T09:13:32.378558402+00:00 stderr F I0531 09:13:32.378507 2350 update.go:2610] Update prepared; requesting cordon and drain via annotation to controller
2024-05-31T09:15:02.403077130+00:00 stderr F I0531 09:15:02.403025 2350 update.go:2610] drain complete
2024-05-31T09:15:02.404598371+00:00 stderr F I0531 09:15:02.404570 2350 drain.go:114] Successful drain took 90.024576284 seconds
2024-05-31T09:15:02.404598371+00:00 stderr F I0531 09:15:02.404589 2350 rpm-ostree.go:308] Running captured: rpm-ostree --version
2024-05-31T09:15:02.416771084+00:00 stderr F E0531 09:15:02.416736 2350 rpm-ostree.go:276] Merged secret file does not exist; defaulting to cluster pull secret
2024-05-31T09:15:02.416771084+00:00 stderr F I0531 09:15:02.416764 2350 rpm-ostree.go:263] Linking ostree authfile to /var/lib/kubelet/config.json
2024-05-31T09:15:02.416798545+00:00 stderr F I0531 09:15:02.416790 2350 rpm-ostree.go:243] Executing rebase to quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89
2024-05-31T09:15:02.416798545+00:00 stderr F I0531 09:15:02.416796 2350 update.go:2595] Running: rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb
2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89
2024-05-31T09:15:02.498820482+00:00 stdout F Pulling manifest: ostree-unverified-registry:quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89
We can see that the image applied is not consistent with the rendered machineconfig that the pool is using
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-6281f1cfe35fa581444ea78e0dca347d True False False 3 3 3 0 101m
worker rendered-worker-7f9a9285ae9428f69f9d927b2075af79 True False False 2 2 2 0 101m
$ oc get machineosbuild
NAME PREPARED BUILDING SUCCEEDED INTERRUPTED FAILED
worker-rendered-worker-558d0099ef37dc923293bc2bfd2f0d7f-builder False False True False False
worker-rendered-worker-7f9a9285ae9428f69f9d927b2075af79-builder False False True False False
$ oc get machineosbuild worker-rendered-worker-7f9a9285ae9428f69f9d927b2075af79-builder -ojsonpath='{.status.finalImagePullspec}'
quay.io/mcoqe/layering@sha256:1adb95f5e0e1afd482aefbf0482e3df8c9f358622ec5ec081476ac1690001567
$ oc get machineosbuild worker-rendered-worker-558d0099ef37dc923293bc2bfd2f0d7f-builder -ojsonpath='{.status.finalImagePullspec}'
quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89
oc debug node/ip-10-0-0-214.us-east-2.compute.internal -- chroot /host rpm-ostree status
Starting pod/ip-10-0-0-214us-east-2computeinternal-debug-559wq ...
To use host binaries, run `chroot /host`
State: idle
Deployments:
* ostree-unverified-registry:quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89
Digest: sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89
Version: 416.94.202405301601-0 (2024-05-31T08:58:19Z)
The image applied is the one is not the one that belongs to the right MC after we unpause the pool
Expected results:
When a pool is paused and there are pending MCs to be applied its status should always be updated=false and updating=false.
When we restore the initial configuration and we upause the pool, no image should be applied.
Additional info:
- links to
-
RHEA-2024:6122
OpenShift Container Platform 4.18.z bug fix update