Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: None
Affects Version/s: 4.16
Component/s: Machine Config Operator
Labels:
- mco-triaged
- qe-ocb-test

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
1
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:

4.18.0
Release Blocker:
Rejected
Sprint:
MCO Sprint 259, MCO Sprint 264, MCO Sprint 265
sprint_count:
3

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

When OCB is enabled in a pool, and we pause the pool, and we create a new MC, then the pool reports updated=true status, but it should report udated=false.


If we remove the MC (restoring the original configuration), and we unpause the pool, the latest image (the one with the MC) is applied anyway, and the configuration in the nodes is not consistent with the rendered MC.

Version-Release number of selected component (if applicable):

IPI on AWS, version:

NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-05-31-062415   True        False         54m     Cluster version is 4.16.0-0.nightly-2024-05-31-062415

How reproducible:

Alwasy

Steps to Reproduce:

    1. Enable OCP in the worker pool, for example:


oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: MachineOSConfig
metadata:
  name: worker
spec:
  machineConfigPool:
    name: worker
  buildInputs:
    imageBuilder:
      imageBuilderType: PodImageBuilder
    baseImagePullSecret:
      name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy")
    renderedImagePushSecret:
      name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}')
    renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest"
EOF

    2. Wait for the build to finish and to be applied to the nodes

    3. Pause the worker pool 

$ oc patch mcp worker --type merge -p '{"spec":{"paused": true}}'
machineconfigpool.machineconfiguration.openshift.io/worker patched


    4. Create a new MC

oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: test-machine-config-1
spec:
  config:
    ignition:
      version: 3.1.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,dGVzdA==
        filesystem: root
        mode: 420
        path: /etc/test-file-1.test
EOF


    5. Wait for the build to finish

    6. Remove the MC created in step 4

    7 Unpause the pool

Actual results:


After step 4 (we create a MC in a paused pool), the worker MCP status is set to "updated=false" status, then a new build pod is created, the image is built and pushed, and it is not applied to the pool because the pool is paused.

$ oc get pods -n openshift-machine-config-operator | grep build
build-rendered-worker-558d0099ef37dc923293bc2bfd2f0d7f           2/2     Running   0             26s
machine-os-builder-66fbf48666-s55qv                              1/1     Running   0             4m55s
[fedora@preserve-sregidor-work openshift-tests-private]$ oc get mcp worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-7f9a9285ae9428f69f9d927b2075af79   False     False      False      3              0                   0                     0                      68m


Nevertheless, once the build pod finishes building the image, the worker pool's status is set to "updated=true", even though the pool is paused and the image was not applied.

$ oc get pods -n openshift-machine-config-operator | grep build
machine-os-builder-66fbf48666-s55qv                              1/1     Running   0             8m11s

$ oc get machineosbuilds worker-rendered-worker-558d0099ef37dc923293bc2bfd2f0d7f-builder -ojsonpath='{.status.finalImagePullspec}'
quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89

]$ oc debug -q  node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") -- chroot /host rpm-ostree status
State: idle
Deployments:
* ostree-unverified-registry:quay.io/mcoqe/layering@sha256:1adb95f5e0e1afd482aefbf0482e3df8c9f358622ec5ec081476ac1690001567   <<--- the new image was not applied
                   Digest: sha256:1adb95f5e0e1afd482aefbf0482e3df8c9f358622ec5ec081476ac1690001567
                  Version: 416.94.202405301601-0 (2024-05-31T08:44:45Z)

$ oc get mcp worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-558d0099ef37dc923293bc2bfd2f0d7f   True      False      False      3              3                   3                     0                      72m   <<-- but the status is updated=true



If we remove the MC (restoring the original configuration) and we unpause the pool, we can see  that the nodes apply the new image even if they shouldn't.

We can see in the logs


2024-05-31T09:13:32.310491300+00:00 stderr F I0531 09:13:32.310479    2350 daemon.go:2443] Performing layered OS update
2024-05-31T09:13:32.324502561+00:00 stderr F I0531 09:13:32.324430    2350 update.go:2632] Adding SIGTERM protection
2024-05-31T09:13:32.342739023+00:00 stderr F I0531 09:13:32.342694    2350 update.go:845] Checking Reconcilable for config rendered-worker-7f9a9285ae9428f69f9d927b2075af79 to rendered-worker-7f9a9285ae9428f69f9d
927b2075af79
2024-05-31T09:13:32.378558402+00:00 stderr F I0531 09:13:32.378507    2350 update.go:2610] Update prepared; requesting cordon and drain via annotation to controller
2024-05-31T09:15:02.403077130+00:00 stderr F I0531 09:15:02.403025    2350 update.go:2610] drain complete
2024-05-31T09:15:02.404598371+00:00 stderr F I0531 09:15:02.404570    2350 drain.go:114] Successful drain took 90.024576284 seconds
2024-05-31T09:15:02.404598371+00:00 stderr F I0531 09:15:02.404589    2350 rpm-ostree.go:308] Running captured: rpm-ostree --version
2024-05-31T09:15:02.416771084+00:00 stderr F E0531 09:15:02.416736    2350 rpm-ostree.go:276] Merged secret file does not exist; defaulting to cluster pull secret
2024-05-31T09:15:02.416771084+00:00 stderr F I0531 09:15:02.416764    2350 rpm-ostree.go:263] Linking ostree authfile to /var/lib/kubelet/config.json
2024-05-31T09:15:02.416798545+00:00 stderr F I0531 09:15:02.416790    2350 rpm-ostree.go:243] Executing rebase to quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89
2024-05-31T09:15:02.416798545+00:00 stderr F I0531 09:15:02.416796    2350 update.go:2595] Running: rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb
2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89
2024-05-31T09:15:02.498820482+00:00 stdout F Pulling manifest: ostree-unverified-registry:quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89


We can see that the image applied is not consistent with the rendered machineconfig that the pool is using


$ oc get  mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-6281f1cfe35fa581444ea78e0dca347d   True      False      False      3              3                   3                     0                      101m
worker   rendered-worker-7f9a9285ae9428f69f9d927b2075af79   True      False      False      2              2                   2                     0                      101m


$ oc get machineosbuild
NAME                                                              PREPARED   BUILDING   SUCCEEDED   INTERRUPTED   FAILED
worker-rendered-worker-558d0099ef37dc923293bc2bfd2f0d7f-builder   False      False      True        False         False
worker-rendered-worker-7f9a9285ae9428f69f9d927b2075af79-builder   False      False      True        False         False

$ oc get machineosbuild worker-rendered-worker-7f9a9285ae9428f69f9d927b2075af79-builder -ojsonpath='{.status.finalImagePullspec}'
quay.io/mcoqe/layering@sha256:1adb95f5e0e1afd482aefbf0482e3df8c9f358622ec5ec081476ac1690001567
 $ oc get machineosbuild worker-rendered-worker-558d0099ef37dc923293bc2bfd2f0d7f-builder -ojsonpath='{.status.finalImagePullspec}'
quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89

  oc debug node/ip-10-0-0-214.us-east-2.compute.internal -- chroot /host rpm-ostree status
Starting pod/ip-10-0-0-214us-east-2computeinternal-debug-559wq ...
To use host binaries, run `chroot /host`
State: idle
Deployments:
* ostree-unverified-registry:quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89
                   Digest: sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89
                  Version: 416.94.202405301601-0 (2024-05-31T08:58:19Z)

The image applied is the one is not the one that belongs to the right MC after we unpause the pool

Expected results:

When a pool is paused and there are pending MCs to be applied its status should always be updated=false and updating=false.

When we restore the initial configuration and we upause the pool, no image should be applied.

Additional info:

links to

RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update

Assignee:: Urvashi Mohnani

Reporter:: Sergio Regidor de la Rosa

Need Info From:: None

Contributors:: None

QA Contact:: Sergio Regidor de la Rosa

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/05/31 9:05 AM

Updated:: 2025/07/22 5:35 PM

Resolved:: 2025/02/25 4:53 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide