Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-34725

When using OCB the pools do not behave properly when they are paused

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 1
    • Important
    • None
    • None
    • Rejected
    • MCO Sprint 259, MCO Sprint 264, MCO Sprint 265
    • 3
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      When OCB is enabled in a pool, and we pause the pool, and we create a new MC, then the pool reports updated=true status, but it should report udated=false.
      
      
      If we remove the MC (restoring the original configuration), and we unpause the pool, the latest image (the one with the MC) is applied anyway, and the configuration in the nodes is not consistent with the rendered MC.
      
          

      Version-Release number of selected component (if applicable):

      IPI on AWS, version:
      
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.16.0-0.nightly-2024-05-31-062415   True        False         54m     Cluster version is 4.16.0-0.nightly-2024-05-31-062415
      
      
      
          

      How reproducible:

      Alwasy
          

      Steps to Reproduce:

          1. Enable OCP in the worker pool, for example:
      
      
      oc create -f - << EOF
      apiVersion: machineconfiguration.openshift.io/v1alpha1
      kind: MachineOSConfig
      metadata:
        name: worker
      spec:
        machineConfigPool:
          name: worker
        buildInputs:
          imageBuilder:
            imageBuilderType: PodImageBuilder
          baseImagePullSecret:
            name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy")
          renderedImagePushSecret:
            name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}')
          renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest"
      EOF
      
          2. Wait for the build to finish and to be applied to the nodes
      
          3. Pause the worker pool 
      
      $ oc patch mcp worker --type merge -p '{"spec":{"paused": true}}'
      machineconfigpool.machineconfiguration.openshift.io/worker patched
      
      
          4. Create a new MC
      
      oc create -f - << EOF
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      metadata:
        labels:
          machineconfiguration.openshift.io/role: worker
        name: test-machine-config-1
      spec:
        config:
          ignition:
            version: 3.1.0
          storage:
            files:
            - contents:
                source: data:text/plain;charset=utf-8;base64,dGVzdA==
              filesystem: root
              mode: 420
              path: /etc/test-file-1.test
      EOF
      
      
          5. Wait for the build to finish
      
          6. Remove the MC created in step 4
      
          7 Unpause the pool
      
          

      Actual results:

      
      After step 4 (we create a MC in a paused pool), the worker MCP status is set to "updated=false" status, then a new build pod is created, the image is built and pushed, and it is not applied to the pool because the pool is paused.
      
      $ oc get pods -n openshift-machine-config-operator | grep build
      build-rendered-worker-558d0099ef37dc923293bc2bfd2f0d7f           2/2     Running   0             26s
      machine-os-builder-66fbf48666-s55qv                              1/1     Running   0             4m55s
      [fedora@preserve-sregidor-work openshift-tests-private]$ oc get mcp worker
      NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      worker   rendered-worker-7f9a9285ae9428f69f9d927b2075af79   False     False      False      3              0                   0                     0                      68m
      
      
      Nevertheless, once the build pod finishes building the image, the worker pool's status is set to "updated=true", even though the pool is paused and the image was not applied.
      
      $ oc get pods -n openshift-machine-config-operator | grep build
      machine-os-builder-66fbf48666-s55qv                              1/1     Running   0             8m11s
      
      $ oc get machineosbuilds worker-rendered-worker-558d0099ef37dc923293bc2bfd2f0d7f-builder -ojsonpath='{.status.finalImagePullspec}'
      quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89
      
      ]$ oc debug -q  node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") -- chroot /host rpm-ostree status
      State: idle
      Deployments:
      * ostree-unverified-registry:quay.io/mcoqe/layering@sha256:1adb95f5e0e1afd482aefbf0482e3df8c9f358622ec5ec081476ac1690001567   <<--- the new image was not applied
                         Digest: sha256:1adb95f5e0e1afd482aefbf0482e3df8c9f358622ec5ec081476ac1690001567
                        Version: 416.94.202405301601-0 (2024-05-31T08:44:45Z)
      
      $ oc get mcp worker
      NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      worker   rendered-worker-558d0099ef37dc923293bc2bfd2f0d7f   True      False      False      3              3                   3                     0                      72m   <<-- but the status is updated=true
      
      
      
      If we remove the MC (restoring the original configuration) and we unpause the pool, we can see  that the nodes apply the new image even if they shouldn't.
      
      We can see in the logs
      
      
      2024-05-31T09:13:32.310491300+00:00 stderr F I0531 09:13:32.310479    2350 daemon.go:2443] Performing layered OS update
      2024-05-31T09:13:32.324502561+00:00 stderr F I0531 09:13:32.324430    2350 update.go:2632] Adding SIGTERM protection
      2024-05-31T09:13:32.342739023+00:00 stderr F I0531 09:13:32.342694    2350 update.go:845] Checking Reconcilable for config rendered-worker-7f9a9285ae9428f69f9d927b2075af79 to rendered-worker-7f9a9285ae9428f69f9d
      927b2075af79
      2024-05-31T09:13:32.378558402+00:00 stderr F I0531 09:13:32.378507    2350 update.go:2610] Update prepared; requesting cordon and drain via annotation to controller
      2024-05-31T09:15:02.403077130+00:00 stderr F I0531 09:15:02.403025    2350 update.go:2610] drain complete
      2024-05-31T09:15:02.404598371+00:00 stderr F I0531 09:15:02.404570    2350 drain.go:114] Successful drain took 90.024576284 seconds
      2024-05-31T09:15:02.404598371+00:00 stderr F I0531 09:15:02.404589    2350 rpm-ostree.go:308] Running captured: rpm-ostree --version
      2024-05-31T09:15:02.416771084+00:00 stderr F E0531 09:15:02.416736    2350 rpm-ostree.go:276] Merged secret file does not exist; defaulting to cluster pull secret
      2024-05-31T09:15:02.416771084+00:00 stderr F I0531 09:15:02.416764    2350 rpm-ostree.go:263] Linking ostree authfile to /var/lib/kubelet/config.json
      2024-05-31T09:15:02.416798545+00:00 stderr F I0531 09:15:02.416790    2350 rpm-ostree.go:243] Executing rebase to quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89
      2024-05-31T09:15:02.416798545+00:00 stderr F I0531 09:15:02.416796    2350 update.go:2595] Running: rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb
      2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89
      2024-05-31T09:15:02.498820482+00:00 stdout F Pulling manifest: ostree-unverified-registry:quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89
      
      
      We can see that the image applied is not consistent with the rendered machineconfig that the pool is using
      
      
      $ oc get  mcp
      NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      master   rendered-master-6281f1cfe35fa581444ea78e0dca347d   True      False      False      3              3                   3                     0                      101m
      worker   rendered-worker-7f9a9285ae9428f69f9d927b2075af79   True      False      False      2              2                   2                     0                      101m
      
      
      $ oc get machineosbuild
      NAME                                                              PREPARED   BUILDING   SUCCEEDED   INTERRUPTED   FAILED
      worker-rendered-worker-558d0099ef37dc923293bc2bfd2f0d7f-builder   False      False      True        False         False
      worker-rendered-worker-7f9a9285ae9428f69f9d927b2075af79-builder   False      False      True        False         False
      
      $ oc get machineosbuild worker-rendered-worker-7f9a9285ae9428f69f9d927b2075af79-builder -ojsonpath='{.status.finalImagePullspec}'
      quay.io/mcoqe/layering@sha256:1adb95f5e0e1afd482aefbf0482e3df8c9f358622ec5ec081476ac1690001567
       $ oc get machineosbuild worker-rendered-worker-558d0099ef37dc923293bc2bfd2f0d7f-builder -ojsonpath='{.status.finalImagePullspec}'
      quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89
      
        oc debug node/ip-10-0-0-214.us-east-2.compute.internal -- chroot /host rpm-ostree status
      Starting pod/ip-10-0-0-214us-east-2computeinternal-debug-559wq ...
      To use host binaries, run `chroot /host`
      State: idle
      Deployments:
      * ostree-unverified-registry:quay.io/mcoqe/layering@sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89
                         Digest: sha256:f93af463c752ae4205ffb2aab52fe05dbc6c5b62914a22cb20bf6274001c8e89
                        Version: 416.94.202405301601-0 (2024-05-31T08:58:19Z)
      
      The image applied is the one is not the one that belongs to the right MC after we unpause the pool
      
          

      Expected results:

      When a pool is paused and there are pending MCs to be applied its status should always be updated=false and updating=false.
      
      When we restore the initial configuration and we upause the pool, no image should be applied.
      
          

      Additional info:

          

              umohnani Urvashi Mohnani
              sregidor@redhat.com Sergio Regidor de la Rosa
              None
              None
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: