Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18414

In OCB pools, when a config drift happens and it is fixed, the pool is degraded with error: "Old and new refs are equal"

XMLWordPrintable

      Description of problem:

      In pools with On-Cluster Build enabled. When a config drift happens because a file's content has been manually changed the MCP goes degraded (this is expected).
      
        - lastTransitionTime: "2023-08-31T11:34:33Z"
          message: 'Node sregidor-sr2-2gb5z-worker-a-7tpjd.c.openshift-qe.internal is reporting:
            "unexpected on-disk state validating against quay.io/xxx/xxx@sha256:........................:
            content mismatch for file \"/etc/mco-test-file\""'
          reason: 1 nodes are reporting degraded status on sync
          status: "True"
          type: NodeDegraded
      
      
      If we fix this drift and we restore the original file's content, the MCP becomes degraded with this message:
      
          - lastTransitionTime: "2023-08-31T12:24:47Z"
            message: 'Node sregidor-sr2-2gb5z-worker-a-q7wcb.c.openshift-qe.internal is
              reporting: "failed to update OS to quay.io/xxx/xxx@sha256:.......
              : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/xxx/xxx@sha256:........:
              error: Old and new refs are equal: ostree-unverified-registry:quay.io/xxx/xxx@sha256:..............\n:
              exit status 1"'
            reason: 1 nodes are reporting degraded status on sync
            status: "True"
            type: NodeDegraded
      
      
      
      

      Version-Release number of selected component (if applicable):

      $ oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.14.0-0.nightly-2023-08-30-191617   True        False         4h18m   Error while reconciling 4.14.0-0.nightly-2023-08-30-191617: the cluster operator monitoring is not available
      
      

      How reproducible:

      Always
      

      Steps to Reproduce:

      1. Enable the OCB functionality for worker pool
      $ oc label mcp/worker machineconfiguration.openshift.io/layering-enabled=
      
      (Create the necessary cms and secrets for the OCB functionality to work fine)
      
      wait until the new image is created and the nodes are updated
      
      2. Create a MC to deploy a new file
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      metadata:
        labels:
          machineconfiguration.openshift.io/role: worker
        name: mco-drift-test-file
      spec:
        config:
          ignition:
            version: 3.2.0
          storage:
            files:
            - contents:
                source: data:,MCO%20test%20file%0A
              path: /etc/mco-test-file
      
      wait until the new MC is deployed
      
      3. Modify the content of the file /etc/mco-test-file making a backup first
      
      $ oc debug  node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}")
      chrWarning: metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters]
      Starting pod/sregidor-sr2-2gb5z-worker-a-q7wcbcopenshift-qeinternal-debug-sv85v ...
      To use host binaries, run `chroot /host`
      oot /host
      cd /etc
      Pod IP: 10.0.128.9
      If you don't see a command prompt, try pressing enter.
      sh-4.4# chroot /host
      sh-5.1# cd /etc
      sh-5.1# cat mco-test-file 
      MCO test file
      sh-5.1# cp mco-test-file mco-test-file-back
      sh-5.1# echo -n "1" >> mco-test-file
      
      
      4. wait until the MCP reports the config drift issue
      
      $ oc get mcp worker -o yaml
      ....
        - lastTransitionTime: "2023-08-31T11:34:33Z"
          message: 'Node sregidor-sr2-2gb5z-worker-a-7tpjd.c.openshift-qe.internal is reporting:
            "unexpected on-disk state validating against quay.io/xxx/xxx@sha256:........................:
            content mismatch for file \"/etc/mco-test-file\""'
          reason: 1 nodes are reporting degraded status on sync
          status: "True"
          type: NodeDegraded
      
      
      5. Restore the backup that we made in step 3
      sh-5.1# cp mco-test-file-back mco-test-file
      
      
      

      Actual results:

      The worker pool is degraded with this message
      
          - lastTransitionTime: "2023-08-31T12:24:47Z"
            message: 'Node sregidor-sr2-2gb5z-worker-a-q7wcb.c.openshift-qe.internal is
              reporting: "failed to update OS to quay.io/xxx/xxx@sha256:.......
              : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/xxx/xxx@sha256:........:
              error: Old and new refs are equal: ostree-unverified-registry:quay.io/xxx/xxx@sha256:..............\n:
              exit status 1"'
            reason: 1 nodes are reporting degraded status on sync
            status: "True"
            type: NodeDegraded
      
      
      

      Expected results:

      The node pool should stop being degraded.
      
      
      

      Additional info:

      There is a link to the must-gather file in the first comment of this issue.
      

            dkhater@redhat.com Dalia Khater
            sregidor@redhat.com Sergio Regidor de la Rosa
            Sergio Regidor de la Rosa Sergio Regidor de la Rosa
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: