Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-43896

In OCB/OCL. No alert is raised when reboot is broken. The pool becomes degraded with the wrong message

XMLWordPrintable

    • Moderate
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      When the reboot process is broken a MCDRebootError alert should be raised. Nevertheless, the alert is not raise, and the mcp is degraded with a wrong message
      
      E1028 17:22:38.515751   45330 writer.go:226] Marking Degraded due to: failed to update OS to quay.io/mcoqe/layering@sha256:c56f19230be27cbc595d9467bcbc227858e097964ac5e5e7e74c5242aaca61e3: error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/mcoqe/layering@sha256:c56f19230be27cbc595d9467bcbc227858e097964ac5e5e7e74c5242aaca61e3: error: Old and new refs are equal: ostree-unverified-registry:quay.io/mcoqe/layering@sha256:c56f19230be27cbc595d9467bcbc227858e097964ac5e5e7e74c5242aaca61e3
      
      If the reboot process is fixed the node cannot be recovered and remains stuck reporting the " Old and new refs are equal" error.
      
      
          

      Version-Release number of selected component (if applicable):

      IPI on AWS:
      $ oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.18.0-0.nightly-2024-10-28-052434   True        False         8h      Error while reconciling 4.18.0-0.nightly-2024-10-28-052434: an unknown error has occurred: MultipleErrors
      
      
          

      How reproducible:

      Always
          

      Steps to Reproduce:

          1. Enable OCL
          2. Break the reboot
      
      $ oc debug  node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") -- chroot /host sh -c "mount -o remount,rw /usr; mv /usr/bin/systemd-run /usr/bin/systemd-run2"
      Starting pod/sregidor-ver1-w48rv-worker-a-rln2vcopenshift-qeinternal-debug ...
      To use host binaries, run `chroot /host`
      
          3. Wait for a     MCDRebootError to be raised and check that the MCP is degraded with message: "reboot command failed, something is seriously wrong"'
          

      Actual results:

      
         The MCDRebootError alert is not raised and the MCP is degraded with the wrong message
      
        - lastTransitionTime: "2024-10-28T16:40:43Z"
          message: 'Node ip-10-0-51-0.us-east-2.compute.internal is reporting: "failed to
            update OS to quay.io/mcoqe/layering@sha256:c56f19230be27cbc595d9467bcbc227858e097964ac5e5e7e74c5242aaca61e3:
            error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/mcoqe/layering@sha256:c56f19230be27cbc595d9467bcbc227858e097964ac5e5e7e74c5242aaca61e3:
            error: Old and new refs are equal: ostree-unverified-registry:quay.io/mcoqe/layering@sha256:c56f19230be27cbc595d9467bcbc227858e097964ac5e5e7e74c5242aaca61e3\n:
            exit status 1"'
          reason: 1 nodes are reporting degraded status on sync
          status: "True"
          type: NodeDegraded
      
          

      Expected results:

         The alert should be raised and the mcp should be degraded with the right message
          

      Additional info:

          If OCL is disabled this functionality works as expected.
          

              team-mco Team MCO
              sregidor@redhat.com Sergio Regidor de la Rosa
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: