Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-59201

[4.20] Nodes born on 4.1/4.2 will not be able to upgrade to 4.19 due to composefs + grub2-probe incompatibility

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 5
    • Moderate
    • No
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      The original problem described here was that 4.1/4.2 bootimages will no longer work with composefs since we didn't have static GRUB configs back then and so did grub2-mkconfig and thus grub2-prob (which breaks on composefs: https://github.com/ostreedev/ostree/issues/3198#issuecomment-2828935716). We can require bootimage updates for this. But the point remains that there are fully updated nodes out there still using grub2-mkconfig that will barf when they upgrade to 4.19. See https://issues.redhat.com/browse/OCPBUGS-52485?focusedId=27145454&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-27145454.

      Original bug description follows.

      Description of problem:

      When we try to scale a node using a 4.1 boot image in 4.19, the node is added to the cluster but the pool is degraded showing this error
      
        - lastTransitionTime: "2025-03-06T12:59:31Z"
          message: 'Node ip-10-0-16-142.ec2.internal is reporting: "failed to remove rollback:
            error running rpm-ostree cleanup -r: error: cleanup: GDBus.Error:org.projectatomic.rpmostreed.Error.Failed:
            Bootloader write config: grub2-mkconfig: Child process exited with code 1\n:
            exit status 1"'
          reason: 1 nodes are reporting degraded status on sync
          status: "True"
          type: NodeDegraded
        - lastTransitionTime: "2025-03-06T12:59:31Z"
          message: 'Node ip-10-0-16-142.ec2.internal is reporting: "failed to remove rollback:
            error running rpm-ostree cleanup -r: error: cleanup: GDBus.Error:org.projectatomic.rpmostreed.Error.Failed:
            Bootloader write config: grub2-mkconfig: Child process exited with code 1\n:
            exit status 1"'
          reason: ""
          status: "True"
          type: Degraded
      
      
          

      Version-Release number of selected component (if applicable):

      IPI on AWS version 4.19.0-0.nightly-2025-03-05-160850
          

      How reproducible:

      Always
          

      Steps to Reproduce:

          1. Create a machineset using a 4.1 boot image
          2. Scale up the machineset to create a new node 
          
      If more details are needed, we can have a look at this test case: https://polarion.engineering.redhat.com/polarion/redirect/project/OSE/workitem?id=OCP-63894
      
      
          

      Actual results:

      A new node is created, the node can join to the cluster but the MCP is degraded reporting the error mentioned above.
          

      Expected results:

      No degradation should happen
          

      Additional info:

      We were not able to reproduce it using 4.3 boot images, but we could reproduce it with 4.2 boot images.
      
      We can find this error in the node's journals logs. It seems to be related to the new composefs change.
      
      Thu 2025-03-06 13:21:28 UTC localhost.localdomain rpm-ostreed.service[3455]: Process [pid: 11278 uid: 0 unit: crio-b01113eb0030cac9918424bbb829041651d6d4c354274836778652e2edb06b02.scope] connected to transaction progress
      Thu 2025-03-06 13:21:28 UTC localhost.localdomain rpm-ostreed.service[3455]: bootfs is sufficient for calculated new size: 0 bytes
      Thu 2025-03-06 13:21:28 UTC localhost.localdomain rpm-ostreed.service[11291]: /usr/sbin/grub2-probe: error: failed to get canonical path of `composefs'.
      Thu 2025-03-06 13:21:28 UTC localhost.localdomain rpm-ostreed.service[3455]: Txn Cleanup on /org/projectatomic/rpmostree1/rhcos failed: Bootloader write config: grub2-mkconfig: Child process exited with code 1
      
      
      
          

              Unassigned Unassigned
              sregidor@redhat.com Sergio Regidor de la Rosa
              None
              None
              Michael Nguyen Michael Nguyen
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: