Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-20418

/sysroot mountpoint failed to resize automatically on new nodes during machineset scaleup

    • Moderate
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:
      New machines got stuck in Provisioned state when the customer tried to scale the machineset.
      ~~~
      NAME PHASE TYPE REGION ZONE AGE
      ocp4-ftf8t-worker-2-wn6lp Provisioned 44m
      ocp4-ftf8t-worker-redhat-x78s5 Provisioned 44m
      ~~~

      Upon checking the journalctl logs from these VMs, we noticed that it was failing with "no space left on the device" errors while pulling images.

      To troubleshoot the issue further we had to break root password in order to login and check the issue further.

      Once root password was broken, we logged in to the system and check journalctl logs for failure errors.
      We could see "no space left of device" for image pulls. Checking df -h output we could see /dev/sda4 (/dev/mapper/coreos-luks-root-nocrypt) which is mounted on /sysroot was 100% full.
      As image would fail to get pulled, the machine-config-daemon-firstboot.service will not get completed. This would not allow us to get the node to 4.12, nor be part of the cluster.
      The rest of the errors were side effect of the "no space left on device" error.
      We could see that the /dev/sda4 was correctly partitioned to 120Gib. We compared to the working system and partition scheme matched.
      The filesystem was only of 2.8 Gib instead of 120 Gib.
      We manually extended the filesystem for / (xfs_growfs /) after which / mount was resized to 120Gib.
      The node got rebooted once this step was performed and system came up fine with 4.12 Red Hat Coreos.
      We waited for a while for the node to come up with kubelet and crio running, approved the certs and now the node is part of the cluster.

      Later while checking the logs for RCA, we observed below errors from the logs which might help in determining why the sysroot mountpoint was not resized.
      ~~~
      $ grep i growfs sos_commands/logs/journalctl_no-pager_-since_-3days
      Jun 12 10:37:30 ocp4-ftf8t-worker-2-wn6lp systemd[1]: ignition-ostree-growfs.service: Failed to load configuration: No such file or directory <---
      Jun 12 10:37:30 ocp4-ftf8t-worker-2-wn6lp systemd[1]: ignition-ostree-growfs.service: Collecting.
      ~~~

      Version-Release number of selected component (if applicable):
      OCP 4.12.18.
      IPI installation on RHV.

      How reproducible:
      Not able to reproduce the issue.

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:
      The /sysroot mountpoint was not resized to the actual size of the /dev/sda4 partition which further prevented the machine-config-daemon-firstboot.service from completing and the node was stuck at RHCOS version 4.6.

      Currently the customer has to manually resize the /sysroot mountpoint everytime he adds a new node in the cluster as a workaround.

      Expected results:
      The /sysroot mountpoint should be automatically resized as a part of ignition-ostree-growfs.sh script.

      Additional info:
      The customer has recently migrated from old storagedomain to a new one on RHV if that matters? However they performed successful machineset scaleup tests with the new storagedomain on OCP 4.11.33 (before upgrading OCP).
      They started facing issue with all the machinesets (new/existing) only after they upgraded the OCP version to 4.12.18.

            [OCPBUGS-20418] /sysroot mountpoint failed to resize automatically on new nodes during machineset scaleup

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Important: OpenShift Container Platform 4.14.2 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2023:6837

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Important: OpenShift Container Platform 4.14.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6837

            Using IPI on AWS 4.14.0-0.nightly-2023-11-03-110310

            We have executed the following test cases post-merge

             

            "[sig-mco] MCO scale Author:sregidor-NonHyperShiftHOST-NonPreRelease-Longduration-LongDuration-High-63894-Scaleup using 4.1 cloud image[Disruptive] [Serial]"
            
            "[sig-mco] MCO scale Author:sregidor-NonHyperShiftHOST-NonPreRelease-Longduration-High-52822-Create new config resources with 2.2.0 ignition boot image nodes [Disruptive] [Serial]"
            
            "[sig-mco] MCO scale Author:sregidor-NonHyperShiftHOST-NonPreRelease-Longduration-High-65923-SSH key in scaled clusters [Disruptive] [Serial]"

             

            The rest of the verification was execute pre-merge.

             

            Since it has been pre-merge and post-merge verified, we can move the status to verified.

            Nevertheless, because of the comment above saying that we should not move it to verified  without first providing a Release Note Type("Bug Fix" or "No Doc Update") we will wait for the right release notes to be provided before moving the status.

            Sergio Regidor de la Rosa added a comment - - edited Using IPI on AWS 4.14.0-0.nightly-2023-11-03-110310 We have executed the following test cases post-merge   "[sig-mco] MCO scale Author:sregidor-NonHyperShiftHOST-NonPreRelease-Longduration-LongDuration-High-63894-Scaleup using 4.1 cloud image[Disruptive] [Serial]" "[sig-mco] MCO scale Author:sregidor-NonHyperShiftHOST-NonPreRelease-Longduration-High-52822-Create new config resources with 2.2.0 ignition boot image nodes [Disruptive] [Serial]" "[sig-mco] MCO scale Author:sregidor-NonHyperShiftHOST-NonPreRelease-Longduration-High-65923-SSH key in scaled clusters [Disruptive] [Serial]"   The rest of the verification was execute pre-merge.   Since it has been pre-merge and post-merge verified, we can move the status to verified. Nevertheless, because of the comment above saying that we should not move it to verified  without first providing a Release Note Type("Bug Fix" or "No Doc Update") we will wait for the right release notes to be provided before moving the status.

            Hi walters@redhat.com,

            Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            OpenShift Jira Bot added a comment - Hi walters@redhat.com , Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            Sergio Regidor de la Rosa added a comment - Pre-merge verified in: https://github.com/openshift/machine-config-operator/pull/3967#issuecomment-1764422525

            ACK, sounds good to me. I've kicked off payload testing using 4.14 blocking suite on the PR in the mean time.

            Scott Dodson added a comment - ACK, sounds good to me. I've kicked off payload testing using 4.14 blocking suite on the PR in the mean time.

            This required a highly nontrivial change to the systemd units and should have some "soak time" in 4.15 I think before we ship in 4.14. I'd say at least 2 weeks.

            Colin Walters added a comment - This required a highly nontrivial change to the systemd units and should have some "soak time" in 4.15 I think before we ship in 4.14. I'd say at least 2 weeks.

              walters@redhat.com Colin Walters
              rhn-support-suagarwa Sumit Agarwal (Inactive)
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

                Created:
                Updated:
                Resolved: