Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-15087

/sysroot mountpoint failed to resize automatically on new nodes during machineset scaleup

    XMLWordPrintable

Details

    • Moderate
    • No
    • Sprint 239 - Update&Remoting, Sprint 242 - Update&Remoting, Sprint 243 - Update&Remoting
    • 3
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, older {op-system} boot images contained a race condition between services on boot that prevented the node from running the `rhcos-growpart` command before it pulled images, preventing the node from starting up. This caused node scaling to sometimes fail on clusters that use old boot images because it was determined there was no room left on the disk. With this update, mitigations were added to the Machine Config Operator for stricter ordering of services so that nodes boot correctly.
      +
      [NOTE]
      ====
      In these situations, updating to newer boot images prevents similar issues from occurring.
      ====
      +
      (https://issues.redhat.com/browse/OCPBUGS-15087[*OCPBUGS-15087*])
      Show
      * Previously, older {op-system} boot images contained a race condition between services on boot that prevented the node from running the `rhcos-growpart` command before it pulled images, preventing the node from starting up. This caused node scaling to sometimes fail on clusters that use old boot images because it was determined there was no room left on the disk. With this update, mitigations were added to the Machine Config Operator for stricter ordering of services so that nodes boot correctly. + [NOTE] ==== In these situations, updating to newer boot images prevents similar issues from occurring. ==== + ( https://issues.redhat.com/browse/OCPBUGS-15087 [* OCPBUGS-15087 *])
    • Bug Fix
    • Done

    Description

      Description of problem:
      New machines got stuck in Provisioned state when the customer tried to scale the machineset.
      ~~~
      NAME PHASE TYPE REGION ZONE AGE
      ocp4-ftf8t-worker-2-wn6lp Provisioned 44m
      ocp4-ftf8t-worker-redhat-x78s5 Provisioned 44m
      ~~~

      Upon checking the journalctl logs from these VMs, we noticed that it was failing with "no space left on the device" errors while pulling images.

      To troubleshoot the issue further we had to break root password in order to login and check the issue further.

      Once root password was broken, we logged in to the system and check journalctl logs for failure errors.
      We could see "no space left of device" for image pulls. Checking df -h output we could see /dev/sda4 (/dev/mapper/coreos-luks-root-nocrypt) which is mounted on /sysroot was 100% full.
      As image would fail to get pulled, the machine-config-daemon-firstboot.service will not get completed. This would not allow us to get the node to 4.12, nor be part of the cluster.
      The rest of the errors were side effect of the "no space left on device" error.
      We could see that the /dev/sda4 was correctly partitioned to 120Gib. We compared to the working system and partition scheme matched.
      The filesystem was only of 2.8 Gib instead of 120 Gib.
      We manually extended the filesystem for / (xfs_growfs /) after which / mount was resized to 120Gib.
      The node got rebooted once this step was performed and system came up fine with 4.12 Red Hat Coreos.
      We waited for a while for the node to come up with kubelet and crio running, approved the certs and now the node is part of the cluster.

      Later while checking the logs for RCA, we observed below errors from the logs which might help in determining why the sysroot mountpoint was not resized.
      ~~~
      $ grep i growfs sos_commands/logs/journalctl_no-pager_-since_-3days
      Jun 12 10:37:30 ocp4-ftf8t-worker-2-wn6lp systemd[1]: ignition-ostree-growfs.service: Failed to load configuration: No such file or directory <---
      Jun 12 10:37:30 ocp4-ftf8t-worker-2-wn6lp systemd[1]: ignition-ostree-growfs.service: Collecting.
      ~~~

      Version-Release number of selected component (if applicable):
      OCP 4.12.18.
      IPI installation on RHV.

      How reproducible:
      Not able to reproduce the issue.

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:
      The /sysroot mountpoint was not resized to the actual size of the /dev/sda4 partition which further prevented the machine-config-daemon-firstboot.service from completing and the node was stuck at RHCOS version 4.6.

      Currently the customer has to manually resize the /sysroot mountpoint everytime he adds a new node in the cluster as a workaround.

      Expected results:
      The /sysroot mountpoint should be automatically resized as a part of ignition-ostree-growfs.sh script.

      Additional info:
      The customer has recently migrated from old storagedomain to a new one on RHV if that matters? However they performed successful machineset scaleup tests with the new storagedomain on OCP 4.11.33 (before upgrading OCP).
      They started facing issue with all the machinesets (new/existing) only after they upgraded the OCP version to 4.12.18.

      Attachments

        Issue Links

          Activity

            People

              walters@redhat.com Colin Walters
              rhn-support-suagarwa Sumit Agarwal (Inactive)
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              1 Vote for this issue
              Watchers:
              27 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: