[OCPBUGS-20418] /sysroot mountpoint failed to resize automatically on new nodes during machineset scaleup

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.14.z
Affects Version/s: 4.13, 4.12, 4.14
Component/s: Machine Config Operator
Labels:

Severity:
Moderate
Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.14.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:
PX Review Complete:

Description of problem:
New machines got stuck in Provisioned state when the customer tried to scale the machineset.
~~~
NAME PHASE TYPE REGION ZONE AGE
ocp4-ftf8t-worker-2-wn6lp Provisioned 44m
ocp4-ftf8t-worker-redhat-x78s5 Provisioned 44m
~~~

Upon checking the journalctl logs from these VMs, we noticed that it was failing with "no space left on the device" errors while pulling images.

To troubleshoot the issue further we had to break root password in order to login and check the issue further.

Once root password was broken, we logged in to the system and check journalctl logs for failure errors.
We could see "no space left of device" for image pulls. Checking df -h output we could see /dev/sda4 (/dev/mapper/coreos-luks-root-nocrypt) which is mounted on /sysroot was 100% full.
As image would fail to get pulled, the machine-config-daemon-firstboot.service will not get completed. This would not allow us to get the node to 4.12, nor be part of the cluster.
The rest of the errors were side effect of the "no space left on device" error.
We could see that the /dev/sda4 was correctly partitioned to 120Gib. We compared to the working system and partition scheme matched.
The filesystem was only of 2.8 Gib instead of 120 Gib.
We manually extended the filesystem for / (xfs_growfs /) after which / mount was resized to 120Gib.
The node got rebooted once this step was performed and system came up fine with 4.12 Red Hat Coreos.
We waited for a while for the node to come up with kubelet and crio running, approved the certs and now the node is part of the cluster.

Later while checking the logs for RCA, we observed below errors from the logs which might help in determining why the sysroot mountpoint was not resized.
~~~
$ grep ~~i growfs sos_commands/logs/journalctl_no-pager_~~-since_-3days
Jun 12 10:37:30 ocp4-ftf8t-worker-2-wn6lp systemd[1]: ignition-ostree-growfs.service: Failed to load configuration: No such file or directory <---
Jun 12 10:37:30 ocp4-ftf8t-worker-2-wn6lp systemd[1]: ignition-ostree-growfs.service: Collecting.
~~~

Version-Release number of selected component (if applicable):
OCP 4.12.18.
IPI installation on RHV.

How reproducible:
Not able to reproduce the issue.

Steps to Reproduce:

1.
2.
3.

Actual results:
The /sysroot mountpoint was not resized to the actual size of the /dev/sda4 partition which further prevented the machine-config-daemon-firstboot.service from completing and the node was stuck at RHCOS version 4.6.

Currently the customer has to manually resize the /sysroot mountpoint everytime he adds a new node in the cluster as a workaround.

Expected results:
The /sysroot mountpoint should be automatically resized as a part of ignition-ostree-growfs.sh script.

Additional info:
The customer has recently migrated from old storagedomain to a new one on RHV if that matters? However they performed successful machineset scaleup tests with the new storagedomain on OCP 4.11.33 (before upgrading OCP).
They started facing issue with all the machinesets (new/existing) only after they upgraded the OCP version to 4.12.18.

blocks

OCPBUGS-23536 /sysroot mountpoint failed to resize automatically on new nodes during machineset scaleup

Closed

clones

OCPBUGS-15087 /sysroot mountpoint failed to resize automatically on new nodes during machineset scaleup

Closed

depends on

OCPBUGS-15087 /sysroot mountpoint failed to resize automatically on new nodes during machineset scaleup

Closed

is cloned by

OCPBUGS-23536 /sysroot mountpoint failed to resize automatically on new nodes during machineset scaleup

Closed

links to

openshift/machine-config-operator#3967: OCPBUGS-20418: Introduce kubelet-dependencies.target and firstboot-osupdate.target

RHBA-2023:6837 OpenShift Container Platform 4.14.z bug fix update

(1 links to)

Errata Tool added a comment - 2023/11/15 4:23 AM

Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

For information on the advisory (Important: OpenShift Container Platform 4.14.2 bug fix and security update), and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2023:6837

Errata Tool added a comment - 2023/11/15 4:23 AM Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Important: OpenShift Container Platform 4.14.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6837

Sergio Regidor de la Rosa added a comment - 2023/11/03 5:14 PM - edited

Using IPI on AWS 4.14.0-0.nightly-2023-11-03-110310

We have executed the following test cases post-merge

"[sig-mco] MCO scale Author:sregidor-NonHyperShiftHOST-NonPreRelease-Longduration-LongDuration-High-63894-Scaleup using 4.1 cloud image[Disruptive] [Serial]"

"[sig-mco] MCO scale Author:sregidor-NonHyperShiftHOST-NonPreRelease-Longduration-High-52822-Create new config resources with 2.2.0 ignition boot image nodes [Disruptive] [Serial]"

"[sig-mco] MCO scale Author:sregidor-NonHyperShiftHOST-NonPreRelease-Longduration-High-65923-SSH key in scaled clusters [Disruptive] [Serial]"

The rest of the verification was execute pre-merge.

Since it has been pre-merge and post-merge verified, we can move the status to verified.

Nevertheless, because of the comment above saying that we should not move it to verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") we will wait for the right release notes to be provided before moving the status.

Sergio Regidor de la Rosa added a comment - 2023/11/03 5:14 PM - edited Using IPI on AWS 4.14.0-0.nightly-2023-11-03-110310 We have executed the following test cases post-merge "[sig-mco] MCO scale Author:sregidor-NonHyperShiftHOST-NonPreRelease-Longduration-LongDuration-High-63894-Scaleup using 4.1 cloud image[Disruptive] [Serial]" "[sig-mco] MCO scale Author:sregidor-NonHyperShiftHOST-NonPreRelease-Longduration-High-52822-Create new config resources with 2.2.0 ignition boot image nodes [Disruptive] [Serial]" "[sig-mco] MCO scale Author:sregidor-NonHyperShiftHOST-NonPreRelease-Longduration-High-65923-SSH key in scaled clusters [Disruptive] [Serial]" The rest of the verification was execute pre-merge. Since it has been pre-merge and post-merge verified, we can move the status to verified. Nevertheless, because of the comment above saying that we should not move it to verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") we will wait for the right release notes to be provided before moving the status.

OpenShift Jira Bot added a comment - 2023/11/03 10:19 AM

Hi walters@redhat.com,

Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

OpenShift Jira Bot added a comment - 2023/11/03 10:19 AM Hi walters@redhat.com , Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

Sergio Regidor de la Rosa added a comment - 2023/10/16 12:53 PM

Pre-merge verified in: https://github.com/openshift/machine-config-operator/pull/3967#issuecomment-1764422525

Sergio Regidor de la Rosa added a comment - 2023/10/16 12:53 PM Pre-merge verified in: https://github.com/openshift/machine-config-operator/pull/3967#issuecomment-1764422525

Scott Dodson added a comment - 2023/10/11 5:58 PM

ACK, sounds good to me. I've kicked off payload testing using 4.14 blocking suite on the PR in the mean time.

Scott Dodson added a comment - 2023/10/11 5:58 PM ACK, sounds good to me. I've kicked off payload testing using 4.14 blocking suite on the PR in the mean time.

Colin Walters added a comment - 2023/10/11 5:39 PM

This required a highly nontrivial change to the systemd units and should have some "soak time" in 4.15 I think before we ship in 4.14. I'd say at least 2 weeks.

Colin Walters added a comment - 2023/10/11 5:39 PM This required a highly nontrivial change to the systemd units and should have some "soak time" in 4.15 I think before we ship in 4.14. I'd say at least 2 weeks.

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Errata Tool added a comment - 2023/11/15 4:23 AM

Expand comment: Errata Tool added a comment - 2023/11/15 4:23 AM

Collapse comment: Sergio Regidor de la Rosa added a comment - 2023/11/03 5:14 PM, Edited by Sergio Regidor de la Rosa - 2023/11/03 5:16 PM

Expand comment: Sergio Regidor de la Rosa added a comment - 2023/11/03 5:14 PM, Edited by Sergio Regidor de la Rosa - 2023/11/03 5:16 PM

Collapse comment: OpenShift Jira Bot added a comment - 2023/11/03 10:19 AM

Expand comment: OpenShift Jira Bot added a comment - 2023/11/03 10:19 AM

Collapse comment: Sergio Regidor de la Rosa added a comment - 2023/10/16 12:53 PM

Expand comment: Sergio Regidor de la Rosa added a comment - 2023/10/16 12:53 PM

Collapse comment: Scott Dodson added a comment - 2023/10/11 5:58 PM

Expand comment: Scott Dodson added a comment - 2023/10/11 5:58 PM

Collapse comment: Colin Walters added a comment - 2023/10/11 5:39 PM

Expand comment: Colin Walters added a comment - 2023/10/11 5:39 PM

People

Dates