Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.12.z
Affects Version/s: 4.13, 4.12, 4.14
Component/s: Machine Config Operator
Labels:

Severity:
Moderate
Regression:
No
Sprint:
MCO Sprint 251
sprint_count:
1
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.12.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue ~~OCPBUGS-20418~~. The following is the description of the original issue:
—
Description of problem:
New machines got stuck in Provisioned state when the customer tried to scale the machineset.
~~~
NAME PHASE TYPE REGION ZONE AGE
ocp4-ftf8t-worker-2-wn6lp Provisioned 44m
ocp4-ftf8t-worker-redhat-x78s5 Provisioned 44m
~~~

Upon checking the journalctl logs from these VMs, we noticed that it was failing with "no space left on the device" errors while pulling images.

To troubleshoot the issue further we had to break root password in order to login and check the issue further.

Once root password was broken, we logged in to the system and check journalctl logs for failure errors.
We could see "no space left of device" for image pulls. Checking df -h output we could see /dev/sda4 (/dev/mapper/coreos-luks-root-nocrypt) which is mounted on /sysroot was 100% full.
As image would fail to get pulled, the machine-config-daemon-firstboot.service will not get completed. This would not allow us to get the node to 4.12, nor be part of the cluster.
The rest of the errors were side effect of the "no space left on device" error.
We could see that the /dev/sda4 was correctly partitioned to 120Gib. We compared to the working system and partition scheme matched.
The filesystem was only of 2.8 Gib instead of 120 Gib.
We manually extended the filesystem for / (xfs_growfs /) after which / mount was resized to 120Gib.
The node got rebooted once this step was performed and system came up fine with 4.12 Red Hat Coreos.
We waited for a while for the node to come up with kubelet and crio running, approved the certs and now the node is part of the cluster.

Later while checking the logs for RCA, we observed below errors from the logs which might help in determining why the sysroot mountpoint was not resized.
~~~
$ grep ~~i growfs sos_commands/logs/journalctl_no-pager_~~-since_-3days
Jun 12 10:37:30 ocp4-ftf8t-worker-2-wn6lp systemd[1]: ignition-ostree-growfs.service: Failed to load configuration: No such file or directory <---
Jun 12 10:37:30 ocp4-ftf8t-worker-2-wn6lp systemd[1]: ignition-ostree-growfs.service: Collecting.
~~~

Version-Release number of selected component (if applicable):
OCP 4.12.18.
IPI installation on RHV.

How reproducible:
Not able to reproduce the issue.

Steps to Reproduce:

1.
2.
3.

Actual results:
The /sysroot mountpoint was not resized to the actual size of the /dev/sda4 partition which further prevented the machine-config-daemon-firstboot.service from completing and the node was stuck at RHCOS version 4.6.

Currently the customer has to manually resize the /sysroot mountpoint everytime he adds a new node in the cluster as a workaround.

Expected results:
The /sysroot mountpoint should be automatically resized as a part of ignition-ostree-growfs.sh script.

Additional info:
The customer has recently migrated from old storagedomain to a new one on RHV if that matters? However they performed successful machineset scaleup tests with the new storagedomain on OCP 4.11.33 (before upgrading OCP).
They started facing issue with all the machinesets (new/existing) only after they upgraded the OCP version to 4.12.18.

clones

OCPBUGS-23536 /sysroot mountpoint failed to resize automatically on new nodes during machineset scaleup

Closed

depends on

OCPBUGS-23536 /sysroot mountpoint failed to resize automatically on new nodes during machineset scaleup

Closed

links to

openshift/machine-config-operator#4268: [release-4.12] OCPBUGS-30823: Introduce kubelet-dependencies.target and firstboot-osupdate.target

RHBA-2024:1572 OpenShift Container Platform 4.12.z bug fix update

Assignee:: Sinny Kumari

Reporter:: OpenShift Prow Bot

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/03/12 3:29 PM

Updated:: 2024/04/03 6:57 AM

Resolved:: 2024/04/03 6:57 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates