-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.16.0
This is a clone of bug OCPBUGS-35211, so that the fix can be backported to 4.16.
------
Description of problem:
The ACM perf/scale hub OCP has 3 baremetal nodes, each has 480GB for the installation disk. metal3 pod uses too much disk space for logs and make the node has disk presure and start evicting pods. which make the ACM stop provisioning clusters. below is the log size of the metal3 pods: # du -h -d 1 /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83 4.0K /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/machine-os-images 276M /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-httpd 181M /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-ironic 384G /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-ramdisk-logs 77M /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-ironic-inspector 385G /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83 # ls -l -h /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-ramdisk-logs total 384G -rw-------. 1 root root 203G Jun 10 12:44 0.log -rw-r--r--. 1 root root 6.5G Jun 10 09:05 0.log.20240610-084807.gz -rw-r--r--. 1 root root 8.1G Jun 10 09:27 0.log.20240610-090606.gz -rw-------. 1 root root 167G Jun 10 09:27 0.log.20240610-092755
the logs are too huge to be attached. Please contact me if you need access to the cluster to check.
Version-Release number of selected component (if applicable):
the one has the issue is 4.16.0-rc4. 4.16.0.rc3 does not have the issue
How reproducible:
Steps to Reproduce:
1.Install latest ACM 2.11.0 build on OCP 4.16.0-rc4 and deploy 3500 SNOs on baremetal hosts 2. 3.
Actual results:
ACM stop deploying the rest of SNOs after 1913 SNOs are deployed b/c ACM pods are being evicated.
Expected results:
3500 SNOs are deployed.
Additional info:
- clones
-
OCPBUGS-35211 metal3 pod produces too much logs and eats up the node disk space
- Closed
- depends on
-
OCPBUGS-35211 metal3 pod produces too much logs and eats up the node disk space
- Closed
- is duplicated by
-
OCPBUGS-35741 metal3 pod produces too much logs and eats up the node disk space
- Closed
- links to
-
RHSA-2024:0041 OpenShift Container Platform 4.16.0 bug fix and security update