Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-35503

[4.16] metal3 pod produces too much logs and eats up the node disk space

XMLWordPrintable

    • Critical
    • Yes
    • 1
    • Metal Platform 255
    • 1
    • Approved
    • False
    • Hide

      None

      Show
      None
    • Release Note Not Required
    • In Progress

      This is a clone of bug OCPBUGS-35211, so that the fix can be backported to 4.16.
      ------
      Description of problem:

      The ACM perf/scale hub OCP has  3 baremetal nodes, each has 480GB for the installation disk. metal3 pod uses too much disk space for logs and make the node has disk presure and start evicting pods. which make the ACM stop provisioning clusters.
      below is the log size of the metal3 pods:
      # du -h -d 1 /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83
      4.0K	/sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/machine-os-images
      276M	/sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-httpd
      181M	/sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-ironic
      384G	/sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-ramdisk-logs
      77M	/sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-ironic-inspector
      385G	/sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83
      
      # ls -l -h /sysroot/ostree/deploy/rhcos/var/log/pods/openshift-machine-api_metal3-9df7c7576-9t7dd_7c72c6d6-168d-4c8e-a3c3-3ce8c0518b83/metal3-ramdisk-logs
      total 384G
      -rw-------. 1 root root 203G Jun 10 12:44 0.log
      -rw-r--r--. 1 root root 6.5G Jun 10 09:05 0.log.20240610-084807.gz
      -rw-r--r--. 1 root root 8.1G Jun 10 09:27 0.log.20240610-090606.gz
      -rw-------. 1 root root 167G Jun 10 09:27 0.log.20240610-092755

      the logs are too huge to be attached. Please contact me if you need access to the cluster to check.

       

       

       

      Version-Release number of selected component (if applicable):

      the one has the issue is 4.16.0-rc4. 4.16.0.rc3 does not have the issue

      How reproducible:

       

      Steps to Reproduce:

      1.Install latest ACM 2.11.0 build on OCP 4.16.0-rc4 and deploy 3500 SNOs on baremetal hosts
      2.
      3.
      

      Actual results:

      ACM stop deploying the rest of SNOs after 1913 SNOs are deployed b/c ACM pods are being evicated. 

      Expected results:

      3500 SNOs are deployed.

      Additional info:

       

            rh-ee-masghar Mahnoor Asghar
            rhn-support-txue Ting Xue
            Jad Haj Yahya Jad Haj Yahya
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: