Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-75226

metal3-machine-os-downloader init container fails with libguestfs read-only filesystem error, preventing worker node provisioning

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Summary

      The metal3 pod's metal3-machine-os-downloader init container fails due to libguestfs attempting to create temporary directories in /tmp, which is read-only due to readOnlyRootFilesystem: true in the pod's security context. This prevents worker node provisioning and causes cluster installation to fail.

      Environment

      Component

      • Component: Bare Metal Hardware Provisioning
      • Affected Resource: deployment/metal3 in openshift-machine-api namespace
      • Affected Container: metal3-machine-os-downloader (init container)

      Description

      During OpenShift 4.21 baremetal IPI installation, the metal3 pod fails to start because the metal3-machine-os-downloader init container crashes repeatedly. The container uses libguestfs tools (virt-filesystems, virt-ls) to inspect RHCOS images, but libguestfs attempts to create temporary directories in /tmp, which is read-only.

      Root Cause

      1. The pod's security context has readOnlyRootFilesystem: true (security best practice)
      2. The init container script sets TMPDIR=/shared/tmp/tmp.XXXXX environment variable
      3. However, libguestfs ignores the TMPDIR environment variable and hardcodes /tmp for its temporary directory
      4. libguestfs fails with: libguestfs: error: /tmp/libguestfsXXXXX: cannot create temporary directory: Read-only file system

      Impact

      • Critical: Worker nodes cannot be provisioned
      • Worker BareMetalHosts remain unprovisioned (stuck in empty state)
      • Worker machines are stuck in Provisioning phase with error: "No available BareMetalHost found"
      • Cluster operators are degraded:
        • machine-api: Waiting for minimum worker replica count (2) not yet met
        • ingress: Router pods cannot be scheduled (no worker nodes, master nodes have taints)
        • authentication: Depends on ingress
        • console: Depends on ingress
      • Cluster installation cannot complete

      Steps to Reproduce

      1. Deploy OpenShift 4.21 baremetal IPI cluster
      2. Wait for bootstrap to complete
      3. Observe metal3 pod in openshift-machine-api namespace
      4. Pod will be in Init:CrashLoopBackOff state
      5. Check logs of metal3-machine-os-downloader init container

      Actual Results

      Pod Status

      NAME                                          READY   STATUS                  RESTARTS   AGE
      metal3-846bdc8b76-n7l4r                       0/5     Init:CrashLoopBackOff   22         97m
      

      Error Logs

      + mkdir -p /shared/tmp
      ++ mktemp -d -p /shared/tmp
      + TMPDIR=/shared/tmp/tmp.6DGdhqZTRe
      + trap 'rm -fr /shared/tmp/tmp.6DGdhqZTRe' EXIT
      + cd /shared/tmp/tmp.6DGdhqZTRe
      ...
      ++ LIBGUESTFS_BACKEND=direct
      ++ virt-filesystems -a rhcos-9.6.20251212-1-openstack.x86_64.qcow2 -l
      ++ cut -f1 '-d '
      ++ grep boot
      libguestfs: error: /tmp/libguestfsT8x4kP: cannot create temporary directory: Read-only file system
      + BOOT_DISK=
      ++ LIBGUESTFS_BACKEND=direct
      ++ virt-ls -a rhcos-9.6.20251212-1-openstack.x86_64.qcow2 -m '' /boot/loader/entries
      libguestfs: error: /tmp/libguestfsUkHROL: cannot create temporary directory: Read-only file system
      + BOOT_ENTRIES=
      + rm -fr /shared/tmp/tmp.o7pEYTrM8c
      

      Pod Security Context

      securityContext:   capabilities:     drop:     - ALL
        privileged: true
        readOnlyRootFilesystem: true  # <-- This causes the issue
      

      Environment Variables (Current)

      env: - name: RHCOS_IMAGE_URL
        value: "http://10.1.156.1/rhcos/images/rhcos-9.6.20251212-1-openstack.x86_64.qcow2.gz?sha256=..."
      - name: IP_OPTIONS
        value: "ip=dhcp,dhcp6"
      

      Worker Machines Status

      NAME                                    PHASE          TYPE   REGION   ZONE   AGE
      bm03-cnvqe2-rdu2-b942t-worker-0-64wm2   Provisioning                          96m
      bm03-cnvqe2-rdu2-b942t-worker-0-xb4zz   Provisioning                          96m
      bm03-cnvqe2-rdu2-b942t-worker-0-zngvh   Provisioning                          96m
      

      Worker BareMetalHosts Status

      NAME                                             STATE         CONSUMER   ONLINE   ERROR   AGE
      cnv-qe-infra-17.cnvqe2.lab.eng.rdu2.redhat.com  (empty)       (none)     true             128m
      cnv-qe-infra-18.cnvqe2.lab.eng.rdu2.redhat.com  (empty)       (none)     true             128m
      cnv-qe-infra-19.cnvqe2.lab.eng.rdu2.redhat.com  (empty)       (none)     true             128m
      

      Cluster Operator Status

      authentication                             4.21.0    False       True          True       98m     OAuthServerRouteEndpointAccessibleControllerAvailable: failed to retrieve route...
      ingress                                              False       True          True       97m     The "default" ingress controller reports Available=False...
      machine-api                                          False       True          True       97m     Operator is initializing
      

      Expected Results

      The metal3-machine-os-downloader init container should successfully complete, allowing:

      • RHCOS images to be processed and cached
      • Worker BareMetalHosts to be inspected and provisioned
      • Worker nodes to join the cluster
      • Cluster installation to complete successfully

      Workaround Attempted

      Attempted to patch the deployment to add LIBGUESTFS_TMPDIR environment variable:

      oc patch deployment metal3 -n openshift-machine-api --type='json' \
        -p='[{"op": "add", "path": "/spec/template/spec/initContainers/2/env/-", 
             "value": {"name": "LIBGUESTFS_TMPDIR", "value": "/shared/tmp"}}]'
      

      Result: The baremetal operator reconciles and reverts the change, as the deployment is operator-managed.

      Proposed Solution

      The metal3 deployment managed by the cluster-baremetal-operator should be updated to:

      1. Option 1 (Recommended): Add LIBGUESTFS_TMPDIR environment variable to the metal3-machine-os-downloader init container:
        env: - name: LIBGUESTFS_TMPDIR
          value: "/shared/tmp"
        
      1. Option 2: Mount /tmp as a writable emptyDir volume (may conflict with security policies requiring readOnlyRootFilesystem)
      1. Option 3: Update the init container script to set LIBGUESTFS_TMPDIR before calling libguestfs tools

      Additional Information

      Related Bugs

      • Bugzilla 1043249: libguestfs fails to create appliance from /tmp (general libguestfs issue)
      • Bugzilla 1916649: libguestfs/libvirt container socket issues when /tmp is not shared

      Technical Details

      • libguestfs behavior: libguestfs has a known limitation where it ignores TMPDIR and hardcodes /tmp for temporary directory creation
      • libguestfs solution: libguestfs provides LIBGUESTFS_TMPDIR environment variable specifically for this purpose
      • Pod security: readOnlyRootFilesystem: true is a security best practice and should be maintained
      • Volume mounts: The pod already has /shared mounted as a writable volume, which is the appropriate location for temporary files

      Cluster Configuration

      • Provisioning Network: Managed (172.22.0.0/24)
      • Provisioning IP: 172.22.0.3
      • Control Plane: 3 master nodes (provisioned successfully)
      • Workers: 3 worker nodes (stuck in provisioning)

      Commands for Verification

      # Check metal3 pod status
      oc get pods -n openshift-machine-api -l k8s-app=metal3
      
      # Check init container logs
      oc logs -n openshift-machine-api <metal3-pod-name> -c metal3-machine-os-downloader
      
      # Check worker machines
      oc get machines -n openshift-machine-api
      
      # Check worker BareMetalHosts
      oc get baremetalhosts -n openshift-machine-api
      
      # Check cluster operators
      oc get clusteroperator machine-api baremetal ingress authentication
      

      References

              rpittau@redhat.com Riccardo Pittau
              lbednar@redhat.com Lukas Bednar
              None
              None
              Jad Haj Yahya Jad Haj Yahya
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: