-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
4.21.0
-
None
-
None
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Summary
The metal3 pod's metal3-machine-os-downloader init container fails due to libguestfs attempting to create temporary directories in /tmp, which is read-only due to readOnlyRootFilesystem: true in the pod's security context. This prevents worker node provisioning and causes cluster installation to fail.
Environment
- OpenShift Version: 4.21.0
- Kubernetes Version: v1.34.2
- Cluster: bm03-cnvqe2-rdu2.cnvqe2.lab.eng.rdu2.redhat.com
- Installation Method: Bare Metal IPI
- Jenkins Job: infra-deploy-ocp-bare-metal-cluster-cnv-4.21/139
- Job URL: https://jenkins-csb-cnvqe-main.dno.corp.redhat.com/job/infra-deploy-ocp-bare-metal-cluster-cnv-4.21/139/consoleFull
Component
- Component: Bare Metal Hardware Provisioning
- Affected Resource: deployment/metal3 in openshift-machine-api namespace
- Affected Container: metal3-machine-os-downloader (init container)
Description
During OpenShift 4.21 baremetal IPI installation, the metal3 pod fails to start because the metal3-machine-os-downloader init container crashes repeatedly. The container uses libguestfs tools (virt-filesystems, virt-ls) to inspect RHCOS images, but libguestfs attempts to create temporary directories in /tmp, which is read-only.
Root Cause
- The pod's security context has readOnlyRootFilesystem: true (security best practice)
- The init container script sets TMPDIR=/shared/tmp/tmp.XXXXX environment variable
- However, libguestfs ignores the TMPDIR environment variable and hardcodes /tmp for its temporary directory
- libguestfs fails with: libguestfs: error: /tmp/libguestfsXXXXX: cannot create temporary directory: Read-only file system
Impact
- Critical: Worker nodes cannot be provisioned
- Worker BareMetalHosts remain unprovisioned (stuck in empty state)
- Worker machines are stuck in Provisioning phase with error: "No available BareMetalHost found"
- Cluster operators are degraded:
- machine-api: Waiting for minimum worker replica count (2) not yet met
- ingress: Router pods cannot be scheduled (no worker nodes, master nodes have taints)
- authentication: Depends on ingress
- console: Depends on ingress
- Cluster installation cannot complete
Steps to Reproduce
- Deploy OpenShift 4.21 baremetal IPI cluster
- Wait for bootstrap to complete
- Observe metal3 pod in openshift-machine-api namespace
- Pod will be in Init:CrashLoopBackOff state
- Check logs of metal3-machine-os-downloader init container
Actual Results
Pod Status
NAME READY STATUS RESTARTS AGE metal3-846bdc8b76-n7l4r 0/5 Init:CrashLoopBackOff 22 97m
Error Logs
+ mkdir -p /shared/tmp ++ mktemp -d -p /shared/tmp + TMPDIR=/shared/tmp/tmp.6DGdhqZTRe + trap 'rm -fr /shared/tmp/tmp.6DGdhqZTRe' EXIT + cd /shared/tmp/tmp.6DGdhqZTRe ... ++ LIBGUESTFS_BACKEND=direct ++ virt-filesystems -a rhcos-9.6.20251212-1-openstack.x86_64.qcow2 -l ++ cut -f1 '-d ' ++ grep boot libguestfs: error: /tmp/libguestfsT8x4kP: cannot create temporary directory: Read-only file system + BOOT_DISK= ++ LIBGUESTFS_BACKEND=direct ++ virt-ls -a rhcos-9.6.20251212-1-openstack.x86_64.qcow2 -m '' /boot/loader/entries libguestfs: error: /tmp/libguestfsUkHROL: cannot create temporary directory: Read-only file system + BOOT_ENTRIES= + rm -fr /shared/tmp/tmp.o7pEYTrM8c
Pod Security Context
securityContext: capabilities: drop: - ALL privileged: true readOnlyRootFilesystem: true # <-- This causes the issue
Environment Variables (Current)
env: - name: RHCOS_IMAGE_URL value: "http://10.1.156.1/rhcos/images/rhcos-9.6.20251212-1-openstack.x86_64.qcow2.gz?sha256=..." - name: IP_OPTIONS value: "ip=dhcp,dhcp6"
Worker Machines Status
NAME PHASE TYPE REGION ZONE AGE bm03-cnvqe2-rdu2-b942t-worker-0-64wm2 Provisioning 96m bm03-cnvqe2-rdu2-b942t-worker-0-xb4zz Provisioning 96m bm03-cnvqe2-rdu2-b942t-worker-0-zngvh Provisioning 96m
Worker BareMetalHosts Status
NAME STATE CONSUMER ONLINE ERROR AGE cnv-qe-infra-17.cnvqe2.lab.eng.rdu2.redhat.com (empty) (none) true 128m cnv-qe-infra-18.cnvqe2.lab.eng.rdu2.redhat.com (empty) (none) true 128m cnv-qe-infra-19.cnvqe2.lab.eng.rdu2.redhat.com (empty) (none) true 128m
Cluster Operator Status
authentication 4.21.0 False True True 98m OAuthServerRouteEndpointAccessibleControllerAvailable: failed to retrieve route...
ingress False True True 97m The "default" ingress controller reports Available=False...
machine-api False True True 97m Operator is initializing
Expected Results
The metal3-machine-os-downloader init container should successfully complete, allowing:
- RHCOS images to be processed and cached
- Worker BareMetalHosts to be inspected and provisioned
- Worker nodes to join the cluster
- Cluster installation to complete successfully
Workaround Attempted
Attempted to patch the deployment to add LIBGUESTFS_TMPDIR environment variable:
oc patch deployment metal3 -n openshift-machine-api --type='json' \ -p='[{"op": "add", "path": "/spec/template/spec/initContainers/2/env/-", "value": {"name": "LIBGUESTFS_TMPDIR", "value": "/shared/tmp"}}]'
Result: The baremetal operator reconciles and reverts the change, as the deployment is operator-managed.
Proposed Solution
The metal3 deployment managed by the cluster-baremetal-operator should be updated to:
- Option 1 (Recommended): Add LIBGUESTFS_TMPDIR environment variable to the metal3-machine-os-downloader init container:
env: - name: LIBGUESTFS_TMPDIR value: "/shared/tmp"
- Option 2: Mount /tmp as a writable emptyDir volume (may conflict with security policies requiring readOnlyRootFilesystem)
- Option 3: Update the init container script to set LIBGUESTFS_TMPDIR before calling libguestfs tools
Additional Information
Related Bugs
- Bugzilla 1043249: libguestfs fails to create appliance from /tmp (general libguestfs issue)
- Bugzilla 1916649: libguestfs/libvirt container socket issues when /tmp is not shared
Technical Details
- libguestfs behavior: libguestfs has a known limitation where it ignores TMPDIR and hardcodes /tmp for temporary directory creation
- libguestfs solution: libguestfs provides LIBGUESTFS_TMPDIR environment variable specifically for this purpose
- Pod security: readOnlyRootFilesystem: true is a security best practice and should be maintained
- Volume mounts: The pod already has /shared mounted as a writable volume, which is the appropriate location for temporary files
Cluster Configuration
- Provisioning Network: Managed (172.22.0.0/24)
- Provisioning IP: 172.22.0.3
- Control Plane: 3 master nodes (provisioned successfully)
- Workers: 3 worker nodes (stuck in provisioning)
Commands for Verification
# Check metal3 pod status oc get pods -n openshift-machine-api -l k8s-app=metal3 # Check init container logs oc logs -n openshift-machine-api <metal3-pod-name> -c metal3-machine-os-downloader # Check worker machines oc get machines -n openshift-machine-api # Check worker BareMetalHosts oc get baremetalhosts -n openshift-machine-api # Check cluster operators oc get clusteroperator machine-api baremetal ingress authentication
References
- Jenkins Job: https://jenkins-csb-cnvqe-main.dno.corp.redhat.com/job/infra-deploy-ocp-bare-metal-cluster-cnv-4.21/139/consoleFull
- Cluster: bm03-cnvqe2-rdu2.cnvqe2.lab.eng.rdu2.redhat.com
- libguestfs documentation: https://libguestfs.org/guestfs.3.html#environment-variables
- relates to
-
OCPBUGS-70157 metal3 pod fails due to unable to create directory in /shared/tmp
-
- Verified
-
- links to