Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-62223

Worker node addition failed in OCI – CoreOS installer error (/dev/sda device busy)

XMLWordPrintable

    • Quality / Stability / Reliability
    • True
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      During worker node addition in OCI, the installation failed after 3 attempts. The error indicates that the CoreOS installer (coreos-installer) was unable to gain exclusive access to the target disk /dev/sda, as it was marked busy.

      Version-Release number of selected component (if applicable):

      4.19

      How reproducible:

      Always  

      Steps to Reproduce:

      1. Create the node ISO.
      2. Boot the created ISO in the OCI environment.
      3. Monitor the progress using the oc monitoring command and wait until the installer reaches the disk writing stage.

      Actual results:

      2025-09-23T19:11:22Z [node-image monitor] Node 10.0.17.156: Host 02-00-17-03-3c-0e: updated status from known to installing (Installation is in progress)
      2025-09-23T19:12:22Z [node-image monitor] Node 10.0.17.156: Host: 02-00-17-03-3c-0e, reached installation stage Failed: failed after 3 attempts, last error: failed executing /usr/bin/nsenter [--target 1 --cgroup --mount --ipc --pid -- coreos-installer install --insecure -i /opt/install-dir/worker-c9cd9447-3e67-434d-b720-aa44056ea61d.ign /dev/sda], Error exit status 1, LastOutput "Error: checking for exclusive access to /dev/sda
      2025-09-23T19:12:22Z [node-image monitor] time=2025-09-23T19:12:22Z level=info
      2025-09-23T19:12:22Z [node-image monitor] Caused by:
      2025-09-23T19:12:22Z [node-image monitor]     0: couldn't reread partition table: device is in use
      2025-09-23T19:12:22Z [node-image monitor]     1: EBUSY: Device or resource busy"
      2025-09-23T19:13:17Z [node-image monitor] Node 10.0.17.156: Uploaded logs for host 02-00-17-03-3c-0e cluster 4bb341d9-b582-436d-9920-4264912683af

      Expected results:

      The worker node should be added successfully.

      Additional info:

      Tried adding a worker node by attaching an extra block volume. However, after writing to the disk and rebooting, the node booted again from the boot volume (ISO) instead of the block volume but the expectation was that the node should boot from the block volume.
      
      2025-09-23T19:41:25Z [node-image monitor] Node 10.0.30.62: Host 02-00-17-01-cc-24: updated status from known to installing (Installation is in progress)
      2025-09-23T19:42:20Z [node-image monitor] Node 10.0.30.62: Host: 02-00-17-01-cc-24, reached installation stage Writing image to disk: 15%
      2025-09-23T19:42:25Z [node-image monitor] Node 10.0.30.62: Host: 02-00-17-01-cc-24, reached installation stage Writing image to disk: 27%
      2025-09-23T19:42:30Z [node-image monitor] Node 10.0.30.62: Host: 02-00-17-01-cc-24, reached installation stage Writing image to disk: 57%
      2025-09-23T19:42:35Z [node-image monitor] Node 10.0.30.62: Host: 02-00-17-01-cc-24, reached installation stage Writing image to disk: 71%
      2025-09-23T19:42:40Z [node-image monitor] Node 10.0.30.62: Host: 02-00-17-01-cc-24, reached installation stage Writing image to disk: 83%
      2025-09-23T19:42:45Z [node-image monitor] Node 10.0.30.62: Host: 02-00-17-01-cc-24, reached installation stage Writing image to disk: 95%
      2025-09-23T19:42:50Z [node-image monitor] Node 10.0.30.62: Host: 02-00-17-01-cc-24, reached installation stage Writing image to disk: 100%
      2025-09-23T19:43:00Z [node-image monitor] Node 10.0.30.62: Host: 02-00-17-01-cc-24, reached installation stage Waiting for control plane
      2025-09-23T19:43:05Z [node-image monitor] Node 10.0.30.62: Host: 02-00-17-01-cc-24, reached installation stage Rebooting
      2025-09-23T19:45:30Z [node-image monitor] Node 10.0.30.62: Error fetching status from assisted-service for node 10.0.30.62: Unable to retrieve cluster metadata from Agent Rest API: [GET /v2/clusters/{cluster_id}][404] v2GetClusterNotFound  &{Code:0xc000793510 Href:0xc000793540 ID:0xc00176610c Kind:0xc000793570 Reason:0xc0007935a0}
      2025-09-23T19:45:35Z [node-image monitor] Node 10.0.30.62: Error fetching status from assisted-service for node 10.0.30.62: Unable to retrieve cluster metadata from Agent Rest API: [GET /v2/clusters/{cluster_id}][

      Workaround:
      The following steps can be used as a workaround for adding worker nodes to OCP versions 4.18 and above, which utilize the ABI mechanism for node addition.

      1. Prepare the Node Image:
        • Use the command: oc adm node-image create --mac-address=<FAKE MAC ADDRESS> --root-device-hint='deviceName:/dev/sdb'
      1. Create the Node:
      • Now, proceed with creating the worker node, either via the Terraform module or manually.
      1. Attach Block Volume (Critical):
        • If using Terraform: Attach the block volume to the created instance manually.
        • If creating the instance manually: The block volume can be done during the instance creation.
      1. When Node Reboots:
      • Access the Cloud Shell: Connect to the node’s cloud shell.
      • Modify the Boot Order:
      • Press 'e' from the GRUB menu to enter the kernel arguments.
      • Press 'Ctrl + C' to enter the GRUB command-line interface (CLI).
      • Type exit and press Enter to exit the GRUB CLI.
      • Type exit and press Enter again in the shell to exit.
      • Navigate to Boot Maintenance Manager > Boot Options > Change Boot Order.
      • Select BlockVolume2 and move it to the top of the list.
      • Commit the changes and exit. You may need to repeat these steps if the GRUB menu reappears.
      1. CSR Approval (Critical):
      • Monitor for CSR Pending Approval: Check the monitoring logs for messages indicating "First CSR Pending approval" and “Second CSR Pending approval”.
      • Retrieve CSR Certificate: Use the command: oc get csr
      • Approve CSR: Manually approve the CSR using the command: oc adm certificate approve <NAME> (Replace <NAME> with the CSR’s name).
      1. Confirmation of Node Joining:
        • Look for the Confirmation Log Message: After completing all previous steps, you will see a log message similar to:  2025-09-25T11:18:03Z [node-image monitor] Node 10.0.30.224: Node joined cluster 2025-09-25T11:18:03Z [node-image monitor] Node 10.0.30.224: Node is Ready

              afasano@redhat.com Andrea Fasano
              rhn-support-mhans Manoj Hans
              None
              None
              Manoj Hans Manoj Hans
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: