Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-77945

SNO bootstrap-in-place permafail since RHCOS 9.8.20260303-1 — operators never converge within timeout

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • Proposed
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem

      SNO (Single Node OpenShift) bootstrap-in-place has been permafailing since the RHCOS update from 9.8.20260227-0 to 9.8.20260303-1. The bootstrap process times out waiting for cluster operators to converge. Two CI jobs are affected with 0% pass rate and 9+ consecutive failures since March 4, 2026.

      The bootstrap previously completed in ~3440s (within the 4200s timeout) but now consistently exceeds it. Since the failure is 100% (not intermittent), this appears to be a hard regression rather than a timing margin issue — suggesting something fundamentally changed in the bootstrap flow.

      Version-Release number

      4.22 — specifically nightlies using RHCOS 9.8.20260303-1

      How reproducible

      Always — 0% pass rate, 9+ consecutive failures on both affected jobs since March 4.

      Steps to Reproduce

      Build or use a 4.22 nightly payload containing RHCOS 9.8.20260303-1 (any nightly from 4.22.0-0.nightly-2026-03-04-024042 onward)

      Deploy SNO using bootstrap-in-place on bare metal

      Wait for bootstrap to complete

      Actual results

      Bootstrap times out after 4200 seconds (70 minutes). All cluster operators remain in "not available" state. The API server at 192.168.127.10:6443 is unreachable. Must-gather fails because the cluster never becomes accessible.

      Primary failure message:

      TimeoutExpired: Timeout of 4200 seconds expired waiting for all operators to get up
      

      Secondary cascading failure (~30% of runs): the previous timed-out run leaves the bare metal node in a dirty state, causing the next run to fail immediately with:

      mkfs.xfs: cannot open /dev/nvme0n1: Device or resource busy
      

      Expected results

      Bootstrap-in-place should complete successfully and all cluster operators should converge within the timeout period, as they did with RHCOS 9.8.20260227-0.

      Additional info

      Bisection

      Detail Value
      Last passing payload 4.22.0-0.nightly-2026-03-03-150411
      Last passing RHCOS 9.8.20260227-0
      First failing payload 4.22.0-0.nightly-2026-03-04-024042
      First failing RHCOS 9.8.20260303-1

      Cross-version comparison

      The equivalent 4.21 job (e2e-metal-ovn-single-node-live-iso) passes at ~85% during the same period, confirming this is a 4.22-specific regression.

      Affected CI jobs

      • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-single-node-live-iso — 0.0% pass rate (was 42.9%, Δ -42.9%)
      • periodic-ci-openshift-release-main-nightly-4.22-metal-ovn-single-node-recert-cluster-rename — 0.0% pass rate (was 50.0%, Δ -50.0%)

      Both fail in the baremetalds-sno-setup step.

      PRs that landed between the last passing and first failing payload

      42 PRs landed in this window. Potentially relevant ones for SNO bootstrap:

      • cluster-update-keys #97 — "Third run at expanding the default ClusterImagePolicy for openshift component images" (OCPNODE-3978) — could slow bootstrap if signature verification is now required for more images during bootstrap-in-place where all images are pulled on a single node
      • ironic-image #801 — updated requirements for ironic (bare metal provisioning)
      • baremetal-runtimecfg #386 — ART consistency update

      The RHCOS rebuild itself (9.8.20260227-0 to 9.8.20260303-1) would not appear as a PR in the payload diff.

      Possibly related: vSphere UPI also broken

      A separate but potentially related issue: the vSphere UPI serial job (e2e-vsphere-ovn-upi-serial) also went to 0% in the same payload window, with bootstrap timeout and crictl: command not found on master nodes. This may indicate a broader RHCOS regression affecting multiple bootstrap paths.

      Timeout location

      The 4200s timeout is in openshift/assisted-test-infra at src/tests/test_bootstrap_in_place.py in the waiting_for_installation_completion method (timeout_seconds=70 * 60). However, increasing this timeout is unlikely to help given the 100% failure rate — the bootstrap appears fundamentally broken, not just slower.

              jpoulin Jeremy Poulin
              openshift-trt-privileged Technical Release Team Openshift
              Tiago Bueno Tiago Bueno
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: