-
Bug
-
Resolution: Unresolved
-
Critical
-
4.22
Description of problem
SNO (Single Node OpenShift) bootstrap-in-place has been permafailing since the RHCOS update from 9.8.20260227-0 to 9.8.20260303-1. The bootstrap process times out waiting for cluster operators to converge. Two CI jobs are affected with 0% pass rate and 9+ consecutive failures since March 4, 2026.
The bootstrap previously completed in ~3440s (within the 4200s timeout) but now consistently exceeds it. Since the failure is 100% (not intermittent), this appears to be a hard regression rather than a timing margin issue — suggesting something fundamentally changed in the bootstrap flow.
Version-Release number
4.22 — specifically nightlies using RHCOS 9.8.20260303-1
How reproducible
Always — 0% pass rate, 9+ consecutive failures on both affected jobs since March 4.
Steps to Reproduce
Build or use a 4.22 nightly payload containing RHCOS 9.8.20260303-1 (any nightly from 4.22.0-0.nightly-2026-03-04-024042 onward)
Deploy SNO using bootstrap-in-place on bare metal
Wait for bootstrap to complete
Actual results
Bootstrap times out after 4200 seconds (70 minutes). All cluster operators remain in "not available" state. The API server at 192.168.127.10:6443 is unreachable. Must-gather fails because the cluster never becomes accessible.
Primary failure message:
TimeoutExpired: Timeout of 4200 seconds expired waiting for all operators to get up
Secondary cascading failure (~30% of runs): the previous timed-out run leaves the bare metal node in a dirty state, causing the next run to fail immediately with:
mkfs.xfs: cannot open /dev/nvme0n1: Device or resource busy
Expected results
Bootstrap-in-place should complete successfully and all cluster operators should converge within the timeout period, as they did with RHCOS 9.8.20260227-0.
Additional info
Bisection
| Detail | Value |
|---|---|
| Last passing payload | 4.22.0-0.nightly-2026-03-03-150411 |
| Last passing RHCOS | 9.8.20260227-0 |
| First failing payload | 4.22.0-0.nightly-2026-03-04-024042 |
| First failing RHCOS | 9.8.20260303-1 |
Cross-version comparison
The equivalent 4.21 job (e2e-metal-ovn-single-node-live-iso) passes at ~85% during the same period, confirming this is a 4.22-specific regression.
Affected CI jobs
- periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-single-node-live-iso — 0.0% pass rate (was 42.9%, Δ -42.9%)
- periodic-ci-openshift-release-main-nightly-4.22-metal-ovn-single-node-recert-cluster-rename — 0.0% pass rate (was 50.0%, Δ -50.0%)
Both fail in the baremetalds-sno-setup step.
PRs that landed between the last passing and first failing payload
42 PRs landed in this window. Potentially relevant ones for SNO bootstrap:
- cluster-update-keys #97 — "Third run at expanding the default ClusterImagePolicy for openshift component images" (
OCPNODE-3978) — could slow bootstrap if signature verification is now required for more images during bootstrap-in-place where all images are pulled on a single node - ironic-image #801 — updated requirements for ironic (bare metal provisioning)
- baremetal-runtimecfg #386 — ART consistency update
The RHCOS rebuild itself (9.8.20260227-0 to 9.8.20260303-1) would not appear as a PR in the payload diff.
Possibly related: vSphere UPI also broken
A separate but potentially related issue: the vSphere UPI serial job (e2e-vsphere-ovn-upi-serial) also went to 0% in the same payload window, with bootstrap timeout and crictl: command not found on master nodes. This may indicate a broader RHCOS regression affecting multiple bootstrap paths.
Timeout location
The 4200s timeout is in openshift/assisted-test-infra at src/tests/test_bootstrap_in_place.py in the waiting_for_installation_completion method (timeout_seconds=70 * 60). However, increasing this timeout is unlikely to help given the 100% failure rate — the bootstrap appears fundamentally broken, not just slower.