XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: 4.22
Affects Version/s: 4.22
Component/s: Installer / Single Node OpenShift
Labels:
- ai-generated-jira
- ci-regression

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:

4.22
Release Blocker:
Proposed
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem

SNO (Single Node OpenShift) bootstrap-in-place has been permafailing since the RHCOS update from 9.8.20260227-0 to 9.8.20260303-1. The bootstrap process times out waiting for cluster operators to converge. Two CI jobs are affected with 0% pass rate and 9+ consecutive failures since March 4, 2026.

The bootstrap previously completed in ~3440s (within the 4200s timeout) but now consistently exceeds it. Since the failure is 100% (not intermittent), this appears to be a hard regression rather than a timing margin issue — suggesting something fundamentally changed in the bootstrap flow.

Version-Release number

4.22 — specifically nightlies using RHCOS 9.8.20260303-1

How reproducible

Always — 0% pass rate, 9+ consecutive failures on both affected jobs since March 4.

Steps to Reproduce

Build or use a 4.22 nightly payload containing RHCOS `9.8.20260303-1` (any nightly from `4.22.0-0.nightly-2026-03-04-024042` onward)

Deploy SNO using bootstrap-in-place on bare metal

Wait for bootstrap to complete

Actual results

Bootstrap times out after 4200 seconds (70 minutes). All cluster operators remain in "not available" state. The API server at 192.168.127.10:6443 is unreachable. Must-gather fails because the cluster never becomes accessible.

Primary failure message:

TimeoutExpired: Timeout of 4200 seconds expired waiting for all operators to get up

Secondary cascading failure (~30% of runs): the previous timed-out run leaves the bare metal node in a dirty state, causing the next run to fail immediately with:

mkfs.xfs: cannot open /dev/nvme0n1: Device or resource busy

Expected results

Bootstrap-in-place should complete successfully and all cluster operators should converge within the timeout period, as they did with RHCOS 9.8.20260227-0.

Additional info

Bisection

Detail	Value
Last passing payload	`4.22.0-0.nightly-2026-03-03-150411`
Last passing RHCOS	`9.8.20260227-0`
First failing payload	`4.22.0-0.nightly-2026-03-04-024042`
First failing RHCOS	`9.8.20260303-1`

Cross-version comparison

The equivalent 4.21 job (e2e-metal-ovn-single-node-live-iso) passes at ~85% during the same period, confirming this is a 4.22-specific regression.

Affected CI jobs

periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-single-node-live-iso — 0.0% pass rate (was 42.9%, Δ -42.9%)
periodic-ci-openshift-release-main-nightly-4.22-metal-ovn-single-node-recert-cluster-rename — 0.0% pass rate (was 50.0%, Δ -50.0%)

Both fail in the baremetalds-sno-setup step.

PRs that landed between the last passing and first failing payload

42 PRs landed in this window. Potentially relevant ones for SNO bootstrap:

cluster-update-keys #97 — "Third run at expanding the default ClusterImagePolicy for openshift component images" (~~OCPNODE-3978~~) — could slow bootstrap if signature verification is now required for more images during bootstrap-in-place where all images are pulled on a single node
ironic-image #801 — updated requirements for ironic (bare metal provisioning)
baremetal-runtimecfg #386 — ART consistency update

The RHCOS rebuild itself (9.8.20260227-0 to 9.8.20260303-1) would not appear as a PR in the payload diff.

Possibly related: vSphere UPI also broken

A separate but potentially related issue: the vSphere UPI serial job (e2e-vsphere-ovn-upi-serial) also went to 0% in the same payload window, with bootstrap timeout and crictl: command not found on master nodes. This may indicate a broader RHCOS regression affecting multiple bootstrap paths.

Timeout location

The 4200s timeout is in openshift/assisted-test-infra at src/tests/test_bootstrap_in_place.py in the waiting_for_installation_completion method (timeout_seconds=70 * 60). However, increasing this timeout is unlikely to help given the 100% failure rate — the bootstrap appears fundamentally broken, not just slower.

Assignee:: Jeremy Poulin

Reporter:: Technical Release Team Openshift

QA Contact:: Tiago Bueno

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2026/03/06 7:59 PM

Updated:: 2026/03/06 8:00 PM

Details

Description