Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: rhos-18.0.10 FR 3
Affects Version/s: rhos-18.0 FR 2 (Mar 2025)
Component/s: openstack-ironic
Labels:
None

Story Points:
3
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
Fixed in Build:
openstack-ironic-21.4.5-18.0.20250519144814.9213ccd.el9ost
Flagged:

Impediment
AssignedTeam:
rhos-ops-day1day2-hardprov
Regression:
None
Release Note Text:

Hide
.Workflow operations persist through interruptions in connectivity

This update solves an issue in the Bare Metal Provisioning service (ironic) that caused the deployment process to loop and time out because of interruptions in connectivity while the deployment agent was starting. The issue occurred because only one attempt was made to evaluate if a RAM drive was recently booted. When this issue occurred, the bare metal nodes would fail to clean, deploy, or perform other workflow actions.

Show
.Workflow operations persist through interruptions in connectivity This update solves an issue in the Bare Metal Provisioning service (ironic) that caused the deployment process to loop and time out because of interruptions in connectivity while the deployment agent was starting. The issue occurred because only one attempt was made to evaluate if a RAM drive was recently booted. When this issue occurred, the bare metal nodes would fail to clean, deploy, or perform other workflow actions.
Release Note Type:
Bug Fix
Release Note Status:
Done
Steps to Reproduce:

Hide

This issue is exceptionally difficult to reproduce.

Ultimately the root cause is due to a short lived transient connectivity loss, the kind of connectivity loss which could be the result of LACP Fallback activating on a switchport, where after the host boots and network begins to come online, the network port will go offline for a short period of time. This short period of time can be enough to cause the overall step retrieval method in the ironic-conductor process to consider the connection timed out as opposed to returning valid data. The retry logic in the conductor would then retry, and the conductor's "is this a fresh/new agent" logic would disqualify the newly built agent so it would never retrieve it's list of steps available to execute upon to allow the overall deployment flow to activate.

Some lab environments have apparently been able to reproduce this more reliably than others, specifically because the ramdisk in those environments is starting up about 30 seconds after network connectivity is first established, which enables the timing of the switch to attempt link validation causes the fallback logic logic to engage. This is where customers are attempting to use Bonded interfaces and then deploy over those bonded interfaces.

Show
This issue is exceptionally difficult to reproduce. Ultimately the root cause is due to a short lived transient connectivity loss, the kind of connectivity loss which could be the result of LACP Fallback activating on a switchport, where after the host boots and network begins to come online, the network port will go offline for a short period of time. This short period of time can be enough to cause the overall step retrieval method in the ironic-conductor process to consider the connection timed out as opposed to returning valid data. The retry logic in the conductor would then retry, and the conductor's "is this a fresh/new agent" logic would disqualify the newly built agent so it would never retrieve it's list of steps available to execute upon to allow the overall deployment flow to activate. Some lab environments have apparently been able to reproduce this more reliably than others, specifically because the ramdisk in those environments is starting up about 30 seconds after network connectivity is first established, which enables the timing of the switch to attempt link validation causes the fallback logic logic to engage. This is where customers are attempting to use Bonded interfaces and then deploy over those bonded interfaces.
Intelligence Requested:
Market:
Errata Link:
https://errata.engineering.redhat.com/advisory/152056
Target Version:

rhos-18.0.10 FR 3

Sprint:
HardProv Sprint 4, HardProv Sprint 6, HardProv Sprint 7, HardProv Sprint 8
sprint_count:
4
Severity:
Important

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

While provisioning a BareMetalHosts we encounter occasional situations where the BMH boots into RHCOS and send healthchecks to the OpenStack control plane (successfully) after creating the "OpenStackDataPlaneNodeSet" resource.

Nevertheless, the metal3 operator keeps waiting indefinitely for the BMH to finish provisioning.

NAME                             STATE          CONSUMER            ONLINE   ERROR   AGE
baremetalhost.metal3.io/srv12d   provisioning   dataplane-nodeset   true             3h16m
NAME                                                              STATUS   MESSAGE
openstackbaremetalset.baremetal.openstack.org/dataplane-nodeset   False    OpenStackBaremetalSet BMH provisioning in progress
NAME                                                                  STATUS   MESSAGE
openstackdataplanenodeset.dataplane.openstack.org/dataplane-nodeset   False    Setup started

jkreger@redhat.com has done some initial debugging and found the following:

appears that we're hitting a weird edge case which is going to require us to revisit the logic deep inside that interaction, because what, at a high level appears to be happening is we sort of get derailed at the worst possible place due to something happening breaking connectivity wise. Why, I have no clue, but I suspect it could be a race condition or competing networking on the ramdisk.

More context: https://redhat-internal.slack.com/archives/C04HGQ5N51N/p1743084026742799

Two must-gather archives (of separate incidents) are attached to this ticket.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

must-gather-2.tar.xz
16.28 MB
2025/04/17 10:08 AM
must-gather-1.tar.gz
33.65 MB
2025/04/17 10:08 AM

links to

RHBA-2025:152056 Release of components for RHOSO 18.0

Assignee:: Jason Paroly

Reporter:: Jack Henschel

Team:: rhos-dfg-hardprov

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2025/04/17 10:01 AM

Updated:: 2025/07/31 3:36 PM

Resolved:: 2025/07/31 2:03 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty