Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-15883

Sporadic failures during provisioning of DataPlaneNodeSet

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Normal Normal
    • rhos-18.0.10 FR 3
    • rhos-18.0 FR 2 (Mar 2025)
    • openstack-ironic
    • None
    • 3
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • openstack-ironic-21.4.5-18.0.20250519144814.9213ccd.el9ost
    • Impediment
    • rhos-ops-day1day2-hardprov
    • None
    • Hide
      .Workflow operations persist through interruptions in connectivity

      This update solves an issue in the Bare Metal Provisioning service (ironic) that caused the deployment process to loop and time out because of interruptions in connectivity while the deployment agent was starting. The issue occurred because only one attempt was made to evaluate if a RAM drive was recently booted. When this issue occurred, the bare metal nodes would fail to clean, deploy, or perform other workflow actions.
      Show
      .Workflow operations persist through interruptions in connectivity This update solves an issue in the Bare Metal Provisioning service (ironic) that caused the deployment process to loop and time out because of interruptions in connectivity while the deployment agent was starting. The issue occurred because only one attempt was made to evaluate if a RAM drive was recently booted. When this issue occurred, the bare metal nodes would fail to clean, deploy, or perform other workflow actions.
    • Bug Fix
    • Done
    • Hide

      This issue is exceptionally difficult to reproduce.

      Ultimately the root cause is due to a short lived transient connectivity loss, the kind of connectivity loss which could be the result of LACP Fallback activating on a switchport, where after the host boots and network begins to come online, the network port will go offline for a short period of time. This short period of time can be enough to cause the overall step retrieval method in the ironic-conductor process to consider the connection timed out as opposed to returning valid data. The retry logic in the conductor would then retry, and the conductor's "is this a fresh/new agent" logic would disqualify the newly built agent so it would never retrieve it's list of steps available to execute upon to allow the overall deployment flow to activate.

      Some lab environments have apparently been able to reproduce this more reliably than others, specifically because the ramdisk in those environments is starting up about 30 seconds after network connectivity is first established, which enables the timing of the switch to attempt link validation causes the fallback logic logic to engage. This is where customers are attempting to use Bonded interfaces and then deploy over those bonded interfaces.

      Show
      This issue is exceptionally difficult to reproduce. Ultimately the root cause is due to a short lived transient connectivity loss, the kind of connectivity loss which could be the result of LACP Fallback activating on a switchport, where after the host boots and network begins to come online, the network port will go offline for a short period of time. This short period of time can be enough to cause the overall step retrieval method in the ironic-conductor process to consider the connection timed out as opposed to returning valid data. The retry logic in the conductor would then retry, and the conductor's "is this a fresh/new agent" logic would disqualify the newly built agent so it would never retrieve it's list of steps available to execute upon to allow the overall deployment flow to activate. Some lab environments have apparently been able to reproduce this more reliably than others, specifically because the ramdisk in those environments is starting up about 30 seconds after network connectivity is first established, which enables the timing of the switch to attempt link validation causes the fallback logic logic to engage. This is where customers are attempting to use Bonded interfaces and then deploy over those bonded interfaces.
    • HardProv Sprint 4, HardProv Sprint 6, HardProv Sprint 7, HardProv Sprint 8
    • 4
    • Important

      While provisioning a BareMetalHosts we encounter occasional situations where the BMH boots into RHCOS and send healthchecks to the OpenStack control plane (successfully) after creating the "OpenStackDataPlaneNodeSet" resource.

      Nevertheless, the metal3 operator keeps waiting indefinitely for the BMH to finish provisioning.

       

      NAME                             STATE          CONSUMER            ONLINE   ERROR   AGE
      baremetalhost.metal3.io/srv12d   provisioning   dataplane-nodeset   true             3h16m
      NAME                                                              STATUS   MESSAGE
      openstackbaremetalset.baremetal.openstack.org/dataplane-nodeset   False    OpenStackBaremetalSet BMH provisioning in progress
      NAME                                                                  STATUS   MESSAGE
      openstackdataplanenodeset.dataplane.openstack.org/dataplane-nodeset   False    Setup started
      

       

      jkreger@redhat.com has done some initial debugging and found the following:

      appears that we're hitting a weird edge case which is going to require us to revisit the logic deep inside that interaction, because what, at a high level appears to be happening is we sort of get derailed at the worst possible place due to something happening breaking connectivity wise. Why, I have no clue, but I suspect it could be a race condition or competing networking on the ramdisk.

      More context: https://redhat-internal.slack.com/archives/C04HGQ5N51N/p1743084026742799

      Two must-gather archives (of separate incidents) are attached to this ticket.

              jasonparoly Jason Paroly
              rh-ee-jhensche Jack Henschel
              rhos-dfg-hardprov
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: