Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-55168

Investigate issues with servicing retries in Baremetal Operator

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 3
    • Important
    • No
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem: In the past we have observed evidence of servicing retry logic in BMO not working as expected. For example, this led to this issue: https://issues.redhat.com/browse/OCPBUGS-48789

      We worked around this by preventing the condition needed to enter this state from happening, but now is the time to try to ensure retry works as expected

          Version-Release number of selected component (if applicable): 4.19
      
          

      How reproducible: often

          Steps to Reproduce:
          1. introduce a condition likely to cause servicing issue (e.g. attempt servicing on two-worker cluster) without applying any workarounds (e.g. preemptively powering off the node for a 5 minute period to allow healthchecks to fail and mark the node down)
          2. attempt servicing
          3. watch out for ICC issue as in the referenced bug
          

      Actual results: the node gets stuck in servicing error

          Expected results: BMO should keep retrying - once networking machinery realises DNS and router pods on the nodes are down, it should get past the ICC issue without workaround applied
      
          

      Additional info:

      
          

              janders@redhat.com Jacob Anders
              janders@redhat.com Jacob Anders
              None
              None
              Jad Haj Yahya Jad Haj Yahya
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: