Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-20067

WMCO does not wait for instance to reboot properly

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Major Major
    • 4.14.0
    • 4.14
    • Windows Containers
    • None
    • Moderate
    • No
    • 0
    • WINC - Sprint 243
    • 1
    • False
    • Hide

      None

      Show
      None
    • Hide
      Previously WMCO did not properly deduce when a Windows instance reboot was completed. This would lead to occasional timing issues where WMCO would try to interact with a node that was in the middle of a reboot, causing WMCO to log an error and restart node configuration. Now, WMCO waits for instances to reboot in a complete manner, fixing this issue entirely.
      Show
      Previously WMCO did not properly deduce when a Windows instance reboot was completed. This would lead to occasional timing issues where WMCO would try to interact with a node that was in the middle of a reboot, causing WMCO to log an error and restart node configuration. Now, WMCO waits for instances to reboot in a complete manner, fixing this issue entirely.
    • Bug Fix

      This is a clone of issue OCPBUGS-17217. The following is the description of the original issue:

      Description of problem:

      WMCO does not properly deduce when a reboot is complete. Currently, it tries to initialize an SSH connection directly after issuing the reboot request -- this can lead to timing issues where the reboot hasn't occurred yet and the initial SSH connection is still active, so WMCO thinks the reboot is done and proceeds with node configuration. Then, a bit later, the reboot actually is underway and WMCO errors out and has to re-init SSH and restart configuration.
      

      Version-Release number of selected component (if applicable):

      4.14 (and below through 4.10)

      How reproducible:

      Always

      Steps to Reproduce:

      1. Use a Windows image that does not have the Containers feature enabled already 
      2. Have WMCO try to configure the instance as a node
      3. Timing error will show when restarting instance after turning on containers feature

      Actual results:

      WMCO's check if the instance is reachable via SSH is too quick and incorrectly assumes the reboot has been completed right away, which leads to configuration failure later as it can not run powershell commands over SSH when reboot is underway.

      Expected results:

      WMCO should wait/check for reboot in a more complete manner to avoid false positives. 

      Additional info:

      Perhaps waiting for the node to be unreachable first, and then waiting for it to be reachable again could solve this?

      Thread with logs and discussion: https://redhat-internal.slack.com/archives/CM4ERHBJS/p1690925841359849

       

              jvaldes@redhat.com Jose Valdes
              openshift-crt-jira-prow OpenShift Prow Bot
              Aharon Rasouli Aharon Rasouli
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: