Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.15.0
Affects Version/s: 4.14
Component/s: Windows Containers
Labels:
None

Severity:
Moderate
Regression:
No
Story Points:
3
Sprint:
WINC - Sprint 242, WINC - Sprint 243
sprint_count:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
Previously WMCO did not properly deduce when a Windows instance reboot was completed. This would lead to occasional timing issues where WMCO would try to interact with a node that was in the middle of a reboot, causing WMCO to log an error and restart node configuration. Now, WMCO waits for instances to reboot in a complete manner, fixing this issue entirely.

Show
Previously WMCO did not properly deduce when a Windows instance reboot was completed. This would lead to occasional timing issues where WMCO would try to interact with a node that was in the middle of a reboot, causing WMCO to log an error and restart node configuration. Now, WMCO waits for instances to reboot in a complete manner, fixing this issue entirely.
Release Note Type:
Bug Fix
Target Version:

4.15.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

WMCO does not properly deduce when a reboot is complete. Currently, it tries to initialize an SSH connection directly after issuing the reboot request -- this can lead to timing issues where the reboot hasn't occurred yet and the initial SSH connection is still active, so WMCO thinks the reboot is done and proceeds with node configuration. Then, a bit later, the reboot actually is underway and WMCO errors out and has to re-init SSH and restart configuration.

Version-Release number of selected component (if applicable):

4.14 (and below through 4.10)

How reproducible:

Always

Steps to Reproduce:

1. Use a Windows image that does not have the Containers feature enabled already 
2. Have WMCO try to configure the instance as a node
3. Timing error will show when restarting instance after turning on containers feature

Actual results:

WMCO's check if the instance is reachable via SSH is too quick and incorrectly assumes the reboot has been completed right away, which leads to configuration failure later as it can not run powershell commands over SSH when reboot is underway.

Expected results:

WMCO should wait/check for reboot in a more complete manner to avoid false positives.

Additional info:

Perhaps waiting for the node to be unreachable first, and then waiting for it to be reachable again could solve this?

Thread with logs and discussion: https://redhat-internal.slack.com/archives/CM4ERHBJS/p1690925841359849

blocks

OCPBUGS-18554 error removing %s HNS network when cleaning up BYOH proxy nodes

Closed

OCPBUGS-20067 WMCO does not wait for instance to reboot properly

Closed

is blocked by

OCPBUGS-19502 Enable proxy removal test in CI

Closed

is cloned by

OCPBUGS-20067 WMCO does not wait for instance to reboot properly

Closed

links to

openshift/windows-machine-config-operator#1770: OCPBUGS-17217: Fix "WMCO does not wait for instance reboots properly"

RHBA-2023:120235 Red Hat OpenShift support for Windows Containers 10.15.0 product release

mentioned on

Merge request - Updated US source to: be1eb32 Merge pull request #1770 from saifshaikh48/reboot-wait

(1 links to, 1 mentioned on)

Assignee:: Mohammad Shaikh (Inactive)

Reporter:: Mohammad Shaikh (Inactive)

QA Contact:: Aharon Rasouli

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/08/02 5:00 PM

Updated:: 2024/02/27 3:17 PM

Resolved:: 2024/02/27 3:17 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates