Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.16.z
Component/s: Cloud Compute / Machine API Providers
Labels:
- pmr-ai

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

During an SDN to OVN limited-live migration, we have seen some clusters get stuck deleting machines which blocks the entire migration. The machine events show failures talking to STS. Deleting the machine-api-controllers pod, causing it to reschedule, then unblocks the machine deletion.

Version-Release number of selected component (if applicable):

How reproducible:

Presuming you can cause a network failure for the controllers, should be easy to reproduce.

Steps to Reproduce:

    1. Somehow break network access for machine-api-controllers pod
    2. Delete a machine
    3. See machine events to show the events about failing to talk to STS

Actual results:

The machine deleting hangs forever

Expected results:

The failed connection is retried and the machine is deleted once AWS credentials are available

Additional info:

The events look like this

Events:  Type     Reason        Age   From           Message  ----     ------        ----  ----           -------  Warning  FailedDelete  175m  awscontroller  vs-upgr-mgmt-1s-mz446-workers-m6i-xlarge-us-east-1c-nkdhz: reconciler failed to Delete machine: WebIdentityErr: failed to retrieve credentialscaused by: RequestError: send request failedcaused by: Post "https://sts.amazonaws.com/": dial tcp: lookup sts.amazonaws.com on 172.30.0.10:53: read udp 10.130.0.47:44112->172.30.0.10:53: i/o timeout  Normal   DetectedUnhealthy  22m (x24 over 29m)  machinehealthcheck-controller  Machine openshift-machine-api/srep-worker-healthcheck/vs-upgr-mgmt-1s-mz446-workers-m6i-xlarge-us-east-1c-nkdhz/ip-10-114-20-168.vs-upgrd-mgmt-si.aws.delta.com has unhealthy node ip-10-114-20-168.vs-upgrd-mgmt-si.aws.delta.com  Warning  FailedDelete       13m (x28 over 23h)  awscontroller                  vs-upgr-mgmt-1s-mz446-workers-m6i-xlarge-us-east-1c-nkdhz: reconciler failed to Delete machine: WebIdentityErr: failed to retrieve credentialscaused by: RequestError: send request failedcaused by: Post "https://sts.amazonaws.com/": dial tcp: lookup sts.amazonaws.com: i/o timeout

Assignee:: Theo Barber-Bany

Reporter:: Josh Branham

Need Info From:: None

Contributors:: None

QA Contact:: Huali Liu

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2025/05/01 5:55 PM

Updated:: 2025/07/13 1:27 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates