-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.16.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
During an SDN to OVN limited-live migration, we have seen some clusters get stuck deleting machines which blocks the entire migration. The machine events show failures talking to STS. Deleting the machine-api-controllers pod, causing it to reschedule, then unblocks the machine deletion.
Version-Release number of selected component (if applicable):
How reproducible:
Presuming you can cause a network failure for the controllers, should be easy to reproduce.
Steps to Reproduce:
1. Somehow break network access for machine-api-controllers pod 2. Delete a machine 3. See machine events to show the events about failing to talk to STS
Actual results:
The machine deleting hangs forever
Expected results:
The failed connection is retried and the machine is deleted once AWS credentials are available
Additional info:
The events look like this Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedDelete 175m awscontroller vs-upgr-mgmt-1s-mz446-workers-m6i-xlarge-us-east-1c-nkdhz: reconciler failed to Delete machine: WebIdentityErr: failed to retrieve credentialscaused by: RequestError: send request failedcaused by: Post "https://sts.amazonaws.com/": dial tcp: lookup sts.amazonaws.com on 172.30.0.10:53: read udp 10.130.0.47:44112->172.30.0.10:53: i/o timeout Normal DetectedUnhealthy 22m (x24 over 29m) machinehealthcheck-controller Machine openshift-machine-api/srep-worker-healthcheck/vs-upgr-mgmt-1s-mz446-workers-m6i-xlarge-us-east-1c-nkdhz/ip-10-114-20-168.vs-upgrd-mgmt-si.aws.delta.com has unhealthy node ip-10-114-20-168.vs-upgrd-mgmt-si.aws.delta.com Warning FailedDelete 13m (x28 over 23h) awscontroller vs-upgr-mgmt-1s-mz446-workers-m6i-xlarge-us-east-1c-nkdhz: reconciler failed to Delete machine: WebIdentityErr: failed to retrieve credentialscaused by: RequestError: send request failedcaused by: Post "https://sts.amazonaws.com/": dial tcp: lookup sts.amazonaws.com: i/o timeout