Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-55634

machine-api-controllers hangs when failing to hit sts.amazonaws.com

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      During an SDN to OVN limited-live migration, we have seen some clusters get stuck deleting machines which blocks the entire migration. The machine events show failures talking to STS. Deleting the machine-api-controllers pod, causing it to reschedule, then unblocks the machine deletion.    

      Version-Release number of selected component (if applicable):

          

      How reproducible:

      Presuming you can cause a network failure for the controllers, should be easy to reproduce.    

      Steps to Reproduce:

          1. Somehow break network access for machine-api-controllers pod
          2. Delete a machine
          3. See machine events to show the events about failing to talk to STS
          

      Actual results:

      The machine deleting hangs forever    

      Expected results:

      The failed connection is retried and the machine is deleted once AWS credentials are available    

      Additional info:

      The events look like this
      
      Events:  Type     Reason        Age   From           Message  ----     ------        ----  ----           -------  Warning  FailedDelete  175m  awscontroller  vs-upgr-mgmt-1s-mz446-workers-m6i-xlarge-us-east-1c-nkdhz: reconciler failed to Delete machine: WebIdentityErr: failed to retrieve credentialscaused by: RequestError: send request failedcaused by: Post "https://sts.amazonaws.com/": dial tcp: lookup sts.amazonaws.com on 172.30.0.10:53: read udp 10.130.0.47:44112->172.30.0.10:53: i/o timeout  Normal   DetectedUnhealthy  22m (x24 over 29m)  machinehealthcheck-controller  Machine openshift-machine-api/srep-worker-healthcheck/vs-upgr-mgmt-1s-mz446-workers-m6i-xlarge-us-east-1c-nkdhz/ip-10-114-20-168.vs-upgrd-mgmt-si.aws.delta.com has unhealthy node ip-10-114-20-168.vs-upgrd-mgmt-si.aws.delta.com  Warning  FailedDelete       13m (x28 over 23h)  awscontroller                  vs-upgr-mgmt-1s-mz446-workers-m6i-xlarge-us-east-1c-nkdhz: reconciler failed to Delete machine: WebIdentityErr: failed to retrieve credentialscaused by: RequestError: send request failedcaused by: Post "https://sts.amazonaws.com/": dial tcp: lookup sts.amazonaws.com: i/o timeout    

              rh-ee-tbarberb Theo Barber-Bany
              jbranham.openshift Josh Branham
              None
              None
              Huali Liu Huali Liu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: