Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-15255

[aws] Machine got stuck in Provisioning when EC2 instance was Terminated by AWS

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.14.z
    • None
    • No
    • CLOUD Sprint 242, CLOUD Sprint 243, CLOUD Sprint 244, CLOUD Sprint 245, CLOUD Sprint 246, CLOUD Sprint 247, CLOUD Sprint 248, CLOUD Sprint 249, CLOUD Sprint 250, CLOUD Sprint 251, CLOUD Sprint 252, CLOUD Sprint 253
    • 12
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Machine got stuck in Provisioning phase after the EC2 gets terminated by AWS.
      
      The scenario I got this problem was when running an rehearsal cluster in a under development[1] job[2] for AWS Local Zone. The EC2 created through MachineSet template was launched in the Local Zone us-east-1-qro-1a, but the instance was terminated right after it was created with this message[3] (From AWS Console):
      ~~~
      Client.VolumeLimitExceeded: Volume limit exceeded. You have exceeded the maximum gp2 storage limit of 87040 GiB in this location. Please contact AWS Support for more information.
      ~~~
      
      When I saw this problem in the Console, I removed the Machine object and the MAPI was able to create a new instance in the same Zone:
      
      ~~~
      $ oc rsh pod/e2e-aws-ovn-shared-vpc-localzones-openshift-e2e-test
      Defaulted container "test" out of: test, sidecar, ci-scheduling-dns-wait (init), place-entrypoint (init), cp-entrypoint-wrapper (init)
      sh-4.4$ oc get machines -A
      NAMESPACE               NAME                                                     PHASE          TYPE         REGION      ZONE         AGE
      openshift-machine-api   ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx   Provisioning                                         45m
      
      sh-4.4$ oc delete machine ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx -n openshift-machine-api
      machine.machine.openshift.io "ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx" deleted
      
      (...)
      $ oc rsh pod/e2e-aws-ovn-shared-vpc-localzones-openshift-e2e-test
      Defaulted container "test" out of: test, sidecar, ci-scheduling-dns-wait (init), place-entrypoint (init), cp-entrypoint-wrapper (init)
      sh-4.4$ oc get machines -n openshift-machine-api -w
      NAME                                                     PHASE         TYPE         REGION      ZONE               AGE
      ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-v675j   Provisioned   c5.2xlarge   us-east-1   us-east-1-qro-1a   2m6s
      ~~~
      
      The job[2] didn't finish successfully due the timeout checking for node readiness, but the Machine got provisioned correctly (without Console errors) and kept in running state.
      
      The main problem I can see in the logs of Machine Controller is an endless loop trying to reconcile an terminated machine/instance (i-0fc8f2e7fe7bba939):
      
      ~~~
      2023-06-20T19:38:01.016776717Z I0620 19:38:01.016760       1 controller.go:156] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciling Machine
      2023-06-20T19:38:01.016776717Z I0620 19:38:01.016767       1 actuator.go:108] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: actuator checking if machine exists
      2023-06-20T19:38:01.079829331Z W0620 19:38:01.079800       1 reconciler.go:481] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Failed to find existing instance by id i-0fc8f2e7fe7bba939: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down
      2023-06-20T19:38:01.132099118Z E0620 19:38:01.132063       1 utils.go:236] Excluding instance matching ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down
      2023-06-20T19:38:01.132099118Z I0620 19:38:01.132080       1 reconciler.go:296] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Instance does not exist
      2023-06-20T19:38:01.132146892Z I0620 19:38:01.132096       1 controller.go:349] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciling machine triggers idempotent create
      2023-06-20T19:38:01.132146892Z I0620 19:38:01.132101       1 actuator.go:81] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: actuator creating machine
      2023-06-20T19:38:01.132489856Z I0620 19:38:01.132460       1 reconciler.go:41] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: creating machine
      2023-06-20T19:38:01.190935211Z W0620 19:38:01.190901       1 reconciler.go:481] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Failed to find existing instance by id i-0fc8f2e7fe7bba939: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down
      2023-06-20T19:38:01.238693678Z E0620 19:38:01.238661       1 utils.go:236] Excluding instance matching ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down
      2023-06-20T19:38:01.238693678Z I0620 19:38:01.238680       1 machine_scope.go:90] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: patching machine
      2023-06-20T19:38:01.249796760Z E0620 19:38:01.249761       1 actuator.go:72] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx error: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciler failed to Create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Machine was already created, InstanceID is set in providerStatus. Possible eventual-consistency discrepancy; returning an error to requeue
      2023-06-20T19:38:01.249824958Z W0620 19:38:01.249796       1 controller.go:351] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: failed to create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciler failed to Create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Machine was already created, InstanceID is set in providerStatus. Possible eventual-consistency discrepancy; returning an error to requeue
      2023-06-20T19:38:01.249858967Z E0620 19:38:01.249847       1 controller.go:324]  "msg"="Reconciler error" "error"="ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciler failed to Create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Machine was already created, InstanceID is set in providerStatus. Possible eventual-consistency discrepancy; returning an error to requeue" "controller"="machine-controller" "name"="ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx" "namespace"="openshift-machine-api" "object"={"name":"ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx","namespace":"openshift-machine-api"} "reconcileID"="8890f9f7-2fbf-441d-a8b7-a52ec5f4ae2f"
      ~~~
      
      I also reviewed the Account quotas for EBS gp2 and we are under the limits. The second machine was also provisioned, so I would discard any account quotas, and focus on the capacity issues in the Zone - considering Local Zone does not have high capacity as regular zones, it could happen more frequently.
      
      I am asking the AWS teams a RCA, asking more clarification how we can programatically get this error (maybe EC2 API, I didn't described the EC2 when the event happened).
      
      
      [1] https://github.com/openshift/release/pull/39902#issuecomment-1599559108
      [2] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39902/rehearse-39902-pull-ci-openshift-installer-master-e2e-aws-ovn-shared-vpc-localzones/1671215930459295744
      [3] https://user-images.githubusercontent.com/3216894/247285243-3cd28306-2972-4576-a9a6-a620e01747a6.png

      Version-Release number of selected component (if applicable):

      4.14.0-0.ci.test-2023-06-20-191559-ci-op-ljs7pd35-latest

      How reproducible:

      - Rarely by AWS (mainly in zone capacity issues - a RCA has been requested to AWS to check if we can find options to reproduce)

      Steps to Reproduce:

      this is hard to reproduce as the EC2 had been terminated by AWS. 
      
      I created one script to watch the specific subnet ID and terminate any instances created on it instantaneously, but the Machine is going to the Failed phase and getting stuck on it - and not the "Provisioning" as we got in the CI job.
      
      Steps to try to reproduce:
      1. Create a cluster with Local Zone support: https://docs.openshift.com/container-platform/4.13/installing/installing_aws/installing-aws-localzone.html
      2. Wait for the cluster be created
      3. Scale down the MachineSet for the Local Zone
      4. Start a new Terminal(#2): watch and terminate EC2 instance created in an Local Zone subnet (example: us-east-1-bue-1a)
      ~~~
      machineset_monitor="byonet1-sc9fb-edge-us-east-1-bue-1a"
      
      # discover the subnet ID
      subnet_id=$(oc get machineset $machineset_monitor -n openshift-machine-api -o json | jq -r .spec.template.spec.providerSpec.value.subnet.id)
      
      # discover the zone name
      zone_name="$(aws ec2 describe-subnets --subnet-ids $subnet_id --query 'Subnets[].AvailabilityZone' --output text)"
      
      # Discover instance ids in the subnet and terminate it
      while true; do
          echo "$(date): Getting instance in the zone ${zone_name} / subnet ${subnet_id}..."
          
          instance_ids=$(aws ec2 describe-instances --filters Name=subnet-id,Values=$subnet_id Name=instance-state-name,Values=pending,running,shutting-down,stopping --query 'Reservations[].Instances[].InstanceId' --output text)
          
          echo "$(date): Instances retrieved: $instance_ids"
          if [[ -n "$instance_ids" ]]; then
              echo "Terminating instances..."
              aws ec2 terminate-instances --instance-ids $instance_ids
              sleep 1
          else
              echo "Awaiting..."
              sleep 2
          fi
      done
      ~~~
      
      4. Scale up the MachineSet
      5. Observe the Machines
      

      Actual results:

       

      Expected results:

      - Machine moved to Failed phase when EC2 is terminated by AWS, or
      - maybe self-recover the Machine when EC2 is deleted/terminated by deleting the Machine object when managed by a MachineSet, so we can prevent manual steps

      Additional info:

       

            raryan@redhat.com Rachel Ryan
            rhn-support-mrbraga Marco Braga
            Huali Liu Huali Liu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: