-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.14.z
-
No
-
CLOUD Sprint 242, CLOUD Sprint 243, CLOUD Sprint 244, CLOUD Sprint 245, CLOUD Sprint 246, CLOUD Sprint 247, CLOUD Sprint 248, CLOUD Sprint 249, CLOUD Sprint 250, CLOUD Sprint 251, CLOUD Sprint 252, CLOUD Sprint 253, CLOUD Sprint 254, CLOUD Sprint 255, CLOUD Sprint 256, CLOUD Sprint 257, CLOUD Sprint 258, CLOUD Sprint 259, CLOUD Sprint 260, CLOUD Sprint 261
-
20
-
Rejected
-
False
-
Description of problem:
Machine got stuck in Provisioning phase after the EC2 gets terminated by AWS. The scenario I got this problem was when running an rehearsal cluster in a under development[1] job[2] for AWS Local Zone. The EC2 created through MachineSet template was launched in the Local Zone us-east-1-qro-1a, but the instance was terminated right after it was created with this message[3] (From AWS Console): ~~~ Client.VolumeLimitExceeded: Volume limit exceeded. You have exceeded the maximum gp2 storage limit of 87040 GiB in this location. Please contact AWS Support for more information. ~~~ When I saw this problem in the Console, I removed the Machine object and the MAPI was able to create a new instance in the same Zone: ~~~ $ oc rsh pod/e2e-aws-ovn-shared-vpc-localzones-openshift-e2e-test Defaulted container "test" out of: test, sidecar, ci-scheduling-dns-wait (init), place-entrypoint (init), cp-entrypoint-wrapper (init) sh-4.4$ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx Provisioning 45m sh-4.4$ oc delete machine ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx -n openshift-machine-api machine.machine.openshift.io "ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx" deleted (...) $ oc rsh pod/e2e-aws-ovn-shared-vpc-localzones-openshift-e2e-test Defaulted container "test" out of: test, sidecar, ci-scheduling-dns-wait (init), place-entrypoint (init), cp-entrypoint-wrapper (init) sh-4.4$ oc get machines -n openshift-machine-api -w NAME PHASE TYPE REGION ZONE AGE ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-v675j Provisioned c5.2xlarge us-east-1 us-east-1-qro-1a 2m6s ~~~ The job[2] didn't finish successfully due the timeout checking for node readiness, but the Machine got provisioned correctly (without Console errors) and kept in running state. The main problem I can see in the logs of Machine Controller is an endless loop trying to reconcile an terminated machine/instance (i-0fc8f2e7fe7bba939): ~~~ 2023-06-20T19:38:01.016776717Z I0620 19:38:01.016760 1 controller.go:156] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciling Machine 2023-06-20T19:38:01.016776717Z I0620 19:38:01.016767 1 actuator.go:108] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: actuator checking if machine exists 2023-06-20T19:38:01.079829331Z W0620 19:38:01.079800 1 reconciler.go:481] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Failed to find existing instance by id i-0fc8f2e7fe7bba939: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down 2023-06-20T19:38:01.132099118Z E0620 19:38:01.132063 1 utils.go:236] Excluding instance matching ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down 2023-06-20T19:38:01.132099118Z I0620 19:38:01.132080 1 reconciler.go:296] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Instance does not exist 2023-06-20T19:38:01.132146892Z I0620 19:38:01.132096 1 controller.go:349] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciling machine triggers idempotent create 2023-06-20T19:38:01.132146892Z I0620 19:38:01.132101 1 actuator.go:81] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: actuator creating machine 2023-06-20T19:38:01.132489856Z I0620 19:38:01.132460 1 reconciler.go:41] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: creating machine 2023-06-20T19:38:01.190935211Z W0620 19:38:01.190901 1 reconciler.go:481] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Failed to find existing instance by id i-0fc8f2e7fe7bba939: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down 2023-06-20T19:38:01.238693678Z E0620 19:38:01.238661 1 utils.go:236] Excluding instance matching ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down 2023-06-20T19:38:01.238693678Z I0620 19:38:01.238680 1 machine_scope.go:90] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: patching machine 2023-06-20T19:38:01.249796760Z E0620 19:38:01.249761 1 actuator.go:72] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx error: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciler failed to Create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Machine was already created, InstanceID is set in providerStatus. Possible eventual-consistency discrepancy; returning an error to requeue 2023-06-20T19:38:01.249824958Z W0620 19:38:01.249796 1 controller.go:351] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: failed to create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciler failed to Create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Machine was already created, InstanceID is set in providerStatus. Possible eventual-consistency discrepancy; returning an error to requeue 2023-06-20T19:38:01.249858967Z E0620 19:38:01.249847 1 controller.go:324] "msg"="Reconciler error" "error"="ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciler failed to Create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Machine was already created, InstanceID is set in providerStatus. Possible eventual-consistency discrepancy; returning an error to requeue" "controller"="machine-controller" "name"="ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx" "namespace"="openshift-machine-api" "object"={"name":"ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx","namespace":"openshift-machine-api"} "reconcileID"="8890f9f7-2fbf-441d-a8b7-a52ec5f4ae2f" ~~~ I also reviewed the Account quotas for EBS gp2 and we are under the limits. The second machine was also provisioned, so I would discard any account quotas, and focus on the capacity issues in the Zone - considering Local Zone does not have high capacity as regular zones, it could happen more frequently. I am asking the AWS teams a RCA, asking more clarification how we can programatically get this error (maybe EC2 API, I didn't described the EC2 when the event happened). [1] https://github.com/openshift/release/pull/39902#issuecomment-1599559108 [2] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39902/rehearse-39902-pull-ci-openshift-installer-master-e2e-aws-ovn-shared-vpc-localzones/1671215930459295744 [3] https://user-images.githubusercontent.com/3216894/247285243-3cd28306-2972-4576-a9a6-a620e01747a6.png
Version-Release number of selected component (if applicable):
4.14.0-0.ci.test-2023-06-20-191559-ci-op-ljs7pd35-latest
How reproducible:
- Rarely by AWS (mainly in zone capacity issues - a RCA has been requested to AWS to check if we can find options to reproduce)
Steps to Reproduce:
this is hard to reproduce as the EC2 had been terminated by AWS. I created one script to watch the specific subnet ID and terminate any instances created on it instantaneously, but the Machine is going to the Failed phase and getting stuck on it - and not the "Provisioning" as we got in the CI job. Steps to try to reproduce: 1. Create a cluster with Local Zone support: https://docs.openshift.com/container-platform/4.13/installing/installing_aws/installing-aws-localzone.html 2. Wait for the cluster be created 3. Scale down the MachineSet for the Local Zone 4. Start a new Terminal(#2): watch and terminate EC2 instance created in an Local Zone subnet (example: us-east-1-bue-1a) ~~~ machineset_monitor="byonet1-sc9fb-edge-us-east-1-bue-1a" # discover the subnet ID subnet_id=$(oc get machineset $machineset_monitor -n openshift-machine-api -o json | jq -r .spec.template.spec.providerSpec.value.subnet.id) # discover the zone name zone_name="$(aws ec2 describe-subnets --subnet-ids $subnet_id --query 'Subnets[].AvailabilityZone' --output text)" # Discover instance ids in the subnet and terminate it while true; do echo "$(date): Getting instance in the zone ${zone_name} / subnet ${subnet_id}..." instance_ids=$(aws ec2 describe-instances --filters Name=subnet-id,Values=$subnet_id Name=instance-state-name,Values=pending,running,shutting-down,stopping --query 'Reservations[].Instances[].InstanceId' --output text) echo "$(date): Instances retrieved: $instance_ids" if [[ -n "$instance_ids" ]]; then echo "Terminating instances..." aws ec2 terminate-instances --instance-ids $instance_ids sleep 1 else echo "Awaiting..." sleep 2 fi done ~~~ 4. Scale up the MachineSet 5. Observe the Machines
Actual results:
Expected results:
- Machine moved to Failed phase when EC2 is terminated by AWS, or - maybe self-recover the Machine when EC2 is deleted/terminated by deleting the Machine object when managed by a MachineSet, so we can prevent manual steps
Additional info:
- is documented by
-
SPLAT-1096 [aws][bug]OCPBUGS-15255: Track MAPI provisioning issues in AWS Local Zone jobs
- Closed
- links to
-
RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update