Loading...

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: 4.18.0
Affects Version/s: 4.14.z
Component/s: Cloud Compute / Unknown
Labels:
- pre-merge-tested

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
No

Target Backport Versions:
None
Target Version:

4.18.0
Release Blocker:
Rejected
Sprint:
CLOUD Sprint 242, CLOUD Sprint 243, CLOUD Sprint 244, CLOUD Sprint 245, CLOUD Sprint 246, CLOUD Sprint 247, CLOUD Sprint 248, CLOUD Sprint 249, CLOUD Sprint 250, CLOUD Sprint 251, CLOUD Sprint 252, CLOUD Sprint 253, CLOUD Sprint 254, CLOUD Sprint 255, CLOUD Sprint 256, CLOUD Sprint 257, CLOUD Sprint 258, CLOUD Sprint 259, CLOUD Sprint 260, CLOUD Sprint 261
sprint_count:
20

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Machine got stuck in Provisioning phase after the EC2 gets terminated by AWS.

The scenario I got this problem was when running an rehearsal cluster in a under development[1] job[2] for AWS Local Zone. The EC2 created through MachineSet template was launched in the Local Zone us-east-1-qro-1a, but the instance was terminated right after it was created with this message[3] (From AWS Console):
~~~
Client.VolumeLimitExceeded: Volume limit exceeded. You have exceeded the maximum gp2 storage limit of 87040 GiB in this location. Please contact AWS Support for more information.
~~~

When I saw this problem in the Console, I removed the Machine object and the MAPI was able to create a new instance in the same Zone:

~~~
$ oc rsh pod/e2e-aws-ovn-shared-vpc-localzones-openshift-e2e-test
Defaulted container "test" out of: test, sidecar, ci-scheduling-dns-wait (init), place-entrypoint (init), cp-entrypoint-wrapper (init)
sh-4.4$ oc get machines -A
NAMESPACE               NAME                                                     PHASE          TYPE         REGION      ZONE         AGE
openshift-machine-api   ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx   Provisioning                                         45m

sh-4.4$ oc delete machine ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx -n openshift-machine-api
machine.machine.openshift.io "ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx" deleted

(...)
$ oc rsh pod/e2e-aws-ovn-shared-vpc-localzones-openshift-e2e-test
Defaulted container "test" out of: test, sidecar, ci-scheduling-dns-wait (init), place-entrypoint (init), cp-entrypoint-wrapper (init)
sh-4.4$ oc get machines -n openshift-machine-api -w
NAME                                                     PHASE         TYPE         REGION      ZONE               AGE
ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-v675j   Provisioned   c5.2xlarge   us-east-1   us-east-1-qro-1a   2m6s
~~~

The job[2] didn't finish successfully due the timeout checking for node readiness, but the Machine got provisioned correctly (without Console errors) and kept in running state.

The main problem I can see in the logs of Machine Controller is an endless loop trying to reconcile an terminated machine/instance (i-0fc8f2e7fe7bba939):

~~~
2023-06-20T19:38:01.016776717Z I0620 19:38:01.016760       1 controller.go:156] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciling Machine
2023-06-20T19:38:01.016776717Z I0620 19:38:01.016767       1 actuator.go:108] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: actuator checking if machine exists
2023-06-20T19:38:01.079829331Z W0620 19:38:01.079800       1 reconciler.go:481] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Failed to find existing instance by id i-0fc8f2e7fe7bba939: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down
2023-06-20T19:38:01.132099118Z E0620 19:38:01.132063       1 utils.go:236] Excluding instance matching ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down
2023-06-20T19:38:01.132099118Z I0620 19:38:01.132080       1 reconciler.go:296] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Instance does not exist
2023-06-20T19:38:01.132146892Z I0620 19:38:01.132096       1 controller.go:349] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciling machine triggers idempotent create
2023-06-20T19:38:01.132146892Z I0620 19:38:01.132101       1 actuator.go:81] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: actuator creating machine
2023-06-20T19:38:01.132489856Z I0620 19:38:01.132460       1 reconciler.go:41] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: creating machine
2023-06-20T19:38:01.190935211Z W0620 19:38:01.190901       1 reconciler.go:481] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Failed to find existing instance by id i-0fc8f2e7fe7bba939: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down
2023-06-20T19:38:01.238693678Z E0620 19:38:01.238661       1 utils.go:236] Excluding instance matching ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: instance i-0fc8f2e7fe7bba939 state "terminated" is not in running, pending, stopped, stopping, shutting-down
2023-06-20T19:38:01.238693678Z I0620 19:38:01.238680       1 machine_scope.go:90] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: patching machine
2023-06-20T19:38:01.249796760Z E0620 19:38:01.249761       1 actuator.go:72] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx error: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciler failed to Create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Machine was already created, InstanceID is set in providerStatus. Possible eventual-consistency discrepancy; returning an error to requeue
2023-06-20T19:38:01.249824958Z W0620 19:38:01.249796       1 controller.go:351] ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: failed to create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciler failed to Create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Machine was already created, InstanceID is set in providerStatus. Possible eventual-consistency discrepancy; returning an error to requeue
2023-06-20T19:38:01.249858967Z E0620 19:38:01.249847       1 controller.go:324]  "msg"="Reconciler error" "error"="ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: reconciler failed to Create machine: ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx: Machine was already created, InstanceID is set in providerStatus. Possible eventual-consistency discrepancy; returning an error to requeue" "controller"="machine-controller" "name"="ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx" "namespace"="openshift-machine-api" "object"={"name":"ci-op-ljs7pd35-f5e2f-hzmkx-edge-us-east-1-qro-1a-frbbx","namespace":"openshift-machine-api"} "reconcileID"="8890f9f7-2fbf-441d-a8b7-a52ec5f4ae2f"
~~~

I also reviewed the Account quotas for EBS gp2 and we are under the limits. The second machine was also provisioned, so I would discard any account quotas, and focus on the capacity issues in the Zone - considering Local Zone does not have high capacity as regular zones, it could happen more frequently.

I am asking the AWS teams a RCA, asking more clarification how we can programatically get this error (maybe EC2 API, I didn't described the EC2 when the event happened).


[1] https://github.com/openshift/release/pull/39902#issuecomment-1599559108
[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/39902/rehearse-39902-pull-ci-openshift-installer-master-e2e-aws-ovn-shared-vpc-localzones/1671215930459295744
[3] https://user-images.githubusercontent.com/3216894/247285243-3cd28306-2972-4576-a9a6-a620e01747a6.png

Version-Release number of selected component (if applicable):

4.14.0-0.ci.test-2023-06-20-191559-ci-op-ljs7pd35-latest

How reproducible:

- Rarely by AWS (mainly in zone capacity issues - a RCA has been requested to AWS to check if we can find options to reproduce)

Steps to Reproduce:

this is hard to reproduce as the EC2 had been terminated by AWS. 

I created one script to watch the specific subnet ID and terminate any instances created on it instantaneously, but the Machine is going to the Failed phase and getting stuck on it - and not the "Provisioning" as we got in the CI job.

Steps to try to reproduce:
1. Create a cluster with Local Zone support: https://docs.openshift.com/container-platform/4.13/installing/installing_aws/installing-aws-localzone.html
2. Wait for the cluster be created
3. Scale down the MachineSet for the Local Zone
4. Start a new Terminal(#2): watch and terminate EC2 instance created in an Local Zone subnet (example: us-east-1-bue-1a)
~~~
machineset_monitor="byonet1-sc9fb-edge-us-east-1-bue-1a"

# discover the subnet ID
subnet_id=$(oc get machineset $machineset_monitor -n openshift-machine-api -o json | jq -r .spec.template.spec.providerSpec.value.subnet.id)

# discover the zone name
zone_name="$(aws ec2 describe-subnets --subnet-ids $subnet_id --query 'Subnets[].AvailabilityZone' --output text)"

# Discover instance ids in the subnet and terminate it
while true; do
    echo "$(date): Getting instance in the zone ${zone_name} / subnet ${subnet_id}..."
    
    instance_ids=$(aws ec2 describe-instances --filters Name=subnet-id,Values=$subnet_id Name=instance-state-name,Values=pending,running,shutting-down,stopping --query 'Reservations[].Instances[].InstanceId' --output text)
    
    echo "$(date): Instances retrieved: $instance_ids"
    if [[ -n "$instance_ids" ]]; then
        echo "Terminating instances..."
        aws ec2 terminate-instances --instance-ids $instance_ids
        sleep 1
    else
        echo "Awaiting..."
        sleep 2
    fi
done
~~~

4. Scale up the MachineSet
5. Observe the Machines

Actual results:

Expected results:

- Machine moved to Failed phase when EC2 is terminated by AWS, or
- maybe self-recover the Machine when EC2 is deleted/terminated by deleting the Machine object when managed by a MachineSet, so we can prevent manual steps

Additional info:

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

must-gather_local-zone_ec2-terminated_i-0fc8f2e7fe7bba939_us-east-1-qro-1a.tar
16.49 MB
2023/06/21 2:11 PM

is documented by

SPLAT-1096 [aws][bug]OCPBUGS-15255: Track MAPI provisioning issues in AWS Local Zone jobs

Closed

links to

openshift/machine-api-provider-aws#83: OCPBUGS-15255: Add terminated as a handled state so terminated instances don't get stuck

RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide