-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
4.22
-
None
-
None
-
False
-
-
3
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Karpenter flaked while cleaning up, appeared to be stuck on volumes
Version-Release number of selected component (if applicable):
OpenShift 4.22 / Karpenter 1.8.6
How reproducible:
Occasionally
Steps to Reproduce:
1. Run e2e-hypershift
2. Watch it flake (maybe)
3. Be sad
Actual results:
Teardown times out waiting for volume cleanup: https://github.com/openshift/aws-karpenter-provider-aws/pull/25#issuecomment-3862775490
Expected results:
Test run succeeds
Additional info:
Failed run was: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_aws-karpenter-provider-aws/25/pull-ci-openshift-aws-karpenter-provider-aws-main-e2e-hypershift/2019853186537361408
Claude thinks AWS just took too long: The test timed out during the Teardown phase (line 170 in build-log.txt):
fixture.go:321: Failed to wait for infra resources in guest cluster to be deleted: context deadline exceeded
The teardown waited 15 minutes for AWS resources to be cleaned up, but 10 resources remained:
- 8 EBS volumes (attached to Karpenter-provisioned nodes)
- 2 EBS volumes (from CAPI-provisioned nodes)
- 1 Network Load Balancer (for ingress) Why This is a Flake (Not a Real Bug) All functional tests passed - the test only failed during cleanup. Looking at the timeline: 1. Test completed successfully at ~20:47:50 (line 150)
2. Teardown started at ~20:50:00 (line 167-168)
3. Cluster destroy completed at ~20:50:38 (destroy.log shows successful cleanup)
4. 15-minute AWS resource wait timed out at some point after that The issue is a race condition in the teardown logic in hypershift_main/test/e2e/util/fixture.go:293: err := wait.PollUntilContextTimeout(ctx, 5*time.Second, 15*time.Minute, false, func(ctx context.Context) (bool, error) { This polls AWS Resource Groups Tagging API every 5 seconds for up to 15 minutes, checking if volumes and load balancers tagged with the cluster are deleted. The Race Condition From the destroy.log, I can see the infrastructure cleanup did complete successfully, but there's an inherent delay between: 1. Karpenter/AWS deleting the actual resources (EC2 instances, volumes, NLBs)
2. AWS Resource Groups Tagging API reflecting those deletions The 3-minute forced termination timeout (from karpenter_controller.go:46) for stuck NodeClaims worked correctly - the test created nodes with a blocking PDB at line 93, and those should have been forcefully terminated. However, AWS resource deletion is asynchronous: - When Karpenter forcefully terminates NodeClaims, it triggers EC2 instance termination
- EBS volumes attached to those instances are deleted when the instance terminates
- However, the AWS Tagging API may not immediately reflect these deletions Why It Flaked This Time But Not Others Looking at the volumes left behind:
- 6 volumes tagged with karpenter.sh/nodepool=on-demand
- 1 volume tagged with karpenter.sh/nodepool=instance-profile-test
- 2 volumes from CAPI nodes (MachineName=...)
- 1 NLB for openshift-ingress/router-default The test created 3 Karpenter nodes in the final test case (line 94: replicas := 3) with the blocking PDB. These likely took longer than usual to:
1. Receive the forced termination signal (after 3-minute timeout)
2. Actually terminate in AWS
3. Have their attached EBS volumes cleaned up
4. Be reflected in the AWS Tagging API The timing suggests:
- Test ended: ~20:47:50
- Teardown started checking AWS resources: ~20:50:00
- 15-minute timeout expired: ~21:05:00
- But the actual cluster/resources were still being deleted asynchronously The Real Issue The 15-minute timeout in fixture.go is too aggressive when combined with:
- 3-minute NodeClaim forced deletion timeout
- AWS async deletion propagation delays
- Multiple nodes (3) with blocking PDBs Recommendations to Deflake 1. Increase the teardown timeout in fixture.go:293 from 15 to 20-25 minutes to account for:
- NodeClaim forced deletion timeout (3 min)
- EC2 instance termination (1-2 min per instance)
- EBS volume deletion (1-2 min)
- AWS API propagation delay (varies)
2. Add retry logic with exponential backoff instead of fixed 5-second polling
3. Check if resources are in "deleting" state rather than just "exists" - AWS may report them as existing but in the process of deletion
4. Consider making the test less aggressive - instead of 3 replicas with blocking PDB, use 1-2 replicas to reduce the cleanup burden The test is functionally correct - the forced termination logic is working (all pods were cleaned up, cluster destroyed successfully). It's just that the teardown validation is racing with AWS's eventual consistency model.