Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: 4.22
Component/s: autoscaling / karpenter
Labels:
None

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
3
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

    Karpenter flaked while cleaning up, appeared to be stuck on volumes

Version-Release number of selected component (if applicable):

    OpenShift 4.22 / Karpenter 1.8.6

How reproducible:

    Occasionally

Steps to Reproduce:

    1. Run e2e-hypershift 
    2. Watch it flake (maybe) 
    3. Be sad

Actual results:

Teardown times out waiting for volume cleanup: 
https://github.com/openshift/aws-karpenter-provider-aws/pull/25#issuecomment-3862775490

Expected results:

    Test run succeeds

Additional info:

Failed run was: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_aws-karpenter-provider-aws/25/pull-ci-openshift-aws-karpenter-provider-aws-main-e2e-hypershift/2019853186537361408

Claude thinks AWS just took too long:                                                                                                                     The test timed out during the Teardown phase (line 170 in build-log.txt):
  fixture.go:321: Failed to wait for infra resources in guest cluster to be deleted: context deadline exceeded                                                      
                                             
  The teardown waited 15 minutes for AWS resources to be cleaned up, but 10 resources remained:                                                                     
  - 8 EBS volumes (attached to Karpenter-provisioned nodes)
  - 2 EBS volumes (from CAPI-provisioned nodes)
  - 1 Network Load Balancer (for ingress)  Why This is a Flake (Not a Real Bug)  All functional tests passed - the test only failed during cleanup. Looking at the timeline:  1. Test completed successfully at ~20:47:50 (line 150)
  2. Teardown started at ~20:50:00 (line 167-168)
  3. Cluster destroy completed at ~20:50:38 (destroy.log shows successful cleanup)
  4. 15-minute AWS resource wait timed out at some point after that  The issue is a race condition in the teardown logic in hypershift_main/test/e2e/util/fixture.go:293:  err := wait.PollUntilContextTimeout(ctx, 5*time.Second, 15*time.Minute, false, func(ctx context.Context) (bool, error) {  This polls AWS Resource Groups Tagging API every 5 seconds for up to 15 minutes, checking if volumes and load balancers tagged with the cluster are deleted.  The Race Condition  From the destroy.log, I can see the infrastructure cleanup did complete successfully, but there's an inherent delay between:  1. Karpenter/AWS deleting the actual resources (EC2 instances, volumes, NLBs)
  2. AWS Resource Groups Tagging API reflecting those deletions  The 3-minute forced termination timeout (from karpenter_controller.go:46) for stuck NodeClaims worked correctly - the test created nodes with a blocking PDB at line 93, and those should have been forcefully terminated. However, AWS resource deletion is asynchronous:  - When Karpenter forcefully terminates NodeClaims, it triggers EC2 instance termination
  - EBS volumes attached to those instances are deleted when the instance terminates
  - However, the AWS Tagging API may not immediately reflect these deletions  Why It Flaked This Time But Not Others  Looking at the volumes left behind:
  - 6 volumes tagged with karpenter.sh/nodepool=on-demand
  - 1 volume tagged with karpenter.sh/nodepool=instance-profile-test
  - 2 volumes from CAPI nodes (MachineName=...)
  - 1 NLB for openshift-ingress/router-default  The test created 3 Karpenter nodes in the final test case (line 94: replicas := 3) with the blocking PDB. These likely took longer than usual to:
  1. Receive the forced termination signal (after 3-minute timeout)
  2. Actually terminate in AWS
  3. Have their attached EBS volumes cleaned up
  4. Be reflected in the AWS Tagging API  The timing suggests:
  - Test ended: ~20:47:50
  - Teardown started checking AWS resources: ~20:50:00
  - 15-minute timeout expired: ~21:05:00
  - But the actual cluster/resources were still being deleted asynchronously  The Real Issue  The 15-minute timeout in fixture.go is too aggressive when combined with:
  - 3-minute NodeClaim forced deletion timeout
  - AWS async deletion propagation delays
  - Multiple nodes (3) with blocking PDBs  Recommendations to Deflake  1. Increase the teardown timeout in fixture.go:293 from 15 to 20-25 minutes to account for:
    - NodeClaim forced deletion timeout (3 min)
    - EC2 instance termination (1-2 min per instance)
    - EBS volume deletion (1-2 min)
    - AWS API propagation delay (varies)
  2. Add retry logic with exponential backoff instead of fixed 5-second polling
  3. Check if resources are in "deleting" state rather than just "exists" - AWS may report them as existing but in the process of deletion
  4. Consider making the test less aggressive - instead of 3 replicas with blocking PDB, use 1-2 replicas to reduce the cleanup burden  The test is functionally correct - the forced termination logic is working (all pods were cleaned up, cluster destroyed successfully). It's just that the teardown validation is racing with AWS's eventual consistency model.

Assignee:: Joel Smith

Reporter:: John Kyros

QA Contact:: Paul Rozehnal

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2026/02/07 1:01 AM

Updated:: 2026/02/16 5:02 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates