Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-76363

Karpenter teardown test flakes in Hypershift

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • None
    • 4.22
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

          Karpenter flaked while cleaning up, appeared to be stuck on volumes

      Version-Release number of selected component (if applicable):

          OpenShift 4.22 / Karpenter 1.8.6

      How reproducible:

          Occasionally

      Steps to Reproduce:

          1. Run e2e-hypershift 
          2. Watch it flake (maybe) 
          3. Be sad 
          

      Actual results:

      Teardown times out waiting for volume cleanup: 
      https://github.com/openshift/aws-karpenter-provider-aws/pull/25#issuecomment-3862775490 

      Expected results:

          Test run succeeds 

      Additional info:

      Failed run was: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_aws-karpenter-provider-aws/25/pull-ci-openshift-aws-karpenter-provider-aws-main-e2e-hypershift/2019853186537361408
      
      Claude thinks AWS just took too long:                                                                                                                     The test timed out during the Teardown phase (line 170 in build-log.txt):
        fixture.go:321: Failed to wait for infra resources in guest cluster to be deleted: context deadline exceeded                                                      
                                                   
        The teardown waited 15 minutes for AWS resources to be cleaned up, but 10 resources remained:                                                                     
        - 8 EBS volumes (attached to Karpenter-provisioned nodes)
        - 2 EBS volumes (from CAPI-provisioned nodes)
        - 1 Network Load Balancer (for ingress)  Why This is a Flake (Not a Real Bug)  All functional tests passed - the test only failed during cleanup. Looking at the timeline:  1. Test completed successfully at ~20:47:50 (line 150)
        2. Teardown started at ~20:50:00 (line 167-168)
        3. Cluster destroy completed at ~20:50:38 (destroy.log shows successful cleanup)
        4. 15-minute AWS resource wait timed out at some point after that  The issue is a race condition in the teardown logic in hypershift_main/test/e2e/util/fixture.go:293:  err := wait.PollUntilContextTimeout(ctx, 5*time.Second, 15*time.Minute, false, func(ctx context.Context) (bool, error) {  This polls AWS Resource Groups Tagging API every 5 seconds for up to 15 minutes, checking if volumes and load balancers tagged with the cluster are deleted.  The Race Condition  From the destroy.log, I can see the infrastructure cleanup did complete successfully, but there's an inherent delay between:  1. Karpenter/AWS deleting the actual resources (EC2 instances, volumes, NLBs)
        2. AWS Resource Groups Tagging API reflecting those deletions  The 3-minute forced termination timeout (from karpenter_controller.go:46) for stuck NodeClaims worked correctly - the test created nodes with a blocking PDB at line 93, and those should have been forcefully terminated. However, AWS resource deletion is asynchronous:  - When Karpenter forcefully terminates NodeClaims, it triggers EC2 instance termination
        - EBS volumes attached to those instances are deleted when the instance terminates
        - However, the AWS Tagging API may not immediately reflect these deletions  Why It Flaked This Time But Not Others  Looking at the volumes left behind:
        - 6 volumes tagged with karpenter.sh/nodepool=on-demand
        - 1 volume tagged with karpenter.sh/nodepool=instance-profile-test
        - 2 volumes from CAPI nodes (MachineName=...)
        - 1 NLB for openshift-ingress/router-default  The test created 3 Karpenter nodes in the final test case (line 94: replicas := 3) with the blocking PDB. These likely took longer than usual to:
        1. Receive the forced termination signal (after 3-minute timeout)
        2. Actually terminate in AWS
        3. Have their attached EBS volumes cleaned up
        4. Be reflected in the AWS Tagging API  The timing suggests:
        - Test ended: ~20:47:50
        - Teardown started checking AWS resources: ~20:50:00
        - 15-minute timeout expired: ~21:05:00
        - But the actual cluster/resources were still being deleted asynchronously  The Real Issue  The 15-minute timeout in fixture.go is too aggressive when combined with:
        - 3-minute NodeClaim forced deletion timeout
        - AWS async deletion propagation delays
        - Multiple nodes (3) with blocking PDBs  Recommendations to Deflake  1. Increase the teardown timeout in fixture.go:293 from 15 to 20-25 minutes to account for:
          - NodeClaim forced deletion timeout (3 min)
          - EC2 instance termination (1-2 min per instance)
          - EBS volume deletion (1-2 min)
          - AWS API propagation delay (varies)
        2. Add retry logic with exponential backoff instead of fixed 5-second polling
        3. Check if resources are in "deleting" state rather than just "exists" - AWS may report them as existing but in the process of deletion
        4. Consider making the test less aggressive - instead of 3 replicas with blocking PDB, use 1-2 replicas to reduce the cleanup burden  The test is functionally correct - the forced termination logic is working (all pods were cleaned up, cluster destroyed successfully). It's just that the teardown validation is racing with AWS's eventual consistency model.

              joelsmith.redhat Joel Smith
              jkyros@redhat.com John Kyros
              Paul Rozehnal Paul Rozehnal
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: