Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-61850

KAS Bootstrap container seeing restarts within timeout period in AKS e2es with KMS enabled

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • In Progress
    • Release Note Not Required
    • None
    • None
    • None
    • None
    • None

      Bootstrap container restart issue causing test failures in k8s 1.33 AKS management cluster tests

      Problem Description

      Tests are failing in the k8s 1.33 AKS management cluster PR (https://github.com/openshift/release/pull/69180) due to bootstrap container restarts being flagged as test failures. The EnsureNoCrashingPods test validation treats any container restart as a failure, even when the system ultimately reaches a healthy state.

      Empirical Evidence Gathered

      Test Failure Pattern (Consistent across all test runs):

        testresults/1967908323000848384/build-log.txt:739:
        util.go:755: Container bootstrap in pod kube-apiserver-69b778c7b4-kmllq has a restartCount > 0 (1)
      
        testresults/1967756095346708480/build-log.txt:
        util.go:755: Container bootstrap in pod kube-apiserver-67b798b65d-zt4vh has a restartCount > 0 (1)
      
        testresults/1967714528665800704/build-log.txt:679:
        util.go:755: Container bootstrap in pod kube-apiserver-6978d766c5-6nm9z has a restartCount > 0 (1)
        

      Bootstrap Container Connection Failures (First attempt):

      From bootstrap-previous.log showing failed first attempt:

        {"level":"error","ts":"2025-09-16T12:03:46Z","msg":"failed to apply bootstrap resources, retrying","error":"failed to createOrUpdate file /work/0000_03_config-operator_01_clusterresourcequotas.crd.yaml: failed to get API group resources: unable to
        retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://localhost:6443/apis/apiextensions.k8s.io/v1\": dial tcp [::1]:6443: connect: connection refused"}
        

      Timing Analysis:

      • Start: 12:03:46Z - First connection attempt
      • End: 12:04:36Z - Final failure with "context deadline exceeded"
      • Duration: 50 seconds (matches configured timeout)
      • Pattern: Connection refused errors repeated every 500ms for exactly 50 seconds

      Bootstrap Container Success (After restart):

      From bootstrap.log showing successful second attempt:

        {"level":"info","ts":"2025-09-16T12:04:39Z","logger":"kas-bootstrap","msg":"Processing file","path":"/work/0000_03_config-operator_01_clusterresourcequotas.crd.yaml"}
        ...
        {"level":"info","ts":"2025-09-16T12:04:40Z","msg":"kas-bootstrap process completed successfully, waiting for termination signal"}
        

      Key observation: Second attempt starts at 12:04:39Z (3 seconds after first failed) and completes successfully in under 1 second.

      Validation Code Location:

      The failing validation logic at util.go:755 treats any restart as a test failure:

        === FAIL: . TestCreateClusterCustomConfig/ValidateHostedCluster/EnsureNoCrashingPods (0.26s)
        util.go:755: Container bootstrap in pod kube-apiserver-XXX has a restartCount > 0 (1)
        

      Actual Results

      • Test Status: FAIL
      • Failure Reason: EnsureNoCrashingPods test fails due to bootstrap container restart count > 0
      • System State: Healthy - all containers running, kube-apiserver functional, bootstrap completed successfully
      • Restart Count: Exactly 1 restart across all test runs (consistent pattern)
      • Root Cause: Race condition - bootstrap container starts before kube-apiserver is ready to accept connections

      Expected Results

      • Test Status: PASS
      • Container Behavior: Bootstrap container should complete successfully without restarts
      • System State: Healthy - all containers running with restart count = 0
      • Timing: Bootstrap should wait for kube-apiserver readiness before attempting connections

      Root Cause Analysis

      1. Race Condition: Bootstrap container attempts to connect to localhost:6443 immediately upon startup
      2. Insufficient Timeout: 50-second timeout is inadequate - kube-apiserver becomes ready at ~53 seconds
      3. Test Validation Issue: Validation logic treats expected initialization restarts as failures
      4. Not a k8s 1.33 Compatibility Issue: No version-specific errors found in logs; issue is timing-related

      Impact

      • Blocking k8s 1.33 AKS management cluster integration
      • False positive test failures masking actual issues
      • Inconsistent test results due to timing dependencies

              rh-ee-brcox Bryan Cox
              rh-ee-brcox Bryan Cox
              None
              None
              He Liu He Liu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: