-
Bug
-
Resolution: Unresolved
-
Major
-
4.19.z
Bootstrap container restart issue causing test failures in k8s 1.33 AKS management cluster tests
Problem Description
Tests are failing in the k8s 1.33 AKS management cluster PR (https://github.com/openshift/release/pull/69180) due to bootstrap container restarts being flagged as test failures. The EnsureNoCrashingPods test validation treats any container restart as a failure, even when the system ultimately reaches a healthy state.
Empirical Evidence Gathered
Test Failure Pattern (Consistent across all test runs):
testresults/1967908323000848384/build-log.txt:739: util.go:755: Container bootstrap in pod kube-apiserver-69b778c7b4-kmllq has a restartCount > 0 (1) testresults/1967756095346708480/build-log.txt: util.go:755: Container bootstrap in pod kube-apiserver-67b798b65d-zt4vh has a restartCount > 0 (1) testresults/1967714528665800704/build-log.txt:679: util.go:755: Container bootstrap in pod kube-apiserver-6978d766c5-6nm9z has a restartCount > 0 (1)
Bootstrap Container Connection Failures (First attempt):
From bootstrap-previous.log showing failed first attempt:
{"level":"error","ts":"2025-09-16T12:03:46Z","msg":"failed to apply bootstrap resources, retrying","error":"failed to createOrUpdate file /work/0000_03_config-operator_01_clusterresourcequotas.crd.yaml: failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://localhost:6443/apis/apiextensions.k8s.io/v1\": dial tcp [::1]:6443: connect: connection refused"}
Timing Analysis:
- Start: 12:03:46Z - First connection attempt
- End: 12:04:36Z - Final failure with "context deadline exceeded"
- Duration: 50 seconds (matches configured timeout)
- Pattern: Connection refused errors repeated every 500ms for exactly 50 seconds
Bootstrap Container Success (After restart):
From bootstrap.log showing successful second attempt:
{"level":"info","ts":"2025-09-16T12:04:39Z","logger":"kas-bootstrap","msg":"Processing file","path":"/work/0000_03_config-operator_01_clusterresourcequotas.crd.yaml"} ... {"level":"info","ts":"2025-09-16T12:04:40Z","msg":"kas-bootstrap process completed successfully, waiting for termination signal"}
Key observation: Second attempt starts at 12:04:39Z (3 seconds after first failed) and completes successfully in under 1 second.
Validation Code Location:
The failing validation logic at util.go:755 treats any restart as a test failure:
=== FAIL: . TestCreateClusterCustomConfig/ValidateHostedCluster/EnsureNoCrashingPods (0.26s) util.go:755: Container bootstrap in pod kube-apiserver-XXX has a restartCount > 0 (1)
Actual Results
- Test Status: FAIL
- Failure Reason: EnsureNoCrashingPods test fails due to bootstrap container restart count > 0
- System State: Healthy - all containers running, kube-apiserver functional, bootstrap completed successfully
- Restart Count: Exactly 1 restart across all test runs (consistent pattern)
- Root Cause: Race condition - bootstrap container starts before kube-apiserver is ready to accept connections
Expected Results
- Test Status: PASS
- Container Behavior: Bootstrap container should complete successfully without restarts
- System State: Healthy - all containers running with restart count = 0
- Timing: Bootstrap should wait for kube-apiserver readiness before attempting connections
Root Cause Analysis
- Race Condition: Bootstrap container attempts to connect to localhost:6443 immediately upon startup
- Insufficient Timeout: 50-second timeout is inadequate - kube-apiserver becomes ready at ~53 seconds
- Test Validation Issue: Validation logic treats expected initialization restarts as failures
- Not a k8s 1.33 Compatibility Issue: No version-specific errors found in logs; issue is timing-related
Impact
- Blocking k8s 1.33 AKS management cluster integration
- False positive test failures masking actual issues
- Inconsistent test results due to timing dependencies
- blocks
-
OCPBUGS-62182 KAS Bootstrap container seeing restarts within timeout period in AKS e2es with KMS enabled
-
- Verified
-
- is cloned by
-
OCPBUGS-62182 KAS Bootstrap container seeing restarts within timeout period in AKS e2es with KMS enabled
-
- Verified
-
- links to