-
Bug
-
Resolution: Unresolved
-
Minor
-
4.20
-
Quality / Stability / Reliability
-
False
-
-
3
-
None
-
None
-
None
-
Rejected
-
NI&D Sprint 275
-
1
-
In Progress
-
Release Note Not Required
-
None
-
None
-
None
-
None
-
None
Description of problem
CI is flaky because of test failures such as the following errors:
=== RUN TestAll/serial/TestGatewayAPI/testGatewayAPIObjects gateway_api_test.go:185: Creating namespace "test-e2e-gwapi-rvvsz"... . . . util_gatewayapi_test.go:907: GET test-hostname-qfh94.gws.ci-op-3ctihrx1-43abb.origin-ci-int-aws.dev.rhcloud.com failed: status 503, expected 200, retrying... util_gatewayapi_test.go:907: GET test-hostname-qfh94.gws.ci-op-3ctihrx1-43abb.origin-ci-int-aws.dev.rhcloud.com failed: status 503, expected 200, retrying... util_gatewayapi_test.go:907: GET test-hostname-qfh94.gws.ci-op-3ctihrx1-43abb.origin-ci-int-aws.dev.rhcloud.com failed: status 503, expected 200, retrying... util_gatewayapi_test.go:915: Response headers for most recent request: map[Content-Length:[19] Content-Type:[text/plain] Date:[Mon, 18 Aug 2025 21:41:16 GMT]] util_gatewayapi_test.go:916: Reponse body for most recent request: no healthy upstream util_gatewayapi_test.go:918: Error connecting to test-hostname-qfh94.gws.ci-op-3ctihrx1-43abb.origin-ci-int-aws.dev.rhcloud.com: context deadline exceeded util_test.go:954: Dumping events in namespace "test-e2e-gwapi-rvvsz"... util_test.go:956: 0001-01-01 00:00:00 +0000 UTC { } Pod test-gateway-openshift-default Scheduled Successfully assigned test-e2e-gwapi-rvvsz/test-gateway-openshift-default to ip-10-0-3-15.ec2.internal util_test.go:956: 2025-08-18 21:33:41 +0000 UTC {multus } Pod test-gateway-openshift-default AddedInterface Add eth0 [10.129.2.28/23] from ovn-kubernetes util_test.go:956: 2025-08-18 21:33:41 +0000 UTC {kubelet ip-10-0-3-15.ec2.internal} Pod test-gateway-openshift-default Pulling Pulling image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest" util_test.go:956: 2025-08-18 21:33:42 +0000 UTC {kubelet ip-10-0-3-15.ec2.internal} Pod test-gateway-openshift-default Pulled Successfully pulled image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest" in 404ms (404ms including waiting). Image size: 891543594 bytes. util_test.go:956: 2025-08-18 21:33:42 +0000 UTC {kubelet ip-10-0-3-15.ec2.internal} Pod test-gateway-openshift-default Created Created container: echo util_test.go:956: 2025-08-18 21:33:42 +0000 UTC {kubelet ip-10-0-3-15.ec2.internal} Pod test-gateway-openshift-default Started Started container echo util_test.go:956: 2025-08-18 21:34:07 +0000 UTC {kubelet ip-10-0-3-15.ec2.internal} Pod test-gateway-openshift-default Killing Stopping container echo util_test.go:958: Deleting namespace "test-e2e-gwapi-rvvsz"...
This particular failure comes from https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1268/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/1957527219811127296.
Checking the logs of machine config daemon on the node where the test pod was scheduled we can see that there was a reboot around the time the test pod got terminated:
ip-10-0-3-15.ec2.internal$ ag -A5 Reboot | ag 21:34 openshift-machine-config-operator_machine-config-daemon-t7q9c_5944055f-d6f3-40e9-b042-71e20cce8ad3/machine-config-daemon/1.log:103:2025-08-18T21:34:01.000271755+00:00 stderr F I0818 21:34:01.000218 2629 update.go:823] Reboot openshift-machine-config-operator_machine-config-daemon-t7q9c_5944055f-d6f3-40e9-b042-71e20cce8ad3/machine-config-daemon/1.log:104-2025-08-18T21:34:01.000271755+00:00 stderr F I0818 21:34:01.000227 2629 drain.go:132] Checking drain required for node disruption actions openshift-machine-config-operator_machine-config-daemon-t7q9c_5944055f-d6f3-40e9-b042-71e20cce8ad3/machine-config-daemon/1.log:105-2025-08-18T21:34:01.000271755+00:00 stderr F I0818 21:34:01.000236 2629 update.go:1045] Drain calculated for node disruption: true for config rendered-worker-c162af78974f19592aaf515d842522d3 openshift-machine-config-operator_machine-config-daemon-t7q9c_5944055f-d6f3-40e9-b042-71e20cce8ad3/machine-config-daemon/1.log:106-2025-08-18T21:34:01.029141687+00:00 stderr F I0818 21:34:01.029099 2629 update.go:2637] "Update prepared; requesting cordon and drain via annotation to controller"
Version-Release number of selected component (if applicable)
I have seen this in 4.20 CI jobs.
How reproducible
Not always. A node reboot or drain should take place after the test workload is scheduled.
Steps to Reproduce
Actual results
CI fails.
Expected results
CI passes, or fails on some other test failure.
Additional info
The test needs to make the workload more resilient to node reboots. Having a controller (such as ReplicaSet) should be enough.
- clones
-
OCPBUGS-60302 CI fails on TestGatewayAPI/testGatewayAPIResourcesProtection/Pod_binding_required
-
- Verified
-
- links to