Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Minor
Fix Version/s: 4.20.0
Affects Version/s: 4.20
Component/s: Networking / router
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
3
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:

4.20.0
Release Blocker:
Rejected
Sprint:
NI&D Sprint 275
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
In Progress
Release Note Type:
Release Note Not Required
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem

CI is flaky because of test failures such as the following errors:

=== RUN   TestAll/serial/TestGatewayAPI/testGatewayAPIObjects
    gateway_api_test.go:185: Creating namespace "test-e2e-gwapi-rvvsz"...
    . . .
    util_gatewayapi_test.go:907: GET test-hostname-qfh94.gws.ci-op-3ctihrx1-43abb.origin-ci-int-aws.dev.rhcloud.com failed: status 503, expected 200, retrying...
    util_gatewayapi_test.go:907: GET test-hostname-qfh94.gws.ci-op-3ctihrx1-43abb.origin-ci-int-aws.dev.rhcloud.com failed: status 503, expected 200, retrying...
    util_gatewayapi_test.go:907: GET test-hostname-qfh94.gws.ci-op-3ctihrx1-43abb.origin-ci-int-aws.dev.rhcloud.com failed: status 503, expected 200, retrying...
    util_gatewayapi_test.go:915: Response headers for most recent request: map[Content-Length:[19] Content-Type:[text/plain] Date:[Mon, 18 Aug 2025 21:41:16 GMT]]
    util_gatewayapi_test.go:916: Reponse body for most recent request: no healthy upstream
    util_gatewayapi_test.go:918: Error connecting to test-hostname-qfh94.gws.ci-op-3ctihrx1-43abb.origin-ci-int-aws.dev.rhcloud.com: context deadline exceeded
    util_test.go:954: Dumping events in namespace "test-e2e-gwapi-rvvsz"...
    util_test.go:956: 0001-01-01 00:00:00 +0000 UTC { } Pod test-gateway-openshift-default Scheduled Successfully assigned test-e2e-gwapi-rvvsz/test-gateway-openshift-default to ip-10-0-3-15.ec2.internal
    util_test.go:956: 2025-08-18 21:33:41 +0000 UTC {multus } Pod test-gateway-openshift-default AddedInterface Add eth0 [10.129.2.28/23] from ovn-kubernetes
    util_test.go:956: 2025-08-18 21:33:41 +0000 UTC {kubelet ip-10-0-3-15.ec2.internal} Pod test-gateway-openshift-default Pulling Pulling image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest"
    util_test.go:956: 2025-08-18 21:33:42 +0000 UTC {kubelet ip-10-0-3-15.ec2.internal} Pod test-gateway-openshift-default Pulled Successfully pulled image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest" in 404ms (404ms including waiting). Image size: 891543594 bytes.
    util_test.go:956: 2025-08-18 21:33:42 +0000 UTC {kubelet ip-10-0-3-15.ec2.internal} Pod test-gateway-openshift-default Created Created container: echo
    util_test.go:956: 2025-08-18 21:33:42 +0000 UTC {kubelet ip-10-0-3-15.ec2.internal} Pod test-gateway-openshift-default Started Started container echo
    util_test.go:956: 2025-08-18 21:34:07 +0000 UTC {kubelet ip-10-0-3-15.ec2.internal} Pod test-gateway-openshift-default Killing Stopping container echo
    util_test.go:958: Deleting namespace "test-e2e-gwapi-rvvsz"...

This particular failure comes from https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1268/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/1957527219811127296.

Checking the logs of machine config daemon on the node where the test pod was scheduled we can see that there was a reboot around the time the test pod got terminated:

ip-10-0-3-15.ec2.internal$ ag -A5 Reboot | ag 21:34
openshift-machine-config-operator_machine-config-daemon-t7q9c_5944055f-d6f3-40e9-b042-71e20cce8ad3/machine-config-daemon/1.log:103:2025-08-18T21:34:01.000271755+00:00 stderr F I0818 21:34:01.000218    2629 update.go:823] Reboot
openshift-machine-config-operator_machine-config-daemon-t7q9c_5944055f-d6f3-40e9-b042-71e20cce8ad3/machine-config-daemon/1.log:104-2025-08-18T21:34:01.000271755+00:00 stderr F I0818 21:34:01.000227    2629 drain.go:132] Checking drain required for node disruption actions
openshift-machine-config-operator_machine-config-daemon-t7q9c_5944055f-d6f3-40e9-b042-71e20cce8ad3/machine-config-daemon/1.log:105-2025-08-18T21:34:01.000271755+00:00 stderr F I0818 21:34:01.000236    2629 update.go:1045] Drain calculated for node disruption: true for config rendered-worker-c162af78974f19592aaf515d842522d3
openshift-machine-config-operator_machine-config-daemon-t7q9c_5944055f-d6f3-40e9-b042-71e20cce8ad3/machine-config-daemon/1.log:106-2025-08-18T21:34:01.029141687+00:00 stderr F I0818 21:34:01.029099    2629 update.go:2637] "Update prepared; requesting cordon and drain via annotation to controller"

Version-Release number of selected component (if applicable)

I have seen this in 4.20 CI jobs.

How reproducible

Not always. A node reboot or drain should take place after the test workload is scheduled.

Steps to Reproduce

Actual results

CI fails.

Expected results

CI passes, or fails on some other test failure.

Additional info

The test needs to make the workload more resilient to node reboots. Having a controller (such as ReplicaSet) should be enough.

clones

OCPBUGS-60302 CI fails on TestGatewayAPI/testGatewayAPIResourcesProtection/Pod_binding_required

Closed

links to

openshift/cluster-ingress-operator#1262: OCPBUGS-60620: e2e: Deflake tests by using ReplicaSet for test workload

Assignee:: Andrey Lebedev

Reporter:: Andrey Lebedev

Need Info From:: None

Contributors:: None

QA Contact:: Ishmam Amin

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2025/08/18 10:39 PM

Updated:: 2025/10/21 4:47 AM

Resolved:: 2025/10/21 4:47 AM

Details

Description

Description of problem

Version-Release number of selected component (if applicable)

How reproducible

Steps to Reproduce

Actual results

Expected results

Additional info

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide