Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-60620

CI fails on TestGatewayAPI/testGatewayAPIObjects

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 3
    • None
    • None
    • None
    • Rejected
    • NI&D Sprint 275
    • 1
    • In Progress
    • Release Note Not Required
    • None
    • None
    • None
    • None
    • None

      Description of problem

      CI is flaky because of test failures such as the following errors:

      === RUN   TestAll/serial/TestGatewayAPI/testGatewayAPIObjects
          gateway_api_test.go:185: Creating namespace "test-e2e-gwapi-rvvsz"...
          . . .
          util_gatewayapi_test.go:907: GET test-hostname-qfh94.gws.ci-op-3ctihrx1-43abb.origin-ci-int-aws.dev.rhcloud.com failed: status 503, expected 200, retrying...
          util_gatewayapi_test.go:907: GET test-hostname-qfh94.gws.ci-op-3ctihrx1-43abb.origin-ci-int-aws.dev.rhcloud.com failed: status 503, expected 200, retrying...
          util_gatewayapi_test.go:907: GET test-hostname-qfh94.gws.ci-op-3ctihrx1-43abb.origin-ci-int-aws.dev.rhcloud.com failed: status 503, expected 200, retrying...
          util_gatewayapi_test.go:915: Response headers for most recent request: map[Content-Length:[19] Content-Type:[text/plain] Date:[Mon, 18 Aug 2025 21:41:16 GMT]]
          util_gatewayapi_test.go:916: Reponse body for most recent request: no healthy upstream
          util_gatewayapi_test.go:918: Error connecting to test-hostname-qfh94.gws.ci-op-3ctihrx1-43abb.origin-ci-int-aws.dev.rhcloud.com: context deadline exceeded
          util_test.go:954: Dumping events in namespace "test-e2e-gwapi-rvvsz"...
          util_test.go:956: 0001-01-01 00:00:00 +0000 UTC { } Pod test-gateway-openshift-default Scheduled Successfully assigned test-e2e-gwapi-rvvsz/test-gateway-openshift-default to ip-10-0-3-15.ec2.internal
          util_test.go:956: 2025-08-18 21:33:41 +0000 UTC {multus } Pod test-gateway-openshift-default AddedInterface Add eth0 [10.129.2.28/23] from ovn-kubernetes
          util_test.go:956: 2025-08-18 21:33:41 +0000 UTC {kubelet ip-10-0-3-15.ec2.internal} Pod test-gateway-openshift-default Pulling Pulling image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest"
          util_test.go:956: 2025-08-18 21:33:42 +0000 UTC {kubelet ip-10-0-3-15.ec2.internal} Pod test-gateway-openshift-default Pulled Successfully pulled image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest" in 404ms (404ms including waiting). Image size: 891543594 bytes.
          util_test.go:956: 2025-08-18 21:33:42 +0000 UTC {kubelet ip-10-0-3-15.ec2.internal} Pod test-gateway-openshift-default Created Created container: echo
          util_test.go:956: 2025-08-18 21:33:42 +0000 UTC {kubelet ip-10-0-3-15.ec2.internal} Pod test-gateway-openshift-default Started Started container echo
          util_test.go:956: 2025-08-18 21:34:07 +0000 UTC {kubelet ip-10-0-3-15.ec2.internal} Pod test-gateway-openshift-default Killing Stopping container echo
          util_test.go:958: Deleting namespace "test-e2e-gwapi-rvvsz"... 

      This particular failure comes from https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1268/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/1957527219811127296.

      Checking the logs of machine config daemon on the node where the test pod was scheduled we can see that there was a reboot around the time the test pod got terminated:

      ip-10-0-3-15.ec2.internal$ ag -A5 Reboot | ag 21:34
      openshift-machine-config-operator_machine-config-daemon-t7q9c_5944055f-d6f3-40e9-b042-71e20cce8ad3/machine-config-daemon/1.log:103:2025-08-18T21:34:01.000271755+00:00 stderr F I0818 21:34:01.000218    2629 update.go:823] Reboot
      openshift-machine-config-operator_machine-config-daemon-t7q9c_5944055f-d6f3-40e9-b042-71e20cce8ad3/machine-config-daemon/1.log:104-2025-08-18T21:34:01.000271755+00:00 stderr F I0818 21:34:01.000227    2629 drain.go:132] Checking drain required for node disruption actions
      openshift-machine-config-operator_machine-config-daemon-t7q9c_5944055f-d6f3-40e9-b042-71e20cce8ad3/machine-config-daemon/1.log:105-2025-08-18T21:34:01.000271755+00:00 stderr F I0818 21:34:01.000236    2629 update.go:1045] Drain calculated for node disruption: true for config rendered-worker-c162af78974f19592aaf515d842522d3
      openshift-machine-config-operator_machine-config-daemon-t7q9c_5944055f-d6f3-40e9-b042-71e20cce8ad3/machine-config-daemon/1.log:106-2025-08-18T21:34:01.029141687+00:00 stderr F I0818 21:34:01.029099    2629 update.go:2637] "Update prepared; requesting cordon and drain via annotation to controller"

      Version-Release number of selected component (if applicable)

      I have seen this in 4.20 CI jobs.

      How reproducible

      Not always. A node reboot or drain should take place after the test workload is scheduled.

      Steps to Reproduce

      Actual results

      CI fails.

      Expected results

      CI passes, or fails on some other test failure.

      Additional info

      The test needs to make the workload more resilient to node reboots. Having a controller (such as ReplicaSet) should be enough.

              alebedev@redhat.com Andrey Lebedev
              alebedev@redhat.com Andrey Lebedev
              None
              None
              Ishmam Amin Ishmam Amin
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: