-
Bug
-
Resolution: Obsolete
-
Major
-
None
-
4.16.0
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
No
-
None
-
None
-
Rejected
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Beginning with https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4188/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1761092719893024769, e2e-gcp-op job runs begin to take over 3 hours to run in some cases. This bumps into the OpenShift CI e2e timeout which causes the tests to fail.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Nearly always
Steps to Reproduce:
- Open a PR to the openshift/machine-config-operator repository.
- Wait for the e2e-gcp-op CI job to run and eventually time out.
Actual results:
After approximately 3.5 hours the e2e-gcp-op bumps into the OpenShift CI timeout, causing the job to fail. Here are a list of sample jobs showing this timeout:
- https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4106/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1762561731067908096
- https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4106/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1762458171915374592
- https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4016/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1762426857241710592
Expected results:
I would have expected the jobs to complete in ~3 hours or so in keeping with what the prevailing trends are. Additionally, I'm puzzled by there being no test output in the build logs for the test execution.
Ideally, we can bring average test execution time back down to just under three hours and we can understand why no test logs were outputted.
Additional info:
- The last generally successful test run was https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4188/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1761092719893024769. In this job, the main branch commit was https://github.com/openshift/machine-config-operator/commit/4805507d1735dd848cafd4f0b9a1d916bbafd028.
- However, https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4214/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1762149797319413760 was built against https://github.com/openshift/machine-config-operator/commit/398c1710f12622e9c0f94f99b12abeabed3ac5f7 which is where the issue seems to begin.
- The following commits don't seem like they could have caused the problem although I'm including them here with the hope that someone more knowledgeable can answer that since they seem to be correlated:
- https://github.com/openshift/machine-config-operator/commit/4805507d1735dd848cafd4f0b9a1d916bbafd028
- https://github.com/openshift/machine-config-operator/commit/98d6a374d54bf12d437be47fc7e27abf5bc02b79
- https://github.com/openshift/machine-config-operator/commit/398c1710f12622e9c0f94f99b12abeabed3ac5f7
- I've also opened https://github.com/openshift/machine-config-operator/pull/4225/ which removes the calls to our junit parsing script. I suspect something there is going awry which may explain why we're not getting test output or timings. Having that output and timings would go a long way toward understanding why these tests are now taking 3+ hours.
- For now, we plan to bump the test timeouts in both OpenShift CI as well as the Makefile timeout to ensure that we don't block anyone.