[OCPBUGS-26605] e2e-gcp-op-layering CI job continuously failing - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.16
Affects Version/s: 4.15
Component/s: Machine Config Operator
Labels:
- mco-triaged
- pre-merge-tested

Regression:
No
Sprint:
MCO Sprint 247, MCO Sprint 248
sprint_count:
2
Release Blocker:
Proposed
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Release Note Not Required
Target Version:

4.16.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

The e2e-gcp-op-layering CI job seems to be continuously and consistently failing during the teardown process. In particular, it appears to be the TestOnClusterBuildRollsOutImage test that is failing whenever it attempts to tear down the node. See: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/4060/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-layering/1744805949165539328 for an example of a failing job.

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

Open a PR to the GitHub MCO repository.

Actual results:

The teardown portion of the TestOnClusterBuildsRollout test fails thusly:

  utils.go:1097: Deleting machine ci-op-v5qcditr-46b3f-bh29c-worker-c-fcl9f / node ci-op-v5qcditr-46b3f-bh29c-worker-c-fcl9f
    utils.go:1098: 
            Error Trace:    /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:1098
                                        /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/onclusterbuild_test.go:103
                                        /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/helpers_test.go:149
                                        /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:79
                                        /usr/lib/golang/src/testing/testing.go:1150
                                        /usr/lib/golang/src/testing/testing.go:1328
                                        /usr/lib/golang/src/testing/testing.go:1570
            Error:          Received unexpected error:
                            exit status 1
            Test:           TestOnClusterBuildRollsOutImage
    utils.go:1097: Deleting machine ci-op-v5qcditr-46b3f-bh29c-worker-c-fcl9f / node ci-op-v5qcditr-46b3f-bh29c-worker-c-fcl9f
    utils.go:1098: 
            Error Trace:    /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:1098
                                        /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/onclusterbuild_test.go:103
                                        /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/helpers_test.go:149
                                        /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:79
                                        /usr/lib/golang/src/testing/testing.go:1150
                                        /usr/lib/golang/src/testing/testing.go:1328
                                        /usr/lib/golang/src/testing/testing.go:1312
                                        /usr/lib/golang/src/runtime/panic.go:522
                                        /usr/lib/golang/src/testing/testing.go:980
                                        /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:1098
                                        /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/onclusterbuild_test.go:103
                                        /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/helpers_test.go:149
                                        /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:79
                                        /usr/lib/golang/src/testing/testing.go:1150
                                        /usr/lib/golang/src/testing/testing.go:1328
                                        /usr/lib/golang/src/testing/testing.go:1570
            Error:          Received unexpected error:
                            exit status 1
            Test:           TestOnClusterBuildRollsOutImage

Expected results:

This part of the test should pass.

Additional info:

The way the test teardown process currently works is that it shells out to the oc command to delete the underlying Machine and Node. We delete the underlying machine and node so that the cloud provider will provision us a new one due to issues with opting out of on-cluster builds that have yet to be resolved.

At the time this test was written, it was implemented in this way to avoid having to vendor the Machine client and API into the MCO codebase which has since happened. I suspect the issue is that oc is failing in some way since we get an exit status 1 from where it is invoked. Now that the Machine client and API are vendored into the MCO codebase, it makes more sense for us to use those directly instead of shelling out to oc in order to do this since we would get more verbose error messages instead.

links to

openshift/machine-config-operator#4110: OCPBUGS-26605: use machine client instead of oc for teardown

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates