Details
-
Bug
-
Resolution: Done
-
Undefined
-
None
-
4.11
-
None
-
False
-
Description
Description of problem:
The ci/prow/e2e-nutanix-operator test runs failed with both 4.11 and 4.12. The test runs seemed failed at different test cases randomly. When running the test suites manually with the OCP cluster deployed with the LTS environment, it showed that the failures may be caused by the slow LTS network (DNS server).
Version-Release number of selected component (if applicable):
How reproducible:
The ci/prow/e2e-nutanix-operator test runs always failed with 4.11 and 4.12
Steps to Reproduce:
Trigger the ci/prow/e2e-nutanix-operator test run with 4.11 or 4.12. Or manually run the actuator-pkg test suites with
Actual results:
The ci/prow/e2e-nutanix-operator test runs failed at different test cases randomly.
Expected results:
The ci/prow/e2e-nutanix-operator test runs pass successfully.
Additional info:
Slack thread https://coreos.slack.com/archives/C0211848DBN/p1659363922100509 When running the actuator-pkg tests manually with the OCP cluster deployed to the LTS-dev environment, I got the below test failure: ------------------------------ [Feature:Machines] Managed cluster should recover from deleted worker machines /Users/yanhuali/go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/infra/infra.go:224 I0729 18:10:15.329330 17649 request.go:601] Waited for 1.050859554s due to client-side throttling, not priority and fairness, request: GET:https://api.nutanix-dev.devcluster.openshift.com:6443/apis/monitoring.coreos.com/v1?timeout=32s STEP: Creating a new MachineSet E0729 18:10:49.218277 17649 machinesets.go:319] found 1 Machines in failed phase: E0729 18:10:49.218296 17649 machinesets.go:329] Failed machine: nutanix-dev-fxq6fkhm65-xkwzb, Reason: InvalidConfiguration, Message: nutanix-dev-fxq6fkhm65-xkwzb: failed in validating machine providerSpec: spec.providerSpec.value.cluster.uuid: Invalid value: “0005d9a4-8e4f-7c33-58d1-e9d0e2d48853”: Failed to find cluster with uuid 0005d9a4-8e4f-7c33-58d1-e9d0e2d48853. error: Get “https://prismcentral.lts-cluster.nutanix-dev.devcluster.openshift.com:9440/api/nutanix/v3/clusters/0005d9a4-8e4f-7c33-58d1-e9d0e2d48853”: dial tcp: lookup prismcentral.lts-cluster.nutanix-dev.devcluster.openshift.com on 172.30.0.10:53: read udp 10.128.0.49:40679->172.30.0.10:53: i/o timeout STEP: Deleting the new MachineSet • Failure in Spec Setup (BeforeEach) [51.122 seconds] [Feature:Machines] Managed cluster should /Users/yanhuali/go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/infra/infra.go:141 recover from deleted worker machines [BeforeEach] /Users/yanhuali/go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/infra/infra.go:224 Expected <int>: 1 to equal <int>: 0 /Users/yanhuali/go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/framework/machinesets.go:332 ——————————————— It seems the failure cause was the dns name lookup timeout when making the prism-cental api call.