-
Bug
-
Resolution: Done-Errata
-
Major
-
4.14
-
Important
-
No
-
SDN Sprint 236, SDN Sprint 237
-
2
-
Proposed
-
False
-
Description of problem:
Cluster upgrade failure has been affecting three consecutive nightly payloads. https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-05-20-041508 https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-05-21-120836 https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-05-22-035713 In all three cases, upgrade seems to fail waiting on network. Take this job as an example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade/1660495736527130624 Cluster version operator complains about network operator has not finished upgrade: I0522 07:12:58.540244 1 sync_worker.go:1149] Update error 684 of 845: ClusterOperatorUpdating Cluster operator network is updating versions (*errors.errorString: cluster operator network is available and not degraded but has not finished updating to target version) This log can been seen in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade/1660495736527130624/artifacts/e2e-aws-sdn-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-5565f87cc6-6sjqf_cluster-version-operator.log The network operator keeps waiting with the following log: I0522 07:12:58.563312 1 connectivity_check_controller.go:166] ConnectivityCheckController is waiting for transition to desired version (4.14.0-0.nightly-2023-05-22-035713) to be completed. This lasted over 2 hours. The log can be seen in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade/1660495736527130624/artifacts/e2e-aws-sdn-upgrade/gather-extra/artifacts/pods/openshift-network-operator_network-operator-6975b7b8ff-pdxzk_network-operator.log Compared with a working job, there seems to be an error getting *v1alpha1.PodNetworkConnectivityCheck in the openshift-network-diagnostics_network-check-source: W0522 04:34:18.527315 1 reflector.go:424] k8s.io/client-go@v12.0.0+incompatible/tools/cache/reflector.go:169: failed to list *v1alpha1.PodNetworkConnectivityCheck: the server could not find the requested resource (get podnetworkconnectivitychecks.controlplane.operator.openshift.io) E0522 04:34:18.527391 1 reflector.go:140] k8s.io/client-go@v12.0.0+incompatible/tools/cache/reflector.go:169: Failed to watch *v1alpha1.PodNetworkConnectivityCheck: failed to list *v1alpha1.PodNetworkConnectivityCheck: the server could not find the requested resource (get podnetworkconnectivitychecks.controlplane.operator.openshift.io) It is not clear whether this is really relevant. Also worth mentioning is that, every time when this problem happens, machine-config and dns also stuck with the older version. This has been affecting 4.14 nightly payload three times. If it shows more consistency, we might have to increase the severity of the bug. Please ping TRT if any more info is needed.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
- links to
-
RHSA-2023:5006 OpenShift Container Platform 4.14.z security update