Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.14.0
Affects Version/s: 4.14
Component/s: Networking / cluster-network-operator
Labels:
- trt-incident

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
No

Target Backport Versions:
None
Target Version:

4.14.0
Release Blocker:
Proposed
Sprint:
SDN Sprint 236, SDN Sprint 237
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Cluster upgrade failure has been affecting three consecutive nightly payloads. 

https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-05-20-041508
https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-05-21-120836
https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-05-22-035713

In all three cases, upgrade seems to fail waiting on network. Take this job as an example:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade/1660495736527130624

Cluster version operator complains about network operator has not finished upgrade:

I0522 07:12:58.540244       1 sync_worker.go:1149] Update error 684 of 845: ClusterOperatorUpdating Cluster operator network is updating versions (*errors.errorString: cluster operator network is available and not degraded but has not finished updating to target version)

This log can been seen in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade/1660495736527130624/artifacts/e2e-aws-sdn-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-5565f87cc6-6sjqf_cluster-version-operator.log

The network operator keeps waiting with the following log:
I0522 07:12:58.563312       1 connectivity_check_controller.go:166] ConnectivityCheckController is waiting for transition to desired version (4.14.0-0.nightly-2023-05-22-035713) to be completed.

This lasted over 2 hours. The log can be seen in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade/1660495736527130624/artifacts/e2e-aws-sdn-upgrade/gather-extra/artifacts/pods/openshift-network-operator_network-operator-6975b7b8ff-pdxzk_network-operator.log

Compared with a working job, there seems to be an error getting *v1alpha1.PodNetworkConnectivityCheck in the openshift-network-diagnostics_network-check-source:
W0522 04:34:18.527315       1 reflector.go:424] k8s.io/client-go@v12.0.0+incompatible/tools/cache/reflector.go:169: failed to list *v1alpha1.PodNetworkConnectivityCheck: the server could not find the requested resource (get podnetworkconnectivitychecks.controlplane.operator.openshift.io)
E0522 04:34:18.527391       1 reflector.go:140] k8s.io/client-go@v12.0.0+incompatible/tools/cache/reflector.go:169: Failed to watch *v1alpha1.PodNetworkConnectivityCheck: failed to list *v1alpha1.PodNetworkConnectivityCheck: the server could not find the requested resource (get podnetworkconnectivitychecks.controlplane.operator.openshift.io)

It is not clear whether this is really relevant. Also worth mentioning is that, every time when this problem happens, machine-config and dns also stuck with the older version. 

This has been affecting 4.14 nightly payload three times. If it shows more consistency, we might have to increase the severity of the bug. Please ping TRT if any more info is needed.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

links to

openshift/cluster-network-operator#1818: OCPBUGS-13922: Revert "Do not set the operator as available before updating the network config"

RHSA-2023:5006 OpenShift Container Platform 4.14.z security update

Assignee:: Patryk Diak

Reporter:: Ken Zhang

Need Info From:: None

Contributors:: None

QA Contact:: Jean Chen

Doc Contact:: None

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2023/05/22 7:53 PM

Updated:: 2025/10/08 12:51 PM

Resolved:: 2023/10/31 1:38 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates