Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-13922

Cluster upgrade failed waiting on network

XMLWordPrintable

    • Important
    • No
    • SDN Sprint 236, SDN Sprint 237
    • 2
    • Proposed
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Cluster upgrade failure has been affecting three consecutive nightly payloads. 
      
      https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-05-20-041508
      https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-05-21-120836
      https://amd64.ocp.releases.ci.openshift.org/releasestream/4.14.0-0.nightly/release/4.14.0-0.nightly-2023-05-22-035713
      
      In all three cases, upgrade seems to fail waiting on network. Take this job as an example:
      
      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade/1660495736527130624
      
      Cluster version operator complains about network operator has not finished upgrade:
      
      I0522 07:12:58.540244       1 sync_worker.go:1149] Update error 684 of 845: ClusterOperatorUpdating Cluster operator network is updating versions (*errors.errorString: cluster operator network is available and not degraded but has not finished updating to target version)
      
      This log can been seen in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade/1660495736527130624/artifacts/e2e-aws-sdn-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-5565f87cc6-6sjqf_cluster-version-operator.log
      
      The network operator keeps waiting with the following log:
      I0522 07:12:58.563312       1 connectivity_check_controller.go:166] ConnectivityCheckController is waiting for transition to desired version (4.14.0-0.nightly-2023-05-22-035713) to be completed.
      
      This lasted over 2 hours. The log can be seen in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade/1660495736527130624/artifacts/e2e-aws-sdn-upgrade/gather-extra/artifacts/pods/openshift-network-operator_network-operator-6975b7b8ff-pdxzk_network-operator.log
      
      Compared with a working job, there seems to be an error getting *v1alpha1.PodNetworkConnectivityCheck in the openshift-network-diagnostics_network-check-source:
      W0522 04:34:18.527315       1 reflector.go:424] k8s.io/client-go@v12.0.0+incompatible/tools/cache/reflector.go:169: failed to list *v1alpha1.PodNetworkConnectivityCheck: the server could not find the requested resource (get podnetworkconnectivitychecks.controlplane.operator.openshift.io)
      E0522 04:34:18.527391       1 reflector.go:140] k8s.io/client-go@v12.0.0+incompatible/tools/cache/reflector.go:169: Failed to watch *v1alpha1.PodNetworkConnectivityCheck: failed to list *v1alpha1.PodNetworkConnectivityCheck: the server could not find the requested resource (get podnetworkconnectivitychecks.controlplane.operator.openshift.io)
      
      It is not clear whether this is really relevant. Also worth mentioning is that, every time when this problem happens, machine-config and dns also stuck with the older version. 
      
      This has been affecting 4.14 nightly payload three times. If it shows more consistency, we might have to increase the severity of the bug. Please ping TRT if any more info is needed. 
      
      

      Version-Release number of selected component (if applicable):

       

      How reproducible:

       

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

       

            pdiak@redhat.com Patryk Diak
            kenzhang@redhat.com Ken Zhang
            Jean Chen Jean Chen
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: