Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31894

[120 Nodes] SDN --> OVNKubernetes Offline Migration Fails after reboots

XMLWordPrintable

    • Important
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      To ensure the functionality of offline SDN migration of OpenShift SDN to OVN-IC at large scale, performed a SDN-OVNK Migration on a cluster which is pre-loaded with cluster-density-v2 workload.
      
      Post updating the networkType field of the Network.config.openshift.io CR to OVNKubernetes followed by a reboot, the nodes in the cluster are "stuck" in NotReady State for more than 6 hours, upon investigation the following was found on ovnkube-controller container on one of the master nodes:
      
      ========================================================================
      
        ovnkube-controller:
          Container ID:  cri-o://fe8ad966f61423b6cee23c622594b834cd270566d8ea90261e7fd2023d6017ff
          Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:368149fc734294fe7c851246f91738ef4d652fc83c32e2477d4eb20f1f41643a
          Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:368149fc734294fe7c851246f91738ef4d652fc83c32e2477d4eb20f1f41643a
          Port:          29105/TCP
          Host Port:     29105/TCP
          Command:
            /bin/bash
            -c
            set -xe
            . /ovnkube-lib/ovnkube-lib.sh || exit 1
            start-ovnkube-node ${OVN_KUBE_LOG_LEVEL} 29103 29105
            
          State:       Running
            Started:   Tue, 09 Apr 2024 00:30:26 +0530
          Last State:  Terminated
            Reason:    Error
            Message:   5:31.727933  208842 obj_retry.go:607] Update event received for *v1.Pod openshift-multus/multus-zsdr8
      I0408 18:55:31.727941  208842 ovn.go:132] Ensuring zone remote for Pod openshift-multus/multus-zsdr8 in node ip-10-0-44-148.us-west-2.compute.internal
      I0408 18:55:31.727946  208842 default_network_controller.go:679] Recording success event on pod openshift-multus/multus-zsdr8
      I0408 18:55:31.727956  208842 obj_retry.go:555] Update event received for resource *factory.egressIPPod, old object is equal to new: false
      I0408 18:55:31.727965  208842 obj_retry.go:607] Update event received for *factory.egressIPPod openshift-multus/multus-zsdr8
      I0408 18:55:31.751675  208842 ovs.go:167] Exec(1207): stdout: "7885e979-e03b-48d8-8495-331b0f3ce391\n"
      I0408 18:55:31.751694  208842 ovs.go:168] Exec(1207): stderr: ""
      I0408 18:55:31.751709  208842 default_node_network_controller.go:639] Upgrade Hack: checkOVNSBNodeLRSR for node - 10.128.4.0/23 : match match="reg7 == 0 && ip4.dst == 10.128.4.0/23" : stdout - 7885e979-e03b-48d8-8495-331b0f3ce391 : stderr -  : err <nil>
      I0408 18:55:31.751739  208842 default_node_network_controller.go:878] Upgrade Hack: node ip-10-0-28-110.us-west-2.compute.internal, subnet 10.131.34.0/23
      I0408 18:55:31.751769  208842 ovs.go:164] Exec(1208): /usr/bin/ovn-sbctl --timeout=15 --no-leader-only --bare --columns _uuid find logical_flow ma
            Exit Code:    1
            Started:      Tue, 09 Apr 2024 00:20:21 +0530
            Finished:     Tue, 09 Apr 2024 00:25:32 +0530
          Ready:          False
          Restart Count:  55
      
      ========================================================================

      Version-Release number of selected component (if applicable):

          OCP Version: 4.14.10
          ovs-vswitchd (Open vSwitch) 3.1.2

      How reproducible:

      Easily reproducible

      Steps to Reproduce:

      The step listed below will help users to perform SDN--->OVN-K Migration.

          1. git clone https://github.com/cloud-bulldozer/e2e-benchmarking
          2. cd e2e-benchmarking/workloads/sdn2ovn/
          3. ./run.sh

      Actual results:

          

      Expected results:

          CNI Migrated to OVN-Kubernetes

      Additional info:

      Cluster operators details:
      ========================================================================
      $ oc get co
      NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.14.10   False       False         True       6h8m    APIServicesAvailable: "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request...
      baremetal                                  4.14.10   True        False         False      10h     
      cloud-controller-manager                   4.14.10   True        False         False      10h     
      cloud-credential                           4.14.10   True        False         False      10h     
      cluster-autoscaler                         4.14.10   True        False         False      10h     
      config-operator                            4.14.10   True        False         False      10h     
      console                                    4.14.10   False       False         False      6h8m    RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.har-sdn-120.perfscale.devcluster.openshift.com): Get "https://console-openshift-console.apps.har-sdn-120.perfscale.devcluster.openshift.com": EOF
      control-plane-machine-set                  4.14.10   True        False         False      7h6m    
      csi-snapshot-controller                    4.14.10   True        True          False      10h     CSISnapshotControllerProgressing: Waiting for Deployment to deploy pods...
      dns                                        4.14.10   True        False         False      10h     
      etcd                                       4.14.10   True        False         True       10h     EtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [{Member:ID:4398145115237214508 name:"ip-10-0-69-222.us-west-2.compute.internal" peerURLs:"https://10.0.69.222:2380" clientURLs:"https://10.0.69.222:2379"  Healthy:true Took:1.045453ms Error:<nil>} {Member:ID:7320495613934196650 name:"ip-10-0-34-215.us-west-2.compute.internal" peerURLs:"https://10.0.34.215:2380" clientURLs:"https://10.0.34.215:2379"  Healthy:false Took: Error:create client failure: failed to make etcd client for endpoints [https://10.0.34.215:2379]: context deadline exceeded} {Member:ID:8759161088354208548 name:"ip-10-0-19-163.us-west-2.compute.internal" peerURLs:"https://10.0.19.163:2380" clientURLs:"https://10.0.19.163:2379"  Healthy:true Took:2.066578ms Error:<nil>}]...
      image-registry                             4.14.10   True        False         False      10h     
      ingress                                    4.14.10   True        False         False      10h     
      insights                                   4.14.10   True        False         False      10h     
      kube-apiserver                             4.14.10   True        False         False      10h     
      kube-controller-manager                    4.14.10   True        False         False      10h     
      kube-scheduler                             4.14.10   True        False         False      10h     
      kube-storage-version-migrator              4.14.10   True        False         False      6h44m   
      machine-api                                4.14.10   True        False         False      10h     
      machine-approver                           4.14.10   True        False         False      10h     
      machine-config                             4.14.10   True        False         False      10h     
      marketplace                                4.14.10   True        False         False      10h     
      monitoring                                 4.14.10   True        False         False      10h     
      network                                    4.14.10   True        True          True       10h     DaemonSet "/openshift-multus/multus" rollout is not making progress - pod multus-26q59 is in CrashLoopBackOff State...
      node-tuning                                4.14.10   True        False         False      10h     
      openshift-apiserver                        4.14.10   False       False         False      6h2m    APIServicesAvailable: "template.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request
      openshift-controller-manager               4.14.10   True        False         False      10h     
      openshift-samples                          4.14.10   True        False         False      10h     
      operator-lifecycle-manager                 4.14.10   True        False         False      10h     
      operator-lifecycle-manager-catalog         4.14.10   True        False         False      10h     
      operator-lifecycle-manager-packageserver   4.14.10   True        False         False      10h     
      service-ca                                 4.14.10   True        True          False      10h     Progressing: ...
      storage                                    4.14.10   True        True          False      10h     AWSEBSProgressing: Waiting for Deployment to deploy pods
      $
      ========================================================================
      
      $ oc get po -n openshift-etcd
      NAME                                                          READY   STATUS              RESTARTS   AGE
      etcd-guard-ip-10-0-19-163.us-west-2.compute.internal          0/1     ContainerCreating   1          7h44m
      etcd-guard-ip-10-0-34-215.us-west-2.compute.internal          0/1     ContainerCreating   1          7h27m
      etcd-guard-ip-10-0-69-222.us-west-2.compute.internal          0/1     ContainerCreating   1          7h34m
      etcd-ip-10-0-19-163.us-west-2.compute.internal                4/4     Running             8          10h
      etcd-ip-10-0-34-215.us-west-2.compute.internal                4/4     Running             8          10h
      etcd-ip-10-0-69-222.us-west-2.compute.internal                4/4     Running             8          10h
      revision-pruner-7-ip-10-0-19-163.us-west-2.compute.internal   0/1     Completed           0          7h47m
      revision-pruner-7-ip-10-0-34-215.us-west-2.compute.internal   0/1     Completed           0          7h33m
      revision-pruner-7-ip-10-0-69-222.us-west-2.compute.internal   0/1     Completed           0          7h39m
      ========================================================================
      ========================================================================
      $ oc describe po etcd-guard-ip-10-0-19-163.us-west-2.compute.internal  -n openshift-etcd
      Name:                 etcd-guard-ip-10-0-19-163.us-west-2.compute.internal
      Namespace:            openshift-etcd
      Priority:             2000000000
      Priority Class Name:  system-cluster-critical
      Service Account:      default
      Node:                 ip-10-0-19-163.us-west-2.compute.internal/10.0.19.163
      Start Time:           Mon, 08 Apr 2024 16:52:40 +0530
      Labels:               app=guard
      Annotations:          k8s.ovn.org/pod-networks:
                              {"default":{"ip_addresses":["10.128.44.16/23"],"mac_address":"0a:58:0a:80:2c:10","gateway_ips":["10.128.44.1"],"routes":[{"dest":"10.128.0...
                            k8s.v1.cni.cncf.io/network-status:
                              [{
                                  "name": "openshift-sdn",
                                  "interface": "eth0",
                                  "ips": [
                                      "10.128.0.17"
                                  ],
                                  "default": true,
                                  "dns": {}
                              }]
      Status:               Running
      IP:                   
      IPs:                  <none>
      Containers:
        guard:
          Container ID: 
          Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8c691f68a37812bf1501bc243ebabf6cb845a927873960e294c677f21fcade49
          Image ID:      
          Port:          <none>
          Host Port:     <none>
          Command:
            /bin/bash
          Args:
            -c
            # properly handle TERM and exit as soon as it is signaled
            set -euo pipefail
            trap 'jobs -p | xargs -r kill; exit 0' TERM
            sleep infinity & wait
            
          State:          Waiting
            Reason:       ContainerCreating
          Last State:     Terminated
            Reason:       ContainerStatusUnknown
            Message:      The container could not be located when the pod was deleted.  The container used to be Running
            Exit Code:    137
            Started:      Mon, 01 Jan 0001 00:00:00 +0000
            Finished:     Mon, 01 Jan 0001 00:00:00 +0000
          Ready:          False
          Restart Count:  1
          Requests:
            cpu:        10m
            memory:     5Mi
          Readiness:    http-get https://10.0.19.163:9980/readyz delay=0s timeout=5s period=5s #success=1 #failure=3
          Environment:  <none>
          Mounts:
            /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fwq7f (ro)
      Conditions:
        Type              Status
        Initialized       True 
        Ready             False 
        ContainersReady   False 
        PodScheduled      True 
      Volumes:
        kube-api-access-fwq7f:
          Type:                    Projected (a volume that contains injected data from multiple sources)
          TokenExpirationSeconds:  3607
          ConfigMapName:           kube-root-ca.crt
          ConfigMapOptional:       <nil>
          DownwardAPI:             true
          ConfigMapName:           openshift-service-ca.crt
          ConfigMapOptional:       <nil>
      QoS Class:                   Burstable
      Node-Selectors:              <none>
      Tolerations:                 node-role.kubernetes.io/etcd:NoSchedule op=Exists
                                   node-role.kubernetes.io/master:NoSchedule op=Exists
                                   node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                                   node.kubernetes.io/not-ready:NoExecute op=Exists
                                   node.kubernetes.io/unreachable:NoExecute op=Exists
      Events:
        Type     Reason           Age                       From     Message
        ----     ------           ----                      ----     -------
        Warning  FailedMount      37m (x179 over 6h27m)     kubelet  MountVolume.SetUp failed for volume "kube-api-access-fwq7f" : [object "openshift-etcd"/"kube-root-ca.crt" not registered, object "openshift-etcd"/"openshift-service-ca.crt" not registered]
        Warning  NetworkNotReady  2m2s (x11537 over 6h27m)  kubelet  network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
      $
      $
      ========================================================================
      $ oc get po -n openshift-ovn-kubernetes -o wide | egrep -i "ip-10-0-19-163.us-west-2.compute.internal|ip-10-0-34-215.us-west-2.compute.internal|ip-10-0-69-222.us-west-2.compute.internal"
      ovnkube-control-plane-58b785bcd-5hmv5   2/2     Running            2                6h32m   10.0.19.163   ip-10-0-19-163.us-west-2.compute.internal   <none>           <none>
      ovnkube-control-plane-58b785bcd-rm59v   2/2     Running            2                6h32m   10.0.69.222   ip-10-0-69-222.us-west-2.compute.internal   <none>           <none>
      ovnkube-control-plane-58b785bcd-shjhf   2/2     Running            2                6h32m   10.0.34.215   ip-10-0-34-215.us-west-2.compute.internal   <none>           <none>
      ovnkube-node-8742m                      7/8     Running            57 (2m23s ago)   6h26m   10.0.34.215   ip-10-0-34-215.us-west-2.compute.internal   <none>           <none>
      ovnkube-node-nrj8k                      7/8     Running            56 (3m24s ago)   6h27m   10.0.19.163   ip-10-0-19-163.us-west-2.compute.internal   <none>           <none>
      ovnkube-node-zqg8k                      7/8     Running            64 (9m58s ago)   6h27m   10.0.69.222   ip-10-0-69-222.us-west-2.compute.internal   <none>           <none>
      $
      ========================================================================
      $ oc get po -n openshift-ovn-kubernetes -o wide | egrep -i "ip-10-0-19-163.us-west-2.compute.internal|ip-10-0-34-215.us-west-2.compute.internal|ip-10-0-69-222.us-west-2.compute.internal"
      ovnkube-control-plane-58b785bcd-5hmv5   2/2     Running            2                6h32m   10.0.19.163   ip-10-0-19-163.us-west-2.compute.internal   <none>           <none>
      ovnkube-control-plane-58b785bcd-rm59v   2/2     Running            2                6h32m   10.0.69.222   ip-10-0-69-222.us-west-2.compute.internal   <none>           <none>
      ovnkube-control-plane-58b785bcd-shjhf   2/2     Running            2                6h32m   10.0.34.215   ip-10-0-34-215.us-west-2.compute.internal   <none>           <none>
      ovnkube-node-8742m                      7/8     Running            57 (2m23s ago)   6h26m   10.0.34.215   ip-10-0-34-215.us-west-2.compute.internal   <none>           <none>
      ovnkube-node-nrj8k                      7/8     Running            56 (3m24s ago)   6h27m   10.0.19.163   ip-10-0-19-163.us-west-2.compute.internal   <none>           <none>
      ovnkube-node-zqg8k                      7/8     Running            64 (9m58s ago)   6h27m   10.0.69.222   ip-10-0-69-222.us-west-2.compute.internal   <none>           <none>
      $
      ========================================================================
      Snippet from oc describe on Pod
      $
        ovnkube-controller:
          Container ID:  cri-o://116653ee3b5d24d2957e8a371e81fd9edc201b92f6f9a7463d00815c102f2758
          Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:368149fc734294fe7c851246f91738ef4d652fc83c32e2477d4eb20f1f41643a
          Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:368149fc734294fe7c851246f91738ef4d652fc83c32e2477d4eb20f1f41643a
          Port:          29105/TCP
          Host Port:     29105/TCP
          Command:
            /bin/bash
            -c
            set -xe
            . /ovnkube-lib/ovnkube-lib.sh || exit 1
            start-ovnkube-node ${OVN_KUBE_LOG_LEVEL} 29103 29105
            
          State:       Running
            Started:   Tue, 09 Apr 2024 00:36:44 +0530
          Last State:  Terminated
            Reason:    Error
            Message:   SBNodeLRSR for node - 10.128.4.0/23 : match match="reg7 == 0 && ip4.dst == 10.128.4.0/23" : stdout - aca9df32-208f-48f5-a6e0-acfb0e2b4d5e : stderr -  : err <nil>
      I0408 19:06:31.987410  209557 default_node_network_controller.go:878] Upgrade Hack: node ip-10-0-28-110.us-west-2.compute.internal, subnet 10.131.34.0/23
      I0408 19:06:31.987425  209557 ovs.go:164] Exec(1091): /usr/bin/ovn-sbctl --timeout=15 --no-leader-only --bare --columns _uuid find logical_flow match="reg7 == 0 && ip4.dst == 10.131.34.0/23"
      I0408 19:06:32.088331  209557 obj_retry.go:555] Update event received for resource *v1.Pod, old object is equal to new: false
      I0408 19:06:32.088355  209557 default_network_controller.go:650] Recording update event on pod openshift-multus/multus-qdwc2
      I0408 19:06:32.088371  209557 obj_retry.go:607] Update event received for *v1.Pod openshift-multus/multus-qdwc2
      I0408 19:06:32.088382  209557 ovn.go:132] Ensuring zone remote for Pod openshift-multus/multus-qdwc2 in node ip-10-0-28-47.us-west-2.compute.internal
      I0408 19:06:32.088388  209557 default_network_controller.go:679] Recording success event on pod openshift-multus/multus-qdwc2
      I0408 19:06:32.088395  209557 obj_retry.go:555] Update event received for resource *factory.egressIPPod, old object is equal to new: false
      I0408 19:06:32.088401  209557 obj_retry.go:607] Update event received for *facto
            Exit Code:    1
            Started:      Tue, 09 Apr 2024 00:31:26 +0530
            Finished:     Tue, 09 Apr 2024 00:36:32 +0530
          Ready:          False
          Restart Count:  56
          Requests:
            cpu:      10m
            memory:   600Mi
          Readiness:  exec [test -f /etc/cni/net.d/10-ovn-kubernetes.conf] delay=5s timeout=1s period=30s #success=1 #failure=3
          Environment:
            KUBERNETES_SERVICE_PORT:          6443
            KUBERNETES_SERVICE_HOST:          api-int.har-sdn-120.perfscale.devcluster.openshift.com
            OVN_CONTROLLER_INACTIVITY_PROBE:  180000
            OVN_KUBE_LOG_LEVEL:               4
            K8S_NODE:                          (v1:spec.nodeName)
            POD_NAME:                         ovnkube-node-8742m (v1:metadata.name)
          Mounts:
            /cni-bin-dir from host-cni-bin (rw)
            /env from env-overrides (rw)
            /etc/cni/net.d from host-cni-netd (rw)
            /etc/openvswitch from etc-openvswitch (rw)
            /etc/ovn/ from etc-openvswitch (rw)
            /etc/systemd/system from systemd-units (ro)
            /host from host-slash (ro)
            /ovnkube-lib from ovnkube-script-lib (rw)
            /run/netns from host-run-netns (ro)
            /run/openvswitch from run-openvswitch (rw)
            /run/ovn-kubernetes/ from host-run-ovn-kubernetes (rw)
            /run/ovn/ from run-ovn (rw)
            /run/ovnkube-config/ from ovnkube-config (rw)
            /var/lib/cni/networks/ovn-k8s-cni-overlay from host-var-lib-cni-networks-ovn-kubernetes (rw)
            /var/lib/kubelet from host-kubelet (ro)
            /var/lib/openvswitch from var-lib-openvswitch (rw)
            /var/log/ovnkube/ from etc-openvswitch (rw)
            /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rg296 (ro)

      Which 

            pliurh Peng Liu
            rh-ee-krvoora Krishna Harsha Voora
            Anurag Saxena Anurag Saxena
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: