Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-32581

[249 Nodes] Offline SDN--> OVNK Migration Fails

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.14.z
    • None
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      To ensure the functionality of offline SDN migration of OpenShift SDN to OVN-IC at large scale, performed a SDN-OVNK Migration on a cluster which is pre-loaded with cluster-density-v2 workload.
      
      Post updating the networkType field of the Network.config.openshift.io CR to OVNKubernetes followed by a reboot, the nodes hosting the Monitoring Operator was in "False" state

      Version-Release number of selected component (if applicable):

          OCP Version: 4.14.10
          ovs-vswitchd (Open vSwitch) 3.1.2

      How reproducible:

          Reproducible at Scale (252 nodes)

      The step listed below will perform SDN--->OVN-K Migration.

          1. git clone https://github.com/krishvoor/e2e-benchmarking
          2. cd e2e-benchmarking/workloads/sdn2ovn/
          3. ./run.sh

      Actual results:

      [root@vkommadi aws_249_nodes]# oc get co
      NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.14.10   True        False         False      10m     
      baremetal                                  4.14.10   True        False         False      6h17m   
      cloud-controller-manager                   4.14.10   True        False         False      6h20m   
      cloud-credential                           4.14.10   True        False         False      6h21m   
      cluster-autoscaler                         4.14.10   True        False         False      6h18m   
      config-operator                            4.14.10   True        False         False      6h19m   
      console                                    4.14.10   True        False         False      10m     
      control-plane-machine-set                  4.14.10   True        False         False      17m     
      csi-snapshot-controller                    4.14.10   True        False         False      21m     
      dns                                        4.14.10   True        False         False      6h17m   
      etcd                                       4.14.10   True        False         False      6h16m   
      image-registry                             4.14.10   True        False         False      14m     
      ingress                                    4.14.10   True        False         False      19m     
      insights                                   4.14.10   True        False         False      6h12m   
      kube-apiserver                             4.14.10   True        False         False      6h14m   
      kube-controller-manager                    4.14.10   True        False         False      6h15m   
      kube-scheduler                             4.14.10   True        False         False      6h15m   
      kube-storage-version-migrator              4.14.10   True        False         False      20m     
      machine-api                                4.14.10   True        False         False      6h14m   
      machine-approver                           4.14.10   True        False         False      6h18m   
      machine-config                             4.14.10   True        False         False      122m    
      marketplace                                4.14.10   True        False         False      6h18m   
      monitoring                                 4.14.10   False       True          True       11m     reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: context deadline exceeded
      network                                    4.14.10   True        True          True       6h19m   DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-4vq2k is in CrashLoopBackOff State...
      node-tuning                                4.14.10   True        False         False      6h17m   
      openshift-apiserver                        4.14.10   True        False         False      19m     
      openshift-controller-manager               4.14.10   True        False         False      6h17m   
      openshift-samples                          4.14.10   True        False         False      6h11m   
      operator-lifecycle-manager                 4.14.10   True        False         False      6h18m   
      operator-lifecycle-manager-catalog         4.14.10   True        False         False      6h18m   
      operator-lifecycle-manager-packageserver   4.14.10   True        False         False      19m     
      service-ca                                 4.14.10   True        False         False      6h18m   
      storage                                    4.14.10   True        False         False      18m     
      [root@vkommadi aws_249_nodes]#
      
      

      Expected results:

      CNI is Successfully Migrated to OVN-Kubernetes, all nodes are up and active

      Additional info:

      [root@vkommadi aws_249_nodes]# oc get po -n openshift-monitoring -o wide | grep -v Running
      NAME                                                     READY   STATUS              RESTARTS   AGE     IP              NODE                                        NOMINATED NODE   READINESS GATES
      monitoring-plugin-764d5bd484-4nmks                       0/1     ContainerCreating   0          20m     <none>          ip-10-0-68-231.us-west-2.compute.internal   <none>           <none>
      monitoring-plugin-764d5bd484-t9dhq                       0/1     ContainerCreating   1          86m     <none>          ip-10-0-45-234.us-west-2.compute.internal   <none>           <none>
      prometheus-operator-admission-webhook-6f5668f5dd-g2j6d   0/1     ContainerCreating   1          135m    <none>          ip-10-0-20-163.us-west-2.compute.internal   <none>           <none>
      prometheus-operator-admission-webhook-6f5668f5dd-gq5sh   0/1     ContainerCreating   1          86m     <none>          ip-10-0-57-31.us-west-2.compute.internal    <none>           <none>
      [root@vkommadi aws_249_nodes]# oc get no/ip-10-0-20-163.us-west-2.compute.internal -oyaml | grep -i machineConfig
          machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
          machineconfiguration.openshift.io/currentConfig: rendered-worker-3e2e53c81c94205dce819f2824ea82ff
          machineconfiguration.openshift.io/desiredConfig: rendered-worker-3e2e53c81c94205dce819f2824ea82ff
          machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-3e2e53c81c94205dce819f2824ea82ff
          machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-3e2e53c81c94205dce819f2824ea82ff
          machineconfiguration.openshift.io/lastSyncedControllerConfigResourceVersion: "1470650"
          machineconfiguration.openshift.io/reason: ""
          machineconfiguration.openshift.io/state: Done
      [root@vkommadi aws_249_nodes]# oc get mcp
      NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      master   rendered-master-6d95af83deed644562dc33d38a3712ba   True      False      False      3              3                   3                     0                      6h20m
      worker   rendered-worker-3e2e53c81c94205dce819f2824ea82ff   False     True       False      252            3                   252                   0                      6h20m
      [root@vkommadi aws_249_nodes]#
      ==================================================
      
      [root@vkommadi aws_249_nodes]# oc describe po prometheus-operator-admission-webhook-6f5668f5dd-g2j6d -n openshift-monitoring
      ......
        Type     Reason           Age                    From               Message
        ----     ------           ----                   ----               -------
        Normal   Scheduled        137m                   default-scheduler  Successfully assigned openshift-monitoring/prometheus-operator-admission-webhook-6f5668f5dd-g2j6d to ip-10-0-20-163.us-west-2.compute.internal
        Normal   AddedInterface   137m                   multus             Add eth0 [10.130.40.9/23] from openshift-sdn
        Normal   Pulling          137m                   kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8d85ff677a4e42abc3d951b761e61421eb3b9c92e5bd7e33a2085a18580349d5"
        Normal   Pulled           137m                   kubelet            Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8d85ff677a4e42abc3d951b761e61421eb3b9c92e5bd7e33a2085a18580349d5" in 3.127892796s (3.127910981s including waiting)
        Normal   Created          137m                   kubelet            Created container prometheus-operator-admission-webhook
        Normal   Started          137m                   kubelet            Started container prometheus-operator-admission-webhook
        Warning  FailedMount      29m                    kubelet            MountVolume.SetUp failed for volume "tls-certificates" : object "openshift-monitoring"/"prometheus-operator-admission-webhook-tls" not registered
        Warning  NetworkNotReady  4m16s (x730 over 29m)  kubelet            network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
      [root@vkommadi aws_249_nodes]#

            pliurh Peng Liu
            rh-ee-krvoora Krishna Harsha Voora
            Anurag Saxena Anurag Saxena
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: