-
Bug
-
Resolution: Done
-
Critical
-
None
-
4.14.z
-
Important
-
No
-
False
-
Description of problem:
To ensure the functionality of offline SDN migration of OpenShift SDN to OVN-IC at large scale, performed a SDN-OVNK Migration on a cluster which is pre-loaded with cluster-density-v2 workload. Post updating the networkType field of the Network.config.openshift.io CR to OVNKubernetes followed by a reboot, the nodes in the cluster are "stuck" in NotReady State for more than 6 hours, upon investigation the following was found on ovnkube-controller container on one of the master nodes: ======================================================================== ovnkube-controller: Container ID: cri-o://fe8ad966f61423b6cee23c622594b834cd270566d8ea90261e7fd2023d6017ff Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:368149fc734294fe7c851246f91738ef4d652fc83c32e2477d4eb20f1f41643a Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:368149fc734294fe7c851246f91738ef4d652fc83c32e2477d4eb20f1f41643a Port: 29105/TCP Host Port: 29105/TCP Command: /bin/bash -c set -xe . /ovnkube-lib/ovnkube-lib.sh || exit 1 start-ovnkube-node ${OVN_KUBE_LOG_LEVEL} 29103 29105 State: Running Started: Tue, 09 Apr 2024 00:30:26 +0530 Last State: Terminated Reason: Error Message: 5:31.727933 208842 obj_retry.go:607] Update event received for *v1.Pod openshift-multus/multus-zsdr8 I0408 18:55:31.727941 208842 ovn.go:132] Ensuring zone remote for Pod openshift-multus/multus-zsdr8 in node ip-10-0-44-148.us-west-2.compute.internal I0408 18:55:31.727946 208842 default_network_controller.go:679] Recording success event on pod openshift-multus/multus-zsdr8 I0408 18:55:31.727956 208842 obj_retry.go:555] Update event received for resource *factory.egressIPPod, old object is equal to new: false I0408 18:55:31.727965 208842 obj_retry.go:607] Update event received for *factory.egressIPPod openshift-multus/multus-zsdr8 I0408 18:55:31.751675 208842 ovs.go:167] Exec(1207): stdout: "7885e979-e03b-48d8-8495-331b0f3ce391\n" I0408 18:55:31.751694 208842 ovs.go:168] Exec(1207): stderr: "" I0408 18:55:31.751709 208842 default_node_network_controller.go:639] Upgrade Hack: checkOVNSBNodeLRSR for node - 10.128.4.0/23 : match match="reg7 == 0 && ip4.dst == 10.128.4.0/23" : stdout - 7885e979-e03b-48d8-8495-331b0f3ce391 : stderr - : err <nil> I0408 18:55:31.751739 208842 default_node_network_controller.go:878] Upgrade Hack: node ip-10-0-28-110.us-west-2.compute.internal, subnet 10.131.34.0/23 I0408 18:55:31.751769 208842 ovs.go:164] Exec(1208): /usr/bin/ovn-sbctl --timeout=15 --no-leader-only --bare --columns _uuid find logical_flow ma Exit Code: 1 Started: Tue, 09 Apr 2024 00:20:21 +0530 Finished: Tue, 09 Apr 2024 00:25:32 +0530 Ready: False Restart Count: 55 ========================================================================
Version-Release number of selected component (if applicable):
OCP Version: 4.14.10 ovs-vswitchd (Open vSwitch) 3.1.2
How reproducible:
Easily reproducible
Steps to Reproduce:
The step listed below will help users to perform SDN--->OVN-K Migration.
1. git clone https://github.com/cloud-bulldozer/e2e-benchmarking 2. cd e2e-benchmarking/workloads/sdn2ovn/ 3. ./run.sh
Actual results:
Expected results:
CNI Migrated to OVN-Kubernetes
Additional info:
Cluster operators details: ======================================================================== $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.14.10 False False True 6h8m APIServicesAvailable: "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request... baremetal 4.14.10 True False False 10h cloud-controller-manager 4.14.10 True False False 10h cloud-credential 4.14.10 True False False 10h cluster-autoscaler 4.14.10 True False False 10h config-operator 4.14.10 True False False 10h console 4.14.10 False False False 6h8m RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.har-sdn-120.perfscale.devcluster.openshift.com): Get "https://console-openshift-console.apps.har-sdn-120.perfscale.devcluster.openshift.com": EOF control-plane-machine-set 4.14.10 True False False 7h6m csi-snapshot-controller 4.14.10 True True False 10h CSISnapshotControllerProgressing: Waiting for Deployment to deploy pods... dns 4.14.10 True False False 10h etcd 4.14.10 True False True 10h EtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [{Member:ID:4398145115237214508 name:"ip-10-0-69-222.us-west-2.compute.internal" peerURLs:"https://10.0.69.222:2380" clientURLs:"https://10.0.69.222:2379" Healthy:true Took:1.045453ms Error:<nil>} {Member:ID:7320495613934196650 name:"ip-10-0-34-215.us-west-2.compute.internal" peerURLs:"https://10.0.34.215:2380" clientURLs:"https://10.0.34.215:2379" Healthy:false Took: Error:create client failure: failed to make etcd client for endpoints [https://10.0.34.215:2379]: context deadline exceeded} {Member:ID:8759161088354208548 name:"ip-10-0-19-163.us-west-2.compute.internal" peerURLs:"https://10.0.19.163:2380" clientURLs:"https://10.0.19.163:2379" Healthy:true Took:2.066578ms Error:<nil>}]... image-registry 4.14.10 True False False 10h ingress 4.14.10 True False False 10h insights 4.14.10 True False False 10h kube-apiserver 4.14.10 True False False 10h kube-controller-manager 4.14.10 True False False 10h kube-scheduler 4.14.10 True False False 10h kube-storage-version-migrator 4.14.10 True False False 6h44m machine-api 4.14.10 True False False 10h machine-approver 4.14.10 True False False 10h machine-config 4.14.10 True False False 10h marketplace 4.14.10 True False False 10h monitoring 4.14.10 True False False 10h network 4.14.10 True True True 10h DaemonSet "/openshift-multus/multus" rollout is not making progress - pod multus-26q59 is in CrashLoopBackOff State... node-tuning 4.14.10 True False False 10h openshift-apiserver 4.14.10 False False False 6h2m APIServicesAvailable: "template.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request openshift-controller-manager 4.14.10 True False False 10h openshift-samples 4.14.10 True False False 10h operator-lifecycle-manager 4.14.10 True False False 10h operator-lifecycle-manager-catalog 4.14.10 True False False 10h operator-lifecycle-manager-packageserver 4.14.10 True False False 10h service-ca 4.14.10 True True False 10h Progressing: ... storage 4.14.10 True True False 10h AWSEBSProgressing: Waiting for Deployment to deploy pods $ ======================================================================== $ oc get po -n openshift-etcd NAME READY STATUS RESTARTS AGE etcd-guard-ip-10-0-19-163.us-west-2.compute.internal 0/1 ContainerCreating 1 7h44m etcd-guard-ip-10-0-34-215.us-west-2.compute.internal 0/1 ContainerCreating 1 7h27m etcd-guard-ip-10-0-69-222.us-west-2.compute.internal 0/1 ContainerCreating 1 7h34m etcd-ip-10-0-19-163.us-west-2.compute.internal 4/4 Running 8 10h etcd-ip-10-0-34-215.us-west-2.compute.internal 4/4 Running 8 10h etcd-ip-10-0-69-222.us-west-2.compute.internal 4/4 Running 8 10h revision-pruner-7-ip-10-0-19-163.us-west-2.compute.internal 0/1 Completed 0 7h47m revision-pruner-7-ip-10-0-34-215.us-west-2.compute.internal 0/1 Completed 0 7h33m revision-pruner-7-ip-10-0-69-222.us-west-2.compute.internal 0/1 Completed 0 7h39m ======================================================================== ======================================================================== $ oc describe po etcd-guard-ip-10-0-19-163.us-west-2.compute.internal -n openshift-etcd Name: etcd-guard-ip-10-0-19-163.us-west-2.compute.internal Namespace: openshift-etcd Priority: 2000000000 Priority Class Name: system-cluster-critical Service Account: default Node: ip-10-0-19-163.us-west-2.compute.internal/10.0.19.163 Start Time: Mon, 08 Apr 2024 16:52:40 +0530 Labels: app=guard Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.128.44.16/23"],"mac_address":"0a:58:0a:80:2c:10","gateway_ips":["10.128.44.1"],"routes":[{"dest":"10.128.0... k8s.v1.cni.cncf.io/network-status: [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.128.0.17" ], "default": true, "dns": {} }] Status: Running IP: IPs: <none> Containers: guard: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8c691f68a37812bf1501bc243ebabf6cb845a927873960e294c677f21fcade49 Image ID: Port: <none> Host Port: <none> Command: /bin/bash Args: -c # properly handle TERM and exit as soon as it is signaled set -euo pipefail trap 'jobs -p | xargs -r kill; exit 0' TERM sleep infinity & wait State: Waiting Reason: ContainerCreating Last State: Terminated Reason: ContainerStatusUnknown Message: The container could not be located when the pod was deleted. The container used to be Running Exit Code: 137 Started: Mon, 01 Jan 0001 00:00:00 +0000 Finished: Mon, 01 Jan 0001 00:00:00 +0000 Ready: False Restart Count: 1 Requests: cpu: 10m memory: 5Mi Readiness: http-get https://10.0.19.163:9980/readyz delay=0s timeout=5s period=5s #success=1 #failure=3 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fwq7f (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: kube-api-access-fwq7f: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: <none> Tolerations: node-role.kubernetes.io/etcd:NoSchedule op=Exists node-role.kubernetes.io/master:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 37m (x179 over 6h27m) kubelet MountVolume.SetUp failed for volume "kube-api-access-fwq7f" : [object "openshift-etcd"/"kube-root-ca.crt" not registered, object "openshift-etcd"/"openshift-service-ca.crt" not registered] Warning NetworkNotReady 2m2s (x11537 over 6h27m) kubelet network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? $ $ ======================================================================== $ oc get po -n openshift-ovn-kubernetes -o wide | egrep -i "ip-10-0-19-163.us-west-2.compute.internal|ip-10-0-34-215.us-west-2.compute.internal|ip-10-0-69-222.us-west-2.compute.internal" ovnkube-control-plane-58b785bcd-5hmv5 2/2 Running 2 6h32m 10.0.19.163 ip-10-0-19-163.us-west-2.compute.internal <none> <none> ovnkube-control-plane-58b785bcd-rm59v 2/2 Running 2 6h32m 10.0.69.222 ip-10-0-69-222.us-west-2.compute.internal <none> <none> ovnkube-control-plane-58b785bcd-shjhf 2/2 Running 2 6h32m 10.0.34.215 ip-10-0-34-215.us-west-2.compute.internal <none> <none> ovnkube-node-8742m 7/8 Running 57 (2m23s ago) 6h26m 10.0.34.215 ip-10-0-34-215.us-west-2.compute.internal <none> <none> ovnkube-node-nrj8k 7/8 Running 56 (3m24s ago) 6h27m 10.0.19.163 ip-10-0-19-163.us-west-2.compute.internal <none> <none> ovnkube-node-zqg8k 7/8 Running 64 (9m58s ago) 6h27m 10.0.69.222 ip-10-0-69-222.us-west-2.compute.internal <none> <none> $ ======================================================================== $ oc get po -n openshift-ovn-kubernetes -o wide | egrep -i "ip-10-0-19-163.us-west-2.compute.internal|ip-10-0-34-215.us-west-2.compute.internal|ip-10-0-69-222.us-west-2.compute.internal" ovnkube-control-plane-58b785bcd-5hmv5 2/2 Running 2 6h32m 10.0.19.163 ip-10-0-19-163.us-west-2.compute.internal <none> <none> ovnkube-control-plane-58b785bcd-rm59v 2/2 Running 2 6h32m 10.0.69.222 ip-10-0-69-222.us-west-2.compute.internal <none> <none> ovnkube-control-plane-58b785bcd-shjhf 2/2 Running 2 6h32m 10.0.34.215 ip-10-0-34-215.us-west-2.compute.internal <none> <none> ovnkube-node-8742m 7/8 Running 57 (2m23s ago) 6h26m 10.0.34.215 ip-10-0-34-215.us-west-2.compute.internal <none> <none> ovnkube-node-nrj8k 7/8 Running 56 (3m24s ago) 6h27m 10.0.19.163 ip-10-0-19-163.us-west-2.compute.internal <none> <none> ovnkube-node-zqg8k 7/8 Running 64 (9m58s ago) 6h27m 10.0.69.222 ip-10-0-69-222.us-west-2.compute.internal <none> <none> $ ======================================================================== Snippet from oc describe on Pod $ ovnkube-controller: Container ID: cri-o://116653ee3b5d24d2957e8a371e81fd9edc201b92f6f9a7463d00815c102f2758 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:368149fc734294fe7c851246f91738ef4d652fc83c32e2477d4eb20f1f41643a Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:368149fc734294fe7c851246f91738ef4d652fc83c32e2477d4eb20f1f41643a Port: 29105/TCP Host Port: 29105/TCP Command: /bin/bash -c set -xe . /ovnkube-lib/ovnkube-lib.sh || exit 1 start-ovnkube-node ${OVN_KUBE_LOG_LEVEL} 29103 29105 State: Running Started: Tue, 09 Apr 2024 00:36:44 +0530 Last State: Terminated Reason: Error Message: SBNodeLRSR for node - 10.128.4.0/23 : match match="reg7 == 0 && ip4.dst == 10.128.4.0/23" : stdout - aca9df32-208f-48f5-a6e0-acfb0e2b4d5e : stderr - : err <nil> I0408 19:06:31.987410 209557 default_node_network_controller.go:878] Upgrade Hack: node ip-10-0-28-110.us-west-2.compute.internal, subnet 10.131.34.0/23 I0408 19:06:31.987425 209557 ovs.go:164] Exec(1091): /usr/bin/ovn-sbctl --timeout=15 --no-leader-only --bare --columns _uuid find logical_flow match="reg7 == 0 && ip4.dst == 10.131.34.0/23" I0408 19:06:32.088331 209557 obj_retry.go:555] Update event received for resource *v1.Pod, old object is equal to new: false I0408 19:06:32.088355 209557 default_network_controller.go:650] Recording update event on pod openshift-multus/multus-qdwc2 I0408 19:06:32.088371 209557 obj_retry.go:607] Update event received for *v1.Pod openshift-multus/multus-qdwc2 I0408 19:06:32.088382 209557 ovn.go:132] Ensuring zone remote for Pod openshift-multus/multus-qdwc2 in node ip-10-0-28-47.us-west-2.compute.internal I0408 19:06:32.088388 209557 default_network_controller.go:679] Recording success event on pod openshift-multus/multus-qdwc2 I0408 19:06:32.088395 209557 obj_retry.go:555] Update event received for resource *factory.egressIPPod, old object is equal to new: false I0408 19:06:32.088401 209557 obj_retry.go:607] Update event received for *facto Exit Code: 1 Started: Tue, 09 Apr 2024 00:31:26 +0530 Finished: Tue, 09 Apr 2024 00:36:32 +0530 Ready: False Restart Count: 56 Requests: cpu: 10m memory: 600Mi Readiness: exec [test -f /etc/cni/net.d/10-ovn-kubernetes.conf] delay=5s timeout=1s period=30s #success=1 #failure=3 Environment: KUBERNETES_SERVICE_PORT: 6443 KUBERNETES_SERVICE_HOST: api-int.har-sdn-120.perfscale.devcluster.openshift.com OVN_CONTROLLER_INACTIVITY_PROBE: 180000 OVN_KUBE_LOG_LEVEL: 4 K8S_NODE: (v1:spec.nodeName) POD_NAME: ovnkube-node-8742m (v1:metadata.name) Mounts: /cni-bin-dir from host-cni-bin (rw) /env from env-overrides (rw) /etc/cni/net.d from host-cni-netd (rw) /etc/openvswitch from etc-openvswitch (rw) /etc/ovn/ from etc-openvswitch (rw) /etc/systemd/system from systemd-units (ro) /host from host-slash (ro) /ovnkube-lib from ovnkube-script-lib (rw) /run/netns from host-run-netns (ro) /run/openvswitch from run-openvswitch (rw) /run/ovn-kubernetes/ from host-run-ovn-kubernetes (rw) /run/ovn/ from run-ovn (rw) /run/ovnkube-config/ from ovnkube-config (rw) /var/lib/cni/networks/ovn-k8s-cni-overlay from host-var-lib-cni-networks-ovn-kubernetes (rw) /var/lib/kubelet from host-kubelet (ro) /var/lib/openvswitch from var-lib-openvswitch (rw) /var/log/ovnkube/ from etc-openvswitch (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rg296 (ro)
Which