-
Bug
-
Resolution: Done
-
Critical
-
None
-
4.14.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
No
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
-
None
Description of problem:
To ensure the functionality of offline SDN migration of OpenShift SDN to OVN-IC at large scale, performed a SDN-OVNK Migration on a cluster which is pre-loaded with cluster-density-v2 workload.
Post updating the networkType field of the Network.config.openshift.io CR to OVNKubernetes followed by a reboot, the nodes in the cluster are "stuck" in NotReady State for more than 6 hours, upon investigation the following was found on ovnkube-controller container on one of the master nodes:
========================================================================
ovnkube-controller:
Container ID: cri-o://fe8ad966f61423b6cee23c622594b834cd270566d8ea90261e7fd2023d6017ff
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:368149fc734294fe7c851246f91738ef4d652fc83c32e2477d4eb20f1f41643a
Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:368149fc734294fe7c851246f91738ef4d652fc83c32e2477d4eb20f1f41643a
Port: 29105/TCP
Host Port: 29105/TCP
Command:
/bin/bash
-c
set -xe
. /ovnkube-lib/ovnkube-lib.sh || exit 1
start-ovnkube-node ${OVN_KUBE_LOG_LEVEL} 29103 29105
State: Running
Started: Tue, 09 Apr 2024 00:30:26 +0530
Last State: Terminated
Reason: Error
Message: 5:31.727933 208842 obj_retry.go:607] Update event received for *v1.Pod openshift-multus/multus-zsdr8
I0408 18:55:31.727941 208842 ovn.go:132] Ensuring zone remote for Pod openshift-multus/multus-zsdr8 in node ip-10-0-44-148.us-west-2.compute.internal
I0408 18:55:31.727946 208842 default_network_controller.go:679] Recording success event on pod openshift-multus/multus-zsdr8
I0408 18:55:31.727956 208842 obj_retry.go:555] Update event received for resource *factory.egressIPPod, old object is equal to new: false
I0408 18:55:31.727965 208842 obj_retry.go:607] Update event received for *factory.egressIPPod openshift-multus/multus-zsdr8
I0408 18:55:31.751675 208842 ovs.go:167] Exec(1207): stdout: "7885e979-e03b-48d8-8495-331b0f3ce391\n"
I0408 18:55:31.751694 208842 ovs.go:168] Exec(1207): stderr: ""
I0408 18:55:31.751709 208842 default_node_network_controller.go:639] Upgrade Hack: checkOVNSBNodeLRSR for node - 10.128.4.0/23 : match match="reg7 == 0 && ip4.dst == 10.128.4.0/23" : stdout - 7885e979-e03b-48d8-8495-331b0f3ce391 : stderr - : err <nil>
I0408 18:55:31.751739 208842 default_node_network_controller.go:878] Upgrade Hack: node ip-10-0-28-110.us-west-2.compute.internal, subnet 10.131.34.0/23
I0408 18:55:31.751769 208842 ovs.go:164] Exec(1208): /usr/bin/ovn-sbctl --timeout=15 --no-leader-only --bare --columns _uuid find logical_flow ma
Exit Code: 1
Started: Tue, 09 Apr 2024 00:20:21 +0530
Finished: Tue, 09 Apr 2024 00:25:32 +0530
Ready: False
Restart Count: 55
========================================================================
Version-Release number of selected component (if applicable):
OCP Version: 4.14.10
ovs-vswitchd (Open vSwitch) 3.1.2
How reproducible:
Easily reproducible
Steps to Reproduce:
The step listed below will help users to perform SDN--->OVN-K Migration.
1. git clone https://github.com/cloud-bulldozer/e2e-benchmarking
2. cd e2e-benchmarking/workloads/sdn2ovn/
3. ./run.sh
Actual results:
Expected results:
CNI Migrated to OVN-Kubernetes
Additional info:
Cluster operators details:
========================================================================
$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.14.10 False False True 6h8m APIServicesAvailable: "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request...
baremetal 4.14.10 True False False 10h
cloud-controller-manager 4.14.10 True False False 10h
cloud-credential 4.14.10 True False False 10h
cluster-autoscaler 4.14.10 True False False 10h
config-operator 4.14.10 True False False 10h
console 4.14.10 False False False 6h8m RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.har-sdn-120.perfscale.devcluster.openshift.com): Get "https://console-openshift-console.apps.har-sdn-120.perfscale.devcluster.openshift.com": EOF
control-plane-machine-set 4.14.10 True False False 7h6m
csi-snapshot-controller 4.14.10 True True False 10h CSISnapshotControllerProgressing: Waiting for Deployment to deploy pods...
dns 4.14.10 True False False 10h
etcd 4.14.10 True False True 10h EtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [{Member:ID:4398145115237214508 name:"ip-10-0-69-222.us-west-2.compute.internal" peerURLs:"https://10.0.69.222:2380" clientURLs:"https://10.0.69.222:2379" Healthy:true Took:1.045453ms Error:<nil>} {Member:ID:7320495613934196650 name:"ip-10-0-34-215.us-west-2.compute.internal" peerURLs:"https://10.0.34.215:2380" clientURLs:"https://10.0.34.215:2379" Healthy:false Took: Error:create client failure: failed to make etcd client for endpoints [https://10.0.34.215:2379]: context deadline exceeded} {Member:ID:8759161088354208548 name:"ip-10-0-19-163.us-west-2.compute.internal" peerURLs:"https://10.0.19.163:2380" clientURLs:"https://10.0.19.163:2379" Healthy:true Took:2.066578ms Error:<nil>}]...
image-registry 4.14.10 True False False 10h
ingress 4.14.10 True False False 10h
insights 4.14.10 True False False 10h
kube-apiserver 4.14.10 True False False 10h
kube-controller-manager 4.14.10 True False False 10h
kube-scheduler 4.14.10 True False False 10h
kube-storage-version-migrator 4.14.10 True False False 6h44m
machine-api 4.14.10 True False False 10h
machine-approver 4.14.10 True False False 10h
machine-config 4.14.10 True False False 10h
marketplace 4.14.10 True False False 10h
monitoring 4.14.10 True False False 10h
network 4.14.10 True True True 10h DaemonSet "/openshift-multus/multus" rollout is not making progress - pod multus-26q59 is in CrashLoopBackOff State...
node-tuning 4.14.10 True False False 10h
openshift-apiserver 4.14.10 False False False 6h2m APIServicesAvailable: "template.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request
openshift-controller-manager 4.14.10 True False False 10h
openshift-samples 4.14.10 True False False 10h
operator-lifecycle-manager 4.14.10 True False False 10h
operator-lifecycle-manager-catalog 4.14.10 True False False 10h
operator-lifecycle-manager-packageserver 4.14.10 True False False 10h
service-ca 4.14.10 True True False 10h Progressing: ...
storage 4.14.10 True True False 10h AWSEBSProgressing: Waiting for Deployment to deploy pods
$
========================================================================
$ oc get po -n openshift-etcd
NAME READY STATUS RESTARTS AGE
etcd-guard-ip-10-0-19-163.us-west-2.compute.internal 0/1 ContainerCreating 1 7h44m
etcd-guard-ip-10-0-34-215.us-west-2.compute.internal 0/1 ContainerCreating 1 7h27m
etcd-guard-ip-10-0-69-222.us-west-2.compute.internal 0/1 ContainerCreating 1 7h34m
etcd-ip-10-0-19-163.us-west-2.compute.internal 4/4 Running 8 10h
etcd-ip-10-0-34-215.us-west-2.compute.internal 4/4 Running 8 10h
etcd-ip-10-0-69-222.us-west-2.compute.internal 4/4 Running 8 10h
revision-pruner-7-ip-10-0-19-163.us-west-2.compute.internal 0/1 Completed 0 7h47m
revision-pruner-7-ip-10-0-34-215.us-west-2.compute.internal 0/1 Completed 0 7h33m
revision-pruner-7-ip-10-0-69-222.us-west-2.compute.internal 0/1 Completed 0 7h39m
========================================================================
========================================================================
$ oc describe po etcd-guard-ip-10-0-19-163.us-west-2.compute.internal -n openshift-etcd
Name: etcd-guard-ip-10-0-19-163.us-west-2.compute.internal
Namespace: openshift-etcd
Priority: 2000000000
Priority Class Name: system-cluster-critical
Service Account: default
Node: ip-10-0-19-163.us-west-2.compute.internal/10.0.19.163
Start Time: Mon, 08 Apr 2024 16:52:40 +0530
Labels: app=guard
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["10.128.44.16/23"],"mac_address":"0a:58:0a:80:2c:10","gateway_ips":["10.128.44.1"],"routes":[{"dest":"10.128.0...
k8s.v1.cni.cncf.io/network-status:
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.128.0.17"
],
"default": true,
"dns": {}
}]
Status: Running
IP:
IPs: <none>
Containers:
guard:
Container ID:
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8c691f68a37812bf1501bc243ebabf6cb845a927873960e294c677f21fcade49
Image ID:
Port: <none>
Host Port: <none>
Command:
/bin/bash
Args:
-c
# properly handle TERM and exit as soon as it is signaled
set -euo pipefail
trap 'jobs -p | xargs -r kill; exit 0' TERM
sleep infinity & wait
State: Waiting
Reason: ContainerCreating
Last State: Terminated
Reason: ContainerStatusUnknown
Message: The container could not be located when the pod was deleted. The container used to be Running
Exit Code: 137
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Ready: False
Restart Count: 1
Requests:
cpu: 10m
memory: 5Mi
Readiness: http-get https://10.0.19.163:9980/readyz delay=0s timeout=5s period=5s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fwq7f (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-fwq7f:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/etcd:NoSchedule op=Exists
node-role.kubernetes.io/master:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 37m (x179 over 6h27m) kubelet MountVolume.SetUp failed for volume "kube-api-access-fwq7f" : [object "openshift-etcd"/"kube-root-ca.crt" not registered, object "openshift-etcd"/"openshift-service-ca.crt" not registered]
Warning NetworkNotReady 2m2s (x11537 over 6h27m) kubelet network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
$
$
========================================================================
$ oc get po -n openshift-ovn-kubernetes -o wide | egrep -i "ip-10-0-19-163.us-west-2.compute.internal|ip-10-0-34-215.us-west-2.compute.internal|ip-10-0-69-222.us-west-2.compute.internal"
ovnkube-control-plane-58b785bcd-5hmv5 2/2 Running 2 6h32m 10.0.19.163 ip-10-0-19-163.us-west-2.compute.internal <none> <none>
ovnkube-control-plane-58b785bcd-rm59v 2/2 Running 2 6h32m 10.0.69.222 ip-10-0-69-222.us-west-2.compute.internal <none> <none>
ovnkube-control-plane-58b785bcd-shjhf 2/2 Running 2 6h32m 10.0.34.215 ip-10-0-34-215.us-west-2.compute.internal <none> <none>
ovnkube-node-8742m 7/8 Running 57 (2m23s ago) 6h26m 10.0.34.215 ip-10-0-34-215.us-west-2.compute.internal <none> <none>
ovnkube-node-nrj8k 7/8 Running 56 (3m24s ago) 6h27m 10.0.19.163 ip-10-0-19-163.us-west-2.compute.internal <none> <none>
ovnkube-node-zqg8k 7/8 Running 64 (9m58s ago) 6h27m 10.0.69.222 ip-10-0-69-222.us-west-2.compute.internal <none> <none>
$
========================================================================
$ oc get po -n openshift-ovn-kubernetes -o wide | egrep -i "ip-10-0-19-163.us-west-2.compute.internal|ip-10-0-34-215.us-west-2.compute.internal|ip-10-0-69-222.us-west-2.compute.internal"
ovnkube-control-plane-58b785bcd-5hmv5 2/2 Running 2 6h32m 10.0.19.163 ip-10-0-19-163.us-west-2.compute.internal <none> <none>
ovnkube-control-plane-58b785bcd-rm59v 2/2 Running 2 6h32m 10.0.69.222 ip-10-0-69-222.us-west-2.compute.internal <none> <none>
ovnkube-control-plane-58b785bcd-shjhf 2/2 Running 2 6h32m 10.0.34.215 ip-10-0-34-215.us-west-2.compute.internal <none> <none>
ovnkube-node-8742m 7/8 Running 57 (2m23s ago) 6h26m 10.0.34.215 ip-10-0-34-215.us-west-2.compute.internal <none> <none>
ovnkube-node-nrj8k 7/8 Running 56 (3m24s ago) 6h27m 10.0.19.163 ip-10-0-19-163.us-west-2.compute.internal <none> <none>
ovnkube-node-zqg8k 7/8 Running 64 (9m58s ago) 6h27m 10.0.69.222 ip-10-0-69-222.us-west-2.compute.internal <none> <none>
$
========================================================================
Snippet from oc describe on Pod
$
ovnkube-controller:
Container ID: cri-o://116653ee3b5d24d2957e8a371e81fd9edc201b92f6f9a7463d00815c102f2758
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:368149fc734294fe7c851246f91738ef4d652fc83c32e2477d4eb20f1f41643a
Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:368149fc734294fe7c851246f91738ef4d652fc83c32e2477d4eb20f1f41643a
Port: 29105/TCP
Host Port: 29105/TCP
Command:
/bin/bash
-c
set -xe
. /ovnkube-lib/ovnkube-lib.sh || exit 1
start-ovnkube-node ${OVN_KUBE_LOG_LEVEL} 29103 29105
State: Running
Started: Tue, 09 Apr 2024 00:36:44 +0530
Last State: Terminated
Reason: Error
Message: SBNodeLRSR for node - 10.128.4.0/23 : match match="reg7 == 0 && ip4.dst == 10.128.4.0/23" : stdout - aca9df32-208f-48f5-a6e0-acfb0e2b4d5e : stderr - : err <nil>
I0408 19:06:31.987410 209557 default_node_network_controller.go:878] Upgrade Hack: node ip-10-0-28-110.us-west-2.compute.internal, subnet 10.131.34.0/23
I0408 19:06:31.987425 209557 ovs.go:164] Exec(1091): /usr/bin/ovn-sbctl --timeout=15 --no-leader-only --bare --columns _uuid find logical_flow match="reg7 == 0 && ip4.dst == 10.131.34.0/23"
I0408 19:06:32.088331 209557 obj_retry.go:555] Update event received for resource *v1.Pod, old object is equal to new: false
I0408 19:06:32.088355 209557 default_network_controller.go:650] Recording update event on pod openshift-multus/multus-qdwc2
I0408 19:06:32.088371 209557 obj_retry.go:607] Update event received for *v1.Pod openshift-multus/multus-qdwc2
I0408 19:06:32.088382 209557 ovn.go:132] Ensuring zone remote for Pod openshift-multus/multus-qdwc2 in node ip-10-0-28-47.us-west-2.compute.internal
I0408 19:06:32.088388 209557 default_network_controller.go:679] Recording success event on pod openshift-multus/multus-qdwc2
I0408 19:06:32.088395 209557 obj_retry.go:555] Update event received for resource *factory.egressIPPod, old object is equal to new: false
I0408 19:06:32.088401 209557 obj_retry.go:607] Update event received for *facto
Exit Code: 1
Started: Tue, 09 Apr 2024 00:31:26 +0530
Finished: Tue, 09 Apr 2024 00:36:32 +0530
Ready: False
Restart Count: 56
Requests:
cpu: 10m
memory: 600Mi
Readiness: exec [test -f /etc/cni/net.d/10-ovn-kubernetes.conf] delay=5s timeout=1s period=30s #success=1 #failure=3
Environment:
KUBERNETES_SERVICE_PORT: 6443
KUBERNETES_SERVICE_HOST: api-int.har-sdn-120.perfscale.devcluster.openshift.com
OVN_CONTROLLER_INACTIVITY_PROBE: 180000
OVN_KUBE_LOG_LEVEL: 4
K8S_NODE: (v1:spec.nodeName)
POD_NAME: ovnkube-node-8742m (v1:metadata.name)
Mounts:
/cni-bin-dir from host-cni-bin (rw)
/env from env-overrides (rw)
/etc/cni/net.d from host-cni-netd (rw)
/etc/openvswitch from etc-openvswitch (rw)
/etc/ovn/ from etc-openvswitch (rw)
/etc/systemd/system from systemd-units (ro)
/host from host-slash (ro)
/ovnkube-lib from ovnkube-script-lib (rw)
/run/netns from host-run-netns (ro)
/run/openvswitch from run-openvswitch (rw)
/run/ovn-kubernetes/ from host-run-ovn-kubernetes (rw)
/run/ovn/ from run-ovn (rw)
/run/ovnkube-config/ from ovnkube-config (rw)
/var/lib/cni/networks/ovn-k8s-cni-overlay from host-var-lib-cni-networks-ovn-kubernetes (rw)
/var/lib/kubelet from host-kubelet (ro)
/var/lib/openvswitch from var-lib-openvswitch (rw)
/var/log/ovnkube/ from etc-openvswitch (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rg296 (ro)
Which