Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: Networking / ovn-kubernetes
Labels:
- SDN:OVNK:CNIOfflineMigration
- perfscale-managed-services

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
SDLC stage when should've been found:
None

Description of problem:

To ensure the functionality of offline SDN migration of OpenShift SDN to OVN-IC at large scale, performed a SDN-OVNK Migration on a cluster which is pre-loaded with cluster-density-v2 workload.

Post updating the networkType field of the Network.config.openshift.io CR to OVNKubernetes followed by a reboot, the nodes in the cluster are "stuck" in NotReady State for more than 6 hours, upon investigation the following was found on ovnkube-controller container on one of the master nodes:

========================================================================

  ovnkube-controller:
    Container ID:  cri-o://fe8ad966f61423b6cee23c622594b834cd270566d8ea90261e7fd2023d6017ff
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:368149fc734294fe7c851246f91738ef4d652fc83c32e2477d4eb20f1f41643a
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:368149fc734294fe7c851246f91738ef4d652fc83c32e2477d4eb20f1f41643a
    Port:          29105/TCP
    Host Port:     29105/TCP
    Command:
      /bin/bash
      -c
      set -xe
      . /ovnkube-lib/ovnkube-lib.sh || exit 1
      start-ovnkube-node ${OVN_KUBE_LOG_LEVEL} 29103 29105
      
    State:       Running
      Started:   Tue, 09 Apr 2024 00:30:26 +0530
    Last State:  Terminated
      Reason:    Error
      Message:   5:31.727933  208842 obj_retry.go:607] Update event received for *v1.Pod openshift-multus/multus-zsdr8
I0408 18:55:31.727941  208842 ovn.go:132] Ensuring zone remote for Pod openshift-multus/multus-zsdr8 in node ip-10-0-44-148.us-west-2.compute.internal
I0408 18:55:31.727946  208842 default_network_controller.go:679] Recording success event on pod openshift-multus/multus-zsdr8
I0408 18:55:31.727956  208842 obj_retry.go:555] Update event received for resource *factory.egressIPPod, old object is equal to new: false
I0408 18:55:31.727965  208842 obj_retry.go:607] Update event received for *factory.egressIPPod openshift-multus/multus-zsdr8
I0408 18:55:31.751675  208842 ovs.go:167] Exec(1207): stdout: "7885e979-e03b-48d8-8495-331b0f3ce391\n"
I0408 18:55:31.751694  208842 ovs.go:168] Exec(1207): stderr: ""
I0408 18:55:31.751709  208842 default_node_network_controller.go:639] Upgrade Hack: checkOVNSBNodeLRSR for node - 10.128.4.0/23 : match match="reg7 == 0 && ip4.dst == 10.128.4.0/23" : stdout - 7885e979-e03b-48d8-8495-331b0f3ce391 : stderr -  : err <nil>
I0408 18:55:31.751739  208842 default_node_network_controller.go:878] Upgrade Hack: node ip-10-0-28-110.us-west-2.compute.internal, subnet 10.131.34.0/23
I0408 18:55:31.751769  208842 ovs.go:164] Exec(1208): /usr/bin/ovn-sbctl --timeout=15 --no-leader-only --bare --columns _uuid find logical_flow ma
      Exit Code:    1
      Started:      Tue, 09 Apr 2024 00:20:21 +0530
      Finished:     Tue, 09 Apr 2024 00:25:32 +0530
    Ready:          False
    Restart Count:  55

========================================================================

Version-Release number of selected component (if applicable):

    OCP Version: 4.14.10
    ovs-vswitchd (Open vSwitch) 3.1.2

How reproducible:

Easily reproducible

Steps to Reproduce:

The step listed below will help users to perform SDN--->OVN-K Migration.

    1. git clone https://github.com/cloud-bulldozer/e2e-benchmarking
    2. cd e2e-benchmarking/workloads/sdn2ovn/
    3. ./run.sh

Actual results:

Expected results:

    CNI Migrated to OVN-Kubernetes

Additional info:

Cluster operators details:
========================================================================
$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.14.10   False       False         True       6h8m    APIServicesAvailable: "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request...
baremetal                                  4.14.10   True        False         False      10h     
cloud-controller-manager                   4.14.10   True        False         False      10h     
cloud-credential                           4.14.10   True        False         False      10h     
cluster-autoscaler                         4.14.10   True        False         False      10h     
config-operator                            4.14.10   True        False         False      10h     
console                                    4.14.10   False       False         False      6h8m    RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.har-sdn-120.perfscale.devcluster.openshift.com): Get "https://console-openshift-console.apps.har-sdn-120.perfscale.devcluster.openshift.com": EOF
control-plane-machine-set                  4.14.10   True        False         False      7h6m    
csi-snapshot-controller                    4.14.10   True        True          False      10h     CSISnapshotControllerProgressing: Waiting for Deployment to deploy pods...
dns                                        4.14.10   True        False         False      10h     
etcd                                       4.14.10   True        False         True       10h     EtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [{Member:ID:4398145115237214508 name:"ip-10-0-69-222.us-west-2.compute.internal" peerURLs:"https://10.0.69.222:2380" clientURLs:"https://10.0.69.222:2379"  Healthy:true Took:1.045453ms Error:<nil>} {Member:ID:7320495613934196650 name:"ip-10-0-34-215.us-west-2.compute.internal" peerURLs:"https://10.0.34.215:2380" clientURLs:"https://10.0.34.215:2379"  Healthy:false Took: Error:create client failure: failed to make etcd client for endpoints [https://10.0.34.215:2379]: context deadline exceeded} {Member:ID:8759161088354208548 name:"ip-10-0-19-163.us-west-2.compute.internal" peerURLs:"https://10.0.19.163:2380" clientURLs:"https://10.0.19.163:2379"  Healthy:true Took:2.066578ms Error:<nil>}]...
image-registry                             4.14.10   True        False         False      10h     
ingress                                    4.14.10   True        False         False      10h     
insights                                   4.14.10   True        False         False      10h     
kube-apiserver                             4.14.10   True        False         False      10h     
kube-controller-manager                    4.14.10   True        False         False      10h     
kube-scheduler                             4.14.10   True        False         False      10h     
kube-storage-version-migrator              4.14.10   True        False         False      6h44m   
machine-api                                4.14.10   True        False         False      10h     
machine-approver                           4.14.10   True        False         False      10h     
machine-config                             4.14.10   True        False         False      10h     
marketplace                                4.14.10   True        False         False      10h     
monitoring                                 4.14.10   True        False         False      10h     
network                                    4.14.10   True        True          True       10h     DaemonSet "/openshift-multus/multus" rollout is not making progress - pod multus-26q59 is in CrashLoopBackOff State...
node-tuning                                4.14.10   True        False         False      10h     
openshift-apiserver                        4.14.10   False       False         False      6h2m    APIServicesAvailable: "template.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request
openshift-controller-manager               4.14.10   True        False         False      10h     
openshift-samples                          4.14.10   True        False         False      10h     
operator-lifecycle-manager                 4.14.10   True        False         False      10h     
operator-lifecycle-manager-catalog         4.14.10   True        False         False      10h     
operator-lifecycle-manager-packageserver   4.14.10   True        False         False      10h     
service-ca                                 4.14.10   True        True          False      10h     Progressing: ...
storage                                    4.14.10   True        True          False      10h     AWSEBSProgressing: Waiting for Deployment to deploy pods
$
========================================================================

$ oc get po -n openshift-etcd
NAME                                                          READY   STATUS              RESTARTS   AGE
etcd-guard-ip-10-0-19-163.us-west-2.compute.internal          0/1     ContainerCreating   1          7h44m
etcd-guard-ip-10-0-34-215.us-west-2.compute.internal          0/1     ContainerCreating   1          7h27m
etcd-guard-ip-10-0-69-222.us-west-2.compute.internal          0/1     ContainerCreating   1          7h34m
etcd-ip-10-0-19-163.us-west-2.compute.internal                4/4     Running             8          10h
etcd-ip-10-0-34-215.us-west-2.compute.internal                4/4     Running             8          10h
etcd-ip-10-0-69-222.us-west-2.compute.internal                4/4     Running             8          10h
revision-pruner-7-ip-10-0-19-163.us-west-2.compute.internal   0/1     Completed           0          7h47m
revision-pruner-7-ip-10-0-34-215.us-west-2.compute.internal   0/1     Completed           0          7h33m
revision-pruner-7-ip-10-0-69-222.us-west-2.compute.internal   0/1     Completed           0          7h39m
========================================================================
========================================================================
$ oc describe po etcd-guard-ip-10-0-19-163.us-west-2.compute.internal  -n openshift-etcd
Name:                 etcd-guard-ip-10-0-19-163.us-west-2.compute.internal
Namespace:            openshift-etcd
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      default
Node:                 ip-10-0-19-163.us-west-2.compute.internal/10.0.19.163
Start Time:           Mon, 08 Apr 2024 16:52:40 +0530
Labels:               app=guard
Annotations:          k8s.ovn.org/pod-networks:
                        {"default":{"ip_addresses":["10.128.44.16/23"],"mac_address":"0a:58:0a:80:2c:10","gateway_ips":["10.128.44.1"],"routes":[{"dest":"10.128.0...
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "openshift-sdn",
                            "interface": "eth0",
                            "ips": [
                                "10.128.0.17"
                            ],
                            "default": true,
                            "dns": {}
                        }]
Status:               Running
IP:                   
IPs:                  <none>
Containers:
  guard:
    Container ID: 
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8c691f68a37812bf1501bc243ebabf6cb845a927873960e294c677f21fcade49
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
    Args:
      -c
      # properly handle TERM and exit as soon as it is signaled
      set -euo pipefail
      trap 'jobs -p | xargs -r kill; exit 0' TERM
      sleep infinity & wait
      
    State:          Waiting
      Reason:       ContainerCreating
    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  1
    Requests:
      cpu:        10m
      memory:     5Mi
    Readiness:    http-get https://10.0.19.163:9980/readyz delay=0s timeout=5s period=5s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fwq7f (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-fwq7f:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/etcd:NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
Events:
  Type     Reason           Age                       From     Message
  ----     ------           ----                      ----     -------
  Warning  FailedMount      37m (x179 over 6h27m)     kubelet  MountVolume.SetUp failed for volume "kube-api-access-fwq7f" : [object "openshift-etcd"/"kube-root-ca.crt" not registered, object "openshift-etcd"/"openshift-service-ca.crt" not registered]
  Warning  NetworkNotReady  2m2s (x11537 over 6h27m)  kubelet  network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
$
$
========================================================================
$ oc get po -n openshift-ovn-kubernetes -o wide | egrep -i "ip-10-0-19-163.us-west-2.compute.internal|ip-10-0-34-215.us-west-2.compute.internal|ip-10-0-69-222.us-west-2.compute.internal"
ovnkube-control-plane-58b785bcd-5hmv5   2/2     Running            2                6h32m   10.0.19.163   ip-10-0-19-163.us-west-2.compute.internal   <none>           <none>
ovnkube-control-plane-58b785bcd-rm59v   2/2     Running            2                6h32m   10.0.69.222   ip-10-0-69-222.us-west-2.compute.internal   <none>           <none>
ovnkube-control-plane-58b785bcd-shjhf   2/2     Running            2                6h32m   10.0.34.215   ip-10-0-34-215.us-west-2.compute.internal   <none>           <none>
ovnkube-node-8742m                      7/8     Running            57 (2m23s ago)   6h26m   10.0.34.215   ip-10-0-34-215.us-west-2.compute.internal   <none>           <none>
ovnkube-node-nrj8k                      7/8     Running            56 (3m24s ago)   6h27m   10.0.19.163   ip-10-0-19-163.us-west-2.compute.internal   <none>           <none>
ovnkube-node-zqg8k                      7/8     Running            64 (9m58s ago)   6h27m   10.0.69.222   ip-10-0-69-222.us-west-2.compute.internal   <none>           <none>
$
========================================================================
$ oc get po -n openshift-ovn-kubernetes -o wide | egrep -i "ip-10-0-19-163.us-west-2.compute.internal|ip-10-0-34-215.us-west-2.compute.internal|ip-10-0-69-222.us-west-2.compute.internal"
ovnkube-control-plane-58b785bcd-5hmv5   2/2     Running            2                6h32m   10.0.19.163   ip-10-0-19-163.us-west-2.compute.internal   <none>           <none>
ovnkube-control-plane-58b785bcd-rm59v   2/2     Running            2                6h32m   10.0.69.222   ip-10-0-69-222.us-west-2.compute.internal   <none>           <none>
ovnkube-control-plane-58b785bcd-shjhf   2/2     Running            2                6h32m   10.0.34.215   ip-10-0-34-215.us-west-2.compute.internal   <none>           <none>
ovnkube-node-8742m                      7/8     Running            57 (2m23s ago)   6h26m   10.0.34.215   ip-10-0-34-215.us-west-2.compute.internal   <none>           <none>
ovnkube-node-nrj8k                      7/8     Running            56 (3m24s ago)   6h27m   10.0.19.163   ip-10-0-19-163.us-west-2.compute.internal   <none>           <none>
ovnkube-node-zqg8k                      7/8     Running            64 (9m58s ago)   6h27m   10.0.69.222   ip-10-0-69-222.us-west-2.compute.internal   <none>           <none>
$
========================================================================
Snippet from oc describe on Pod
$
  ovnkube-controller:
    Container ID:  cri-o://116653ee3b5d24d2957e8a371e81fd9edc201b92f6f9a7463d00815c102f2758
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:368149fc734294fe7c851246f91738ef4d652fc83c32e2477d4eb20f1f41643a
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:368149fc734294fe7c851246f91738ef4d652fc83c32e2477d4eb20f1f41643a
    Port:          29105/TCP
    Host Port:     29105/TCP
    Command:
      /bin/bash
      -c
      set -xe
      . /ovnkube-lib/ovnkube-lib.sh || exit 1
      start-ovnkube-node ${OVN_KUBE_LOG_LEVEL} 29103 29105
      
    State:       Running
      Started:   Tue, 09 Apr 2024 00:36:44 +0530
    Last State:  Terminated
      Reason:    Error
      Message:   SBNodeLRSR for node - 10.128.4.0/23 : match match="reg7 == 0 && ip4.dst == 10.128.4.0/23" : stdout - aca9df32-208f-48f5-a6e0-acfb0e2b4d5e : stderr -  : err <nil>
I0408 19:06:31.987410  209557 default_node_network_controller.go:878] Upgrade Hack: node ip-10-0-28-110.us-west-2.compute.internal, subnet 10.131.34.0/23
I0408 19:06:31.987425  209557 ovs.go:164] Exec(1091): /usr/bin/ovn-sbctl --timeout=15 --no-leader-only --bare --columns _uuid find logical_flow match="reg7 == 0 && ip4.dst == 10.131.34.0/23"
I0408 19:06:32.088331  209557 obj_retry.go:555] Update event received for resource *v1.Pod, old object is equal to new: false
I0408 19:06:32.088355  209557 default_network_controller.go:650] Recording update event on pod openshift-multus/multus-qdwc2
I0408 19:06:32.088371  209557 obj_retry.go:607] Update event received for *v1.Pod openshift-multus/multus-qdwc2
I0408 19:06:32.088382  209557 ovn.go:132] Ensuring zone remote for Pod openshift-multus/multus-qdwc2 in node ip-10-0-28-47.us-west-2.compute.internal
I0408 19:06:32.088388  209557 default_network_controller.go:679] Recording success event on pod openshift-multus/multus-qdwc2
I0408 19:06:32.088395  209557 obj_retry.go:555] Update event received for resource *factory.egressIPPod, old object is equal to new: false
I0408 19:06:32.088401  209557 obj_retry.go:607] Update event received for *facto
      Exit Code:    1
      Started:      Tue, 09 Apr 2024 00:31:26 +0530
      Finished:     Tue, 09 Apr 2024 00:36:32 +0530
    Ready:          False
    Restart Count:  56
    Requests:
      cpu:      10m
      memory:   600Mi
    Readiness:  exec [test -f /etc/cni/net.d/10-ovn-kubernetes.conf] delay=5s timeout=1s period=30s #success=1 #failure=3
    Environment:
      KUBERNETES_SERVICE_PORT:          6443
      KUBERNETES_SERVICE_HOST:          api-int.har-sdn-120.perfscale.devcluster.openshift.com
      OVN_CONTROLLER_INACTIVITY_PROBE:  180000
      OVN_KUBE_LOG_LEVEL:               4
      K8S_NODE:                          (v1:spec.nodeName)
      POD_NAME:                         ovnkube-node-8742m (v1:metadata.name)
    Mounts:
      /cni-bin-dir from host-cni-bin (rw)
      /env from env-overrides (rw)
      /etc/cni/net.d from host-cni-netd (rw)
      /etc/openvswitch from etc-openvswitch (rw)
      /etc/ovn/ from etc-openvswitch (rw)
      /etc/systemd/system from systemd-units (ro)
      /host from host-slash (ro)
      /ovnkube-lib from ovnkube-script-lib (rw)
      /run/netns from host-run-netns (ro)
      /run/openvswitch from run-openvswitch (rw)
      /run/ovn-kubernetes/ from host-run-ovn-kubernetes (rw)
      /run/ovn/ from run-ovn (rw)
      /run/ovnkube-config/ from ovnkube-config (rw)
      /var/lib/cni/networks/ovn-k8s-cni-overlay from host-var-lib-cni-networks-ovn-kubernetes (rw)
      /var/lib/kubelet from host-kubelet (ro)
      /var/lib/openvswitch from var-lib-openvswitch (rw)
      /var/log/ovnkube/ from etc-openvswitch (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rg296 (ro)

Which

Assignee:: Peng Liu

Reporter:: Krishna Harsha Voora

Need Info From:: None

Contributors:: None

QA Contact:: Anurag Saxena

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/04/08 7:13 PM

Updated:: 2025/07/23 5:39 AM

Resolved:: 2024/04/22 10:33 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide