Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-59289

worker node fail with NotReady status after network migration from OpenshifSDN to OVNKubernetes

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Normal Normal
    • None
    • 4.16.z
    • Networking / multus
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • Stage
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      After the network migration from OpenshifSDN to OVNKubernetes. a worker node fail with NotReady status
      After the network migration started with:
      ~~~
      oc patch Network.config.openshift.io cluster --type='merge' --patch '{ "spec": { "networkType": "OVNKubernetes" } }'
      ~~~
      On failed worker the multus container did not start successfully after the network migration:
      
      ~~~
      sosreport-eu-de-1-worker-1-2025-07-08-gqihiua]$ cat var/log/pods/openshift-multus_multus-mvwzz_bbfe9cf0-c10e-4823-ba21-0f192def7205/kube-multus/6.log
      2025-07-07T08:35:08.587500899+00:00 stdout F 2025-07-07T08:35:08+00:00 [cnibincopy] Successfully copied files in /usr/src/multus-cni/rhel9/bin/ to /host/opt/cni/bin/upgrade_2fa4d037-b63a-44f4-b7cb-cfb0593c7eb4
      2025-07-07T08:35:08.596884428+00:00 stdout F 2025-07-07T08:35:08+00:00 [cnibincopy] Successfully moved files in /host/opt/cni/bin/upgrade_2fa4d037-b63a-44f4-b7cb-cfb0593c7eb4 to /host/opt/cni/bin/
      2025-07-07T08:35:08.627185136+00:00 stderr F 2025-07-07T08:35:08Z [verbose] multus-daemon started
      2025-07-07T08:35:08.627185136+00:00 stderr F 2025-07-07T08:35:08Z [verbose] Readiness Indicator file check
      2025-07-07T08:35:08.627244118+00:00 stderr F 2025-07-07T08:35:08Z [verbose] Readiness Indicator file check done!
      2025-07-07T08:35:08.629822630+00:00 stderr F I0707 08:35:08.629784 1589078 certificate_store.go:130] Loading cert/key pair from "/etc/cni/multus/certs/multus-client-current.pem".
      2025-07-07T08:35:08.630357963+00:00 stderr F 2025-07-07T08:35:08Z [verbose] Waiting for certificate
      2025-07-07T08:35:09.630554486+00:00 stderr F I0707 08:35:09.630501 1589078 certificate_store.go:130] Loading cert/key pair from "/etc/cni/multus/certs/multus-client-current.pem".
      2025-07-07T08:35:09.630869517+00:00 stderr F 2025-07-07T08:35:09Z [verbose] Certificate found!
      2025-07-07T08:35:09.631810762+00:00 stderr F 2025-07-07T08:35:09Z [verbose] server configured with chroot: /hostroot
      2025-07-07T08:35:09.631810762+00:00 stderr P 2025-07-07T08:35:09Z [verbose] 
      2025-07-07T08:35:09.631835813+00:00 stderr F Filtering pod watch for node "eu-de-1-worker-1.temp-41451-04.ibm-cloud-schiphol.nl"
      2025-07-07T08:35:09.732732980+00:00 stderr F 2025-07-07T08:35:09Z [verbose] API readiness check
      2025-07-07T08:35:09.734630985+00:00 stderr F 2025-07-07T08:35:09Z [verbose] API readiness check done!
      2025-07-07T08:35:09.734917535+00:00 stderr F 2025-07-07T08:35:09Z [verbose] Generated MultusCNI config: {"binDir":"/var/lib/cni/bin","cniVersion":"0.3.1","logLevel":"verbose","logToStderr":true,"name":"multus-cni-network","clusterNetwork":"/host/run/multus/cni/net.d/10-ovn-kubernetes.conf","namespaceIsolation":true,"globalNamespaces":"default,openshift-multus,openshift-sriov-network-operator","type":"multus-shim","daemonSocketDir":"/run/multus/socket"}
      2025-07-07T08:35:09.735167154+00:00 stderr F 2025-07-07T08:35:09Z [verbose] started to watch file /host/run/multus/cni/net.d/10-ovn-kubernetes.conf
      2025-07-07T08:37:53.566649210+00:00 stderr P 2025-07-07T08:37:53Z [verbose] 
      2025-07-07T08:37:53.566745132+00:00 stderr P caught terminated, stopping...
      2025-07-07T08:37:53.566809591+00:00 stderr F 
      2025-07-07T08:37:53.567264616+00:00 stderr P 2025-07-07T08:37:53Z [verbose] 
      2025-07-07T08:37:53.567707684+00:00 stderr P Stopped monitoring, closing channel ...
      2025-07-07T08:37:53.567835148+00:00 stderr F 
      2025-07-07T08:37:53.568047155+00:00 stderr F E0707 08:37:53.567507 1589078 server.go:382] CNI server Serve() failed: http: Server closed
      2025-07-07T08:37:53.634495130+00:00 stderr F 2025-07-07T08:37:53Z [verbose] ConfigWatcher done
      2025-07-07T08:37:53.634495130+00:00 stderr F 2025-07-07T08:37:53Z [verbose] Delete old config @ /host/etc/cni/net.d/00-multus.conf
      2025-07-07T08:37:53.634495130+00:00 stderr F 2025-07-07T08:37:53Z [verbose] multus daemon is exited
      ~~~
      While in the other workers the multus pods started successfully, for example on another worker we have:
      ~~~
      2025-07-07T08:35:12.376118722+00:00 stdout F 2025-07-07T08:35:12+00:00 [cnibincopy] Successfully copied files in /usr/src/multus-cni/rhel9/bin/ to /host/opt/cni/bin/upgrade_cfd0b1bb-5c1d-457e-91d2-3bb338a061f0
      2025-07-07T08:35:12.387408513+00:00 stdout F 2025-07-07T08:35:12+00:00 [cnibincopy] Successfully moved files in /host/opt/cni/bin/upgrade_cfd0b1bb-5c1d-457e-91d2-3bb338a061f0 to /host/opt/cni/bin/
      2025-07-07T08:35:12.460887267+00:00 stderr F 2025-07-07T08:35:12Z [verbose] multus-daemon started
      2025-07-07T08:35:12.460887267+00:00 stderr F 2025-07-07T08:35:12Z [verbose] Readiness Indicator file check
      2025-07-07T08:35:12.461252061+00:00 stderr F 2025-07-07T08:35:12Z [verbose] Readiness Indicator file check done!
      2025-07-07T08:35:12.563177618+00:00 stderr F I0707 08:35:12.563104 1947016 certificate_store.go:130] Loading cert/key pair from "/etc/cni/multus/certs/multus-client-current.pem".
      2025-07-07T08:35:12.710691906+00:00 stderr F 2025-07-07T08:35:12Z [verbose] Waiting for certificate
      2025-07-07T08:35:13.711028242+00:00 stderr F I0707 08:35:13.710960 1947016 certificate_store.go:130] Loading cert/key pair from "/etc/cni/multus/certs/multus-client-current.pem".
      2025-07-07T08:35:13.711352733+00:00 stderr F 2025-07-07T08:35:13Z [verbose] Certificate found!
      2025-07-07T08:35:14.017772521+00:00 stderr F 2025-07-07T08:35:14Z [verbose] server configured with chroot: /hostroot
      2025-07-07T08:35:14.017772521+00:00 stderr F 2025-07-07T08:35:14Z [verbose] Filtering pod watch for node "eu-de-1-worker-2.temp-41451-04.ibm-cloud-schiphol.nl"
      2025-07-07T08:35:14.418547335+00:00 stderr F 2025-07-07T08:35:14Z [verbose] API readiness check
      2025-07-07T08:35:14.419008535+00:00 stderr F 2025-07-07T08:35:14Z [verbose] API readiness check done!
      2025-07-07T08:35:14.419169223+00:00 stderr F 2025-07-07T08:35:14Z [verbose] Generated MultusCNI config: {"binDir":"/var/lib/cni/bin","cniVersion":"0.3.1","logLevel":"verbose","logToStderr":true,"name":"multus-cni-network","clusterNetwork":"/host/run/multus/cni/net.d/10-ovn-kubernetes.conf","namespaceIsolation":true,"globalNamespaces":"default,openshift-multus,openshift-sriov-network-operator","type":"multus-shim","daemonSocketDir":"/run/multus/socket"}
      2025-07-07T08:35:14.419323397+00:00 stderr F 2025-07-07T08:35:14Z [verbose] started to watch file /host/run/multus/cni/net.d/10-ovn-kubernetes.conf
      2025-07-07T08:37:18.114014219+00:00 stderr F 2025-07-07T08:37:18Z [verbose] ADD starting CNI request ContainerID:"1f1a4101a92396b337f1182dce3a27c37d3dc0148a52bc753bcd817d3cb057bd" Netns:"/var/run/netns/83d5af63-1ee9-49d0-92af-12ef06809b11" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-storage;K8S_POD_NAME=rook-ceph-mgr-b-858bcfff9-jhxz7;K8S_POD_INFRA_CONTAINER_ID=1f1a4101a92396b337f1182dce3a27c37d3dc0148a52bc753bcd817d3cb057bd;K8S_POD_UID=97e271b8-df18-42b5-bb44-1fb226fca1b0" Path:""
      ~~~
      
      SIGTERM signal received in worker-1 and worker-2 multus pods:
      
      worker-1:
      
      ~~~
      $ grep stopping *
      1.log:2025-06-26T10:30:22.097752411+00:00 stderr F 2025-06-26T10:30:22Z [verbose] caught terminated, stopping...
      2.log:2025-06-27T02:09:36.412298271+00:00 stderr P 2025-06-27T02:09:36Z [verbose] caught terminated, stopping...
      3.log:2025-07-04T12:46:18.008239319+00:00 stderr P caught terminated, stopping...
      6.log:2025-07-07T08:37:53.566745132+00:00 stderr P caught terminated, stopping...
      ~~~
      
      
      worker-2:
      
      ~~~
      $ grep stopping *
      1.log:2025-06-26T10:40:13.127917610+00:00 stderr F 2025-06-26T10:40:13Z [verbose] caught terminated, stopping...
      2.log:2025-06-27T02:25:35.022315563+00:00 stderr F caught terminated, stopping...
      3.log:2025-07-04T14:16:43.758816180+00:00 stderr P caught terminated, stopping...
      6.log:2025-07-07T08:44:59.452652892+00:00 stderr P caught terminated, stopping...
      7.log:2025-07-11T02:23:06.160519470+00:00 stderr F 2025-07-11T02:23:06Z [verbose] caught terminated, stopping...
      ~~~
      
      The SIGTERM received 7th July is present on both workers.
      Neet to know the RCA for SIGTERM signal to the process and led to this condition/status as per the code [1].
      
      [1]: https://github.com/k8snetworkplumbingwg/multus-cni/blob/00adf22482945de41ff702258593aaf867a48e2d/cmd/multus-daemon/main.go#L126
      
      

      Version-Release number of selected component (if applicable):

          OCPv4.16.42

      How reproducible:

          NA

      Steps to Reproduce:

          1.Perform offline networking migration OpenshifSDN to OVNKubernetes 
         

      Actual results:

          when rebooting the nodes. One node is stuck in 'Not Ready' state. The machine config cluster operator gives the following error:
      machine-config                             4.16.42   True        False         True       11d     Failed to resync 4.16.42 because: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)]One pod stays in pending state:
      [root@temp-41451-04-eu-de-1-bastion-0 ~]# oc get pods
      NAME                                                                        READY   STATUS    RESTARTS   AGE
      kube-rbac-proxy-crio-eu-de-1-master-0.temp-41451-04.ibm-cloud-schiphol.nl   1/1     Running   4          11d
      kube-rbac-proxy-crio-eu-de-1-master-1.temp-41451-04.ibm-cloud-schiphol.nl   1/1     Running   4          11d
      kube-rbac-proxy-crio-eu-de-1-master-2.temp-41451-04.ibm-cloud-schiphol.nl   1/1     Running   4          11d
      kube-rbac-proxy-crio-eu-de-1-worker-0.temp-41451-04.ibm-cloud-schiphol.nl   1/1     Running   3          11d
      kube-rbac-proxy-crio-eu-de-1-worker-1.temp-41451-04.ibm-cloud-schiphol.nl   1/1     Running   2          11d
      kube-rbac-proxy-crio-eu-de-1-worker-2.temp-41451-04.ibm-cloud-schiphol.nl   1/1     Running   3          11d
      kube-rbac-proxy-crio-eu-de-1-worker-3.temp-41451-04.ibm-cloud-schiphol.nl   1/1     Running   5          8d
      machine-config-controller-7557499fbd-dxdnk                                  2/2     Running   2          2d21h
      machine-config-daemon-7nmk5                                                 2/2     Running   8          11d
      machine-config-daemon-b8vlc                                                 2/2     Running   8          11d
      machine-config-daemon-bqjcd                                                 2/2     Running   8          11d
      machine-config-daemon-dv6lq                                                 2/2     Running   8          11d
      machine-config-daemon-jdwqv                                                 2/2     Running   4          8d
      machine-config-daemon-mvrdw                                                 2/2     Running   8          11d
      machine-config-daemon-wfvkc                                                 0/2     Pending   0          16m
      machine-config-operator-665c9b8858-28hc7                                    2/2     Running   2          2d21h
      machine-config-server-r8c7s                                                 1/1     Running   4          11d
      machine-config-server-tjxn4                                                 1/1     Running   4          11d
      machine-config-server-x8pmz                                                 1/1     Running   4          11d

      Expected results:

          nodes are in ready state after performing the migration

      Additional info:

      -timestamp for the first step of the migration --> 4th of July ~14:30 (CEST) 
      -timestamp for the issue appearing --> Monday 7th July ~12:00 (CEST)
      -timestamp for the last step of the migration process--> Monday 7th July ~11:30 (CEST) 
      Note: On Friday 4th Cu started the migration, then completed it on Monday 7th after which Cu encountered the issue

              pliurh Peng Liu
              rh-ee-jsoliman John Soliman
              None
              None
              Weibin Liang Weibin Liang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: