-
Bug
-
Resolution: Not a Bug
-
Normal
-
None
-
4.16.z
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
Stage
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
After the network migration from OpenshifSDN to OVNKubernetes. a worker node fail with NotReady status
After the network migration started with:
~~~
oc patch Network.config.openshift.io cluster --type='merge' --patch '{ "spec": { "networkType": "OVNKubernetes" } }'
~~~
On failed worker the multus container did not start successfully after the network migration:
~~~
sosreport-eu-de-1-worker-1-2025-07-08-gqihiua]$ cat var/log/pods/openshift-multus_multus-mvwzz_bbfe9cf0-c10e-4823-ba21-0f192def7205/kube-multus/6.log
2025-07-07T08:35:08.587500899+00:00 stdout F 2025-07-07T08:35:08+00:00 [cnibincopy] Successfully copied files in /usr/src/multus-cni/rhel9/bin/ to /host/opt/cni/bin/upgrade_2fa4d037-b63a-44f4-b7cb-cfb0593c7eb4
2025-07-07T08:35:08.596884428+00:00 stdout F 2025-07-07T08:35:08+00:00 [cnibincopy] Successfully moved files in /host/opt/cni/bin/upgrade_2fa4d037-b63a-44f4-b7cb-cfb0593c7eb4 to /host/opt/cni/bin/
2025-07-07T08:35:08.627185136+00:00 stderr F 2025-07-07T08:35:08Z [verbose] multus-daemon started
2025-07-07T08:35:08.627185136+00:00 stderr F 2025-07-07T08:35:08Z [verbose] Readiness Indicator file check
2025-07-07T08:35:08.627244118+00:00 stderr F 2025-07-07T08:35:08Z [verbose] Readiness Indicator file check done!
2025-07-07T08:35:08.629822630+00:00 stderr F I0707 08:35:08.629784 1589078 certificate_store.go:130] Loading cert/key pair from "/etc/cni/multus/certs/multus-client-current.pem".
2025-07-07T08:35:08.630357963+00:00 stderr F 2025-07-07T08:35:08Z [verbose] Waiting for certificate
2025-07-07T08:35:09.630554486+00:00 stderr F I0707 08:35:09.630501 1589078 certificate_store.go:130] Loading cert/key pair from "/etc/cni/multus/certs/multus-client-current.pem".
2025-07-07T08:35:09.630869517+00:00 stderr F 2025-07-07T08:35:09Z [verbose] Certificate found!
2025-07-07T08:35:09.631810762+00:00 stderr F 2025-07-07T08:35:09Z [verbose] server configured with chroot: /hostroot
2025-07-07T08:35:09.631810762+00:00 stderr P 2025-07-07T08:35:09Z [verbose]
2025-07-07T08:35:09.631835813+00:00 stderr F Filtering pod watch for node "eu-de-1-worker-1.temp-41451-04.ibm-cloud-schiphol.nl"
2025-07-07T08:35:09.732732980+00:00 stderr F 2025-07-07T08:35:09Z [verbose] API readiness check
2025-07-07T08:35:09.734630985+00:00 stderr F 2025-07-07T08:35:09Z [verbose] API readiness check done!
2025-07-07T08:35:09.734917535+00:00 stderr F 2025-07-07T08:35:09Z [verbose] Generated MultusCNI config: {"binDir":"/var/lib/cni/bin","cniVersion":"0.3.1","logLevel":"verbose","logToStderr":true,"name":"multus-cni-network","clusterNetwork":"/host/run/multus/cni/net.d/10-ovn-kubernetes.conf","namespaceIsolation":true,"globalNamespaces":"default,openshift-multus,openshift-sriov-network-operator","type":"multus-shim","daemonSocketDir":"/run/multus/socket"}
2025-07-07T08:35:09.735167154+00:00 stderr F 2025-07-07T08:35:09Z [verbose] started to watch file /host/run/multus/cni/net.d/10-ovn-kubernetes.conf
2025-07-07T08:37:53.566649210+00:00 stderr P 2025-07-07T08:37:53Z [verbose]
2025-07-07T08:37:53.566745132+00:00 stderr P caught terminated, stopping...
2025-07-07T08:37:53.566809591+00:00 stderr F
2025-07-07T08:37:53.567264616+00:00 stderr P 2025-07-07T08:37:53Z [verbose]
2025-07-07T08:37:53.567707684+00:00 stderr P Stopped monitoring, closing channel ...
2025-07-07T08:37:53.567835148+00:00 stderr F
2025-07-07T08:37:53.568047155+00:00 stderr F E0707 08:37:53.567507 1589078 server.go:382] CNI server Serve() failed: http: Server closed
2025-07-07T08:37:53.634495130+00:00 stderr F 2025-07-07T08:37:53Z [verbose] ConfigWatcher done
2025-07-07T08:37:53.634495130+00:00 stderr F 2025-07-07T08:37:53Z [verbose] Delete old config @ /host/etc/cni/net.d/00-multus.conf
2025-07-07T08:37:53.634495130+00:00 stderr F 2025-07-07T08:37:53Z [verbose] multus daemon is exited
~~~
While in the other workers the multus pods started successfully, for example on another worker we have:
~~~
2025-07-07T08:35:12.376118722+00:00 stdout F 2025-07-07T08:35:12+00:00 [cnibincopy] Successfully copied files in /usr/src/multus-cni/rhel9/bin/ to /host/opt/cni/bin/upgrade_cfd0b1bb-5c1d-457e-91d2-3bb338a061f0
2025-07-07T08:35:12.387408513+00:00 stdout F 2025-07-07T08:35:12+00:00 [cnibincopy] Successfully moved files in /host/opt/cni/bin/upgrade_cfd0b1bb-5c1d-457e-91d2-3bb338a061f0 to /host/opt/cni/bin/
2025-07-07T08:35:12.460887267+00:00 stderr F 2025-07-07T08:35:12Z [verbose] multus-daemon started
2025-07-07T08:35:12.460887267+00:00 stderr F 2025-07-07T08:35:12Z [verbose] Readiness Indicator file check
2025-07-07T08:35:12.461252061+00:00 stderr F 2025-07-07T08:35:12Z [verbose] Readiness Indicator file check done!
2025-07-07T08:35:12.563177618+00:00 stderr F I0707 08:35:12.563104 1947016 certificate_store.go:130] Loading cert/key pair from "/etc/cni/multus/certs/multus-client-current.pem".
2025-07-07T08:35:12.710691906+00:00 stderr F 2025-07-07T08:35:12Z [verbose] Waiting for certificate
2025-07-07T08:35:13.711028242+00:00 stderr F I0707 08:35:13.710960 1947016 certificate_store.go:130] Loading cert/key pair from "/etc/cni/multus/certs/multus-client-current.pem".
2025-07-07T08:35:13.711352733+00:00 stderr F 2025-07-07T08:35:13Z [verbose] Certificate found!
2025-07-07T08:35:14.017772521+00:00 stderr F 2025-07-07T08:35:14Z [verbose] server configured with chroot: /hostroot
2025-07-07T08:35:14.017772521+00:00 stderr F 2025-07-07T08:35:14Z [verbose] Filtering pod watch for node "eu-de-1-worker-2.temp-41451-04.ibm-cloud-schiphol.nl"
2025-07-07T08:35:14.418547335+00:00 stderr F 2025-07-07T08:35:14Z [verbose] API readiness check
2025-07-07T08:35:14.419008535+00:00 stderr F 2025-07-07T08:35:14Z [verbose] API readiness check done!
2025-07-07T08:35:14.419169223+00:00 stderr F 2025-07-07T08:35:14Z [verbose] Generated MultusCNI config: {"binDir":"/var/lib/cni/bin","cniVersion":"0.3.1","logLevel":"verbose","logToStderr":true,"name":"multus-cni-network","clusterNetwork":"/host/run/multus/cni/net.d/10-ovn-kubernetes.conf","namespaceIsolation":true,"globalNamespaces":"default,openshift-multus,openshift-sriov-network-operator","type":"multus-shim","daemonSocketDir":"/run/multus/socket"}
2025-07-07T08:35:14.419323397+00:00 stderr F 2025-07-07T08:35:14Z [verbose] started to watch file /host/run/multus/cni/net.d/10-ovn-kubernetes.conf
2025-07-07T08:37:18.114014219+00:00 stderr F 2025-07-07T08:37:18Z [verbose] ADD starting CNI request ContainerID:"1f1a4101a92396b337f1182dce3a27c37d3dc0148a52bc753bcd817d3cb057bd" Netns:"/var/run/netns/83d5af63-1ee9-49d0-92af-12ef06809b11" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-storage;K8S_POD_NAME=rook-ceph-mgr-b-858bcfff9-jhxz7;K8S_POD_INFRA_CONTAINER_ID=1f1a4101a92396b337f1182dce3a27c37d3dc0148a52bc753bcd817d3cb057bd;K8S_POD_UID=97e271b8-df18-42b5-bb44-1fb226fca1b0" Path:""
~~~
SIGTERM signal received in worker-1 and worker-2 multus pods:
worker-1:
~~~
$ grep stopping *
1.log:2025-06-26T10:30:22.097752411+00:00 stderr F 2025-06-26T10:30:22Z [verbose] caught terminated, stopping...
2.log:2025-06-27T02:09:36.412298271+00:00 stderr P 2025-06-27T02:09:36Z [verbose] caught terminated, stopping...
3.log:2025-07-04T12:46:18.008239319+00:00 stderr P caught terminated, stopping...
6.log:2025-07-07T08:37:53.566745132+00:00 stderr P caught terminated, stopping...
~~~
worker-2:
~~~
$ grep stopping *
1.log:2025-06-26T10:40:13.127917610+00:00 stderr F 2025-06-26T10:40:13Z [verbose] caught terminated, stopping...
2.log:2025-06-27T02:25:35.022315563+00:00 stderr F caught terminated, stopping...
3.log:2025-07-04T14:16:43.758816180+00:00 stderr P caught terminated, stopping...
6.log:2025-07-07T08:44:59.452652892+00:00 stderr P caught terminated, stopping...
7.log:2025-07-11T02:23:06.160519470+00:00 stderr F 2025-07-11T02:23:06Z [verbose] caught terminated, stopping...
~~~
The SIGTERM received 7th July is present on both workers.
Neet to know the RCA for SIGTERM signal to the process and led to this condition/status as per the code [1].
[1]: https://github.com/k8snetworkplumbingwg/multus-cni/blob/00adf22482945de41ff702258593aaf867a48e2d/cmd/multus-daemon/main.go#L126
Version-Release number of selected component (if applicable):
OCPv4.16.42
How reproducible:
NA
Steps to Reproduce:
1.Perform offline networking migration OpenshifSDN to OVNKubernetes
Actual results:
when rebooting the nodes. One node is stuck in 'Not Ready' state. The machine config cluster operator gives the following error: machine-config 4.16.42 True False True 11d Failed to resync 4.16.42 because: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)]One pod stays in pending state: [root@temp-41451-04-eu-de-1-bastion-0 ~]# oc get pods NAME READY STATUS RESTARTS AGE kube-rbac-proxy-crio-eu-de-1-master-0.temp-41451-04.ibm-cloud-schiphol.nl 1/1 Running 4 11d kube-rbac-proxy-crio-eu-de-1-master-1.temp-41451-04.ibm-cloud-schiphol.nl 1/1 Running 4 11d kube-rbac-proxy-crio-eu-de-1-master-2.temp-41451-04.ibm-cloud-schiphol.nl 1/1 Running 4 11d kube-rbac-proxy-crio-eu-de-1-worker-0.temp-41451-04.ibm-cloud-schiphol.nl 1/1 Running 3 11d kube-rbac-proxy-crio-eu-de-1-worker-1.temp-41451-04.ibm-cloud-schiphol.nl 1/1 Running 2 11d kube-rbac-proxy-crio-eu-de-1-worker-2.temp-41451-04.ibm-cloud-schiphol.nl 1/1 Running 3 11d kube-rbac-proxy-crio-eu-de-1-worker-3.temp-41451-04.ibm-cloud-schiphol.nl 1/1 Running 5 8d machine-config-controller-7557499fbd-dxdnk 2/2 Running 2 2d21h machine-config-daemon-7nmk5 2/2 Running 8 11d machine-config-daemon-b8vlc 2/2 Running 8 11d machine-config-daemon-bqjcd 2/2 Running 8 11d machine-config-daemon-dv6lq 2/2 Running 8 11d machine-config-daemon-jdwqv 2/2 Running 4 8d machine-config-daemon-mvrdw 2/2 Running 8 11d machine-config-daemon-wfvkc 0/2 Pending 0 16m machine-config-operator-665c9b8858-28hc7 2/2 Running 2 2d21h machine-config-server-r8c7s 1/1 Running 4 11d machine-config-server-tjxn4 1/1 Running 4 11d machine-config-server-x8pmz 1/1 Running 4 11d
Expected results:
nodes are in ready state after performing the migration
Additional info:
-timestamp for the first step of the migration --> 4th of July ~14:30 (CEST) -timestamp for the issue appearing --> Monday 7th July ~12:00 (CEST) -timestamp for the last step of the migration process--> Monday 7th July ~11:30 (CEST) Note: On Friday 4th Cu started the migration, then completed it on Monday 7th after which Cu encountered the issue