Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.16.z
Component/s: Networking / multus
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None
Deployment Environment:
Stage

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

After the network migration from OpenshifSDN to OVNKubernetes. a worker node fail with NotReady status
After the network migration started with:
~~~
oc patch Network.config.openshift.io cluster --type='merge' --patch '{ "spec": { "networkType": "OVNKubernetes" } }'
~~~
On failed worker the multus container did not start successfully after the network migration:

~~~
sosreport-eu-de-1-worker-1-2025-07-08-gqihiua]$ cat var/log/pods/openshift-multus_multus-mvwzz_bbfe9cf0-c10e-4823-ba21-0f192def7205/kube-multus/6.log
2025-07-07T08:35:08.587500899+00:00 stdout F 2025-07-07T08:35:08+00:00 [cnibincopy] Successfully copied files in /usr/src/multus-cni/rhel9/bin/ to /host/opt/cni/bin/upgrade_2fa4d037-b63a-44f4-b7cb-cfb0593c7eb4
2025-07-07T08:35:08.596884428+00:00 stdout F 2025-07-07T08:35:08+00:00 [cnibincopy] Successfully moved files in /host/opt/cni/bin/upgrade_2fa4d037-b63a-44f4-b7cb-cfb0593c7eb4 to /host/opt/cni/bin/
2025-07-07T08:35:08.627185136+00:00 stderr F 2025-07-07T08:35:08Z [verbose] multus-daemon started
2025-07-07T08:35:08.627185136+00:00 stderr F 2025-07-07T08:35:08Z [verbose] Readiness Indicator file check
2025-07-07T08:35:08.627244118+00:00 stderr F 2025-07-07T08:35:08Z [verbose] Readiness Indicator file check done!
2025-07-07T08:35:08.629822630+00:00 stderr F I0707 08:35:08.629784 1589078 certificate_store.go:130] Loading cert/key pair from "/etc/cni/multus/certs/multus-client-current.pem".
2025-07-07T08:35:08.630357963+00:00 stderr F 2025-07-07T08:35:08Z [verbose] Waiting for certificate
2025-07-07T08:35:09.630554486+00:00 stderr F I0707 08:35:09.630501 1589078 certificate_store.go:130] Loading cert/key pair from "/etc/cni/multus/certs/multus-client-current.pem".
2025-07-07T08:35:09.630869517+00:00 stderr F 2025-07-07T08:35:09Z [verbose] Certificate found!
2025-07-07T08:35:09.631810762+00:00 stderr F 2025-07-07T08:35:09Z [verbose] server configured with chroot: /hostroot
2025-07-07T08:35:09.631810762+00:00 stderr P 2025-07-07T08:35:09Z [verbose] 
2025-07-07T08:35:09.631835813+00:00 stderr F Filtering pod watch for node "eu-de-1-worker-1.temp-41451-04.ibm-cloud-schiphol.nl"
2025-07-07T08:35:09.732732980+00:00 stderr F 2025-07-07T08:35:09Z [verbose] API readiness check
2025-07-07T08:35:09.734630985+00:00 stderr F 2025-07-07T08:35:09Z [verbose] API readiness check done!
2025-07-07T08:35:09.734917535+00:00 stderr F 2025-07-07T08:35:09Z [verbose] Generated MultusCNI config: {"binDir":"/var/lib/cni/bin","cniVersion":"0.3.1","logLevel":"verbose","logToStderr":true,"name":"multus-cni-network","clusterNetwork":"/host/run/multus/cni/net.d/10-ovn-kubernetes.conf","namespaceIsolation":true,"globalNamespaces":"default,openshift-multus,openshift-sriov-network-operator","type":"multus-shim","daemonSocketDir":"/run/multus/socket"}
2025-07-07T08:35:09.735167154+00:00 stderr F 2025-07-07T08:35:09Z [verbose] started to watch file /host/run/multus/cni/net.d/10-ovn-kubernetes.conf
2025-07-07T08:37:53.566649210+00:00 stderr P 2025-07-07T08:37:53Z [verbose] 
2025-07-07T08:37:53.566745132+00:00 stderr P caught terminated, stopping...
2025-07-07T08:37:53.566809591+00:00 stderr F 
2025-07-07T08:37:53.567264616+00:00 stderr P 2025-07-07T08:37:53Z [verbose] 
2025-07-07T08:37:53.567707684+00:00 stderr P Stopped monitoring, closing channel ...
2025-07-07T08:37:53.567835148+00:00 stderr F 
2025-07-07T08:37:53.568047155+00:00 stderr F E0707 08:37:53.567507 1589078 server.go:382] CNI server Serve() failed: http: Server closed
2025-07-07T08:37:53.634495130+00:00 stderr F 2025-07-07T08:37:53Z [verbose] ConfigWatcher done
2025-07-07T08:37:53.634495130+00:00 stderr F 2025-07-07T08:37:53Z [verbose] Delete old config @ /host/etc/cni/net.d/00-multus.conf
2025-07-07T08:37:53.634495130+00:00 stderr F 2025-07-07T08:37:53Z [verbose] multus daemon is exited
~~~
While in the other workers the multus pods started successfully, for example on another worker we have:
~~~
2025-07-07T08:35:12.376118722+00:00 stdout F 2025-07-07T08:35:12+00:00 [cnibincopy] Successfully copied files in /usr/src/multus-cni/rhel9/bin/ to /host/opt/cni/bin/upgrade_cfd0b1bb-5c1d-457e-91d2-3bb338a061f0
2025-07-07T08:35:12.387408513+00:00 stdout F 2025-07-07T08:35:12+00:00 [cnibincopy] Successfully moved files in /host/opt/cni/bin/upgrade_cfd0b1bb-5c1d-457e-91d2-3bb338a061f0 to /host/opt/cni/bin/
2025-07-07T08:35:12.460887267+00:00 stderr F 2025-07-07T08:35:12Z [verbose] multus-daemon started
2025-07-07T08:35:12.460887267+00:00 stderr F 2025-07-07T08:35:12Z [verbose] Readiness Indicator file check
2025-07-07T08:35:12.461252061+00:00 stderr F 2025-07-07T08:35:12Z [verbose] Readiness Indicator file check done!
2025-07-07T08:35:12.563177618+00:00 stderr F I0707 08:35:12.563104 1947016 certificate_store.go:130] Loading cert/key pair from "/etc/cni/multus/certs/multus-client-current.pem".
2025-07-07T08:35:12.710691906+00:00 stderr F 2025-07-07T08:35:12Z [verbose] Waiting for certificate
2025-07-07T08:35:13.711028242+00:00 stderr F I0707 08:35:13.710960 1947016 certificate_store.go:130] Loading cert/key pair from "/etc/cni/multus/certs/multus-client-current.pem".
2025-07-07T08:35:13.711352733+00:00 stderr F 2025-07-07T08:35:13Z [verbose] Certificate found!
2025-07-07T08:35:14.017772521+00:00 stderr F 2025-07-07T08:35:14Z [verbose] server configured with chroot: /hostroot
2025-07-07T08:35:14.017772521+00:00 stderr F 2025-07-07T08:35:14Z [verbose] Filtering pod watch for node "eu-de-1-worker-2.temp-41451-04.ibm-cloud-schiphol.nl"
2025-07-07T08:35:14.418547335+00:00 stderr F 2025-07-07T08:35:14Z [verbose] API readiness check
2025-07-07T08:35:14.419008535+00:00 stderr F 2025-07-07T08:35:14Z [verbose] API readiness check done!
2025-07-07T08:35:14.419169223+00:00 stderr F 2025-07-07T08:35:14Z [verbose] Generated MultusCNI config: {"binDir":"/var/lib/cni/bin","cniVersion":"0.3.1","logLevel":"verbose","logToStderr":true,"name":"multus-cni-network","clusterNetwork":"/host/run/multus/cni/net.d/10-ovn-kubernetes.conf","namespaceIsolation":true,"globalNamespaces":"default,openshift-multus,openshift-sriov-network-operator","type":"multus-shim","daemonSocketDir":"/run/multus/socket"}
2025-07-07T08:35:14.419323397+00:00 stderr F 2025-07-07T08:35:14Z [verbose] started to watch file /host/run/multus/cni/net.d/10-ovn-kubernetes.conf
2025-07-07T08:37:18.114014219+00:00 stderr F 2025-07-07T08:37:18Z [verbose] ADD starting CNI request ContainerID:"1f1a4101a92396b337f1182dce3a27c37d3dc0148a52bc753bcd817d3cb057bd" Netns:"/var/run/netns/83d5af63-1ee9-49d0-92af-12ef06809b11" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-storage;K8S_POD_NAME=rook-ceph-mgr-b-858bcfff9-jhxz7;K8S_POD_INFRA_CONTAINER_ID=1f1a4101a92396b337f1182dce3a27c37d3dc0148a52bc753bcd817d3cb057bd;K8S_POD_UID=97e271b8-df18-42b5-bb44-1fb226fca1b0" Path:""
~~~

SIGTERM signal received in worker-1 and worker-2 multus pods:

worker-1:

~~~
$ grep stopping *
1.log:2025-06-26T10:30:22.097752411+00:00 stderr F 2025-06-26T10:30:22Z [verbose] caught terminated, stopping...
2.log:2025-06-27T02:09:36.412298271+00:00 stderr P 2025-06-27T02:09:36Z [verbose] caught terminated, stopping...
3.log:2025-07-04T12:46:18.008239319+00:00 stderr P caught terminated, stopping...
6.log:2025-07-07T08:37:53.566745132+00:00 stderr P caught terminated, stopping...
~~~


worker-2:

~~~
$ grep stopping *
1.log:2025-06-26T10:40:13.127917610+00:00 stderr F 2025-06-26T10:40:13Z [verbose] caught terminated, stopping...
2.log:2025-06-27T02:25:35.022315563+00:00 stderr F caught terminated, stopping...
3.log:2025-07-04T14:16:43.758816180+00:00 stderr P caught terminated, stopping...
6.log:2025-07-07T08:44:59.452652892+00:00 stderr P caught terminated, stopping...
7.log:2025-07-11T02:23:06.160519470+00:00 stderr F 2025-07-11T02:23:06Z [verbose] caught terminated, stopping...
~~~

The SIGTERM received 7th July is present on both workers.
Neet to know the RCA for SIGTERM signal to the process and led to this condition/status as per the code [1].

[1]: https://github.com/k8snetworkplumbingwg/multus-cni/blob/00adf22482945de41ff702258593aaf867a48e2d/cmd/multus-daemon/main.go#L126

Version-Release number of selected component (if applicable):

    OCPv4.16.42

How reproducible:

NA

Steps to Reproduce:

    1.Perform offline networking migration OpenshifSDN to OVNKubernetes

Actual results:

    when rebooting the nodes. One node is stuck in 'Not Ready' state. The machine config cluster operator gives the following error:
machine-config                             4.16.42   True        False         True       11d     Failed to resync 4.16.42 because: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)]One pod stays in pending state:
[root@temp-41451-04-eu-de-1-bastion-0 ~]# oc get pods
NAME                                                                        READY   STATUS    RESTARTS   AGE
kube-rbac-proxy-crio-eu-de-1-master-0.temp-41451-04.ibm-cloud-schiphol.nl   1/1     Running   4          11d
kube-rbac-proxy-crio-eu-de-1-master-1.temp-41451-04.ibm-cloud-schiphol.nl   1/1     Running   4          11d
kube-rbac-proxy-crio-eu-de-1-master-2.temp-41451-04.ibm-cloud-schiphol.nl   1/1     Running   4          11d
kube-rbac-proxy-crio-eu-de-1-worker-0.temp-41451-04.ibm-cloud-schiphol.nl   1/1     Running   3          11d
kube-rbac-proxy-crio-eu-de-1-worker-1.temp-41451-04.ibm-cloud-schiphol.nl   1/1     Running   2          11d
kube-rbac-proxy-crio-eu-de-1-worker-2.temp-41451-04.ibm-cloud-schiphol.nl   1/1     Running   3          11d
kube-rbac-proxy-crio-eu-de-1-worker-3.temp-41451-04.ibm-cloud-schiphol.nl   1/1     Running   5          8d
machine-config-controller-7557499fbd-dxdnk                                  2/2     Running   2          2d21h
machine-config-daemon-7nmk5                                                 2/2     Running   8          11d
machine-config-daemon-b8vlc                                                 2/2     Running   8          11d
machine-config-daemon-bqjcd                                                 2/2     Running   8          11d
machine-config-daemon-dv6lq                                                 2/2     Running   8          11d
machine-config-daemon-jdwqv                                                 2/2     Running   4          8d
machine-config-daemon-mvrdw                                                 2/2     Running   8          11d
machine-config-daemon-wfvkc                                                 0/2     Pending   0          16m
machine-config-operator-665c9b8858-28hc7                                    2/2     Running   2          2d21h
machine-config-server-r8c7s                                                 1/1     Running   4          11d
machine-config-server-tjxn4                                                 1/1     Running   4          11d
machine-config-server-x8pmz                                                 1/1     Running   4          11d

Expected results:

    nodes are in ready state after performing the migration

Additional info:

-timestamp for the first step of the migration --> 4th of July ~14:30 (CEST) 
-timestamp for the issue appearing --> Monday 7th July ~12:00 (CEST)
-timestamp for the last step of the migration process--> Monday 7th July ~11:30 (CEST) 
Note: On Friday 4th Cu started the migration, then completed it on Monday 7th after which Cu encountered the issue

Assignee:: Peng Liu

Reporter:: John Soliman

Need Info From:: None

Contributors:: None

QA Contact:: Weibin Liang

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/07/14 11:00 AM

Updated:: 2025/09/13 5:25 PM

Resolved:: 2025/07/14 7:45 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates