-
Bug
-
Resolution: Won't Do
-
Critical
-
None
-
4.16.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
The SDN to OVN offline migration for OCP 4.16.17 failed post step-12, the nodes became NotReady.
https://docs.openshift.com/container-platform/4.16/networking/ovn_kubernetes_network_provider/migrate-from-openshift-sdn.html#:~:text=t%203%0Adone-,Confirm,-that%20the%20migration
OVN pods were failing with below error.
2025-02-15T00:04:29.241135547Z F0215 00:04:29.241113 351606 ovnkube.go:136] failed to run ovnkube: failed to initialize libovsdb SB client: failed to connect to unix:/var/run/ovn/ovnsb_db.sock: context deadline exceeded 2025-02-15T00:04:29.241135547Z failed to start node network controller: failed to start default node network controller: timed out waiting for the node zone xxxxxx2-xxxxxx-master-1 to match the OVN Southbound db zone, err: context canceled, err1: failed to get the zone name from the OVN Southbound db server, err : OVN command '/usr/bin/ovn-sbctl --timeout=15 --no-leader-only get SB_Global . options:name' failed: signal: alarm clock
OVN DB rebuild and node reboot didn't help much. However, we noticed that the CPU consumption over the master and worker nodes was more than 100%. Over master nodes, ovsdb-server, ovn-controller and kube-api were consuming most of the CPU. We doubled the CPU resources over master nodes but still these components were consuming 100% CPU.
The SBDB container logs contain following messages.
2025-02-15T00:45:14.596521949Z 2025-02-15T00:45:14.596Z|00743|coverage|INFO|Skipping details of duplicate event coverage for hash=13bca397 2025-02-15T00:45:14.604195158Z 2025-02-15T00:45:14.604Z|00744|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (/var/run/ovn/ovnsb_db.sock<->) at ../lib/stream-fd.c:274 (99% CPU usage) 2025-02-15T00:45:14.614502657Z 2025-02-15T00:45:14.614Z|00745|poll_loop|INFO|wakeup due to [POLLIN] on fd 17 (/var/run/ovn/ovnsb_db.ctl<->) at ../lib/stream-fd.c:274 (99% CPU usage) 2025-02-15T00:45:14.676709916Z 2025-02-15T00:45:14.676Z|00746|reconnect|WARN|unix#456: connection dropped (Broken pipe) 2025-02-15T00:45:14.692542301Z 2025-02-15T00:45:14.692Z|00747|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (/var/run/ovn/ovnsb_db.sock<->) at ../lib/stream-fd.c:274 (99% CPU usage) 2025-02-15T00:45:14.697681894Z 2025-02-15T00:45:14.697Z|00748|poll_loop|INFO|wakeup due to [POLLIN][POLLHUP] on fd 25 (/var/run/ovn/ovnsb_db.ctl<->) at ../lib/stream-fd.c:157 (99% CPU usage) 2025-02-15T00:45:14.720813439Z 2025-02-15T00:45:14.720Z|00749|reconnect|WARN|unix#463: connection dropped (Broken pipe) 2025-02-15T00:45:14.736965964Z 2025-02-15T00:45:14.736Z|00750|poll_loop|INFO|wakeup due to [POLLIN][POLLHUP] on fd 19 (/var/run/ovn/ovnsb_db.ctl<->) at ../lib/stream-fd.c:157 (99% CPU usage) 2025-02-15T00:45:14.739631694Z 2025-02-15T00:45:14.739Z|00751|reconnect|WARN|unix#465: connection dropped (Broken pipe) 2025-02-15T00:45:17.090438548Z 2025-02-15T00:45:17.089Z|00752|poll_loop|INFO|wakeup due to 2341-ms timeout at ../ovsdb/ovsdb-server.c:400 (99% CPU usage) 2025-02-15T00:45:17.097818464Z 2025-02-15T00:45:17.097Z|00753|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:202 (99% CPU usage) 2025-02-15T00:45:17.487237902Z 2025-02-15T00:45:17.480Z|00754|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (/var/run/ovn/ovnsb_db.sock<->) at ../lib/stream-fd.c:274 (99% CPU usage)
Customer confirmed that there's very less netpols, egressfirewalls, etc over this cluster which might have contributed to increased ACLs but that's not the case here. We need engineeering team's help to identify why the OVN pods are failing with above error.
I am sharing the sosreport link for master-2 in the comment below at the moment, once the customer shares sosreport from other master nodes and must-gather, I will share the links,