Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.16.z
Component/s: Networking / ovn-kubernetes
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

The SDN to OVN offline migration for OCP 4.16.17 failed post step-12, the nodes became NotReady.
https://docs.openshift.com/container-platform/4.16/networking/ovn_kubernetes_network_provider/migrate-from-openshift-sdn.html#:~:text=t%203%0Adone-,Confirm,-that%20the%20migration

OVN pods were failing with below error.

2025-02-15T00:04:29.241135547Z F0215 00:04:29.241113  351606 ovnkube.go:136] failed to run ovnkube: failed to initialize libovsdb SB client: failed to connect to unix:/var/run/ovn/ovnsb_db.sock: context deadline exceeded
2025-02-15T00:04:29.241135547Z failed to start node network controller: failed to start default node network controller: timed out waiting for the node zone xxxxxx2-xxxxxx-master-1 to match the OVN Southbound db zone, err: context canceled, err1: failed to get the zone name from the OVN Southbound db server, err : OVN command '/usr/bin/ovn-sbctl --timeout=15 --no-leader-only get SB_Global . options:name' failed: signal: alarm clock

OVN DB rebuild and node reboot didn't help much. However, we noticed that the CPU consumption over the master and worker nodes was more than 100%. Over master nodes, ovsdb-server, ovn-controller and kube-api were consuming most of the CPU. We doubled the CPU resources over master nodes but still these components were consuming 100% CPU.

The SBDB container logs contain following messages.

2025-02-15T00:45:14.596521949Z 2025-02-15T00:45:14.596Z|00743|coverage|INFO|Skipping details of duplicate event coverage for hash=13bca397
2025-02-15T00:45:14.604195158Z 2025-02-15T00:45:14.604Z|00744|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (/var/run/ovn/ovnsb_db.sock<->) at ../lib/stream-fd.c:274 (99% CPU usage)
2025-02-15T00:45:14.614502657Z 2025-02-15T00:45:14.614Z|00745|poll_loop|INFO|wakeup due to [POLLIN] on fd 17 (/var/run/ovn/ovnsb_db.ctl<->) at ../lib/stream-fd.c:274 (99% CPU usage)
2025-02-15T00:45:14.676709916Z 2025-02-15T00:45:14.676Z|00746|reconnect|WARN|unix#456: connection dropped (Broken pipe)
2025-02-15T00:45:14.692542301Z 2025-02-15T00:45:14.692Z|00747|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (/var/run/ovn/ovnsb_db.sock<->) at ../lib/stream-fd.c:274 (99% CPU usage)
2025-02-15T00:45:14.697681894Z 2025-02-15T00:45:14.697Z|00748|poll_loop|INFO|wakeup due to [POLLIN][POLLHUP] on fd 25 (/var/run/ovn/ovnsb_db.ctl<->) at ../lib/stream-fd.c:157 (99% CPU usage)
2025-02-15T00:45:14.720813439Z 2025-02-15T00:45:14.720Z|00749|reconnect|WARN|unix#463: connection dropped (Broken pipe)
2025-02-15T00:45:14.736965964Z 2025-02-15T00:45:14.736Z|00750|poll_loop|INFO|wakeup due to [POLLIN][POLLHUP] on fd 19 (/var/run/ovn/ovnsb_db.ctl<->) at ../lib/stream-fd.c:157 (99% CPU usage)
2025-02-15T00:45:14.739631694Z 2025-02-15T00:45:14.739Z|00751|reconnect|WARN|unix#465: connection dropped (Broken pipe)
2025-02-15T00:45:17.090438548Z 2025-02-15T00:45:17.089Z|00752|poll_loop|INFO|wakeup due to 2341-ms timeout at ../ovsdb/ovsdb-server.c:400 (99% CPU usage)
2025-02-15T00:45:17.097818464Z 2025-02-15T00:45:17.097Z|00753|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:202 (99% CPU usage)
2025-02-15T00:45:17.487237902Z 2025-02-15T00:45:17.480Z|00754|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (/var/run/ovn/ovnsb_db.sock<->) at ../lib/stream-fd.c:274 (99% CPU usage)

Customer confirmed that there's very less netpols, egressfirewalls, etc over this cluster which might have contributed to increased ACLs but that's not the case here. We need engineeering team's help to identify why the OVN pods are failing with above error.

I am sharing the sosreport link for master-2 in the comment below at the moment, once the customer shares sosreport from other master nodes and must-gather, I will share the links,

Assignee:: Peng Liu

Reporter:: Ayush Garg

Need Info From:: None

Contributors:: None

QA Contact:: Weibin Liang

Doc Contact:: None

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2025/02/15 6:04 AM

Updated:: 2025/09/13 7:34 PM

Resolved:: 2025/04/30 3:34 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates