Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-50897

SDN to OVN Migration fails with "failed: signal: alarm clock" error in OVN pod logs

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      The SDN to OVN offline migration for OCP 4.16.17 failed post step-12, the nodes became NotReady.
      https://docs.openshift.com/container-platform/4.16/networking/ovn_kubernetes_network_provider/migrate-from-openshift-sdn.html#:~:text=t%203%0Adone-,Confirm,-that%20the%20migration

       

      OVN pods were failing with below error.

      2025-02-15T00:04:29.241135547Z F0215 00:04:29.241113  351606 ovnkube.go:136] failed to run ovnkube: failed to initialize libovsdb SB client: failed to connect to unix:/var/run/ovn/ovnsb_db.sock: context deadline exceeded
      2025-02-15T00:04:29.241135547Z failed to start node network controller: failed to start default node network controller: timed out waiting for the node zone xxxxxx2-xxxxxx-master-1 to match the OVN Southbound db zone, err: context canceled, err1: failed to get the zone name from the OVN Southbound db server, err : OVN command '/usr/bin/ovn-sbctl --timeout=15 --no-leader-only get SB_Global . options:name' failed: signal: alarm clock 

      OVN DB rebuild and node reboot didn't help much. However, we noticed that the CPU consumption over the master and worker nodes was more than 100%. Over master nodes, ovsdb-server, ovn-controller and kube-api were consuming most of the CPU. We doubled the CPU resources over master nodes but still these components were consuming 100% CPU.

      The SBDB container logs contain following messages.

      2025-02-15T00:45:14.596521949Z 2025-02-15T00:45:14.596Z|00743|coverage|INFO|Skipping details of duplicate event coverage for hash=13bca397
      2025-02-15T00:45:14.604195158Z 2025-02-15T00:45:14.604Z|00744|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (/var/run/ovn/ovnsb_db.sock<->) at ../lib/stream-fd.c:274 (99% CPU usage)
      2025-02-15T00:45:14.614502657Z 2025-02-15T00:45:14.614Z|00745|poll_loop|INFO|wakeup due to [POLLIN] on fd 17 (/var/run/ovn/ovnsb_db.ctl<->) at ../lib/stream-fd.c:274 (99% CPU usage)
      2025-02-15T00:45:14.676709916Z 2025-02-15T00:45:14.676Z|00746|reconnect|WARN|unix#456: connection dropped (Broken pipe)
      2025-02-15T00:45:14.692542301Z 2025-02-15T00:45:14.692Z|00747|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (/var/run/ovn/ovnsb_db.sock<->) at ../lib/stream-fd.c:274 (99% CPU usage)
      2025-02-15T00:45:14.697681894Z 2025-02-15T00:45:14.697Z|00748|poll_loop|INFO|wakeup due to [POLLIN][POLLHUP] on fd 25 (/var/run/ovn/ovnsb_db.ctl<->) at ../lib/stream-fd.c:157 (99% CPU usage)
      2025-02-15T00:45:14.720813439Z 2025-02-15T00:45:14.720Z|00749|reconnect|WARN|unix#463: connection dropped (Broken pipe)
      2025-02-15T00:45:14.736965964Z 2025-02-15T00:45:14.736Z|00750|poll_loop|INFO|wakeup due to [POLLIN][POLLHUP] on fd 19 (/var/run/ovn/ovnsb_db.ctl<->) at ../lib/stream-fd.c:157 (99% CPU usage)
      2025-02-15T00:45:14.739631694Z 2025-02-15T00:45:14.739Z|00751|reconnect|WARN|unix#465: connection dropped (Broken pipe)
      2025-02-15T00:45:17.090438548Z 2025-02-15T00:45:17.089Z|00752|poll_loop|INFO|wakeup due to 2341-ms timeout at ../ovsdb/ovsdb-server.c:400 (99% CPU usage)
      2025-02-15T00:45:17.097818464Z 2025-02-15T00:45:17.097Z|00753|poll_loop|INFO|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:202 (99% CPU usage)
      2025-02-15T00:45:17.487237902Z 2025-02-15T00:45:17.480Z|00754|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (/var/run/ovn/ovnsb_db.sock<->) at ../lib/stream-fd.c:274 (99% CPU usage) 

       

      Customer confirmed that there's very less netpols, egressfirewalls, etc over this cluster which might have contributed to increased ACLs but that's not the case here. We need engineeering team's help to identify why the OVN pods are failing with above error.

      I am sharing the sosreport link for master-2 in the comment below at the moment, once the customer shares sosreport from other master nodes and must-gather, I will share the links,
       

              pliurh Peng Liu
              rhn-support-aygarg Ayush Garg
              None
              None
              Weibin Liang Weibin Liang
              None
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: