Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-47762

OpenVSwitch loses system-id causing pod creation failures

XMLWordPrintable

    • Critical
    • None
    • False
    • Hide

      None

      Show
      None
    • Customer Escalated

      Description of problem:

      Cluster started seeing pods failing to be created. Investigating, we found: failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed)

      7m          Warning   FailedCreatePodSandBox    pod/prometheus-k8s-0                                          Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-k8s-0_openshift-monitoring_086ff33e-b490-4667-9c59-89c23827529f_0(1c228301fcc8bc5f0ac353b48ece4eb0c349ae850425be3fc18b9bc96c5d500a): error adding pod openshift-monitoring_prometheus-k8s-0 to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"1c228301fcc8bc5f0ac353b48ece4eb0c349ae850425be3fc18b9bc96c5d500a" Netns:"/var/run/netns/e59001c2-b9e0-4505-807a-140ecf38e4e0" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-monitoring;K8S_POD_NAME=prometheus-k8s-0;K8S_POD_INFRA_CONTAINER_ID=1c228301fcc8bc5f0ac353b48ece4eb0c349ae850425be3fc18b9bc96c5d500a;K8S_POD_UID=086ff33e-b490-4667-9c59-89c23827529f" Path:"" ERRORED: error configuring pod [openshift-monitoring/prometheus-k8s-0] networking: [openshift-monitoring/prometheus-k8s-0/086ff33e-b490-4667-9c59-89c23827529f:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-monitoring/prometheus-k8s-0 1c228301fcc8bc5f0ac353b48ece4eb0c349ae850425be3fc18b9bc96c5d500a network default NAD default] [openshift-monitoring/prometheus-k8s-0 1c228301fcc8bc5f0ac353b48ece4eb0c349ae850425be3fc18b9bc96c5d500a network default NAD default] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:83:01:d0 [10.131.1.208/23]...
      

      The OVS pods were running fine, but nothing was actually working.

      Upon restarting the ovnkube-node pods, we could see issues with the ovnkube-controller container failing to start.

      1) No system-id configured in the local host

      2025-01-02T19:41:48.894551028Z failed to start node network controller: failed to start default node network controller: no system-id configured in the local host
      

      This was being reported, despite having a correct /etc/openvswitch/system-id.conf

      Based on that, I went through the OVN source [1] and determined that the missing system-id would have been fatal and triggered the abrupt shutdown.

      [1] https://github.com/openshift/ovn-kubernetes/blob/release-4.16/go-controller/pkg/util/util.go#L171

      2025-01-02T19:41:48.890131040Z I0102 19:41:48.890101 1615889 ovs.go:159] Exec(35): /usr/bin/ovs-vsctl --timeout=15 --if-exists get Open_vSwitch . external_ids:system-id
      2025-01-02T19:41:48.894090656Z I0102 19:41:48.894021 1615889 ovs.go:162] Exec(35): stdout: "\n"
      2025-01-02T19:41:48.894090656Z I0102 19:41:48.894062 1615889 ovs.go:163] Exec(35): stderr: ""
      2025-01-02T19:41:48.894197778Z I0102 19:41:48.894181 1615889 ovnkube.go:577] Stopping ovnkube...
      

      I went back to look at some of our older data we have previously collected from this cluster and I could see that it started complaining about the system-id issue starting on December 18th in the ovn-controller container.

      2024-12-18T13:02:39.197266189Z 2024-12-18T13:02:39.197Z|04467|chassis|WARN|'system-id' in Open_vSwitch database is missing.
      

      Looking at some of the older logs, we could see that across all 5 worker nodes. The issue was preceded by a restart of the Virt Handler pods. The virt handlers only run on the workers and this was the only time they had restarted during the life of this cluster.

      The rollout of the virt-handler pods and the rest of the CNV pods was triggered by a CNV upgrade from 4.16.4 to 4.16.5.

      We could also see this triggered a reapplication of node network states and bond interfaces.

      Dec 18 13:02:39 lb3svrcd40e35 NetworkManager[3203]: <info>  [1734526959.1919] audit: op="device-reapply" interface="bond0.204" ifindex=10 pid=1842026 uid=0 result="success"
      Dec 18 13:02:39 lb3svrcd40e35 NetworkManager[3203]: <info>  [1734526959.1935] audit: op="checkpoint-adjust-rollback-timeout" arg="/org/freedesktop/NetworkManager/Checkpoint/11" pid=1842026 uid=0 result="success"
      Dec 18 13:02:39 lb3svrcd40e35 nm-dispatcher[1842408]: NM resolv-prepender triggered by bond0.204 reapply.
      

      All 5 worker nodes show the same pattern of the CNV pods rolling out, followed by: 'system-id' in Open_vSwitch database is missing

      2024-12-18T13:02:27.935016581Z 2024-12-18T13:02:27.934Z|04459|binding|INFO|Releasing lport openshift-cnv_virt-handler-mqsln from this chassis (sb_readonly=0)
      2024-12-18T13:02:27.935016581Z 2024-12-18T13:02:27.934Z|04460|if_status|WARN|Dropped 1 log messages in last 109 seconds (most recently, 109 seconds ago) due to excessive rate
      2024-12-18T13:02:27.935016581Z 2024-12-18T13:02:27.934Z|04461|if_status|WARN|Trying to release unknown interface openshift-cnv_virt-handler-mqsln
      2024-12-18T13:02:27.935016581Z 2024-12-18T13:02:27.934Z|04462|binding|INFO|Setting lport openshift-cnv_virt-handler-mqsln down in Southbound
      2024-12-18T13:02:32.812830148Z 2024-12-18T13:02:32.812Z|04463|binding|INFO|Claiming lport openshift-cnv_virt-handler-np447 for this chassis.
      2024-12-18T13:02:32.812830148Z 2024-12-18T13:02:32.812Z|04464|binding|INFO|openshift-cnv_virt-handler-np447: Claiming 0a:58:0a:82:03:1d 10.130.3.29
      2024-12-18T13:02:33.049519842Z 2024-12-18T13:02:33.049Z|04465|binding|INFO|Setting lport openshift-cnv_virt-handler-np447 ovn-installed in OVS
      2024-12-18T13:02:33.049519842Z 2024-12-18T13:02:33.049Z|04466|binding|INFO|Setting lport openshift-cnv_virt-handler-np447 up in Southbound
      2024-12-18T13:02:39.197266189Z 2024-12-18T13:02:39.197Z|04467|chassis|WARN|'system-id' in Open_vSwitch database is missing.
      2024-12-18T13:02:39.197296693Z 2024-12-18T13:02:39.197Z|04468|main|INFO|OVNSB commit failed, force recompute next time.
      

      Version-Release number of selected component (if applicable):
      4.16.20

      How reproducible:
      Appears to have been triggered after a CNV operator upgrade.

      Steps to Reproduce:
      1. 4.16.20 with CNV 4.16.4 upgrading to 4.16.5

      Actual results:
      Caused OVN to fail to allocate new pod interfaces

      Expected results:

      Additional info:

      There was a previous issue like this caused by nmstate:
      https://issues.redhat.com/browse/OCPBUGS-18729 / https://issues.redhat.com/browse/OCPBUGS-18869

      Workaround:
      Drain and reboot the node(s) to resolve the issue: https://docs.openshift.com/container-platform/4.17/nodes/nodes/nodes-nodes-rebooting.html#nodes-nodes-rebooting-gracefully_nodes-nodes-rebooting

              pepalani@redhat.com Periyasamy Palanisamy
              rhn-support-mrobson Matt Robson
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

                Created:
                Updated: