Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-60786

ovnkube-cluster-manager cycling between finished syncing and error processing network

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      The ovnkube-cluster-manager is cycling between finished syncing NAD / finished syncing network ovs-bridge-vlan204 / error found while processing ovs-bridge-vlan204 for every NAD on the cluster.

      This does not appear to impact the NAD / networks or their attached VMs.

      This is a red herring generating thousands of log lines, creating issue for actual debugging and creating the illusion of an issue.

      2025-06-24T19:38:01.736891443+00:00 stderr F I0624 19:38:01.736872       1 controller.go:132] Adding controller [clustermanager-nad-controller NAD controller] event handlers
      2025-06-24T19:38:01.736946275+00:00 stderr F I0624 19:38:01.736937       1 shared_informer.go:313] Waiting for caches to sync for [clustermanager-nad-controller NAD controller]
      2025-06-24T19:38:01.736950611+00:00 stderr F I0624 19:38:01.736945       1 shared_informer.go:320] Caches are synced for [clustermanager-nad-controller NAD controller]
      2025-06-24T19:38:01.737199252+00:00 stderr F I0624 19:38:01.737192       1 controller.go:156] Starting controller [clustermanager-nad-controller NAD controller] with 1 workers
      2025-06-24T19:38:01.737253446+00:00 stderr F I0624 19:38:01.737247       1 network_controller.go:246] [clustermanager-nad-controller network controller]: syncing all networks
      2025-06-24T19:38:01.737265634+00:00 stderr F I0624 19:38:01.737258       1 network_controller.go:275] [clustermanager-nad-controller network controller]: finished syncing network ovs-bridge-vlan210, took 5.422µs
      2025-06-24T19:38:01.737269503+00:00 stderr F I0624 19:38:01.737267       1 network_controller.go:275] [clustermanager-nad-controller network controller]: finished syncing network ovs-bridge-vlan204, took 1.573µs
      2025-06-24T19:38:01.737273898+00:00 stderr F I0624 19:38:01.737271       1 network_controller.go:275] [clustermanager-nad-controller network controller]: finished syncing network ovs-bridge-vlan205, took 1.46µs
      2025-06-24T19:38:01.737278240+00:00 stderr F I0624 19:38:01.737275       1 network_controller.go:275] [clustermanager-nad-controller network controller]: finished syncing network ovs-bridge-vlan206, took 1.398µs
      2025-06-24T19:38:01.737281830+00:00 stderr F I0624 19:38:01.737279       1 network_controller.go:275] [clustermanager-nad-controller network controller]: finished syncing network ovs-bridge-vlan207, took 1.364µs
      2025-06-24T19:38:01.737289253+00:00 stderr F I0624 19:38:01.737283       1 network_controller.go:275] [clustermanager-nad-controller network controller]: finished syncing network ovs-bridge-vlan208, took 2.123µs
      2025-06-24T19:38:01.737304651+00:00 stderr F I0624 19:38:01.737290       1 network_controller.go:275] [clustermanager-nad-controller network controller]: finished syncing network ovs-bridge-vlan209, took 2.074µs
      2025-06-24T19:38:01.737309600+00:00 stderr F I0624 19:38:01.737304       1 network_controller.go:257] [clustermanager-nad-controller network controller]: finished syncing all networks. Time taken: 56.699µs
      2025-06-24T19:38:01.737313128+00:00 stderr F I0624 19:38:01.737308       1 controller.go:156] Starting controller [clustermanager-nad-controller network controller] with 1 workers
      2025-06-24T19:38:01.737317467+00:00 stderr F I0624 19:38:01.737315       1 nad_controller.go:162] [clustermanager-nad-controller NAD controller]: started
      2025-06-24T19:38:01.737326680+00:00 stderr F I0624 19:38:01.737321       1 network_cluster_controller.go:376] Initializing cluster manager network controller "default" ...
      2025-06-24T19:38:01.737357487+00:00 stderr F I0624 19:38:01.737351       1 network_cluster_controller.go:382] Cluster manager network controller "default" initialized. Took: 32.626µs
      2025-06-24T19:38:01.737357487+00:00 stderr F I0624 19:38:01.737355       1 network_cluster_controller.go:386] Cluster manager network controller "default" starting node watcher...
      
      2025-06-24T19:38:01.737391928+00:00 stderr F I0624 19:38:01.737383       1 nad_controller.go:246] [clustermanager-nad-controller NAD controller]: finished syncing NAD default/ovs-bridge-vlan206, took 174.417µs
      2025-06-24T19:38:01.737416439+00:00 stderr F I0624 19:38:01.737403       1 network_controller.go:275] [clustermanager-nad-controller network controller]: finished syncing network ovs-bridge-vlan204, took 19.002µs
      2025-06-24T19:38:01.737421181+00:00 stderr F I0624 19:38:01.737416       1 controller.go:257] Controller [clustermanager-nad-controller network controller]: error found while processing ovs-bridge-vlan204: [clustermanager-nad-controller network controller]: failed to ensure network ovs-bridge-vlan204: failed to create network ovs-bridge-vlan204: no cluster network controller to manage topology
      2025-06-24T19:38:01.737439932+00:00 stderr F I0624 19:38:01.737433       1 network_controller.go:275] [clustermanager-nad-controller network controller]: finished syncing network ovs-bridge-vlan205, took 3.123µs
      2025-06-24T19:38:01.737443920+00:00 stderr F I0624 19:38:01.737439       1 controller.go:257] Controller [clustermanager-nad-controller network controller]: error found while processing ovs-bridge-vlan205: [clustermanager-nad-controller network controller]: failed to ensure network ovs-bridge-vlan205: failed to create network ovs-bridge-vlan205: no cluster network controller to manage topology
      2025-06-24T19:38:01.737443920+00:00 stderr F I0624 19:38:01.737440       1 nad_controller.go:246] [clustermanager-nad-controller NAD controller]: finished syncing NAD default/ovs-bridge-vlan207, took 45.213µs
      2025-06-24T19:38:01.737456415+00:00 stderr F I0624 19:38:01.737449       1 network_controller.go:275] [clustermanager-nad-controller network controller]: finished syncing network ovs-bridge-vlan206, took 2.899µs
      2025-06-24T19:38:01.737459992+00:00 stderr F I0624 19:38:01.737455       1 controller.go:257] Controller [clustermanager-nad-controller network controller]: error found while processing ovs-bridge-vlan206: [clustermanager-nad-controller network controller]: failed to ensure network ovs-bridge-vlan206: failed to create network ovs-bridge-vlan206: no cluster network controller to manage topology
      2025-06-24T19:38:01.737964590+00:00 stderr F I0624 19:38:01.737954       1 network_cluster_controller.go:391] Cluster manager network controller "default" completed watch nodes. Took: 597.124µs
      2025-06-24T19:38:01.737983410+00:00 stderr F I0624 19:38:01.737978       1 zone_cluster_controller.go:217] Node qq2dsfcd27e34.exp-corp.cloud has the id 12 set
      2025-06-24T19:38:01.737987080+00:00 stderr F I0624 19:38:01.737983       1 zone_cluster_controller.go:217] Node qq2dsfcd27e40.exp-corp.cloud has the id 9 set
      2025-06-24T19:38:01.737987080+00:00 stderr F I0624 19:38:01.737985       1 zone_cluster_controller.go:217] Node qq2dsfcd40e36.exp-corp.cloud has the id 3 set
      2025-06-24T19:38:01.737990738+00:00 stderr F I0624 19:38:01.737988       1 zone_cluster_controller.go:217] Node qq2dsfcd40e37.exp-corp.cloud has the id 4 set
      2025-06-24T19:38:01.737994296+00:00 stderr F I0624 19:38:01.737990       1 zone_cluster_controller.go:217] Node qq2dsfcd40e38.exp-corp.cloud has the id 2 set
      2025-06-24T19:38:01.737994296+00:00 stderr F I0624 19:38:01.737993       1 zone_cluster_controller.go:217] Node qq2dsfcc27e34.exp-corp.cloud has the id 8 set
      2025-06-24T19:38:01.737997812+00:00 stderr F I0624 19:38:01.737995       1 zone_cluster_controller.go:217] Node qq2dsfcc27e36.exp-corp.cloud has the id 10 set
      2025-06-24T19:38:01.738001273+00:00 stderr F I0624 19:38:01.737998       1 zone_cluster_controller.go:217] Node qq2dsfcc27e40.exp-corp.cloud has the id 6 set
      2025-06-24T19:38:01.738004749+00:00 stderr F I0624 19:38:01.738000       1 zone_cluster_controller.go:217] Node qq2dsfcc27e38.exp-corp.cloud has the id 5 set
      2025-06-24T19:38:01.738004749+00:00 stderr F I0624 19:38:01.738003       1 zone_cluster_controller.go:217] Node qq2dsfcd27e36.exp-corp.cloud has the id 11 set
      2025-06-24T19:38:01.738008234+00:00 stderr F I0624 19:38:01.738005       1 zone_cluster_controller.go:217] Node qq2dsfcd27e38.exp-corp.cloud has the id 7 set
      2025-06-24T19:38:01.738107914+00:00 stderr F I0624 19:38:01.738089       1 kube.go:133] Setting annotations map[k8s.ovn.org/node-id:2 k8s.ovn.org/node-transit-switch-port-ifaddr:{"ipv4":"100.88.0.2/16"}] on node qq2dsfcd40e38.exp-corp.cloud
      2025-06-24T19:38:01.738147623+00:00 stderr F I0624 19:38:01.738118       1 kube.go:133] Setting annotations map[k8s.ovn.org/node-id:8 k8s.ovn.org/node-transit-switch-port-ifaddr:{"ipv4":"100.88.0.8/16"}] on node qq2dsfcc27e34.exp-corp.cloud
      2025-06-24T19:38:01.738171690+00:00 stderr F I0624 19:38:01.738136       1 kube.go:133] Setting annotations map[k8s.ovn.org/node-id:12 k8s.ovn.org/node-transit-switch-port-ifaddr:{"ipv4":"100.88.0.12/16"}] on node qq2dsfcd27e34.exp-corp.cloud
      2025-06-24T19:38:01.738171690+00:00 stderr F I0624 19:38:01.738142       1 kube.go:133] Setting annotations map[k8s.ovn.org/node-id:3 k8s.ovn.org/node-transit-switch-port-ifaddr:{"ipv4":"100.88.0.3/16"}] on node qq2dsfcd40e36.exp-corp.cloud
      2025-06-24T19:38:01.738171690+00:00 stderr F I0624 19:38:01.738147       1 kube.go:133] Setting annotations map[k8s.ovn.org/node-id:7 k8s.ovn.org/node-transit-switch-port-ifaddr:{"ipv4":"100.88.0.7/16"}] on node qq2dsfcd27e38.exp-corp.cloud
      2025-06-24T19:38:01.738212015+00:00 stderr F I0624 19:38:01.738171       1 kube.go:133] Setting annotations map[k8s.ovn.org/node-id:9 k8s.ovn.org/node-transit-switch-port-ifaddr:{"ipv4":"100.88.0.9/16"}] on node qq2dsfcd27e40.exp-corp.cloud
      2025-06-24T19:38:01.738212015+00:00 stderr F I0624 19:38:01.738119       1 kube.go:133] Setting annotations map[k8s.ovn.org/node-id:5 k8s.ovn.org/node-transit-switch-port-ifaddr:{"ipv4":"100.88.0.5/16"}] on node qq2dsfcc27e38.exp-corp.cloud
      2025-06-24T19:38:01.738220716+00:00 stderr F I0624 19:38:01.738167       1 kube.go:133] Setting annotations map[k8s.ovn.org/node-id:11 k8s.ovn.org/node-transit-switch-port-ifaddr:{"ipv4":"100.88.0.11/16"}] on node qq2dsfcd27e36.exp-corp.cloud
      2025-06-24T19:38:01.738225904+00:00 stderr F I0624 19:38:01.738141       1 kube.go:133] Setting annotations map[k8s.ovn.org/node-id:10 k8s.ovn.org/node-transit-switch-port-ifaddr:{"ipv4":"100.88.0.10/16"}] on node qq2dsfcc27e36.exp-corp.cloud
      2025-06-24T19:38:01.738324583+00:00 stderr F I0624 19:38:01.738284       1 kube.go:133] Setting annotations map[k8s.ovn.org/node-id:6 k8s.ovn.org/node-transit-switch-port-ifaddr:{"ipv4":"100.88.0.6/16"}] on node qq2dsfcc27e40.exp-corp.cloud
      

      Version-Release number of selected component (if applicable):
      4.19.1

      How reproducible:
      Has been happening since the the upgrade from 4.18.

      Steps to Reproduce:
      1. NNCP used as a vlan trunk:

      spec:
        desiredState:
          ovn:
            bridge-mappings:
            - bridge: br-ex
              localnet: vlan-trunk
              state: present
        nodeSelector:
          node-role.kubernetes.io/worker: ""
      

      2. localnet NAD on a VLAN using the physicalNetworkName

      spec:
        config: |-
          {
              "cniVersion": "0.4.0",
              "name": "ovs-bridge-vlan204",
              "type": "ovn-k8s-cni-overlay",
              "mtu": 9000,
              "netAttachDefName": "default/ovs-bridge-vlan204",
              "topology": "localnet",
              "physicalNetworkName": "vlan-trunk",
              "vlanID": 204
          }
      

      3.

      Actual results:

      This is logged 4 times per minute for every NAD. The volume is impactful and its hard to see through the noise.

      Expected results:

      Additional info:

      If it is a customer / SD issue:

      • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
      • Don’t presume that Engineering has access to Salesforce.
      • Do presume that Engineering will access attachments through supportshell.
      • Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
      • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
        • If the issue is in a customer namespace then provide a namespace inspect.
        • If it is a connectivity issue:
          • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
          • What is the dstNode, dstNamespace, dstPodName and dstPodIP?
          • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
          • Please provide the UTC timestamp networking outage window from must-gather
          • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
        • If it is not a connectivity issue:
          • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
      • When showing the results from commands, include the entire command in the output.  
      • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
      • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
      • Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
      • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
      • For guidance on using this template please see
        OCPBUGS Template Training for Networking  components

              sdn-team-bot sdn-team bot
              rhn-support-mrobson Matt Robson
              None
              None
              Meina Li Meina Li
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: