Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-49662

ovnkube-node pods crash after restarting when NAD is configured on cluster

    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      This bug was found while verifying another bug https://issues.redhat.com/browse/OCPBUGS-48412

      Version-Release number of selected component (if applicable):

      4.18.0

      build 4.18.0-0.nightly-2025-01-30-093109, openshift/api#2127

      How reproducible:

      Always

      Steps to Reproduce:

      1. Create a UDN namespace
      โ€”
      apiVersion: v1
      kind: Namespace
      metadata:
        name: ns1
        labels:
          k8s.ovn.org/primary-user-defined-network:

      2. Create L3 NAD in namespace

      apiVersion: k8s.cni.cncf.io/v1
      kind: NetworkAttachmentDefinition
      metadata:
        name: l3-network-ns1
        namespace: ns1
      spec:
        config: |2
          {
                  "cniVersion": "0.3.1",
                  "name": "l3-network-ns1",
                  "type": "ovn-k8s-cni-overlay",
                  "topology":"layer3",
                  "subnets": "10.20.0.0/16/24",
                  "mtu": 1300,
                  "netAttachDefName": "ns1/l3-network-ns1",
                  "role": "primary"
          }

      3. Restart ovnkube-node pods

      Actual results:

      oc -n openshift-ovn-kubernetes get pods

      NAME                                     READY   STATUS             RESTARTS        AGE
      ovnkube-control-plane-65d7c9ddf4-blqsb   2/2     Running            0               86m
      ovnkube-control-plane-65d7c9ddf4-pc67l   2/2     Running            0               86m
      ovnkube-node-2fjt7                       7/8     CrashLoopBackOff   10 (114s ago)   28m
      ovnkube-node-64h2p                       7/8     CrashLoopBackOff   10 (94s ago)    28m
      ovnkube-node-7w2vx                       7/8     CrashLoopBackOff   10 (113s ago)   28m
      ovnkube-node-c4r2z                       7/8     CrashLoopBackOff   10 (111s ago)   28m
      ovnkube-node-djq2w                       7/8     CrashLoopBackOff   10 (73s ago)    28m
      ovnkube-node-gclwv                       7/8     CrashLoopBackOff   10 (101s ago)   28m
       
      

      Expected results:

      The pods should be restarted without any issue

      Additional info:

      Error found on ovnkube-controller of ovnkube-node pod

      0130 15:28:02.898200   18833 ovnkube.go:137] failed to run ovnkube: [failed to start network controller: failed to start default network controller: error running OVN Kubernetes Services controller: handler {0x1e34440 0x1e34120 0x1e340c0} was not added to shared informer because it has stopped already, failed to start node network controller: failed to start NAD controller: initial sync failed: failed to sync network l3-network-ns1: [node-nad-controller network controller]: failed to ensure network l3-network-ns1: failed to create network l3-network-ns1: error creating UDN gateway for network l3-network-ns1: failed to get v4 masquerade IP, network l3-network-ns1 (1): failed generating network id '1' v4-masquerade-ips gateway router ip: generated ip 169.254.169.11 from the idx 11 is out of range in the network 169.254.169.0/29]

       

            [OCPBUGS-49662] ovnkube-node pods crash after restarting when NAD is configured on cluster

            The issue seems to be CNO is resetting the masquerade subnet to /29, which renders invalid pretty much all UDNs.

            Nothing for CNV-SDN WG to see. Removing our label from it.

             

            Miguel Duarte de Mora Barroso added a comment - The issue seems to be CNO is resetting the masquerade subnet to /29, which renders invalid pretty much all UDNs. Nothing for CNV-SDN WG to see. Removing our label from it.  

            So the issue is related to CNO thinking that it's an upgrade so it "keeps" the previous masquerade default problem is that /29 is harcoded as previous default instead of /17 that applies only at new clusters.

            There is this trace 

            1 ovn_kubernetes.go:546] ovnk components: ovnkube-node: isRunning=true, update=true; ovnkube-control-plane: isRunning=true, update=true

            That comes from

             klog.Infof("ovnk components: ovnkube-node: isRunning=%t, update=%t; ovnkube-control-plane: isRunning=%t, update=%t",
                    bootstrapResult.OVN.NodeUpdateStatus != nil, updateNode,                
                    bootstrapResult.OVN.ControlPlaneUpdateStatus != nil, updateControlPlane) 

            So following condition is false at the linked code that configures the /17 masquerade

            res.ControlPlaneUpdateStatus == nil && res.NodeUpdateStatus == nil  

            https://github.com/openshift/cluster-network-operator/blob/680b38f0e36e4614430f6b8f8cdea9ec91ea9b95/pkg/network/ovn_kubernetes.go#L1309-L1314

             

            Also this is a non blocker since:

            • Deleting a daemonset is something that should not be done
            • There is a workaround deleting the nad and the ovnkube-node pods.

             

            rhn-support-arghosh is preparing a patch for it.

             

             

            Felix Enrique Llorente Pastora added a comment - So the issue is related to CNO thinking that it's an upgrade so it "keeps" the previous masquerade default problem is that /29 is harcoded as previous default instead of /17 that applies only at new clusters. There is this trace  1 ovn_kubernetes.go:546] ovnk components: ovnkube-node: isRunning= true , update= true ; ovnkube-control-plane: isRunning= true , update= true That comes from klog.Infof( "ovnk components: ovnkube-node: isRunning=%t, update=%t; ovnkube-control-plane: isRunning=%t, update=%t" , bootstrapResult.OVN.NodeUpdateStatus != nil, updateNode, bootstrapResult.OVN.ControlPlaneUpdateStatus != nil, updateControlPlane) So following condition is false at the linked code that configures the /17 masquerade res.ControlPlaneUpdateStatus == nil && res.NodeUpdateStatus == nil https://github.com/openshift/cluster-network-operator/blob/680b38f0e36e4614430f6b8f8cdea9ec91ea9b95/pkg/network/ovn_kubernetes.go#L1309-L1314   Also this is a non blocker since: Deleting a daemonset is something that should not be done There is a workaround deleting the nad and the ovnkube-node pods.   rhn-support-arghosh is preparing a patch for it.    

            I am not able to reproduce the issue following reproducer steps mentioned in this jira.

            [arghosh@arghosh-thinkpadp1gen3 ~]$ oc get clusterversion
            NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
            version   4.18.0-0.test-2025-02-04-100836-ci-ln-026xphb-latest   True        False         59m     Cluster version is 4.18.0-0.test-2025-02-04-100836-ci-ln-026xphb-latest 
            
            [arghosh@arghosh-thinkpadp1gen3 ~]$ oc get ns ns1 --show-labels
            NAME   STATUS   AGE   LABELS
            ns1    Active   11m   k8s.ovn.org/primary-user-defined-network=,kubernetes.io/metadata.name=ns1,pod-security.kubernetes.io/audit-version=latest,pod-security.kubernetes.io/audit=restricted,pod-security.kubernetes.io/warn-version=latest,pod-security.kubernetes.io/warn=restricted
            
            
            [arghosh@arghosh-thinkpadp1gen3 ~]$ oc get net-attach-def l3-network-ns1 -n ns1 -oyaml
            apiVersion: k8s.cni.cncf.io/v1
            kind: NetworkAttachmentDefinition
            metadata:
              annotations:
                k8s.ovn.org/network-id: "1"
                k8s.ovn.org/network-name: l3-network-ns1
              name: l3-network-ns1
              namespace: ns1
            spec:
              config: |
                {
                        "cniVersion": "0.3.1",
                        "name": "l3-network-ns1",
                        "type": "ovn-k8s-cni-overlay",
                        "topology":"layer3",
                        "subnets": "10.20.0.0/16/24",
                        "mtu": 1300,
                        "netAttachDefName": "ns1/l3-network-ns1",
                        "role": "primary"
                }

            Then I have restarted all ovnkube-node and ovnkube-control-plane PODs. All PODs are running fine after restart. I have also checked node-masquerade-subnet annotation for all nodes and it have not been reverted to old default masquerade subnet. 

            [arghosh@arghosh-thinkpadp1gen3 ~]$ oc get po
            NAME                                    READY   STATUS    RESTARTS   AGE
            ovnkube-control-plane-f995d45f4-p98hd   2/2     Running   0          7m20s
            ovnkube-control-plane-f995d45f4-qdlvr   2/2     Running   0          7m20s
            ovnkube-node-4m4mb                      8/8     Running   0          7m18s
            ovnkube-node-6pfc4                      8/8     Running   0          7m16s
            ovnkube-node-7zcx4                      8/8     Running   0          7m16s
            ovnkube-node-r8kzc                      8/8     Running   0          7m16s
            ovnkube-node-sb59r                      8/8     Running   0          7m18s
            ovnkube-node-vg2gz                      8/8     Running   0          7m16s
             
            [arghosh@arghosh-thinkpadp1gen3 ~]$ oc get no -oyaml|grep -i masq
                  k8s.ovn.org/node-masquerade-subnet: '{"ipv4":"169.254.0.0/17","ipv6":"fd69::/112"}'
                  k8s.ovn.org/node-masquerade-subnet: '{"ipv4":"169.254.0.0/17","ipv6":"fd69::/112"}'
                  k8s.ovn.org/node-masquerade-subnet: '{"ipv4":"169.254.0.0/17","ipv6":"fd69::/112"}'
                  k8s.ovn.org/node-masquerade-subnet: '{"ipv4":"169.254.0.0/17","ipv6":"fd69::/112"}'
                  k8s.ovn.org/node-masquerade-subnet: '{"ipv4":"169.254.0.0/17","ipv6":"fd69::/112"}'
                  k8s.ovn.org/node-masquerade-subnet: '{"ipv4":"169.254.0.0/17","ipv6":"fd69::/112"}'

            There must be some other steps which are required to reproduce the issue.

            Note: If the cluster have been upgraded from 4.17 then masquerade subnet would remain same(169.254.169.0/29) for backward compatibility.  For a new 4.18 cluster default masquerade subnet wold be set to 169.254.0.0/17. Please refer to below cmmit for more info:

            https://github.com/openshift/cluster-network-operator/commit/64ce003cdab4bab61f608871f3f6cced2c80cbd2

            Arnab Ghosh added a comment - I am not able to reproduce the issue following reproducer steps mentioned in this jira. [arghosh@arghosh-thinkpadp1gen3 ~]$ oc get clusterversion NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS version   4.18.0-0.test-2025-02-04-100836-ci-ln-026xphb-latest   True        False         59m     Cluster version is 4.18.0-0.test-2025-02-04-100836-ci-ln-026xphb-latest [arghosh@arghosh-thinkpadp1gen3 ~]$ oc get ns ns1 --show-labels NAME   STATUS   AGE   LABELS ns1    Active   11m   k8s.ovn.org/primary-user-defined-network=,kubernetes.io/metadata.name=ns1,pod-security.kubernetes.io/audit-version=latest,pod-security.kubernetes.io/audit=restricted,pod-security.kubernetes.io/warn-version=latest,pod-security.kubernetes.io/warn=restricted [arghosh@arghosh-thinkpadp1gen3 ~]$ oc get net-attach-def l3-network-ns1 -n ns1 -oyaml apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata:   annotations:     k8s.ovn.org/network-id: "1"     k8s.ovn.org/network-name: l3-network-ns1   name: l3-network-ns1   namespace: ns1 spec:   config: |     {             "cniVersion" : "0.3.1" ,             "name" : "l3-network-ns1" ,             "type" : "ovn-k8s-cni-overlay" ,             "topology" : "layer3" ,             "subnets" : "10.20.0.0/16/24" ,             "mtu" : 1300,             "netAttachDefName" : "ns1/l3-network-ns1" ,             "role" : "primary"     } Then I have restarted all ovnkube-node and ovnkube-control-plane PODs. All PODs are running fine after restart. I have also checked node-masquerade-subnet annotation for all nodes and it have not been reverted to old default masquerade subnet.  [arghosh@arghosh-thinkpadp1gen3 ~]$ oc get po NAME                                    READY   STATUS    RESTARTS   AGE ovnkube-control-plane-f995d45f4-p98hd   2/2     Running   0          7m20s ovnkube-control-plane-f995d45f4-qdlvr   2/2     Running   0          7m20s ovnkube-node-4m4mb                      8/8     Running   0          7m18s ovnkube-node-6pfc4                      8/8     Running   0          7m16s ovnkube-node-7zcx4                      8/8     Running   0          7m16s ovnkube-node-r8kzc                      8/8     Running   0          7m16s ovnkube-node-sb59r                      8/8     Running   0          7m18s ovnkube-node-vg2gz                      8/8     Running   0          7m16s [arghosh@arghosh-thinkpadp1gen3 ~]$ oc get no -oyaml|grep -i masq       k8s.ovn.org/node-masquerade-subnet: '{ "ipv4" : "169.254.0.0/17" , "ipv6" : "fd69::/112" }'       k8s.ovn.org/node-masquerade-subnet: '{ "ipv4" : "169.254.0.0/17" , "ipv6" : "fd69::/112" }'       k8s.ovn.org/node-masquerade-subnet: '{ "ipv4" : "169.254.0.0/17" , "ipv6" : "fd69::/112" }'       k8s.ovn.org/node-masquerade-subnet: '{ "ipv4" : "169.254.0.0/17" , "ipv6" : "fd69::/112" }'       k8s.ovn.org/node-masquerade-subnet: '{ "ipv4" : "169.254.0.0/17" , "ipv6" : "fd69::/112" }'       k8s.ovn.org/node-masquerade-subnet: '{ "ipv4" : "169.254.0.0/17" , "ipv6" : "fd69::/112" }' There must be some other steps which are required to reproduce the issue. Note: If the cluster have been upgraded from 4.17 then masquerade subnet would remain same(169.254.169.0/29) for backward compatibility.  For a new 4.18 cluster default masquerade subnet wold be set to 169.254.0.0/17. Please refer to below cmmit for more info: https://github.com/openshift/cluster-network-operator/commit/64ce003cdab4bab61f608871f3f6cced2c80cbd2

            Latest output from QE team is that after ovnkube-node restart the masquerade subnets ignores configuration and goes back to harcoded default /29, that make it crash loop.

            rhn-support-asood is this about right ?

            Felix Enrique Llorente Pastora added a comment - Latest output from QE team is that after ovnkube-node restart the masquerade subnets ignores configuration and goes back to harcoded default /29, that make it crash loop. rhn-support-asood is this about right ?

            There are some docs related to UDN to configured masquerade subnet depending on the number of networks:

            https://github.com/openshift/openshift-docs/pull/82764/files#diff-a4571cfec54163a4dbb46db34db71f3ea359a229f8c22eb0413727622160a934R18

            Felix Enrique Llorente Pastora added a comment - There are some docs related to UDN to configured masquerade subnet depending on the number of networks: https://github.com/openshift/openshift-docs/pull/82764/files#diff-a4571cfec54163a4dbb46db34db71f3ea359a229f8c22eb0413727622160a934R18

              rhn-support-arghosh Arnab Ghosh
              rhn-support-asood Arti Sood
              Arti Sood Arti Sood
              Felix Enrique Llorente Pastora
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

                Created:
                Updated: