[OCPBUGS-49662] ovnkube-node pods crash after restarting when NAD is configured on cluster

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.18.0
Component/s: Networking / ovn-kubernetes
Labels:
- SDN:OVNK:UserDefinedNetworks:Primary

Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.19.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

This bug was found while verifying another bug https://issues.redhat.com/browse/OCPBUGS-48412

Version-Release number of selected component (if applicable):

4.18.0

build 4.18.0-0.nightly-2025-01-30-093109, openshift/api#2127

How reproducible:

Always

Steps to Reproduce:

Create a UDN namespace

—
apiVersion: v1
kind: Namespace
metadata:
  name: ns1
  labels:
    k8s.ovn.org/primary-user-defined-network:

2. Create L3 NAD in namespace

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: l3-network-ns1
  namespace: ns1
spec:
  config: |2
    {
            "cniVersion": "0.3.1",
            "name": "l3-network-ns1",
            "type": "ovn-k8s-cni-overlay",
            "topology":"layer3",
            "subnets": "10.20.0.0/16/24",
            "mtu": 1300,
            "netAttachDefName": "ns1/l3-network-ns1",
            "role": "primary"
    }

3. Restart ovnkube-node pods

Actual results:

oc -n openshift-ovn-kubernetes get pods

NAME                                     READY   STATUS             RESTARTS        AGE
ovnkube-control-plane-65d7c9ddf4-blqsb   2/2     Running            0               86m
ovnkube-control-plane-65d7c9ddf4-pc67l   2/2     Running            0               86m
ovnkube-node-2fjt7                       7/8     CrashLoopBackOff   10 (114s ago)   28m
ovnkube-node-64h2p                       7/8     CrashLoopBackOff   10 (94s ago)    28m
ovnkube-node-7w2vx                       7/8     CrashLoopBackOff   10 (113s ago)   28m
ovnkube-node-c4r2z                       7/8     CrashLoopBackOff   10 (111s ago)   28m
ovnkube-node-djq2w                       7/8     CrashLoopBackOff   10 (73s ago)    28m
ovnkube-node-gclwv                       7/8     CrashLoopBackOff   10 (101s ago)   28m

Expected results:

The pods should be restarted without any issue

Additional info:

Error found on ovnkube-controller of ovnkube-node pod

0130 15:28:02.898200   18833 ovnkube.go:137] failed to run ovnkube: [failed to start network controller: failed to start default network controller: error running OVN Kubernetes Services controller: handler {0x1e34440 0x1e34120 0x1e340c0} was not added to shared informer because it has stopped already, failed to start node network controller: failed to start NAD controller: initial sync failed: failed to sync network l3-network-ns1: [node-nad-controller network controller]: failed to ensure network l3-network-ns1: failed to create network l3-network-ns1: error creating UDN gateway for network l3-network-ns1: failed to get v4 masquerade IP, network l3-network-ns1 (1): failed generating network id '1' v4-masquerade-ips gateway router ip: generated ip 169.254.169.11 from the idx 11 is out of range in the network 169.254.169.0/29]

links to

openshift/cluster-network-operator#2641: OCPBUGS-49662: Prevent resetting masquerade subnet

Miguel Duarte de Mora Barroso added a comment - 2025/02/05 10:02 AM

The issue seems to be CNO is resetting the masquerade subnet to /29, which renders invalid pretty much all UDNs.

Nothing for CNV-SDN WG to see. Removing our label from it.

Miguel Duarte de Mora Barroso added a comment - 2025/02/05 10:02 AM The issue seems to be CNO is resetting the masquerade subnet to /29, which renders invalid pretty much all UDNs. Nothing for CNV-SDN WG to see. Removing our label from it.

Felix Enrique Llorente Pastora added a comment - 2025/02/04 1:49 PM

So the issue is related to CNO thinking that it's an upgrade so it "keeps" the previous masquerade default problem is that /29 is harcoded as previous default instead of /17 that applies only at new clusters.

There is this trace

1 ovn_kubernetes.go:546] ovnk components: ovnkube-node: isRunning=true, update=true; ovnkube-control-plane: isRunning=true, update=true

That comes from

 klog.Infof("ovnk components: ovnkube-node: isRunning=%t, update=%t; ovnkube-control-plane: isRunning=%t, update=%t",
        bootstrapResult.OVN.NodeUpdateStatus != nil, updateNode,                
        bootstrapResult.OVN.ControlPlaneUpdateStatus != nil, updateControlPlane)

So following condition is false at the linked code that configures the /17 masquerade

res.ControlPlaneUpdateStatus == nil && res.NodeUpdateStatus == nil

https://github.com/openshift/cluster-network-operator/blob/680b38f0e36e4614430f6b8f8cdea9ec91ea9b95/pkg/network/ovn_kubernetes.go#L1309-L1314

Also this is a non blocker since:

Deleting a daemonset is something that should not be done
There is a workaround deleting the nad and the ovnkube-node pods.

rhn-support-arghosh is preparing a patch for it.

Felix Enrique Llorente Pastora added a comment - 2025/02/04 1:49 PM So the issue is related to CNO thinking that it's an upgrade so it "keeps" the previous masquerade default problem is that /29 is harcoded as previous default instead of /17 that applies only at new clusters. There is this trace 1 ovn_kubernetes.go:546] ovnk components: ovnkube-node: isRunning= true , update= true ; ovnkube-control-plane: isRunning= true , update= true That comes from klog.Infof( "ovnk components: ovnkube-node: isRunning=%t, update=%t; ovnkube-control-plane: isRunning=%t, update=%t" , bootstrapResult.OVN.NodeUpdateStatus != nil, updateNode, bootstrapResult.OVN.ControlPlaneUpdateStatus != nil, updateControlPlane) So following condition is false at the linked code that configures the /17 masquerade res.ControlPlaneUpdateStatus == nil && res.NodeUpdateStatus == nil https://github.com/openshift/cluster-network-operator/blob/680b38f0e36e4614430f6b8f8cdea9ec91ea9b95/pkg/network/ovn_kubernetes.go#L1309-L1314 Also this is a non blocker since: Deleting a daemonset is something that should not be done There is a workaround deleting the nad and the ovnkube-node pods. rhn-support-arghosh is preparing a patch for it.

Arnab Ghosh added a comment - 2025/02/04 12:02 PM

I am not able to reproduce the issue following reproducer steps mentioned in this jira.

[arghosh@arghosh-thinkpadp1gen3 ~]$ oc get clusterversion
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-0.test-2025-02-04-100836-ci-ln-026xphb-latest   True        False         59m     Cluster version is 4.18.0-0.test-2025-02-04-100836-ci-ln-026xphb-latest 

[arghosh@arghosh-thinkpadp1gen3 ~]$ oc get ns ns1 --show-labels
NAME   STATUS   AGE   LABELS
ns1    Active   11m   k8s.ovn.org/primary-user-defined-network=,kubernetes.io/metadata.name=ns1,pod-security.kubernetes.io/audit-version=latest,pod-security.kubernetes.io/audit=restricted,pod-security.kubernetes.io/warn-version=latest,pod-security.kubernetes.io/warn=restricted


[arghosh@arghosh-thinkpadp1gen3 ~]$ oc get net-attach-def l3-network-ns1 -n ns1 -oyaml
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  annotations:
    k8s.ovn.org/network-id: "1"
    k8s.ovn.org/network-name: l3-network-ns1
  name: l3-network-ns1
  namespace: ns1
spec:
  config: |
    {
            "cniVersion": "0.3.1",
            "name": "l3-network-ns1",
            "type": "ovn-k8s-cni-overlay",
            "topology":"layer3",
            "subnets": "10.20.0.0/16/24",
            "mtu": 1300,
            "netAttachDefName": "ns1/l3-network-ns1",
            "role": "primary"
    }

Then I have restarted all ovnkube-node and ovnkube-control-plane PODs. All PODs are running fine after restart. I have also checked node-masquerade-subnet annotation for all nodes and it have not been reverted to old default masquerade subnet.

[arghosh@arghosh-thinkpadp1gen3 ~]$ oc get po
NAME                                    READY   STATUS    RESTARTS   AGE
ovnkube-control-plane-f995d45f4-p98hd   2/2     Running   0          7m20s
ovnkube-control-plane-f995d45f4-qdlvr   2/2     Running   0          7m20s
ovnkube-node-4m4mb                      8/8     Running   0          7m18s
ovnkube-node-6pfc4                      8/8     Running   0          7m16s
ovnkube-node-7zcx4                      8/8     Running   0          7m16s
ovnkube-node-r8kzc                      8/8     Running   0          7m16s
ovnkube-node-sb59r                      8/8     Running   0          7m18s
ovnkube-node-vg2gz                      8/8     Running   0          7m16s
 
[arghosh@arghosh-thinkpadp1gen3 ~]$ oc get no -oyaml|grep -i masq
      k8s.ovn.org/node-masquerade-subnet: '{"ipv4":"169.254.0.0/17","ipv6":"fd69::/112"}'
      k8s.ovn.org/node-masquerade-subnet: '{"ipv4":"169.254.0.0/17","ipv6":"fd69::/112"}'
      k8s.ovn.org/node-masquerade-subnet: '{"ipv4":"169.254.0.0/17","ipv6":"fd69::/112"}'
      k8s.ovn.org/node-masquerade-subnet: '{"ipv4":"169.254.0.0/17","ipv6":"fd69::/112"}'
      k8s.ovn.org/node-masquerade-subnet: '{"ipv4":"169.254.0.0/17","ipv6":"fd69::/112"}'
      k8s.ovn.org/node-masquerade-subnet: '{"ipv4":"169.254.0.0/17","ipv6":"fd69::/112"}'

There must be some other steps which are required to reproduce the issue.

Note: If the cluster have been upgraded from 4.17 then masquerade subnet would remain same(169.254.169.0/29) for backward compatibility. For a new 4.18 cluster default masquerade subnet wold be set to 169.254.0.0/17. Please refer to below cmmit for more info:

https://github.com/openshift/cluster-network-operator/commit/64ce003cdab4bab61f608871f3f6cced2c80cbd2

Arnab Ghosh added a comment - 2025/02/04 12:02 PM I am not able to reproduce the issue following reproducer steps mentioned in this jira. [arghosh@arghosh-thinkpadp1gen3 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.18.0-0.test-2025-02-04-100836-ci-ln-026xphb-latest True False 59m Cluster version is 4.18.0-0.test-2025-02-04-100836-ci-ln-026xphb-latest [arghosh@arghosh-thinkpadp1gen3 ~]$ oc get ns ns1 --show-labels NAME STATUS AGE LABELS ns1 Active 11m k8s.ovn.org/primary-user-defined-network=,kubernetes.io/metadata.name=ns1,pod-security.kubernetes.io/audit-version=latest,pod-security.kubernetes.io/audit=restricted,pod-security.kubernetes.io/warn-version=latest,pod-security.kubernetes.io/warn=restricted [arghosh@arghosh-thinkpadp1gen3 ~]$ oc get net-attach-def l3-network-ns1 -n ns1 -oyaml apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: annotations: k8s.ovn.org/network-id: "1" k8s.ovn.org/network-name: l3-network-ns1 name: l3-network-ns1 namespace: ns1 spec: config: | { "cniVersion" : "0.3.1" , "name" : "l3-network-ns1" , "type" : "ovn-k8s-cni-overlay" , "topology" : "layer3" , "subnets" : "10.20.0.0/16/24" , "mtu" : 1300, "netAttachDefName" : "ns1/l3-network-ns1" , "role" : "primary" } Then I have restarted all ovnkube-node and ovnkube-control-plane PODs. All PODs are running fine after restart. I have also checked node-masquerade-subnet annotation for all nodes and it have not been reverted to old default masquerade subnet. [arghosh@arghosh-thinkpadp1gen3 ~]$ oc get po NAME READY STATUS RESTARTS AGE ovnkube-control-plane-f995d45f4-p98hd 2/2 Running 0 7m20s ovnkube-control-plane-f995d45f4-qdlvr 2/2 Running 0 7m20s ovnkube-node-4m4mb 8/8 Running 0 7m18s ovnkube-node-6pfc4 8/8 Running 0 7m16s ovnkube-node-7zcx4 8/8 Running 0 7m16s ovnkube-node-r8kzc 8/8 Running 0 7m16s ovnkube-node-sb59r 8/8 Running 0 7m18s ovnkube-node-vg2gz 8/8 Running 0 7m16s [arghosh@arghosh-thinkpadp1gen3 ~]$ oc get no -oyaml|grep -i masq k8s.ovn.org/node-masquerade-subnet: '{ "ipv4" : "169.254.0.0/17" , "ipv6" : "fd69::/112" }' k8s.ovn.org/node-masquerade-subnet: '{ "ipv4" : "169.254.0.0/17" , "ipv6" : "fd69::/112" }' k8s.ovn.org/node-masquerade-subnet: '{ "ipv4" : "169.254.0.0/17" , "ipv6" : "fd69::/112" }' k8s.ovn.org/node-masquerade-subnet: '{ "ipv4" : "169.254.0.0/17" , "ipv6" : "fd69::/112" }' k8s.ovn.org/node-masquerade-subnet: '{ "ipv4" : "169.254.0.0/17" , "ipv6" : "fd69::/112" }' k8s.ovn.org/node-masquerade-subnet: '{ "ipv4" : "169.254.0.0/17" , "ipv6" : "fd69::/112" }' There must be some other steps which are required to reproduce the issue. Note: If the cluster have been upgraded from 4.17 then masquerade subnet would remain same(169.254.169.0/29) for backward compatibility. For a new 4.18 cluster default masquerade subnet wold be set to 169.254.0.0/17. Please refer to below cmmit for more info: https://github.com/openshift/cluster-network-operator/commit/64ce003cdab4bab61f608871f3f6cced2c80cbd2

Felix Enrique Llorente Pastora added a comment - 2025/02/04 9:55 AM

Latest output from QE team is that after ovnkube-node restart the masquerade subnets ignores configuration and goes back to harcoded default /29, that make it crash loop.

rhn-support-asood is this about right ?

Felix Enrique Llorente Pastora added a comment - 2025/02/04 9:55 AM Latest output from QE team is that after ovnkube-node restart the masquerade subnets ignores configuration and goes back to harcoded default /29, that make it crash loop. rhn-support-asood is this about right ?

Felix Enrique Llorente Pastora added a comment - 2025/01/31 6:59 AM

There are some docs related to UDN to configured masquerade subnet depending on the number of networks:

https://github.com/openshift/openshift-docs/pull/82764/files#diff-a4571cfec54163a4dbb46db34db71f3ea359a229f8c22eb0413727622160a934R18

Felix Enrique Llorente Pastora added a comment - 2025/01/31 6:59 AM There are some docs related to UDN to configured masquerade subnet depending on the number of networks: https://github.com/openshift/openshift-docs/pull/82764/files#diff-a4571cfec54163a4dbb46db34db71f3ea359a229f8c22eb0413727622160a934R18

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Miguel Duarte de Mora Barroso added a comment - 2025/02/05 10:02 AM

Expand comment: Miguel Duarte de Mora Barroso added a comment - 2025/02/05 10:02 AM

Collapse comment: Felix Enrique Llorente Pastora added a comment - 2025/02/04 1:49 PM

Expand comment: Felix Enrique Llorente Pastora added a comment - 2025/02/04 1:49 PM

Collapse comment: Arnab Ghosh added a comment - 2025/02/04 12:02 PM

Expand comment: Arnab Ghosh added a comment - 2025/02/04 12:02 PM

Collapse comment: Felix Enrique Llorente Pastora added a comment - 2025/02/04 9:55 AM

Expand comment: Felix Enrique Llorente Pastora added a comment - 2025/02/04 9:55 AM

Collapse comment: Felix Enrique Llorente Pastora added a comment - 2025/01/31 6:59 AM

Expand comment: Felix Enrique Llorente Pastora added a comment - 2025/01/31 6:59 AM

People

Dates