-
Bug
-
Resolution: Not a Bug
-
Normal
-
None
-
4.18, 4.18.0, 4.18.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
When a UDN object is created, and if we reboot any of the nodes in OCP cluster, that node becomes notReady.
Version-Release number of selected component (if applicable): 4.18.6
How reproducible: Everytime
Steps to Reproduce:
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.18.6 True False 2d21h Cluster version is 4.18.6
=> Created a namespace called udn1 with P-UDN annotation {code:java} $ cat udn1-ns.yaml apiVersion: v1 kind: Namespace metadata: name: udn1 labels: k8s.ovn.org/primary-user-defined-network: ""
=> Created a L2 UDN network using below yaml;
cat udn1-l2.yaml
apiVersion: k8s.ovn.org/v1
kind: UserDefinedNetwork
metadata:
name: udn-1-l2
namespace: udn1
spec:
topology: Layer2
layer2:
role: Primary
subnets:
- "192.168.123.0/24"
=> Created a privileged POD using the following YAML
$ cat dep.yaml apiVersion: apps/v1 kind: Deployment metadata: name: ub24-00 namespace: udn1 spec: selector: matchLabels: app: ub24-00 replicas: 2 template: metadata: labels: app: ub24-00 spec: serviceAccount: demo securityContext: privileged: true runAsUser: 0 containers: - name: ub24-00 image: ghcr.io/rameshsahoo111/ub24:latest command: ["/bin/sleep"] args: ["infinity"] securityContext: privileged: true runAsUser: 0 capabilities: add: - NET_ADMIN - NET_RAW - NET_ADMIN imagePullPolicy: Always # Before creating deployment create the following SCC # oc create sa demo # oc adm policy add-scc-to-user privileged -z demo
=> Reboot any of the node in the cluster.
Actual results: Node become notReady.
-> Ovnkube-node pod is 7/8. $ oc get po -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ovnkube-control-plane-86567b4c5-57x78 2/2 Running 0 5d22h 192.168.100.7 master-1 <none> <none> ovnkube-control-plane-86567b4c5-dmtgc 2/2 Running 0 5d22h 192.168.100.8 master-2 <none> <none> ovnkube-node-bmxcc 8/8 Running 9 (5d21h ago) 5d22h 192.168.100.9 worker-0 <none> <none> ovnkube-node-dgf49 8/8 Running 8 5d22h 192.168.100.6 master-0 <none> <none> ovnkube-node-l79p4 8/8 Running 9 (5d22h ago) 5d22h 192.168.100.8 master-2 <none> <none> ovnkube-node-mdtxz 7/8 CrashLoopBackOff 22 (69s ago) 18h 192.168.100.11 worker-2 <none> <none> ovnkube-node-p8l8x 8/8 Running 9 18h 192.168.100.10 worker-1 <none> <none> ovnkube-node-s645g 8/8 Running 9 (5d22h ago) 5d22h 192.168.100.7 master-1 <none> <none> -> Upon checking, we can see that ovn-controller is in crashloopbackOff - containerID: cri-o://578955777d13e775142cf38f7228a0dffa2a1c6fcd3110f6940f580e8e7f21ad image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5adc4825487255b6d29defb8cebbde1cd7a701b90fa78c4cb6e8b0230458a940 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5adc4825487255b6d29defb8cebbde1cd7a701b90fa78c4cb6e8b0230458a940 lastState: terminated: containerID: cri-o://578955777d13e775142cf38f7228a0dffa2a1c6fcd3110f6940f580e8e7f21ad exitCode: 1 finishedAt: "2025-04-07T12:34:51Z" message: |- == 192.168.123.0/24 && ip4.dst == 192.168.123.0/24 Nexthop:<nil> Nexthops:[] Options:map[] Priority:102} on router GR_udn1_udn.1.l2_worker-2: object not found E0407 12:34:51.251115 21787 factory.go:1320] Failed (will retry) while processing existing *v1.Node items: failed to initialize networks cluster logical router egress policies for network udn1_udn-1-l2: failed to create no reroute policies for pods on network udn1_udn-1-l2: unable to create IPv4 no-reroute pod policies, err: error creating logical router policy {UUID:u2351433412 Action:allow BFDSessions:[] ExternalIDs:map[ip-family:ip4 k8s.ovn.org/id:default-network-controller:EgressIP:102:EIP-No-Reroute-Pod-To-Pod:ip4:udn1_udn-1-l2 k8s.ovn.org/name:EIP-No-Reroute-Pod-To-Pod k8s.ovn.org/owner-controller:default-network-controller k8s.ovn.org/owner-type:EgressIP network:udn1_udn-1-l2 priority:102] Match:ip4.src == 192.168.123.0/24 && ip4.dst == 192.168.123.0/24 Nexthop:<nil> Nexthops:[] Options:map[] Priority:102} on router GR_udn1_udn.1.l2_worker-2: object not found I0407 12:34:51.252043 21787 factory.go:656] Stopping watch factory I0407 12:34:51.252120 21787 ovnkube.go:599] Stopped ovnkube I0407 12:34:51.252239 21787 metrics.go:553] Stopping metrics server at address "127.0.0.1:29103" F0407 12:34:51.252510 21787 ovnkube.go:137] failed to run ovnkube: [failed to start network c reason: Error startedAt: "2025-04-07T12:33:49Z" name: ovnkube-controller ready: false restartCount: 8 started: false state: waiting: message: back-off 5m0s restarting failed container=ovnkube-controller pod=ovnkube-node-mdtxz_openshift-ovn-kubernetes(73dc8d00-edc5-48f8-a82a-1ec081112578) reason: CrashLoopBackOff
Expected results: Node should properly join the cluster.
Additional info:
Attaching network MG to the JIRA.