Loading...

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.19.0
Component/s: Networking / ovn-kubernetes
Labels:
- SDN:OVNK:BGP

Severity:
Critical
Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

After covering from SGW to LGW with UDN advertised, ovnkube-node pod crashed

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1. Create cluster with SGW

2. Create UDN on namspace and routeradvertisement

apiVersion: k8s.ovn.org/v1
kind: UserDefinedNetwork
metadata:
name: l3-udn
labels:
app: udn
spec:
topology: Layer3
layer3:
role: Primary
subnets:
- cidr: "22.100.0.0/16"
hostSubnet: 24

3. RA the UDN

oc get ra udn -o yaml
apiVersion: k8s.ovn.org/v1
kind: RouteAdvertisements
metadata:
creationTimestamp: "2025-01-17T08:13:21Z"
generation: 1
name: udn
resourceVersion: "632296"
uid: 518debf6-9916-4bf1-99e3-e181739334c9
spec:
advertisements:
- PodNetwork
networkSelector:
matchLabels:
app: udn
status:
conditions:
- lastTransitionTime: "2025-01-17T12:25:05Z"
message: ovn-kubernetes cluster-manager validated the resource and requested the
necessary configuration changes
observedGeneration: 1
reason: Accepted
status: "True"
type: Accepted
status: Accepted

4. Create test pod on namespace

oc rsh -n z1 test-rc-bpm4p ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0@if58: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default
link/ether 0a:58:0a:81:02:0d brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.129.2.13/23 brd 10.129.3.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::858:aff:fe81:20d/64 scope link
valid_lft forever preferred_lft forever
3: ovn-udn1@if59: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default
link/ether 0a:58:16:64:04:04 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 22.100.4.4/24 brd 22.100.4.255 scope global ovn-udn1
valid_lft forever preferred_lft forever
inet6 fe80::858:16ff:fe64:404/64 scope link
valid_lft forever preferred_lft forever

5. access the UDN pod from external router

curl 22.100.4.4:8080
Hello OpenShift!

6. Convert to LGW in runtime

oc patch network.operator cluster -p '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"gatewayConfig":

{"routingViaHost": true}

}}}}' --type=merge

7. # oc get pod -n openshift-ovn-kubernetes
NAME READY STATUS RESTARTS AGE
ovnkube-control-plane-57cb55c887-45jgz 2/2 Running 0 24h
ovnkube-control-plane-57cb55c887-sf4p5 2/2 Running 0 24h
ovnkube-node-27dkl 7/8 CrashLoopBackOff 7 (10s ago) 11m
ovnkube-node-7s866 8/8 Running 0 14m
ovnkube-node-8cvz5 8/8 Running 0 13m
ovnkube-node-dxjmk 8/8 Running 0 14m
ovnkube-node-qk9rd 8/8 Running 0 15m
ovnkube-node-xdzsj 8/8 Running 0 15m

Actual results:

ovnkube-node pod crashed with error

I0117 14:07:13.802611 1750314 obj_retry.go:459] Detected object openshift-kube-controller-manager/revision-pruner-8-master-2 of type *v1.Pod in terminal state (e.g. completed) during add event: will remove it
I0117 14:07:13.802631 1750314 pods.go:174] Deleting pod: openshift-kube-controller-manager/revision-pruner-8-master-2
W0117 14:07:13.802663 1750314 base_network_controller_pods.go:222] No cached port info for deleting pod default/openshift-kube-apiserver/installer-11-master-2. Using logical switch master-2 port uuid and addrs [10.130.0.51/23]
W0117 14:07:13.802700 1750314 base_network_controller_pods.go:222] No cached port info for deleting pod default/openshift-kube-controller-manager/revision-pruner-8-master-2. Using logical switch master-2 port uuid and addrs [10.130.0.14/23]
I0117 14:07:13.802717 1750314 base_network_controller_pods.go:1027] Completed pod openshift-kube-controller-manager/revision-pruner-8-master-2 was already released for nad default before startup
I0117 14:07:13.803539 1750314 pods.go:252] [openshift-network-diagnostics/network-check-target-6mzjm] addLogicalPort took 1.374018ms, libovsdb time 1.115711ms
I0117 14:07:13.803802 1750314 pods.go:217] Attempting to release IPs for pod: openshift-kube-apiserver/installer-11-master-2, ips: 10.130.0.51
I0117 14:07:13.803915 1750314 ovnkube.go:599] Stopped ovnkube
I0117 14:07:13.803937 1750314 metrics.go:553] Stopping metrics server at address "127.0.0.1:29103"
F0117 14:07:13.803991 1750314 ovnkube.go:137] failed to run ovnkube: [failed to start network controller: failed to start default network controller: unable to create admin network policy controller, err: could not add Event Handler for anpInformer during admin network policy controller initialization, handler {0x1fcb100 0x1fcade0 0x1fcad80} was not added to shared informer because it has stopped already, failed to start node network controller: failed to start NAD controller: initial sync failed: failed to sync network z1.l3-udn: [node-nad-controller network controller]: failed to ensure network z1.l3-udn: failed to start network z1.l3-udn: failed to add network to node gateway for network z1.l3-udn at node master-2: could not add VRF mp10-udn-vrf routes for network z1.l3-udn, err: failed to add route {Ifindex: 81 Dst: 169.254.0.3/32 Src: <nil> Gw: 22.100.1.1 Flags: [] Table: 1081 Realm: 0} for VRF device mp10-udn-vrf, err: route manager: failed to add route ({Ifindex: 81 Dst: 169.254.0.3/32 Src: <nil> Gw: 22.100.1.1 Flags: [] Table: 1081 Realm: 0}): failed to apply route ({Ifindex: 81 Dst: 169.254.0.3/32 Src: <nil> Gw: 22.100.1.1 Flags: [] Table: 1081 Realm: 0}): failed to add route (gw: 22.100.1.1, subnet 169.254.0.3/32, mtu 0, src IP <nil>): network is unreachable]

Expected results:

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
Don’t presume that Engineering has access to Salesforce.
Do presume that Engineering will access attachments through supportshell.
Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

When showing the results from commands, include the entire command in the output.
For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
For guidance on using this template please see
OCPBUGS Template Training for Networking components

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates