-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.18
-
Important
-
No
-
Rejected
-
False
-
Description of problem:
Cross node udn pods connection broken for layer2 after restarting ovn pods
Version-Release number of selected component (if applicable):
build 4.18.0-0.nightly,openshift/api#1997,openshift/ovn-kubernetes#2334
How reproducible:
Always
Steps to Reproduce:
1. Create a namespace test2
2. Create CRD layer2 in test2
3. Create service and pos in test2
% oc get svc -n test2 NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE test-service ClusterIP 172.30.94.248 <none> 27017/TCP 31s % oc get pods -n test2 NAME READY STATUS RESTARTS AGE hello-pod 1/1 Running 0 5s test-rc-6kf5l 1/1 Running 0 18s test-rc-g4nv2 1/1 Running 0 18s
Before restarting. ovn pods, check pod2service, no issues
% oc rsh -n test2 hello-pod ~ $ while true; do curl 172.30.94.248:27017 --connect-timeout 5; sleep 2;echo "";done Hello OpenShift! Hello OpenShift! Hello OpenShift! Hello OpenShift! Hello OpenShift! Hello OpenShift! Hello OpenShift! Hello OpenShift! Hello OpenShift!
Then restarted ovn pods
% oc delete pods --all -n openshift-ovn-kubernetes pod "ovnkube-control-plane-58b858b9fd-md59k" deleted pod "ovnkube-control-plane-58b858b9fd-wzcjq" deleted pod "ovnkube-node-h57tt" deleted pod "ovnkube-node-l8jjj" deleted pod "ovnkube-node-pbbpz" deleted pod "ovnkube-node-pkfbd" deleted pod "ovnkube-node-s8djs" deleted pod "ovnkube-node-vprtg" deleted % oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-control-plane-58b858b9fd-4rn8t 2/2 Running 0 98s ovnkube-control-plane-58b858b9fd-w24n6 2/2 Running 0 98s ovnkube-node-9n7h5 8/8 Running 0 96s ovnkube-node-b8579 8/8 Running 0 94s ovnkube-node-f7t5n 8/8 Running 0 96s ovnkube-node-flzzx 8/8 Running 0 94s ovnkube-node-k9tmd 8/8 Running 0 95s ovnkube-node-nt8st 8/8 Running 0 94s
Check pod2service connection again, intermittently dropped
% oc rsh -n test2 hello-pod ~ $ while true; do curl 172.30.94.248:27017 --connect-timeout 5; sleep 2;echo "";done curl: (28) Connection timeout after 5000 ms Hello OpenShift! Hello OpenShift! Hello OpenShift! curl: (28) Connection timeout after 5000 ms curl: (28) Connection timeout after 5000 ms Hello OpenShift! Hello OpenShift! Hello OpenShift! Hello OpenShift! curl: (28) Connection timeout after 5001 ms curl: (28) Connection timeout after 5001 ms
Then checking the endpoint of service
% oc get pods -n test2 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES hello-pod 1/1 Running 0 11m 10.131.0.56 huirwang-1104a-q6h7m-worker-b-qtfrk <none> <none> test-rc-6kf5l 1/1 Running 0 12m 10.129.2.38 huirwang-1104a-q6h7m-worker-c-ggglk <none> <none> test-rc-g4nv2 1/1 Running 0 12m 10.131.0.55 huirwang-1104a-q6h7m-worker-b-qtfrk <none> <none> % oc exec -n test2 test-rc-6kf5l -- ip a show ovn-udn1 3: ovn-udn1@if105: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1360 qdisc noqueue state UP group default link/ether 0a:58:0a:c8:04:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.200.4.3/24 brd 10.200.4.255 scope global ovn-udn1 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fec8:403/64 scope link valid_lft forever preferred_lft forever % oc exec -n test2 test-rc-g4nv2 -- ip a show ovn-udn1 3: ovn-udn1@if136: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1360 qdisc noqueue state UP group default link/ether 0a:58:0a:c8:03:04 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.200.3.4/24 brd 10.200.3.255 scope global ovn-udn1 valid_lft forever preferred_lft forever inet6 fe80::858:aff:fec8:304/64 scope link valid_lft forever preferred_lft forever
From udn client to access both udn pods, be able to access the pod on same node as client pod, but not able to access the pod on different node
% oc rsh -n test2 hello-pod ~ $ curl 10.200.3.4:8080 Hello OpenShift! ~ $ curl 10.200.3.4:8080 Hello OpenShift! ~ $ curl 10.200.3.4:8080 Hello OpenShift! ~ $ curl 10.200.3.4:8080 Hello OpenShift! ~ $ curl 10.200.4.3:8080 % oc rsh -n test2 hello-pod ~ $ curl 10.200.4.3:8080 --connect-timeout 10 curl: (28) Connection timeout after 10000 ms
Actual results:
Cross node udn pods connection broken for layer2 after restarting ovn pods
Expected results:
p2p connection should not be broken
Additional info:
Same testing for layer3, no such issue.
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms: