-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.18.z
-
None
-
False
-
-
None
-
Moderate
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Node was cordoned after multiple workloads entered into "CrashLoopBackOff / error" state.
In parallel, OVN controller on the node showed > 90% CPU usage with long poll intervals. This affected scheduling stability intermittently and service availability for workloads.
On this particular node, users noticed that some pods were unable to reach the kubernetes api service.
We could trace the ovs flows used for these calls:
$> ovs-appctl ofproto/trace br-int in_port=3967,tcp,dl_src=0a:58:0a:e1:dc:16,dl_dst=0a:58:0a:e1:dc:01,nw_src=10.225.220.22,nw_dst=10.224.0.1,tp_dst=443,nw_ttl=42,dp_hash=3 | tail -30
resubmit(,28)
28. ip,metadata=0x5, priority 1, cookie 0xfe2c80da
push:NXM_NX_REG0[]
push:NXM_NX_XXREG0[96..127]
pop:NXM_NX_REG0[]
-> NXM_NX_REG0[] is now 0xa3c1201
set_field:00:00:00:00:00:00->eth_dst
resubmit(,66)
66. No match.
drop
pop:NXM_NX_REG0[]
-> NXM_NX_REG0[] is now 0xa3c1201
resubmit(,29)
29. metadata=0x5, priority 0, cookie 0x1de89847
resubmit(,30)
30. metadata=0x5, priority 0, cookie 0xcfe11f91
resubmit(,31)
31. metadata=0x5, priority 0, cookie 0xd391eba8
resubmit(,32)
32. ip,reg0=0xa3c1200/0xfffffe00,reg15=0x2,metadata=0x5, priority 110, cookie 0x3dc6457a
set_field:0/0xf0000000->reg10
resubmit(,33)
33. ip,metadata=0x5,dl_dst=00:00:00:00:00:00, priority 100, cookie 0x81f3bdfb
controller(userdata=00.00.00.00.00.00.00.00.00.19.00.10.80.00.06.06.ff.ff.ff.ff.ff.ff.00.00.00.1c.00.18.00.20.00.40.00.00.00.00.00.01.de.10.80.00.2c.04.00.00.00.00.00.1c.00.18.00.20.00.60.00.00.00.00.00.01.de.10.80.00.2e.04.00.00.00.00.00.19.00.10.80.00.2a.02.00.01.00.00.00.00.00.00.ff.ff.00.10.00.00.23.20.00.0e.ff.f8.28.00.00.00,pause,meter_id=41)
pop:NXM_OF_IN_PORT[]
-> NXM_OF_IN_PORT[] is now 3967
Final flow: recirc_id=0x4c6d735,dp_hash=0x3,eth,tcp,reg0=0x300,reg11=0x3,reg12=0x2,reg13=0x66,reg14=0xfbf,reg15=0x1,metadata=0x3,in_port=3967,vlan_tci=0x0000,dl_src=0a:58:0a:e1:dc:16,dl_dst=0a:58:0a:e1:dc:01,nw_src=10.225.220.22,nw_dst=10.19.137.104,nw_tos=0,nw_ecn=0,nw_ttl=42,nw_frag=no,tp_src=0,tp_dst=6443,tcp_flags=0
Megaflow: pkt_mark=0,recirc_id=0x4c6d735,ct_state=+new-est-rel-rpl-inv+trk,ct_mark=0x2/0xf,eth,tcp,in_port=3967,dl_src=0a:58:0a:e1:dc:16,dl_dst=0a:58:0a:e1:dc:01,nw_src=10.225.220.16/28,nw_dst=10.19.137.104,nw_ttl=42,nw_frag=no,tp_dst=6443
Datapath actions: ct(commit,zone=102,mark=0/0x1,nat(src)),check_pkt_len(size=1414,gt(sample(sample=100.0%,actions(meter(100),userspace(pid=4294967295,controller(reason=1,dont_send=1,continuation=0,recirc_id=80139629,rule_cookie=0x57ec6443,controller_id=0,max_len=65535))))),le(set(eth(src=0a:58:64:40:00:01,dst=0a:58:64:40:00:f3)),set(ipv4(ttl=41)),check_pkt_len(size=1414,gt(sample(sample=100.0%,actions(meter(100),userspace(pid=4294967295,controller(reason=1,dont_send=1,continuation=0,recirc_id=99,rule_cookie=0x9dc0c4f3,controller_id=0,max_len=65535))))),le(set(eth(src=7c:1e:52:fb:5f:d9,dst=00:00:00:00:00:00)),set(ipv4(ttl=40))))))
Here in table 66 that contains flows for arp, no flow is found matching, so ovn-controller is asked to resolve an arp. On the rest of the nodes the flow is present in table 66 and the traffic works.
It looks like a bug (similar to OCPBUGS-53151) in ovn-controller, where ovn-controller fails to install flows for arp.
Version-Release number of selected component (if applicable):
Openshift 4.18
How reproducible:
One occurrence so far
Steps to Reproduce:
Unknown
Actual results:
Flow is missing in table 66 on a node
Expected results:
Flows installed with no issue onto each node
Additional info: