Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.18.z
Component/s: Networking / ovn-kubernetes
Labels:
- OVN-Kubernetes
- ovn

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Node was cordoned after multiple workloads entered into "CrashLoopBackOff / error" state.
In parallel, OVN controller on the node showed > 90% CPU usage with long poll intervals. This affected scheduling stability intermittently and service availability for workloads. 
On this particular node, users noticed that some pods were unable to reach the kubernetes api service.
We could trace the ovs flows used for these calls:

$> ovs-appctl ofproto/trace br-int in_port=3967,tcp,dl_src=0a:58:0a:e1:dc:16,dl_dst=0a:58:0a:e1:dc:01,nw_src=10.225.220.22,nw_dst=10.224.0.1,tp_dst=443,nw_ttl=42,dp_hash=3 | tail -30
            resubmit(,28)
        28. ip,metadata=0x5, priority 1, cookie 0xfe2c80da
            push:NXM_NX_REG0[]
            push:NXM_NX_XXREG0[96..127]
            pop:NXM_NX_REG0[]
             -> NXM_NX_REG0[] is now 0xa3c1201
            set_field:00:00:00:00:00:00->eth_dst
            resubmit(,66)
            66. No match.
                    drop
            pop:NXM_NX_REG0[]
             -> NXM_NX_REG0[] is now 0xa3c1201
            resubmit(,29)
        29. metadata=0x5, priority 0, cookie 0x1de89847
            resubmit(,30)
        30. metadata=0x5, priority 0, cookie 0xcfe11f91
            resubmit(,31)
        31. metadata=0x5, priority 0, cookie 0xd391eba8
            resubmit(,32)
        32. ip,reg0=0xa3c1200/0xfffffe00,reg15=0x2,metadata=0x5, priority 110, cookie 0x3dc6457a
            set_field:0/0xf0000000->reg10
            resubmit(,33)
        33. ip,metadata=0x5,dl_dst=00:00:00:00:00:00, priority 100, cookie 0x81f3bdfb
            controller(userdata=00.00.00.00.00.00.00.00.00.19.00.10.80.00.06.06.ff.ff.ff.ff.ff.ff.00.00.00.1c.00.18.00.20.00.40.00.00.00.00.00.01.de.10.80.00.2c.04.00.00.00.00.00.1c.00.18.00.20.00.60.00.00.00.00.00.01.de.10.80.00.2e.04.00.00.00.00.00.19.00.10.80.00.2a.02.00.01.00.00.00.00.00.00.ff.ff.00.10.00.00.23.20.00.0e.ff.f8.28.00.00.00,pause,meter_id=41)
    pop:NXM_OF_IN_PORT[]
     -> NXM_OF_IN_PORT[] is now 3967

Final flow: recirc_id=0x4c6d735,dp_hash=0x3,eth,tcp,reg0=0x300,reg11=0x3,reg12=0x2,reg13=0x66,reg14=0xfbf,reg15=0x1,metadata=0x3,in_port=3967,vlan_tci=0x0000,dl_src=0a:58:0a:e1:dc:16,dl_dst=0a:58:0a:e1:dc:01,nw_src=10.225.220.22,nw_dst=10.19.137.104,nw_tos=0,nw_ecn=0,nw_ttl=42,nw_frag=no,tp_src=0,tp_dst=6443,tcp_flags=0
Megaflow: pkt_mark=0,recirc_id=0x4c6d735,ct_state=+new-est-rel-rpl-inv+trk,ct_mark=0x2/0xf,eth,tcp,in_port=3967,dl_src=0a:58:0a:e1:dc:16,dl_dst=0a:58:0a:e1:dc:01,nw_src=10.225.220.16/28,nw_dst=10.19.137.104,nw_ttl=42,nw_frag=no,tp_dst=6443
Datapath actions: ct(commit,zone=102,mark=0/0x1,nat(src)),check_pkt_len(size=1414,gt(sample(sample=100.0%,actions(meter(100),userspace(pid=4294967295,controller(reason=1,dont_send=1,continuation=0,recirc_id=80139629,rule_cookie=0x57ec6443,controller_id=0,max_len=65535))))),le(set(eth(src=0a:58:64:40:00:01,dst=0a:58:64:40:00:f3)),set(ipv4(ttl=41)),check_pkt_len(size=1414,gt(sample(sample=100.0%,actions(meter(100),userspace(pid=4294967295,controller(reason=1,dont_send=1,continuation=0,recirc_id=99,rule_cookie=0x9dc0c4f3,controller_id=0,max_len=65535))))),le(set(eth(src=7c:1e:52:fb:5f:d9,dst=00:00:00:00:00:00)),set(ipv4(ttl=40))))))

Here in table 66 that contains flows for arp, no flow is found matching, so ovn-controller is asked to resolve an arp. On the rest of the nodes the flow is present in table 66 and the traffic works.
It looks like a bug (similar to OCPBUGS-53151) in ovn-controller, where ovn-controller fails to install flows for arp.

Version-Release number of selected component (if applicable):

Openshift 4.18

How reproducible:

One occurrence so far

Steps to Reproduce:

Unknown

Actual results:

Flow is missing in table 66 on a node

Expected results:

Flows installed with no issue onto each node

Additional info:

Assignee:: Arkadeep Sen (Aurko)

Reporter:: Franck Grosjean

Need Info From:: None

Contributors:: None

QA Contact:: Anurag Saxena

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2026/01/07 4:07 PM

Updated:: 2026/01/13 7:43 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates